Writing HTML parser wasn't as hard as I thought it'd be

  • Thread starter Robert Maas, see http://tinyurl.com/uh3t
  • Start date
R

Robert Maas, see http://tinyurl.com/uh3t

For years I've had needs for parsing HTML, but avoided writing a
full HTML parser because I thought it'd be too much work. So
instead I wrote various hacks that gleaned particular data from
special formats of HTML files (such as Yahoo! Mail folders and
individual messages) while ignoring the bulk of the HTML file.

But since I have a whole bunch of current needs for parsing various
kinds of HTML files, and I don't want to have to write a separate
hack for each format, all flakey/bugridden, I finally decided to
<cliche>bite the bullet</cliche> and write a genuine HTML parser.

Yesterday (Wednesday) I started work on the tokenizer, using one of
my small Web pages from years ago as the test data:
<http://www.rawbw.com/~rem/WAP.html>
As I was using TDD (Test-Driven Development) I discovered that the
file was still using the *wrong* syntax <p /> to make blank lines
between parts of the text, so I changed them to use valid code, so
now my HTML tokenizer would successfully work on the file, finished
to that point last night.

Then I switched to using the Google-Group Advanced-Search Web page
as test data, and finally got the tokenizer working for it after a
few more hours work today (Thursday).

Then I wrote the routine to take the list of tokens and find all
matching pairs of open tag and closing tag, replacing them with a
single container cell that included everyting between the tags.
For example :)TAG "font" ...) :)TEXT "hello") :)INPUT ...) :)ENDTAG "font")
would be replaced by ("CONTAIN "font" (...) (:)TEXT "hello") :)INPUT ...))).
I single-stepped it at the level of full collapses, all the way to
the end of the test file, so I could watch it and get a feel for
what was happening. It worked perfectly the first time, but I saw
an awful lot of bad HTML in the Google-Groups Advanced-Search page,
such as many <b> and <font> that were opened but never closed, and
also lots of <p> <p> <p> that weren't closed either. Even some
unclosed elements of tables.

Anyway, after spending an hour single-stepping it all, and finding
it working perfectly, I had a DOM (Document Object Model)
structure, i.e. the parse tree, for the HTML file, inside CMUCL, so
then of course I prettyprinted it to disk. Have a look if you're
curious:
<http://www.rawbw.com/~rem/NewPub/parsed-ggadv.dat.txt>
Any place you see a :TAG that means an opening tag without any
matching close tag. For <br>, and for the various <option> inside a
<select>, that's perfectly correct. But for the other stuff I
mentionned such as <b> and <font> that isn't valid HTML and never
was, right? I wonder what the w3c validator says about the HTML?
<http://validator.w3.org/check?uri=http://www.google.com/advanced_group_search?hl=en>
Result: Failed validation, 707 errors
No kidding!!! Over seven hundred mistakes in a one-page document!!!
It's amazing my parser actually parses it successfully!!
Actually, to be fair, many of the errors are because the doctype
declaraction claims it's XHTML transitional, which requires
lower-case tags, but in fact most tags are upper case. (And my
parser is case-insensitive, and *only* parses, doesn't validate at
all.) I wonder if all the tags were changed to lower case, how
fewer errors would show up in w3c validator? Modified GG page:
<http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
<http://validator.w3.org/check?uri=http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
Result: Failed validation, 693 errors
Hmmm, this validation error concerns me:
145. Error Line 174 column 49: end tag for "br" omitted, but OMITTAG
NO was specified.
My guess is some smartypants at Google thought it'd make good P.R.
to declare the document as XHTML instead of HTML, without realizing
that the document wasn't valid XHTML at all, and the DTD used was
totally inappropriate for this document. Does anybody know, from
eyeballing the entire WebPage source, which DOCTYPE/DTD
declaraction would be appropriate to make it almost pass
validation? I bet, with the correct DOCTYPE declaraction, there'd
be only fifty or a hundred validation errors, mostly the kind I
mentionned earlier which I discovered when testing my new parser.
 
T

Toby A Inkster

Robert said:
But since I have a whole bunch of current needs for parsing various
kinds of HTML files, and I don't want to have to write a separate
hack for each format, all flakey/bugridden, I finally decided to
<cliche>bite the bullet</cliche> and write a genuine HTML parser.

Congratulations. Real parsers are fun.

But wouldn't it have been a bit easier to reuse one of the many existing
parsers? e.g. http://opensource.franz.com/xmlutils/xmlutils-dist/phtml.htm

--
Toby A Inkster BSc (Hons) ARCS
http://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!
 
T

Tim Bradshaw

but I saw
an awful lot of bad HTML in the Google-Groups Advanced-Search page,
such as many <b> and <font> that were opened but never closed, and
also lots of <p> <p> <p> that weren't closed either. Even some
unclosed elements of tables.

Depending on the version of HTML (on the DTD in use) omitted closing
tags may be perfectly legal. SGML has many options to allow omission
of tags, both closing and opening. This is one of the things that XML
did away with as it makes it impossible to build a parse tree for the
document unless you know the DTD. So obviously they are not omissable
for any document claiming to be XHTML I think.

P for instance has omissable close tags in HTML 4.01

--tim
 
J

John Thingstad

Anyway, after spending an hour single-stepping it all, and finding
it working perfectly, I had a DOM (Document Object Model)
structure, i.e. the parse tree, for the HTML file, inside CMUCL, so
then of course I prettyprinted it to disk. Have a look if you're
curious:
<http://www.rawbw.com/~rem/NewPub/parsed-ggadv.dat.txt>
Any place you see a :TAG that means an opening tag without any
matching close tag. For <br>, and for the various <option> inside a
<select>, that's perfectly correct. But for the other stuff I
mentionned such as <b> and <font> that isn't valid HTML and never
was, right? I wonder what the w3c validator says about the HTML?
<http://validator.w3.org/check?uri=http://www.google.com/advanced_group_search?hl=en>
Result: Failed validation, 707 errors
No kidding!!! Over seven hundred mistakes in a one-page document!!!
It's amazing my parser actually parses it successfully!!
Actually, to be fair, many of the errors are because the doctype
declaraction claims it's XHTML transitional, which requires
lower-case tags, but in fact most tags are upper case. (And my
parser is case-insensitive, and *only* parses, doesn't validate at
all.) I wonder if all the tags were changed to lower case, how
fewer errors would show up in w3c validator? Modified GG page:
<http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
<http://validator.w3.org/check?uri=http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
Result: Failed validation, 693 errors
Hmmm, this validation error concerns me:
145. Error Line 174 column 49: end tag for "br" omitted, but OMITTAG
NO was specified.
My guess is some smartypants at Google thought it'd make good P.R.
to declare the document as XHTML instead of HTML, without realizing
that the document wasn't valid XHTML at all, and the DTD used was
totally inappropriate for this document. Does anybody know, from
eyeballing the entire WebPage source, which DOCTYPE/DTD
declaraction would be appropriate to make it almost pass
validation? I bet, with the correct DOCTYPE declaraction, there'd
be only fifty or a hundred validation errors, mostly the kind I
mentionned earlier which I discovered when testing my new parser.

As a ex employee of Opera I can say that writing a Web Browser is hard!
It is not so much the parsing of correct HTML as the parsing of incorrect
HTML that poses the problem. Let's face it. It could be simple.
If we all used XHTML and the browser aborted with a error message
when a error occurred. Unfortunately that is hardly the case.
SGML is more difficult to parse. Then there is the fact that many
cites rely on errors in the HTML being handled just like in
Microsoft Explorer. I can't count the number of times I heard that Opera
was broken just to find that it was a HTML error on the web cite that
Explorer got around.
 
T

Thomas F. Burdick

As a ex employee of Opera I can say that writing a Web Browser is hard!
It is not so much the parsing of correct HTML as the parsing of incorrect
HTML that poses the problem. Let's face it. It could be simple.
If we all used XHTML and the browser aborted with a error message
when a error occurred. Unfortunately that is hardly the case.

This is unfortunate why? Because of the high correlation between
people who have something to say worth reading and those who can write
XML without screwing it up? Face it, HTML is a markup language
historically created directly by humans, which means you *will* get
good content with syntax errors by authors who will not fix it.
 
D

dpapathanasiou

This is unfortunate why? Because of the high correlation between
people who have something to say worth reading and those who can write
XML without screwing it up? Face it, HTML is a markup language
historically created directly by humans, which means you *will* get
good content with syntax errors by authors who will not fix it.

But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec, and parsing html would not be the nightmare it
is today.
 
P

Pascal Costanza

dpapathanasiou said:
But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec, and parsing html would not be the nightmare it
is today.

If early browsers had rejected incorrect html, the web would have never
been that successful.

What's important to keep in mind is that those who create the content
are end-users. It must be easy to create content, and shouldn't require
any specific skills (or not more than absolutely necessary).

Stupid error messages from stupid technology is a hindrance, not an enabler.


Pascal
 
T

Tim Bradshaw

If early browsers had rejected incorrect html, the web would have never
been that successful.

What's important to keep in mind is that those who create the content
are end-users. It must be easy to create content, and shouldn't require
any specific skills (or not more than absolutely necessary).

Stupid error messages from stupid technology is a hindrance, not an enabler.

Well said.
 
B

Ben C

Well said.

But completely wrong.

If the stupid technology tells you at once what the error is you fix it,
and then you are less confused fours hours later when something doesn't
display the way you were expecting and you eventually track it down to a
missing closing tag somewhere.

It's not as if the authors _want_ to use incorrectly nested tags.
They're just careless mistakes that we all make and that are trivial to
fix if they're pointed out at once, but that take hours if you have to
work back from their eventual consequences. Fixing them sooner rather
than later helps the author more than anyone else.

I can only see a case for not reporting errors where it is close to
certain that they will not have consequences. In most systems such
errors are classified as "Warnings" and can be turned off.
 
J

Jonathan N. Little

Ben said:
But completely wrong.

If the stupid technology tells you at once what the error is you fix it,
and then you are less confused fours hours later when something doesn't
display the way you were expecting and you eventually track it down to a
missing closing tag somewhere.

Agree totally. That is why IE is abominable for debugging. It is "so
good" at second-guessing intent when junk is thrown at it that fails
miserably when it gets valid markup...
 
T

Thomas A. Russ

dpapathanasiou said:
But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec, and parsing html would not be the nightmare it
is today.

On the other hand, it could also be argued that, especially early on,
before web authoring tools existed, such laxity contributed to the
widespread adoption of html. By making the renderer not particularly
picky about the input, it made it easier for authors to hand create the
html pages without the frustration of having things get rejected and not
appear at all.

That provided a nicer development environment (somewhat reminiscent of
Lisp environments), where things would work, even if not every part of
the document were well-formed and correct. The author could then go
back and fix the places that didn't work. That would be true even if
correct rendering were strict, but I do think that laxness in
enforcement of the standards helped the spread of html in the early
days.
 
A

Andy Dingley

But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec,

We'd also still be using HTML 1.0, as the legacy problems would stifle
any change to the standard.
 
P

Paul Wallich

Andy said:
We'd also still be using HTML 1.0, as the legacy problems would stifle
any change to the standard.

Remember that originally no one was supposed to write HTML. It was
supposed to be produced automagically by design tools and transducers
operating on existing formatted documents.

You know, the same way that no one is supposed to write in assembler.

paul
 
D

dorayme

dpapathanasiou said:
But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec, and parsing html would not be the nightmare it
is today.

It's a nice fantasy that a zero tolerance policy would work. Face
it, someone would bring out a competitor that tolerated faults
and everyone would rush to use that one instead.
 
M

mbstevens

It's a nice fantasy that a zero tolerance policy would work. Face
it, someone would bring out a competitor that tolerated faults
and everyone would rush to use that one instead.

You need to set up a switch 0-9 for how much crap code the parser will
accept.
 
J

John Thingstad

On Fri, 20 Apr 2007 09:48:18 +0200, Robert Maas, see

As a ex employee of Opera I can say that writing a Web Browser is hard!
It is not so much the parsing of correct HTML as the parsing of incorrect
HTML that poses the problem. Let's face it. It could be simple.
If we all used XHTML and the browser aborted with a error message
when a error occurred. Unfortunately that is hardly the case.
SGML is more difficult to parse. Then there is the fact that many
cites rely on errors in the HTML being handled just like in
Microsoft Explorer. I can't count the number of times I heard that Opera
was broken just to find that it was a HTML error on the web cite that
Explorer got around.

I am a bit reluctant to reply to this one.
Suffice it to day I was warning him about he difficulties.
I don't know or care about the difficulties about creating a web browser.
Of course that is not exactly true, hence the reluctance.
 
J

John Thingstad

I am a bit reluctant to reply to this one.
Suffice it to day I was warning him about he difficulties.
I don't know or care about the difficulties about creating a web browser.
Of course that is not exactly true, hence the reluctance.

sorry about the word pun in line two.
 
R

Robert Maas, see http://tinyurl.com/uh3t

From: (e-mail address removed) (Thomas A. Russ)
it could also be argued that, especially early on, before web
authoring tools existed, such laxity contributed to the
widespread adoption of html. By making the renderer not
particularly picky about the input, it made it easier for authors
to hand create the html pages without the frustration of having
things get rejected and not appear at all.

That part is fine, but what you say next isn't quite right...
That provided a nicer development environment (somewhat
reminiscent of Lisp environments), where things would work, even
if not every part of the document were well-formed and correct.

There are two major aspects of Lisp environments, only one of which
is present in an HTML-coding/viewing environment:
- Tolerable of mistakes, one mistake doesn't abort compilation and
cause totally null output except for compiler diagnostics. TRUE
- Interactive R-E-P loop whereby you can instantly see the result
of each line of code as you write it, and after a mistake
immediately modify your attempt until you get it right before
moving on to the next line of code. NO!!
The interactive model for any Web-based service (HTML, CGI, PHP,
etc.) is very different from Lisp's (or Java's BeanShell's) R-E-P.
Web-based services always deal with an entire application, even if
some parts aren't yet written, either missing or stubbed to show
where they *would* be. The entire application is always re-started
each time you try one new line of code, and you must manually
search to the bottom of the output to see where it is, which is
more work for the human visual processing system than just watching
the R-E-P scroll by where your input is immediately followed by the
corresponding output. (And if you insert new code *between*
existing code, then it's even more effort to scroll to where the
new effect should be located to see how it looks when rendered.)

As an example of this difference without changing languages, I
write both R-E-P applications and CGI applications using Common
Lisp. Whenever I am writing a CGI application, it's a lot more
hassle, because of the totally different debugging environment. I'm
constantly fighting it somehow. Depending on the application, or
which part of the application I'm writing, I use one of two
strategies:
- If I'm writing a totally simple application, I copy an old CGI
launchpad (a .cgi file which does nothing except invoke CMUCL
with appropriate Lisp file to load) and change the name of the
file (the name of Lisp file to load), then I create a dummy Lisp
file which does nothing except make the call to my library
routine to generate CGI/MIME header for either TEXT/PLAIN or
TEXT/HTML, and print a banner, and exit. Then I immediately
start the Web browser to make sure I have at least that "hello
world" piece of trivia working at all before I go on. Then I add
one new line of code at a time and immediately re-load the Web
page, to force re-execution of the entire program up to that
point, and scroll if necessary to bring the result of the new
line of code on-screen. I include a lot of FORTRAN-style
debugging printouts to explicitly show me the result of each new
line of code even if that result wouldn't normally be shown
during normal running of the finished application. After a few
more lines of code have been added and FORTRAN-style
debug-printed out, I start to comment out some of the early
debug-print statements that I'm tired of seeing over and over.
This necessity to add a print statement for virtually every new
line of code is a big nuisance compared to the R-E-P loop where
that always happens by default, and commenting out the print
statements late is a nuisance compared to the R-E-P loop where
old printout simply scrolls off screen all by itself.
- Whenever I write a significant D/P (data-processing algorithm),
to avoid the hassle described above, I usually develop the
entire algorithm in the normal R-E-P loop, then port it to CGI
using the above technique at the very end, so only the interface
from CGI and the toplevel calls to various algorithms need be
debugged in the CGI environment with FORTRAN-style print
statements etc. If the algorithm needs the results from a HTML
form, sometimes I first write a dummy application which does
nothing except call the library function to decode the
urlencoded form contents, then print the association list to
screen. Then I run it once, copy the association list from
screen and paste into the R-E-P debug environment. Then after
the algorithm using that data has been completely debugged, I
splice a call to the algorithm back into the CGI application and
finish up debugging there.
The point is that debugging in a Web-refresh-to-restart-whole-program
environment is so painful compared to R-E-P that I avoid it as much
as possible. But with HTML (or PHP), there's no alterative. There
simply is no way, that I know of anyway, to develop new code in a
R-E-P sort of environment.

Now to be fair, in HTML nearly *every* line of code written (not
counting stylesheets, which are recent compared to "early HTML"
discussion here) produces some visual effect which is physically
located in the same relationships to other visual effects as the
physical relationship of the corresponding source (HTML) code. So
we never have to add extra "print" statements and later comment
them out. At most we might sometimes have to add extra visual
characters around white space just to show whether the white space
is really there, since white space at end of line doesn't show
visually. But still, the need to type input in one window and then
switch to another window and deliberately invoke a page-reload
command and then wait for a network transaction (even if working on
local server) before seeing the result, and *not* seeing the source
code and visual effect together on one screen where the eye can
dart back and forth to spot what mistake in source caused the bad
output, is a significant pain during development, whereby your glib
comparison between HTML code development and Lisp R-E-P code
development just isn't true.

Now if somebody could figure out a way to "block" pieces of HTML
code so that it would be possible for a develoopment environment to
alternate showing source code and rendered output within a single
window, and in fact the programmer could type the source directly
onto this intermix-window, either typing a new block of code at the
bottom, or editing an old block of code, that would make it like
the Lisp R-E-P. But then since HTML is primarily a visual-effect
language, and what is really being debugged is the way text looks
nice laid out on a page, the interspersed source would ruin the
visual effect and in some ways make debugging more difficult. So
maybe instead it could use a variation of the idea whereby the main
display screen shows exactly the rendered output, but aside it is
the source screen, with blocks of code mapped to blocks of
presentation via connecting brackets, somewhat like this:

PRESENTATION SOURCE
----+ +----
Hi, this is a paragraph | | <p>Hi,
of rendered text, all +---+ this is a paragraph of rendered text,
nicely aligned. I wonder | | all nicely aligned.
if it will work? | | I wonder if it will work?</p>
----+ +----

But of course, although that might help today's HTML authors if
somebody created such a tool, no such tool existed back in the
early days we're talking about here, so my argument about the pain
of HTML coding compared to Lisp R-E-P stands.

(Also, it may be difficult to work with tables using the idea of
sequential blocks of HTML source, in fact the whole idea may be
useless for such "interesting" (in Chinese sense) coding.)
The author could then go back and fix the places that didn't
work.

Which is rather different from Lisp R-E-P development, where you
hardly ever have to go *back* to fix stuff that didn't work, rather
you fix it immediately while it's still the latest thing you wrote.
If you try to write a whole bunch of Lisp code without bothering to
test each part individually, and *then* you try to run the whole
mess, what happens is similar to what happens when programming in
C, the very first thing that bombs out causes nothing after it to
be properly tested at all. This is a significant difference between
HTML (and other formatting languages, where the various parts of
the script are rather independent), vs. any programming langauge
where later processing is heavily dependent on earlier results.

Now for a real bear, try PHP: It works *only* in a Web environment,
so you can't try it in a interactive environment as you could with
Lisp or Perl, but it's a true programming language, where later
processing steps are heavily dependent on early results, so you
can't just throw together a lot of stuff (as with HTML) and debug
all the independent pieces in any sequence you want. You are
essentially forced to use that painful style of development I
described as the first (least preferred) style of CGI programming.

Back to the main topic: One thing, for the early days, which might
have bridged the gap between sloopy first-cut HTML where the
browser guesses what you really meant (and different browsers guess
differently) and good HTML, would be a way of switching "pedantic"
mode off and on. But hardly any C programmers ever use the pedantic
mode, so why should we expect HTML authors to do so either??

The bottom line is that there's a conflict between ease of
first-cut authoring that made HTML so popular in the early days,
and strict following of the specs to make proper HTML source, and I
don't see any easy solution. Maybe the validation services (such as
W3C provides), together with a "hall of shame" for the worse
offenders at HTML that grossly fails validation, would coerce some
decent fraction of authors to eventually fix their original HTML to
become proper HTML?? (Or maybe Google could do validation on all
Web sites it indexes, and demote any site that fails validation, so
it doesn't show up in the first page of search results, and the
more severely a Web page fails validation the further down the
search results it's forced? If Google can fight the government of
the USA regarding invasion of privacy of users, maybe they can try
my idea here too?? Google *is* the 800 pound gorilla of the Web,
and if they applied reward/punishment to good/bad Web authors, I
think it would have a definite effect. Unfortunately, Google is one
of the wost offenders, as I noted the other day. Nevermind...)

Anybody want to join me in building a Hall of Shame for HTML
authors, starting with Google's grossly bad HTML (declared as
transitional XHTML which is totally bogus, ain't even close to
XHTML)?
 
R

Robert Maas, see http://tinyurl.com/uh3t

From: "Thomas F. Burdick said:
Face it, HTML is a markup language historically created directly
by humans, which means you *will* get good content with syntax
errors by authors who will not fix it.

I'm not talking about occasionally crappy HTML in personal Web
pages. I'm talking about bugs in software that generate the same
crappy HTML millions of times per day, every time anyone anywhere
in the world asks Google to perform a search, the same crappy
mistake in *every* copy of the form emitted by Google's search
engine. Also, the toplevel forms to invoke Google's search engines,
which are fetched via bookmarks or links millions of times per day.
A teensy bit of effort to fix those forms and form-generating
software would fix many millions of Web pages delivered per day.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top