need help reading source code: HTML::Parser

ioneabu · Dec 31, 2004

I was curious about why using regex for parsing HTML was so terrible,
at least in simple cases. I can see why line breaks can complicate
things, but with the relatively small size of most HTML files and power
of today's computers, it should not be a big deal to load the whole
file into a string and remove the line breaks first.

In doing a little searching through the newsgroup, I found a lot of
people saying HTML parsing with regex is always a bad idea but not
explaining clearly why.

My next thought was to read through the code of HTML:

arser and get a
general idea of how they do it or at least how complicated the process
really is.

I used IE 6 to look at the source at cpan.org and the ctrl-f find
command to search through the document. It seems that all of the work
is done in a sub named parse. For example:

$p->parse();

I have searched up and down the source for HTML:

arser and I cannot
find a sub parse. There is a sub parse_file which calls parse.

I searched for any use, require, or do statements and found:

require HTML::Entities;

which I thought might be useful, but was not what I was looking for.

So where is this parse sub? If it is not in HTML:

arser, where is it
and how is HTML:

arser importing it?

Thanks!

wana

Paul Lalli · Dec 31, 2004

I have searched up and down the source for HTML:arser and I cannot
find a sub parse. There is a sub parse_file which calls parse.

HTML:

arser is not a pure-perl module. It consists of C code as well.
On the CPAN results page, click 'manifest', and then the Parser.xs file.
About half way down this code, you will see the `void parse(self,chunk)`
function.

Paul Lalli

Matt Garrish · Dec 31, 2004

I was curious about why using regex for parsing HTML was so terrible,
at least in simple cases. I can see why line breaks can complicate
things, but with the relatively small size of most HTML files and power
of today's computers, it should not be a big deal to load the whole
file into a string and remove the line breaks first.

Line breaks are the least of your worries when it comes to parsing html. The
biggest problem very often is that the file you are dealing with does not
conform to any dtd (and trying to parse unparsable data isn't fun). It's
rare to find a web designer who knows what a dtd is, let alone understands
why they're important. Combine that with the fact that browsers are so loose
with how they render html as to allow just about any data construct and you
should get an idea of why no one looks forward to parsing html documents.

As to your specific question about regular expressions, a single expression
just won't work. Parsing is far too complicated a task to be done with a
single expression. You have to account for implied tags (more for sgml),
tags that open but have an implied closing tag, tags that open and close,
tags with any number of attributes, nested tagging, commented out sections
of tagging, etc., etc., etc.

I could drone on and on, but hopefully you're starting to get an idea of why
it's discouraged. You're always free to try and hand roll a series of
regexes to parse out the text you're after, but just don't be surprised when
it starts failing (and don't call it a parser!).

Matt

ioneabu · Dec 31, 2004

Thanks!

I have been looking through it for a little while now. I am trying to
find where they define SV (typedef or struct?) since it is used
everywhere. The Perl code is much better documented than the C code.

At least I can see for myself that HTML parsing is definitely not a
trivial problem. I will not try it myself anymore. I found out the
hard way that even if it looks like my regex is parsing properly, it's
hard to prove that it is picking up everything it is supposed to.
wana (a humbled amateur programmer)

Peter Wyzl · Jan 1, 2005

:I was curious about why using regex for parsing HTML was so terrible,
: at least in simple cases. I can see why line breaks can complicate
: things, but with the relatively small size of most HTML files and power
: of today's computers, it should not be a big deal to load the whole
: file into a string and remove the line breaks first.
:
: In doing a little searching through the newsgroup, I found a lot of
: people saying HTML parsing with regex is always a bad idea but not
: explaining clearly why.

General parsing of HTML is almost impossible because of the huge range of
potential variations. Specific tasks can be accomplished, often with
several regexen, but 'A' regex to parse HTML is generally not possible.
However, if you have a specific, defined task with clearly consistent HTML
(ie from a particular site) then it can be quicker to process this with a
couple or regexen than to use a parser. It is a matter of using tools
appropriate to the task. General processing of HTML requires a parser, and
regexes are not suitable.

HTH

Juha Laiho · Jan 1, 2005

(e-mail address removed) said:

I have been looking through it for a little while now. I am trying to
find where they define SV (typedef or struct?) since it is used
everywhere.

SV is defined in perl.h (within your perl library directory), though it only
resolves to struct STRUCT_SV, which is defined in sv.h in the same directory.

struct STRUCT_SV { /* struct sv { */
void* sv_any; /* pointer to something */
U32 sv_refcnt; /* how many references to us */
U32 sv_flags; /* what we are */
};

.... and as for documentation, see documents perlguts, perlxstut and perlxs.

Sherm Pendley · Jan 1, 2005

I have been looking through it for a little while now. I am trying to
find where they define SV (typedef or struct?) since it is used
everywhere.

Have a look at 'perldoc perl' - the section titled "Internals and C
Language Interface" lists a number of pods that are of interest.

sherm--

Idk need help in editing this source code	0	Nov 5, 2022
How to implement a html parser in java?	1	Dec 28, 2023
I need help making an html website	2	Aug 2, 2023
Need help with this code	2	May 10, 2023
Code help please	4	May 19, 2023
Need help	2	Nov 2, 2023
Need help with code on website (noob)	2	Jul 18, 2022
Compilation of old source code.	0	Mar 3, 2022

need help reading source code: HTML::Parser

ioneabu

Paul Lalli

Matt Garrish

ioneabu

Peter Wyzl

Juha Laiho

Sherm Pendley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads