need help reading source code: HTML::Parser

I

ioneabu

I was curious about why using regex for parsing HTML was so terrible,
at least in simple cases. I can see why line breaks can complicate
things, but with the relatively small size of most HTML files and power
of today's computers, it should not be a big deal to load the whole
file into a string and remove the line breaks first.

In doing a little searching through the newsgroup, I found a lot of
people saying HTML parsing with regex is always a bad idea but not
explaining clearly why.

My next thought was to read through the code of HTML::parser and get a
general idea of how they do it or at least how complicated the process
really is.

I used IE 6 to look at the source at cpan.org and the ctrl-f find
command to search through the document. It seems that all of the work
is done in a sub named parse. For example:

$p->parse();

I have searched up and down the source for HTML::parser and I cannot
find a sub parse. There is a sub parse_file which calls parse.

I searched for any use, require, or do statements and found:

require HTML::Entities;

which I thought might be useful, but was not what I was looking for.

So where is this parse sub? If it is not in HTML::parser, where is it
and how is HTML::parser importing it?

Thanks!

wana
 
P

Paul Lalli

I have searched up and down the source for HTML::parser and I cannot
find a sub parse. There is a sub parse_file which calls parse.

HTML::parser is not a pure-perl module. It consists of C code as well.
On the CPAN results page, click 'manifest', and then the Parser.xs file.
About half way down this code, you will see the `void parse(self,chunk)`
function.

Paul Lalli
 
M

Matt Garrish

I was curious about why using regex for parsing HTML was so terrible,
at least in simple cases. I can see why line breaks can complicate
things, but with the relatively small size of most HTML files and power
of today's computers, it should not be a big deal to load the whole
file into a string and remove the line breaks first.

Line breaks are the least of your worries when it comes to parsing html. The
biggest problem very often is that the file you are dealing with does not
conform to any dtd (and trying to parse unparsable data isn't fun). It's
rare to find a web designer who knows what a dtd is, let alone understands
why they're important. Combine that with the fact that browsers are so loose
with how they render html as to allow just about any data construct and you
should get an idea of why no one looks forward to parsing html documents.

As to your specific question about regular expressions, a single expression
just won't work. Parsing is far too complicated a task to be done with a
single expression. You have to account for implied tags (more for sgml),
tags that open but have an implied closing tag, tags that open and close,
tags with any number of attributes, nested tagging, commented out sections
of tagging, etc., etc., etc.

I could drone on and on, but hopefully you're starting to get an idea of why
it's discouraged. You're always free to try and hand roll a series of
regexes to parse out the text you're after, but just don't be surprised when
it starts failing (and don't call it a parser!).

Matt
 
I

ioneabu

Thanks!

I have been looking through it for a little while now. I am trying to
find where they define SV (typedef or struct?) since it is used
everywhere. The Perl code is much better documented than the C code.

At least I can see for myself that HTML parsing is definitely not a
trivial problem. I will not try it myself anymore. I found out the
hard way that even if it looks like my regex is parsing properly, it's
hard to prove that it is picking up everything it is supposed to.
wana (a humbled amateur programmer)
 
P

Peter Wyzl

:I was curious about why using regex for parsing HTML was so terrible,
: at least in simple cases. I can see why line breaks can complicate
: things, but with the relatively small size of most HTML files and power
: of today's computers, it should not be a big deal to load the whole
: file into a string and remove the line breaks first.
:
: In doing a little searching through the newsgroup, I found a lot of
: people saying HTML parsing with regex is always a bad idea but not
: explaining clearly why.

General parsing of HTML is almost impossible because of the huge range of
potential variations. Specific tasks can be accomplished, often with
several regexen, but 'A' regex to parse HTML is generally not possible.
However, if you have a specific, defined task with clearly consistent HTML
(ie from a particular site) then it can be quicker to process this with a
couple or regexen than to use a parser. It is a matter of using tools
appropriate to the task. General processing of HTML requires a parser, and
regexes are not suitable.

HTH
 
J

Juha Laiho

(e-mail address removed) said:
I have been looking through it for a little while now. I am trying to
find where they define SV (typedef or struct?) since it is used
everywhere.

SV is defined in perl.h (within your perl library directory), though it only
resolves to struct STRUCT_SV, which is defined in sv.h in the same directory.

struct STRUCT_SV { /* struct sv { */
void* sv_any; /* pointer to something */
U32 sv_refcnt; /* how many references to us */
U32 sv_flags; /* what we are */
};

.... and as for documentation, see documents perlguts, perlxstut and perlxs.
 
S

Sherm Pendley

I have been looking through it for a little while now. I am trying to
find where they define SV (typedef or struct?) since it is used
everywhere.

Have a look at 'perldoc perl' - the section titled "Internals and C
Language Interface" lists a number of pods that are of interest.

sherm--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top