I
ioneabu
I was curious about why using regex for parsing HTML was so terrible,
at least in simple cases. I can see why line breaks can complicate
things, but with the relatively small size of most HTML files and power
of today's computers, it should not be a big deal to load the whole
file into a string and remove the line breaks first.
In doing a little searching through the newsgroup, I found a lot of
people saying HTML parsing with regex is always a bad idea but not
explaining clearly why.
My next thought was to read through the code of HTML:arser and get a
general idea of how they do it or at least how complicated the process
really is.
I used IE 6 to look at the source at cpan.org and the ctrl-f find
command to search through the document. It seems that all of the work
is done in a sub named parse. For example:
$p->parse();
I have searched up and down the source for HTML:arser and I cannot
find a sub parse. There is a sub parse_file which calls parse.
I searched for any use, require, or do statements and found:
require HTML::Entities;
which I thought might be useful, but was not what I was looking for.
So where is this parse sub? If it is not in HTML:arser, where is it
and how is HTML:arser importing it?
Thanks!
wana
at least in simple cases. I can see why line breaks can complicate
things, but with the relatively small size of most HTML files and power
of today's computers, it should not be a big deal to load the whole
file into a string and remove the line breaks first.
In doing a little searching through the newsgroup, I found a lot of
people saying HTML parsing with regex is always a bad idea but not
explaining clearly why.
My next thought was to read through the code of HTML:arser and get a
general idea of how they do it or at least how complicated the process
really is.
I used IE 6 to look at the source at cpan.org and the ctrl-f find
command to search through the document. It seems that all of the work
is done in a sub named parse. For example:
$p->parse();
I have searched up and down the source for HTML:arser and I cannot
find a sub parse. There is a sub parse_file which calls parse.
I searched for any use, require, or do statements and found:
require HTML::Entities;
which I thought might be useful, but was not what I was looking for.
So where is this parse sub? If it is not in HTML:arser, where is it
and how is HTML:arser importing it?
Thanks!
wana