HTML::Parser

Zebee Johnstone · Aug 26, 2004

Are there any tutorials or explanations of HTML:

arser?

I've read the perldoc and I don't understand it. It's gibberish to me.

I've looked at the examples, but using them is cargo cult programming at
its worst, I have no idea what they are doing and why.

I understand I create an object. I understand I can then use this to do
things, but as soon as it talks about handlers, it loses me.

So I look at the code in the examples dir, and hanchors appears to be
the closest to what I want to do - which is get a set of links and their
associated text. But it appears to possibly be recursing, it's getting
things passed that appear to be hashes to the subroutines, but are
passed as strings....

I want to understand it, to work through it, so I can make my own or
modify it but can't work out what it's doing. I don't get the program
flow. I think because I don't see how it reads the files or works
out $attr->{href} (or why that's a bare word), or if start_handler's
being called once or many times. Or really what's happening at all!

#!/usr/bin/perl -w

# This program will print out all <a href=".."> links in a
# document together with the text that goes with it.

use HTML:

arser;

my $p = HTML:

arser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname,attr"],
report_tags => [qw(a img)],
);

$p->parse_file(shift || die) || die $!;

sub a_start_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "a";
return unless exists $attr->{href};
print "A $attr->{href}\n";

$self->handler(text => [], '@{dtext}' );
$self->handler(start => \&img_handler);
$self->handler(end => \&a_end_handler, "self,tagname");
}

sub img_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "img";
push(@{$self->handler("text")}, $attr->{alt} || "");
}

sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @{$self->handler("text")});
$text =~ s/^\s+//;
$text =~ s/\s+$//;
$text =~ s/\s+/ /g;
print "T $text\n";

$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}

Zebee

Tassilo v. Parseval · Aug 26, 2004

Also sprach Zebee Johnstone:

Are there any tutorials or explanations of HTML:arser?

I've read the perldoc and I don't understand it. It's gibberish to me.

I've looked at the examples, but using them is cargo cult programming at
its worst, I have no idea what they are doing and why.

I understand I create an object. I understand I can then use this to do
things, but as soon as it talks about handlers, it loses me.

One problem with HTML:

arser appears to be its two available
interfaces. The description of the provided methods in the perldocs
isn't always quite clear about which API version a method relates to.

Maybe

<http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

will help you. It deals with the old interface (subclassing) which I
find more convenient and easier to use.

Tassilo

Zebee Johnstone · Aug 27, 2004

In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200

Tassilo v. Parseval said:
<http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

will help you. It deals with the old interface (subclassing) which I
find more convenient and easier to use.

Thanks!

Zebee

Zebee Johnstone · Aug 27, 2004

In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200

Tassilo v. Parseval said:
<http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

I understand more now about it, but your tutorial doesn't cover the
text, which I need.

If I print out all the text elements:

sub text {
my($self, $origtext, $is_cdata) = @_;
print "text [$origtext] \n";
}

then I get the text associated with the tags I'm after, but I get a lot
of other text as well.

Is there a way to associate the tag text with the tag, and only
use that?

So a bit of HTML
<a href="http://www.google.com"> Google </a> would have "Google"
associated with "http://www.google.com"?

ideally, I'd like to call the text subroutine from the start subroutine,
and pass it a hash to put the text value in. And have it return that
hash.

It isn't clear to me what items the start subroutine knows about that
it can pass to the text subroutine. IN the examples, it seems to use
(text => [], '@{dtext}' ) as args to the text handler, but I've no
idea where those come from at all, or what they are, or how to use them.
I have the "$self" object, which I can pass to a subroutine but no idea
how to get the things I need from it.

Zebee

Eric Bohlman · Aug 27, 2004

Is there a way to associate the tag text with the tag, and only
use that?

You might want to try HTML::TokeParser instead (it's included with the
HTML:

arser distribution). It's a "pull" parser rather than a "push" one;
rather than it calling your code in response to tags and text, you call it
to get the next "token" which can be a start tag, text, end tag, etc. and
then decide what to do with it. Using it is similar to reading through a
file in a loop.

Zebee Johnstone · Aug 27, 2004

In comp.lang.perl.misc on 27 Aug 2004 04:06:58 GMT

Eric Bohlman said:
You might want to try HTML::TokeParser instead (it's included with the
HTML:arser distribution). It's a "pull" parser rather than a "push" one;
rather than it calling your code in response to tags and text, you call it
to get the next "token" which can be a start tag, text, end tag, etc. and
then decide what to do with it. Using it is similar to reading through a
file in a loop.

Bingo! Much easier to use and understand. Thanks.

Zebee

Tassilo v. Parseval · Aug 27, 2004

Also sprach Zebee Johnstone:

Tassilo v. Parseval said:
In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200

Tassilo v. Parseval said:

<http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

Click to expand...

I understand more now about it, but your tutorial doesn't cover the
text, which I need.

If I print out all the text elements:

sub text {
my($self, $origtext, $is_cdata) = @_;
print "text [$origtext] \n";
}

then I get the text associated with the tags I'm after, but I get a lot
of other text as well.

More specifically, you get all the plain text elements of the HTML file.

Is there a way to associate the tag text with the tag, and only
use that?

Yes, by keeping track in which tag the parser currently is.

So a bit of HTML
<a href="http://www.google.com"> Google </a> would have "Google"
associated with "http://www.google.com"?

ideally, I'd like to call the text subroutine from the start subroutine,
and pass it a hash to put the text value in. And have it return that
hash.

Those are handlers and they can't have such a return value. But you have
an object (the HTML:

arser object) in which you can store the data:

#!/usr/bin/perl -w

package MyParser;

use strict;
use base qw/HTML:

arser/;

sub start {
my ($self, $tagname, $attr) = @_;
if ($tagname eq 'a') {
# store the URL as key of a new hash-ref
# associated text not yet known, therefore undef
push @{ $self->{a} }, { $attr->{href} => undef };
$self->{in_a} = $attr->{ href };
}
}

sub end {
my ($self, $tagname) = @_;
delete $self->{in_a} if $tagname eq 'a';
}

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
# text is between <a> and </a>
$self->{a}->[-1]->{ $self->{in_a} } = $text;
}
}

package main;

use Data:

umper;
my $html = <<EOHTML;
<html>
<body>
<a href="http://www.first.com" target="bla">First link</a>
<a href="http://www.second.com">Second link</a>
</body>
</html>
EOHTML

my $p = MyParser->new;
$p->parse($html);
print Dumper $p->{a};
__END__
$VAR1 = [
{
'http://www.first.com' => 'One link'
},
{
'http://www.second.com' => 'Second link'
}
];

It isn't clear to me what items the start subroutine knows about that
it can pass to the text subroutine.

Handlers don't call each other. It's HTML:

arser's parse-routines that
call the handlers whenever they encounter a start or end tag, a text
block or a comment. Handlers are called as-soon-as-event-happens.

IN the examples, it seems to use (text => [], '@{dtext}' ) as args to
the text handler, but I've no idea where those come from at all, or
what they are, or how to use them. I have the "$self" object, which I
can pass to a subroutine but no idea how to get the things I need from
it.

This $self object is the object you create with 'HTML:

arser->new'. Per
default it doesn't contain useful information. It holds the state of the
parser. But, as show above, you can abuse it as a cheap way of keeping
your own states. All I did was injecting two new member variables into
the object: $self->{in_a} which holds the URL when being inside an <a>
tag, otherwise this field does not exist. It is deleted in the
end-handler when $tagname is 'a'.

The second one is $self->{a}. This one is an array-ref of
hash-references. Each new URL/text pair is recorded in there and pushed
onto this array.

When '$p->parse' returns you look at '$p->{a}' and there you have the
data you want to extract.

Tassilo

Zebee Johnstone · Aug 27, 2004

Bear with me please, I'm still getting to grips with a lot of notation
and ideas...

If that means I need to go read something to understand, please point me
at it!

In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200

Tassilo v. Parseval said:
my ($self, $tagname, $attr) = @_;
if ($tagname eq 'a') {
# store the URL as key of a new hash-ref
# associated text not yet known, therefore undef
push @{ $self->{a} }, { $attr->{href} => undef };

OK, given your explanation below, I think I get this.

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
# text is between <a> and </a>
$self->{a}->[-1]->{ $self->{in_a} } = $text;

Why -1? I don't understand this line at all...

The second one is $self->{a}. This one is an array-ref of
hash-references. Each new URL/text pair is recorded in there and pushed
onto this array.

When '$p->parse' returns you look at '$p->{a}' and there you have the
data you want to extract.

Zebee

Tassilo v. Parseval · Aug 27, 2004

Also sprach Zebee Johnstone:

Bear with me please, I'm still getting to grips with a lot of notation
and ideas...

If that means I need to go read something to understand, please point me
at it!

Your question is mostly about the data-structure that is used here. So
that would make it a perldsc/perlreftut/perlref-question.

Tassilo v. Parseval said:
In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200

Tassilo v. Parseval said:

my ($self, $tagname, $attr) = @_;
if ($tagname eq 'a') {
# store the URL as key of a new hash-ref
# associated text not yet known, therefore undef
push @{ $self->{a} }, { $attr->{href} => undef };

Click to expand...

OK, given your explanation below, I think I get this.

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
# text is between <a> and </a>
$self->{a}->[-1]->{ $self->{in_a} } = $text;

Click to expand...

Why -1? I don't understand this line at all...

Previously I did this:

push @{ $self->{a} }, { $attr->{href} => undef };

This means: $self->{a} is an array-reference. The hash-reference

{ $attr->{href} => undef }

is pushed onto this array-ref which means it is now the last element.

However, the hash-ref is incomplete. The value associated with they key
$attr->{href} is undef because we can't yet know the text enclosed in
<a> and </a>. But later we will (namely in the text() handler).

Once text is called, it's checked that we are inside <a>|</a>. If we
are, we finally have the text portion we wanted. We know that the
incomplete hash-reference is the last element in @{ $self->{a} }. And so
it becomes:

$self->{a}->[-1]

which is our previously created hash-reference. Only the value is
updated. The key was stored in $self->{in_a}:

$self->{a}->[-1]->{ $self->{in_a} } = $text;

I admit that the data-structure I used is not ideal. If you are sure
that the URLs defined in <a> tags are unique, you can do away with the
array-ref altogether:

sub start {
my ($self, $tag, $attr) = @_;
if ($tag eq 'a') {
$self->{in_a} = $attr->{href};
}
}

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
$self->{a}->{ $self->{in_a} } = $text;
delete $self->{in_a};
}
}

We didn't need the end-handler as I just realized. We can also delete
$self->{in_a} in text().

Tassilo

Bart Lateur · Aug 27, 2004

Zebee said:
Are there any tutorials or explanations of HTML:arser?

I've read the perldoc and I don't understand it. It's gibberish to me.

The best intro on the subject, IMO, is gellyfish's old tutorial.

<http://www.gellyfish.com/htexamples/>

Now, if after going through this, you decide that callback-oriented
programming isn't your cup of tea, you might also want to take a look at
the alternative approach, token stream oriented: using HTML::TokeParser,
or a bit more high-level: HTML::TokeParser::Simple. There, you read
tokens (a tag, a piece of plain text) from a HML source one at a time,
like lines from a file.

wfsp · Aug 27, 2004

Bart Lateur said:
The best intro on the subject, IMO, is gellyfish's old tutorial.

<http://www.gellyfish.com/htexamples/>

Now, if after going through this, you decide that callback-oriented
programming isn't your cup of tea, you might also want to take a look at
the alternative approach, token stream oriented: using HTML::TokeParser,
or a bit more high-level: HTML::TokeParser::Simple. There, you read
tokens (a tag, a piece of plain text) from a HML source one at a time,
like lines from a file.

HTML::TokeParser doc has an example:
"This example extracts all links from a document. It will print one line for
each link, containing the URL and the textual description between the
<A>...</A> tags:

use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html"); while (my $token =
$p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}"

Problem with body text extraction with HTML::Parser	1	Dec 13, 2005
How to implement a html parser in java?	1	Dec 28, 2023
Clickable Div Block	1	Oct 13, 2023
HTML::Parser not stripping out comments	3	Jun 14, 2004
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
JQuery - Preloading Images	0	Apr 18, 2019
I need help making an html website	2	Aug 2, 2023
Sort by number of characters	0	Nov 3, 2023

HTML::Parser

Zebee Johnstone

Tassilo v. Parseval

Zebee Johnstone

Zebee Johnstone

Eric Bohlman

Zebee Johnstone

Tassilo v. Parseval

Zebee Johnstone

Tassilo v. Parseval

Bart Lateur

wfsp

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads