HTML::Parser

Z

Zebee Johnstone

Are there any tutorials or explanations of HTML::parser?

I've read the perldoc and I don't understand it. It's gibberish to me.

I've looked at the examples, but using them is cargo cult programming at
its worst, I have no idea what they are doing and why.

I understand I create an object. I understand I can then use this to do
things, but as soon as it talks about handlers, it loses me.

So I look at the code in the examples dir, and hanchors appears to be
the closest to what I want to do - which is get a set of links and their
associated text. But it appears to possibly be recursing, it's getting
things passed that appear to be hashes to the subroutines, but are
passed as strings....

I want to understand it, to work through it, so I can make my own or
modify it but can't work out what it's doing. I don't get the program
flow. I think because I don't see how it reads the files or works
out $attr->{href} (or why that's a bare word), or if start_handler's
being called once or many times. Or really what's happening at all!





#!/usr/bin/perl -w

# This program will print out all <a href=".."> links in a
# document together with the text that goes with it.

use HTML::parser;

my $p = HTML::parser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname,attr"],
report_tags => [qw(a img)],
);


$p->parse_file(shift || die) || die $!;

sub a_start_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "a";
return unless exists $attr->{href};
print "A $attr->{href}\n";

$self->handler(text => [], '@{dtext}' );
$self->handler(start => \&img_handler);
$self->handler(end => \&a_end_handler, "self,tagname");
}

sub img_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "img";
push(@{$self->handler("text")}, $attr->{alt} || "");
}

sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @{$self->handler("text")});
$text =~ s/^\s+//;
$text =~ s/\s+$//;
$text =~ s/\s+/ /g;
print "T $text\n";

$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}


Zebee
 
T

Tassilo v. Parseval

Also sprach Zebee Johnstone:
Are there any tutorials or explanations of HTML::parser?

I've read the perldoc and I don't understand it. It's gibberish to me.

I've looked at the examples, but using them is cargo cult programming at
its worst, I have no idea what they are doing and why.

I understand I create an object. I understand I can then use this to do
things, but as soon as it talks about handlers, it loses me.

One problem with HTML::parser appears to be its two available
interfaces. The description of the provided methods in the perldocs
isn't always quite clear about which API version a method relates to.

Maybe

<http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

will help you. It deals with the old interface (subclassing) which I
find more convenient and easier to use.

Tassilo
 
Z

Zebee Johnstone

In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
Tassilo v. Parseval said:

I understand more now about it, but your tutorial doesn't cover the
text, which I need.

If I print out all the text elements:

sub text {
my($self, $origtext, $is_cdata) = @_;
print "text [$origtext] \n";
}

then I get the text associated with the tags I'm after, but I get a lot
of other text as well.

Is there a way to associate the tag text with the tag, and only
use that?

So a bit of HTML
<a href="http://www.google.com"> Google </a> would have "Google"
associated with "http://www.google.com"?

ideally, I'd like to call the text subroutine from the start subroutine,
and pass it a hash to put the text value in. And have it return that
hash.

It isn't clear to me what items the start subroutine knows about that
it can pass to the text subroutine. IN the examples, it seems to use
(text => [], '@{dtext}' ) as args to the text handler, but I've no
idea where those come from at all, or what they are, or how to use them.
I have the "$self" object, which I can pass to a subroutine but no idea
how to get the things I need from it.

Zebee
 
E

Eric Bohlman

Is there a way to associate the tag text with the tag, and only
use that?

You might want to try HTML::TokeParser instead (it's included with the
HTML::parser distribution). It's a "pull" parser rather than a "push" one;
rather than it calling your code in response to tags and text, you call it
to get the next "token" which can be a start tag, text, end tag, etc. and
then decide what to do with it. Using it is similar to reading through a
file in a loop.
 
Z

Zebee Johnstone

In comp.lang.perl.misc on 27 Aug 2004 04:06:58 GMT
Eric Bohlman said:
You might want to try HTML::TokeParser instead (it's included with the
HTML::parser distribution). It's a "pull" parser rather than a "push" one;
rather than it calling your code in response to tags and text, you call it
to get the next "token" which can be a start tag, text, end tag, etc. and
then decide what to do with it. Using it is similar to reading through a
file in a loop.


Bingo! Much easier to use and understand. Thanks.

Zebee
 
T

Tassilo v. Parseval

Also sprach Zebee Johnstone:
In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
Tassilo v. Parseval said:

I understand more now about it, but your tutorial doesn't cover the
text, which I need.

If I print out all the text elements:

sub text {
my($self, $origtext, $is_cdata) = @_;
print "text [$origtext] \n";
}

then I get the text associated with the tags I'm after, but I get a lot
of other text as well.

More specifically, you get all the plain text elements of the HTML file.
Is there a way to associate the tag text with the tag, and only
use that?

Yes, by keeping track in which tag the parser currently is.
So a bit of HTML
<a href="http://www.google.com"> Google </a> would have "Google"
associated with "http://www.google.com"?

ideally, I'd like to call the text subroutine from the start subroutine,
and pass it a hash to put the text value in. And have it return that
hash.

Those are handlers and they can't have such a return value. But you have
an object (the HTML::parser object) in which you can store the data:

#!/usr/bin/perl -w

package MyParser;

use strict;
use base qw/HTML::parser/;

sub start {
my ($self, $tagname, $attr) = @_;
if ($tagname eq 'a') {
# store the URL as key of a new hash-ref
# associated text not yet known, therefore undef
push @{ $self->{a} }, { $attr->{href} => undef };
$self->{in_a} = $attr->{ href };
}
}

sub end {
my ($self, $tagname) = @_;
delete $self->{in_a} if $tagname eq 'a';
}

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
# text is between <a> and </a>
$self->{a}->[-1]->{ $self->{in_a} } = $text;
}
}

package main;

use Data::Dumper;
my $html = <<EOHTML;
<html>
<body>
<a href="http://www.first.com" target="bla">First link</a>
<a href="http://www.second.com">Second link</a>
</body>
</html>
EOHTML

my $p = MyParser->new;
$p->parse($html);
print Dumper $p->{a};
__END__
$VAR1 = [
{
'http://www.first.com' => 'One link'
},
{
'http://www.second.com' => 'Second link'
}
];
It isn't clear to me what items the start subroutine knows about that
it can pass to the text subroutine.

Handlers don't call each other. It's HTML::parser's parse-routines that
call the handlers whenever they encounter a start or end tag, a text
block or a comment. Handlers are called as-soon-as-event-happens.
IN the examples, it seems to use (text => [], '@{dtext}' ) as args to
the text handler, but I've no idea where those come from at all, or
what they are, or how to use them. I have the "$self" object, which I
can pass to a subroutine but no idea how to get the things I need from
it.

This $self object is the object you create with 'HTML::parser->new'. Per
default it doesn't contain useful information. It holds the state of the
parser. But, as show above, you can abuse it as a cheap way of keeping
your own states. All I did was injecting two new member variables into
the object: $self->{in_a} which holds the URL when being inside an <a>
tag, otherwise this field does not exist. It is deleted in the
end-handler when $tagname is 'a'.

The second one is $self->{a}. This one is an array-ref of
hash-references. Each new URL/text pair is recorded in there and pushed
onto this array.

When '$p->parse' returns you look at '$p->{a}' and there you have the
data you want to extract.

Tassilo
 
Z

Zebee Johnstone

Bear with me please, I'm still getting to grips with a lot of notation
and ideas...

If that means I need to go read something to understand, please point me
at it!

In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200
Tassilo v. Parseval said:
my ($self, $tagname, $attr) = @_;
if ($tagname eq 'a') {
# store the URL as key of a new hash-ref
# associated text not yet known, therefore undef
push @{ $self->{a} }, { $attr->{href} => undef };

OK, given your explanation below, I think I get this.
sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
# text is between <a> and </a>
$self->{a}->[-1]->{ $self->{in_a} } = $text;

Why -1? I don't understand this line at all...
The second one is $self->{a}. This one is an array-ref of
hash-references. Each new URL/text pair is recorded in there and pushed
onto this array.

When '$p->parse' returns you look at '$p->{a}' and there you have the
data you want to extract.

Zebee
 
T

Tassilo v. Parseval

Also sprach Zebee Johnstone:
Bear with me please, I'm still getting to grips with a lot of notation
and ideas...

If that means I need to go read something to understand, please point me
at it!

Your question is mostly about the data-structure that is used here. So
that would make it a perldsc/perlreftut/perlref-question.
In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200
Tassilo v. Parseval said:
my ($self, $tagname, $attr) = @_;
if ($tagname eq 'a') {
# store the URL as key of a new hash-ref
# associated text not yet known, therefore undef
push @{ $self->{a} }, { $attr->{href} => undef };

OK, given your explanation below, I think I get this.
sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
# text is between <a> and </a>
$self->{a}->[-1]->{ $self->{in_a} } = $text;

Why -1? I don't understand this line at all...

Previously I did this:

push @{ $self->{a} }, { $attr->{href} => undef };

This means: $self->{a} is an array-reference. The hash-reference

{ $attr->{href} => undef }

is pushed onto this array-ref which means it is now the last element.

However, the hash-ref is incomplete. The value associated with they key
$attr->{href} is undef because we can't yet know the text enclosed in
<a> and </a>. But later we will (namely in the text() handler).

Once text is called, it's checked that we are inside <a>|</a>. If we
are, we finally have the text portion we wanted. We know that the
incomplete hash-reference is the last element in @{ $self->{a} }. And so
it becomes:

$self->{a}->[-1]

which is our previously created hash-reference. Only the value is
updated. The key was stored in $self->{in_a}:

$self->{a}->[-1]->{ $self->{in_a} } = $text;

I admit that the data-structure I used is not ideal. If you are sure
that the URLs defined in <a> tags are unique, you can do away with the
array-ref altogether:

sub start {
my ($self, $tag, $attr) = @_;
if ($tag eq 'a') {
$self->{in_a} = $attr->{href};
}
}

sub text {
my ($self, $text) = @_;
if (exists $self->{in_a}) {
$self->{a}->{ $self->{in_a} } = $text;
delete $self->{in_a};
}
}

We didn't need the end-handler as I just realized. We can also delete
$self->{in_a} in text().

Tassilo
 
B

Bart Lateur

Zebee said:
Are there any tutorials or explanations of HTML::parser?

I've read the perldoc and I don't understand it. It's gibberish to me.

The best intro on the subject, IMO, is gellyfish's old tutorial.

<http://www.gellyfish.com/htexamples/>

Now, if after going through this, you decide that callback-oriented
programming isn't your cup of tea, you might also want to take a look at
the alternative approach, token stream oriented: using HTML::TokeParser,
or a bit more high-level: HTML::TokeParser::Simple. There, you read
tokens (a tag, a piece of plain text) from a HML source one at a time,
like lines from a file.
 
W

wfsp

Bart Lateur said:
The best intro on the subject, IMO, is gellyfish's old tutorial.

<http://www.gellyfish.com/htexamples/>

Now, if after going through this, you decide that callback-oriented
programming isn't your cup of tea, you might also want to take a look at
the alternative approach, token stream oriented: using HTML::TokeParser,
or a bit more high-level: HTML::TokeParser::Simple. There, you read
tokens (a tag, a piece of plain text) from a HML source one at a time,
like lines from a file.

HTML::TokeParser doc has an example:
"This example extracts all links from a document. It will print one line for
each link, containing the URL and the textual description between the
<A>...</A> tags:

use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html"); while (my $token =
$p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top