Z
Zebee Johnstone
Are there any tutorials or explanations of HTML:arser?
I've read the perldoc and I don't understand it. It's gibberish to me.
I've looked at the examples, but using them is cargo cult programming at
its worst, I have no idea what they are doing and why.
I understand I create an object. I understand I can then use this to do
things, but as soon as it talks about handlers, it loses me.
So I look at the code in the examples dir, and hanchors appears to be
the closest to what I want to do - which is get a set of links and their
associated text. But it appears to possibly be recursing, it's getting
things passed that appear to be hashes to the subroutines, but are
passed as strings....
I want to understand it, to work through it, so I can make my own or
modify it but can't work out what it's doing. I don't get the program
flow. I think because I don't see how it reads the files or works
out $attr->{href} (or why that's a bare word), or if start_handler's
being called once or many times. Or really what's happening at all!
#!/usr/bin/perl -w
# This program will print out all <a href=".."> links in a
# document together with the text that goes with it.
use HTML:arser;
my $p = HTML:arser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname,attr"],
report_tags => [qw(a img)],
);
$p->parse_file(shift || die) || die $!;
sub a_start_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "a";
return unless exists $attr->{href};
print "A $attr->{href}\n";
$self->handler(text => [], '@{dtext}' );
$self->handler(start => \&img_handler);
$self->handler(end => \&a_end_handler, "self,tagname");
}
sub img_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "img";
push(@{$self->handler("text")}, $attr->{alt} || "");
}
sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @{$self->handler("text")});
$text =~ s/^\s+//;
$text =~ s/\s+$//;
$text =~ s/\s+/ /g;
print "T $text\n";
$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}
Zebee
I've read the perldoc and I don't understand it. It's gibberish to me.
I've looked at the examples, but using them is cargo cult programming at
its worst, I have no idea what they are doing and why.
I understand I create an object. I understand I can then use this to do
things, but as soon as it talks about handlers, it loses me.
So I look at the code in the examples dir, and hanchors appears to be
the closest to what I want to do - which is get a set of links and their
associated text. But it appears to possibly be recursing, it's getting
things passed that appear to be hashes to the subroutines, but are
passed as strings....
I want to understand it, to work through it, so I can make my own or
modify it but can't work out what it's doing. I don't get the program
flow. I think because I don't see how it reads the files or works
out $attr->{href} (or why that's a bare word), or if start_handler's
being called once or many times. Or really what's happening at all!
#!/usr/bin/perl -w
# This program will print out all <a href=".."> links in a
# document together with the text that goes with it.
use HTML:arser;
my $p = HTML:arser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname,attr"],
report_tags => [qw(a img)],
);
$p->parse_file(shift || die) || die $!;
sub a_start_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "a";
return unless exists $attr->{href};
print "A $attr->{href}\n";
$self->handler(text => [], '@{dtext}' );
$self->handler(start => \&img_handler);
$self->handler(end => \&a_end_handler, "self,tagname");
}
sub img_handler
{
my($self, $tag, $attr) = @_;
return unless $tag eq "img";
push(@{$self->handler("text")}, $attr->{alt} || "");
}
sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @{$self->handler("text")});
$text =~ s/^\s+//;
$text =~ s/\s+$//;
$text =~ s/\s+/ /g;
print "T $text\n";
$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}
Zebee