HTML::Parser

Discussion in 'Perl Misc' started by Zebee Johnstone, Aug 26, 2004.

  1. Are there any tutorials or explanations of HTML::parser?

    I've read the perldoc and I don't understand it. It's gibberish to me.

    I've looked at the examples, but using them is cargo cult programming at
    its worst, I have no idea what they are doing and why.

    I understand I create an object. I understand I can then use this to do
    things, but as soon as it talks about handlers, it loses me.

    So I look at the code in the examples dir, and hanchors appears to be
    the closest to what I want to do - which is get a set of links and their
    associated text. But it appears to possibly be recursing, it's getting
    things passed that appear to be hashes to the subroutines, but are
    passed as strings....

    I want to understand it, to work through it, so I can make my own or
    modify it but can't work out what it's doing. I don't get the program
    flow. I think because I don't see how it reads the files or works
    out $attr->{href} (or why that's a bare word), or if start_handler's
    being called once or many times. Or really what's happening at all!





    #!/usr/bin/perl -w

    # This program will print out all <a href=".."> links in a
    # document together with the text that goes with it.

    use HTML::parser;

    my $p = HTML::parser->new(api_version => 3,
    start_h => [\&a_start_handler, "self,tagname,attr"],
    report_tags => [qw(a img)],
    );


    $p->parse_file(shift || die) || die $!;

    sub a_start_handler
    {
    my($self, $tag, $attr) = @_;
    return unless $tag eq "a";
    return unless exists $attr->{href};
    print "A $attr->{href}\n";

    $self->handler(text => [], '@{dtext}' );
    $self->handler(start => \&img_handler);
    $self->handler(end => \&a_end_handler, "self,tagname");
    }

    sub img_handler
    {
    my($self, $tag, $attr) = @_;
    return unless $tag eq "img";
    push(@{$self->handler("text")}, $attr->{alt} || "");
    }

    sub a_end_handler
    {
    my($self, $tag) = @_;
    my $text = join("", @{$self->handler("text")});
    $text =~ s/^\s+//;
    $text =~ s/\s+$//;
    $text =~ s/\s+/ /g;
    print "T $text\n";

    $self->handler("text", undef);
    $self->handler("start", \&a_start_handler);
    $self->handler("end", undef);
    }


    Zebee

    --
    Zebee Johnstone (), proud holder of
    aus.motorcycles Poser Permit #1.
    "Motorcycles are like peanuts... who can stop at just one?"
    Zebee Johnstone, Aug 26, 2004
    #1
    1. Advertising

  2. Also sprach Zebee Johnstone:

    > Are there any tutorials or explanations of HTML::parser?
    >
    > I've read the perldoc and I don't understand it. It's gibberish to me.
    >
    > I've looked at the examples, but using them is cargo cult programming at
    > its worst, I have no idea what they are doing and why.
    >
    > I understand I create an object. I understand I can then use this to do
    > things, but as soon as it talks about handlers, it loses me.


    One problem with HTML::parser appears to be its two available
    interfaces. The description of the provided methods in the perldocs
    isn't always quite clear about which API version a method relates to.

    Maybe

    <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

    will help you. It deals with the old interface (subclassing) which I
    find more convenient and easier to use.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Aug 26, 2004
    #2
    1. Advertising

  3. In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
    Tassilo v. Parseval <> wrote:
    >
    >
    > <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>
    >
    > will help you. It deals with the old interface (subclassing) which I
    > find more convenient and easier to use.


    Thanks!


    Zebee
    Zebee Johnstone, Aug 27, 2004
    #3
  4. In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
    Tassilo v. Parseval <> wrote:
    >
    > <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>


    I understand more now about it, but your tutorial doesn't cover the
    text, which I need.

    If I print out all the text elements:

    sub text {
    my($self, $origtext, $is_cdata) = @_;
    print "text [$origtext] \n";
    }

    then I get the text associated with the tags I'm after, but I get a lot
    of other text as well.

    Is there a way to associate the tag text with the tag, and only
    use that?

    So a bit of HTML
    <a href="http://www.google.com"> Google </a> would have "Google"
    associated with "http://www.google.com"?

    ideally, I'd like to call the text subroutine from the start subroutine,
    and pass it a hash to put the text value in. And have it return that
    hash.

    It isn't clear to me what items the start subroutine knows about that
    it can pass to the text subroutine. IN the examples, it seems to use
    (text => [], '@{dtext}' ) as args to the text handler, but I've no
    idea where those come from at all, or what they are, or how to use them.
    I have the "$self" object, which I can pass to a subroutine but no idea
    how to get the things I need from it.

    Zebee
    Zebee Johnstone, Aug 27, 2004
    #4
  5. Zebee Johnstone

    Eric Bohlman Guest

    Zebee Johnstone <> wrote in
    news::

    > Is there a way to associate the tag text with the tag, and only
    > use that?


    You might want to try HTML::TokeParser instead (it's included with the
    HTML::parser distribution). It's a "pull" parser rather than a "push" one;
    rather than it calling your code in response to tags and text, you call it
    to get the next "token" which can be a start tag, text, end tag, etc. and
    then decide what to do with it. Using it is similar to reading through a
    file in a loop.
    Eric Bohlman, Aug 27, 2004
    #5
  6. In comp.lang.perl.misc on 27 Aug 2004 04:06:58 GMT
    Eric Bohlman <> wrote:
    > You might want to try HTML::TokeParser instead (it's included with the
    > HTML::parser distribution). It's a "pull" parser rather than a "push" one;
    > rather than it calling your code in response to tags and text, you call it
    > to get the next "token" which can be a start tag, text, end tag, etc. and
    > then decide what to do with it. Using it is similar to reading through a
    > file in a loop.



    Bingo! Much easier to use and understand. Thanks.

    Zebee

    --
    Zebee Johnstone (), proud holder of
    aus.motorcycles Poser Permit #1.
    "Motorcycles are like peanuts... who can stop at just one?"
    Zebee Johnstone, Aug 27, 2004
    #6
  7. Also sprach Zebee Johnstone:

    > In comp.lang.perl.misc on Thu, 26 Aug 2004 07:46:55 +0200
    > Tassilo v. Parseval <> wrote:
    >>
    >> <http://www.unisolve.com.au/perlmeme/tutorials/html_parser.html>

    >
    > I understand more now about it, but your tutorial doesn't cover the
    > text, which I need.
    >
    > If I print out all the text elements:
    >
    > sub text {
    > my($self, $origtext, $is_cdata) = @_;
    > print "text [$origtext] \n";
    > }
    >
    > then I get the text associated with the tags I'm after, but I get a lot
    > of other text as well.


    More specifically, you get all the plain text elements of the HTML file.

    > Is there a way to associate the tag text with the tag, and only
    > use that?


    Yes, by keeping track in which tag the parser currently is.

    > So a bit of HTML
    > <a href="http://www.google.com"> Google </a> would have "Google"
    > associated with "http://www.google.com"?
    >
    > ideally, I'd like to call the text subroutine from the start subroutine,
    > and pass it a hash to put the text value in. And have it return that
    > hash.


    Those are handlers and they can't have such a return value. But you have
    an object (the HTML::parser object) in which you can store the data:

    #!/usr/bin/perl -w

    package MyParser;

    use strict;
    use base qw/HTML::parser/;

    sub start {
    my ($self, $tagname, $attr) = @_;
    if ($tagname eq 'a') {
    # store the URL as key of a new hash-ref
    # associated text not yet known, therefore undef
    push @{ $self->{a} }, { $attr->{href} => undef };
    $self->{in_a} = $attr->{ href };
    }
    }

    sub end {
    my ($self, $tagname) = @_;
    delete $self->{in_a} if $tagname eq 'a';
    }

    sub text {
    my ($self, $text) = @_;
    if (exists $self->{in_a}) {
    # text is between <a> and </a>
    $self->{a}->[-1]->{ $self->{in_a} } = $text;
    }
    }

    package main;

    use Data::Dumper;
    my $html = <<EOHTML;
    <html>
    <body>
    <a href="http://www.first.com" target="bla">First link</a>
    <a href="http://www.second.com">Second link</a>
    </body>
    </html>
    EOHTML

    my $p = MyParser->new;
    $p->parse($html);
    print Dumper $p->{a};
    __END__
    $VAR1 = [
    {
    'http://www.first.com' => 'One link'
    },
    {
    'http://www.second.com' => 'Second link'
    }
    ];

    > It isn't clear to me what items the start subroutine knows about that
    > it can pass to the text subroutine.


    Handlers don't call each other. It's HTML::parser's parse-routines that
    call the handlers whenever they encounter a start or end tag, a text
    block or a comment. Handlers are called as-soon-as-event-happens.

    > IN the examples, it seems to use (text => [], '@{dtext}' ) as args to
    > the text handler, but I've no idea where those come from at all, or
    > what they are, or how to use them. I have the "$self" object, which I
    > can pass to a subroutine but no idea how to get the things I need from
    > it.


    This $self object is the object you create with 'HTML::parser->new'. Per
    default it doesn't contain useful information. It holds the state of the
    parser. But, as show above, you can abuse it as a cheap way of keeping
    your own states. All I did was injecting two new member variables into
    the object: $self->{in_a} which holds the URL when being inside an <a>
    tag, otherwise this field does not exist. It is deleted in the
    end-handler when $tagname is 'a'.

    The second one is $self->{a}. This one is an array-ref of
    hash-references. Each new URL/text pair is recorded in there and pushed
    onto this array.

    When '$p->parse' returns you look at '$p->{a}' and there you have the
    data you want to extract.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Aug 27, 2004
    #7
  8. Bear with me please, I'm still getting to grips with a lot of notation
    and ideas...

    If that means I need to go read something to understand, please point me
    at it!

    In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200
    Tassilo v. Parseval <> wrote:
    > my ($self, $tagname, $attr) = @_;
    > if ($tagname eq 'a') {
    > # store the URL as key of a new hash-ref
    > # associated text not yet known, therefore undef
    > push @{ $self->{a} }, { $attr->{href} => undef };


    OK, given your explanation below, I think I get this.

    > sub text {
    > my ($self, $text) = @_;
    > if (exists $self->{in_a}) {
    > # text is between <a> and </a>
    > $self->{a}->[-1]->{ $self->{in_a} } = $text;


    Why -1? I don't understand this line at all...
    >
    > The second one is $self->{a}. This one is an array-ref of
    > hash-references. Each new URL/text pair is recorded in there and pushed
    > onto this array.
    >
    > When '$p->parse' returns you look at '$p->{a}' and there you have the
    > data you want to extract.
    >


    Zebee
    Zebee Johnstone, Aug 27, 2004
    #8
  9. Also sprach Zebee Johnstone:

    > Bear with me please, I'm still getting to grips with a lot of notation
    > and ideas...
    >
    > If that means I need to go read something to understand, please point me
    > at it!


    Your question is mostly about the data-structure that is used here. So
    that would make it a perldsc/perlreftut/perlref-question.

    > In comp.lang.perl.misc on Fri, 27 Aug 2004 07:41:55 +0200
    > Tassilo v. Parseval <> wrote:
    >> my ($self, $tagname, $attr) = @_;
    >> if ($tagname eq 'a') {
    >> # store the URL as key of a new hash-ref
    >> # associated text not yet known, therefore undef
    >> push @{ $self->{a} }, { $attr->{href} => undef };

    >
    > OK, given your explanation below, I think I get this.
    >
    >> sub text {
    >> my ($self, $text) = @_;
    >> if (exists $self->{in_a}) {
    >> # text is between <a> and </a>
    >> $self->{a}->[-1]->{ $self->{in_a} } = $text;

    >
    > Why -1? I don't understand this line at all...


    Previously I did this:

    push @{ $self->{a} }, { $attr->{href} => undef };

    This means: $self->{a} is an array-reference. The hash-reference

    { $attr->{href} => undef }

    is pushed onto this array-ref which means it is now the last element.

    However, the hash-ref is incomplete. The value associated with they key
    $attr->{href} is undef because we can't yet know the text enclosed in
    <a> and </a>. But later we will (namely in the text() handler).

    Once text is called, it's checked that we are inside <a>|</a>. If we
    are, we finally have the text portion we wanted. We know that the
    incomplete hash-reference is the last element in @{ $self->{a} }. And so
    it becomes:

    $self->{a}->[-1]

    which is our previously created hash-reference. Only the value is
    updated. The key was stored in $self->{in_a}:

    $self->{a}->[-1]->{ $self->{in_a} } = $text;

    I admit that the data-structure I used is not ideal. If you are sure
    that the URLs defined in <a> tags are unique, you can do away with the
    array-ref altogether:

    sub start {
    my ($self, $tag, $attr) = @_;
    if ($tag eq 'a') {
    $self->{in_a} = $attr->{href};
    }
    }

    sub text {
    my ($self, $text) = @_;
    if (exists $self->{in_a}) {
    $self->{a}->{ $self->{in_a} } = $text;
    delete $self->{in_a};
    }
    }

    We didn't need the end-handler as I just realized. We can also delete
    $self->{in_a} in text().

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Aug 27, 2004
    #9
  10. Zebee Johnstone

    Bart Lateur Guest

    Zebee Johnstone wrote:

    >Are there any tutorials or explanations of HTML::parser?
    >
    >I've read the perldoc and I don't understand it. It's gibberish to me.


    The best intro on the subject, IMO, is gellyfish's old tutorial.

    <http://www.gellyfish.com/htexamples/>

    Now, if after going through this, you decide that callback-oriented
    programming isn't your cup of tea, you might also want to take a look at
    the alternative approach, token stream oriented: using HTML::TokeParser,
    or a bit more high-level: HTML::TokeParser::Simple. There, you read
    tokens (a tag, a piece of plain text) from a HML source one at a time,
    like lines from a file.

    --
    Bart.
    Bart Lateur, Aug 27, 2004
    #10
  11. Zebee Johnstone

    wfsp Guest

    "Bart Lateur" <> wrote in message
    news:...
    > Zebee Johnstone wrote:
    >
    > >Are there any tutorials or explanations of HTML::parser?
    > >
    > >I've read the perldoc and I don't understand it. It's gibberish to me.

    >
    > The best intro on the subject, IMO, is gellyfish's old tutorial.
    >
    > <http://www.gellyfish.com/htexamples/>
    >
    > Now, if after going through this, you decide that callback-oriented
    > programming isn't your cup of tea, you might also want to take a look at
    > the alternative approach, token stream oriented: using HTML::TokeParser,
    > or a bit more high-level: HTML::TokeParser::Simple. There, you read
    > tokens (a tag, a piece of plain text) from a HML source one at a time,
    > like lines from a file.
    >
    > --
    > Bart.


    HTML::TokeParser doc has an example:
    "This example extracts all links from a document. It will print one line for
    each link, containing the URL and the textual description between the
    <A>...</A> tags:

    use HTML::TokeParser;
    $p = HTML::TokeParser->new(shift||"index.html"); while (my $token =
    $p->get_tag("a")) {
    my $url = $token->[1]{href} || "-";
    my $text = $p->get_trimmed_text("/a");
    print "$url\t$text\n";
    }"
    wfsp, Aug 27, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mitchua
    Replies:
    1
    Views:
    7,071
    Ice Demon
    Jul 15, 2003
  2. ZOCOR

    XML Parser VS HTML Parser

    ZOCOR, Oct 3, 2004, in forum: Java
    Replies:
    11
    Views:
    813
    Paul King
    Oct 5, 2004
  3. David Virgil Hobbs
    Replies:
    2
    Views:
    17,253
  4. Bengt Richter
    Replies:
    0
    Views:
    523
    Bengt Richter
    Aug 3, 2003
  5. Zach Dennis

    HTML-Parser / SGML-Parser

    Zach Dennis, Oct 1, 2003, in forum: Ruby
    Replies:
    5
    Views:
    403
    Bernard Delmée
    Oct 1, 2003
Loading...

Share This Page