Parsing HTML - using HTML::TreeBuilder

Discussion in 'Perl Misc' started by olson_ord@yahoo.it, Oct 5, 2006.

  1. Guest

    Hi,
    I am trying to use Perl to parse a webpage - and I cannot get
    started. I hope someone could help me.
    I searched online and I found that I am supposed to use the
    HTML::TreeBuilder. In the example below I am trying to get the text in
    the TAG named "H2". From the documentation there seems to be two
    ways to do this (I might be wrong - then please correct me) i.e.
    Using the look_down() and find_by_tag_name(). The latter is rather old.
    I have used the former to look for images (just as a test) and the
    latter to look for the "H2" tags. In both cases I get the number of
    H2's or Images to be 0.
    What am I doing wrong here - or is there an easier way to get the
    text in a HTML tag. I would be grateful for any help.

    Regards,
    Rio

    --------------- Code -------------------------
    use strict;
    use LWP::UserAgent;
    use LWP::Simple;
    use URI::Escape;
    use HTTP::Request::Common;
    use HTML::TreeBuilder;
    my $url = "http://wordlist.gredic.com/kaleidoscope";
    my $html = get($url);
    # print $html;

    my $tree = HTML::TreeBuilder->new();
    $tree->parse_file($html);

    ## --- Trial 1 ----------------
    my @imgs = $tree->look_down( _tag => 'img');

    ## --- Trial 2 ----------------
    my $elements = $tree->elementify();

    my @word = $elements->find_by_tag_name('h2');

    ## --- Results ----------------
    print "H2 Words = " . @word . "\n";
    print "Imgs = " . @imgs . "\n";

    # At the end need to free up the memory
    $tree->delete;
    print "completed script\n";
    --------- End of Code ---------------------

    P.S. The above is not my actual code - but a working example to
    demonstrate my question
     
    , Oct 5, 2006
    #1
    1. Advertising

  2. Paul Lalli Guest

    wrote:
    > What am I doing wrong here - or is there an easier way to get the
    > text in a HTML tag.


    I personally prefer HTML::TokeParser for parsing HTML, but TIMTOWTDI

    > use strict;


    You forgot:
    use warnings;

    > use LWP::UserAgent;
    > use LWP::Simple;


    You generally don't use both of these. . .

    > use URI::Escape;
    > use HTTP::Request::Common;
    > use HTML::TreeBuilder;
    > my $url = "http://wordlist.gredic.com/kaleidoscope";
    > my $html = get($url);


    This function returns the actul HTML content of the URL.

    > # print $html;
    >
    > my $tree = HTML::TreeBuilder->new();
    > $tree->parse_file($html);


    This attempts to find a file named by the string in $html and parse
    that file. Obviously, no such file exists.

    You want
    $tree->parse($html);

    Paul Lalli
     
    Paul Lalli, Oct 5, 2006
    #2
    1. Advertising

  3. Paul Lalli Guest

    wrote:
    > What am I doing wrong here - or is there an easier way to get the
    > text in a HTML tag.


    I personally prefer HTML::TokeParser for parsing HTML, but TIMTOWTDI

    > use strict;


    You forgot:
    use warnings;

    > use LWP::UserAgent;
    > use LWP::Simple;


    You generally don't use both of these. . .

    > use URI::Escape;
    > use HTTP::Request::Common;
    > use HTML::TreeBuilder;
    > my $url = "http://wordlist.gredic.com/kaleidoscope";
    > my $html = get($url);


    This function returns the actul HTML content of the URL.

    > # print $html;
    >
    > my $tree = HTML::TreeBuilder->new();
    > $tree->parse_file($html);


    This attempts to find a file named by the string in $html and parse
    that file. Obviously, no such file exists.

    You want
    $tree->parse($html);

    Paul Lalli
     
    Paul Lalli, Oct 5, 2006
    #3
  4. Guest

    Dear Paul,
    Thanks a lot for taking your time to answer. I am not new to
    programming (i.e. I use C++ for my work)but I am new to Perl. Yes, now
    at least I got this initial part to work. I think I would have more
    questions in the future.
    If you prefer to use HTML::TokeParser I would love to look at it
    myself. So if you have some handy tutorials on using the TokeParser
    then it would be helpful for me. (Right now I could only locate
    something at http://www.perlmonks.org/index.pl?node_id=99254 I would
    look at this later.
    Thanks again,
    O.O.



    Paul Lalli wrote:

    >
    > I personally prefer HTML::TokeParser for parsing HTML, but TIMTOWTDI
    >


    > > my $tree = HTML::TreeBuilder->new();
    > > $tree->parse_file($html);

    >
    > This attempts to find a file named by the string in $html and parse
    > that file. Obviously, no such file exists.
    >
    > You want
    > $tree->parse($html);
    >
    > Paul Lalli
     
    , Oct 6, 2006
    #4
  5. Paul Lalli Guest

    wrote:

    > If you prefer to use HTML::TokeParser I would love to look at it
    > myself. So if you have some handy tutorials on using the TokeParser
    > then it would be helpful for me.


    I don't know about tutorials, but the documentation for the module is
    pretty decent:
    http://search.cpan.org/~gaas/HTML-Parser-3.55/lib/HTML/TokeParser.pm

    Paul Lalli
     
    Paul Lalli, Oct 6, 2006
    #5
  6. Guest

    Thanks a lot Paul.
    I looked at the documentation HTML::TokeParser and it does not tell me
    if there is an easy way to find a certain token (e.g. "h2") i.e. It
    seems that I would have to start from the beginning and then scan all
    the tokens until I reach the required token. (I am basically looking
    for a find() function - or something similar.)
    Thanks a lot for your help.
    Regards,
    O.O.

    Paul Lalli wrote:
    > wrote:
    >
    > > If you prefer to use HTML::TokeParser I would love to look at it
    > > myself. So if you have some handy tutorials on using the TokeParser
    > > then it would be helpful for me.

    >
    > I don't know about tutorials, but the documentation for the module is
    > pretty decent:
    > http://search.cpan.org/~gaas/HTML-Parser-3.55/lib/HTML/TokeParser.pm
    >
    > Paul Lalli
     
    , Oct 6, 2006
    #6
  7. DJ Stunks Guest

    wrote:
    > Thanks a lot Paul.
    > I looked at the documentation HTML::TokeParser and it does not tell me
    > if there is an easy way to find a certain token (e.g. "h2") i.e. It
    > seems that I would have to start from the beginning and then scan all
    > the tokens until I reach the required token. (I am basically looking
    > for a find() function - or something similar.)


    Look a little harder, dude. it's (basically) 2 lines of code:

    #!/usr/bin/perl

    use strict;
    use warnings;

    use LWP::Simple;
    use HTML::TokeParser;

    my $url = 'http://wordlist.gredic.com/kaleidoscope';
    my $html = get( $url );

    my $p = HTML::TokeParser->new( \$html );

    while ( my $tag_ref = $p->get_tag( 'h2' ) ) {
    printf "%s: %s\n", $tag_ref->[0], $p->get_trimmed_text;
    }

    __END__
     
    DJ Stunks, Oct 6, 2006
    #7
  8. Guest

    Thanks DJ.
    I had thought of using a while statement (from looking at the tutorial
    I mentioned above). This would make my code look like a series of while
    statements. I think I would stick to using HTML::TreeBuilder - i.e.
    Just because I have almost got my code working using that.
    Thanks to you and Paul for your help.
    O.O.

    P.S. To other readers (who are unfamiliar with Perl -like myself)
    consider using a last statement in the while loop i.e.

    while ( my $tag_ref = $tp->get_tag( 'h2' ) ) {
    printf "%s: %s\n", $tag_ref->[0], $tp->get_trimmed_text;
    last;
    }

    -- so that you can process the file further. (Perl calls the 'break'
    statement 'last').


    DJ Stunks wrote:
    > wrote:
    > > Thanks a lot Paul.
    > > I looked at the documentation HTML::TokeParser and it does not tell me
    > > if there is an easy way to find a certain token (e.g. "h2") i.e. It
    > > seems that I would have to start from the beginning and then scan all
    > > the tokens until I reach the required token. (I am basically looking
    > > for a find() function - or something similar.)

    >
    > Look a little harder, dude. it's (basically) 2 lines of code:
    >
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > use LWP::Simple;
    > use HTML::TokeParser;
    >
    > my $url = 'http://wordlist.gredic.com/kaleidoscope';
    > my $html = get( $url );
    >
    > my $p = HTML::TokeParser->new( \$html );
    >
    > while ( my $tag_ref = $p->get_tag( 'h2' ) ) {
    > printf "%s: %s\n", $tag_ref->[0], $p->get_trimmed_text;
    > }
    >
    > __END__
     
    , Oct 6, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Aumann
    Replies:
    0
    Views:
    343
    Greg Aumann
    Jun 28, 2006
  2. Fredrik Lundh
    Replies:
    0
    Views:
    454
    Fredrik Lundh
    Jul 1, 2006
  3. John W. Kennedy

    Equivalent of Perl HTML::TreeBuilder?

    John W. Kennedy, Jul 29, 2004, in forum: Ruby
    Replies:
    2
    Views:
    153
  4. Bruce Horrocks
    Replies:
    1
    Views:
    124
    Bruce Horrocks
    Jun 12, 2005
  5. Dean Karres

    HTML::TreeBuilder issue

    Dean Karres, Feb 5, 2009, in forum: Perl Misc
    Replies:
    6
    Views:
    219
    Larry Gates
    Feb 13, 2009
Loading...

Share This Page