HTML::TreeBuilder issue

Discussion in 'Perl Misc' started by Dean Karres, Feb 5, 2009.

  1. Dean Karres

    Dean Karres Guest

    Hi I posted something similar to this over in comp.lang.perl.modules
    but 1) it looks like that group is not very active and 2) it does not
    look like TreeBuilder is being actively maintained -- CPAN has open
    bug reports on TreeBuilder that are over a year old.

    So what i want to do is harvest some info from a lot of web pages on
    my site. I want to grab the first H1 tag in the file and then, if
    there are H2 tags that follow the H1, I want the "sub"-H2s. If there
    are more H1s or if there are H2s that precede the first H1 or follow
    any secondary H1 then they should be ignored.

    There are some outstanding issues with the existence and location of
    "A" tags either inside or surrounding the H2 tags but I don't think
    that has anything to do with my problem.

    My problem comes from the sample html file:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
    www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
    <meta http-equiv="Content-Type" content="text/html;
    charset=utf-8" />
    <title>thing</title>
    <link href="index.css" rel="stylesheet" type="text/css" />
    </head>

    <body>
    <h1>A Title</h1>

    <a name="foo"></a><h2>Nav topic 1</h2>

    <h2><a name="foo2"></a>Nav topic 2</h2>

    <h2><a name="foo3">Nav topic 3</a</h2>

    <h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic 4</
    a></a></h2>

    <h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
    a></h2>

    <a name="foo"><h2>Nav topic 6</h2></a>

    <a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
    </a> </a>

    </body>
    </html>


    This was just a trivial test case to see if I was using TreeBuilder
    correctly. If I have 0-6 H2s then the scriot works as I expect it
    to. If there are 7+ H2s the script fails on the 7th H2 with the
    message:

    Can't call method "look_down" without a package or object
    reference at /my/script.cgi line 88

    I have marked line 88 below in a comment.

    A thing I just noticed is, if I insert a P tag and paragraph between
    the 6th and 7th H2 then the whole thing works.

    This sure feels like a bug.

    Anyway, the code segment is below. The evals were put in during a
    frustrating few minutes yesterday:


    eval { $body = $tree->look_down('_tag', 'body'); };
    die __LINE__ . ": " . $@ if $@;

    die "$ARGV[0] is missing a BODY tag\n" if (! $body);

    eval { @bodyElementList = $body->content_list(); };
    die __LINE__ . ": " . $@ if $@;

    for (my $i = 0; $i <= $#bodyElementList; $i++)
    {
    if (! $firstH1)
    {
    eval { $H1 = $bodyElementList[$i]->look_down('_tag', 'h1'); };
    die __LINE__ . ": " . $@ if $@;

    if ($H1)
    {
    print STDOUT "<li><a href=\"\">" . $H1->as_trimmed_text
    () .
    "</a></li>\n";

    $firstH1++;
    }
    }
    else
    {
    #
    # LINE 88 below
    #
    eval { $H2 = $bodyElementList[$i]->look_down('_tag', 'h2'); };
    die __LINE__ . ": " . $@ if $@;

    if ($H2)
    {
    if ($h2Count == 0)
    {
    print STDOUT "<ul>\n";
    }

    if ($H2->is_inside('a'))
    {
    if (($A = $H2->look_up('_tag', 'a', 'href', qr/.+/)))
    {
    print STDOUT " <li><a href=\"" . $A->attr
    ('href') .
    "\">" . $H2->as_trimmed_text() .
    "</a></li>\n";
    }
    elsif (($A = $H2->look_up('_tag', 'a', 'name', qr/.
    +/)))
    {
    print STDOUT " <li><a href=\"$ARGV[0]/#" .
    $A->attr('name') . "\">" .
    $H2->as_trimmed_text() . "</a></li>
    \n";
    }
    else
    {
    print STDOUT " <li>" . $H2->as_trimmed_text() .
    "</li>\n";
    }
    }
    elsif ($H2->look_down('_tag', 'a'))
    {
    if (($A = $H2->look_down('_tag', 'a', 'href', qr/.
    +/)))
    {
    print STDOUT " <li><a href=\"" . $A->attr
    ('href') .
    "\">" . $H2->as_trimmed_text() .
    "</a></li>\n";
    }
    elsif (($A = $H2->look_down('_tag', 'a',
    'name', qr/.+/)))
    {
    print STDOUT " <li><a href=\"$ARGV[0]/#" .
    $A->attr('name') . "\">" .
    $H2->as_trimmed_text() . "</a></li>
    \n";
    }
    else
    {
    print STDOUT " <li>" . $H2->as_trimmed_text() .
    "</li>\n";
    }
    }
    else
    {
    print STDOUT " <li>" . $H2->as_trimmed_text() .
    "</li>\n";
    }

    $h2Count++;
    }
    }
    }

    if ($h2Count)
    {
    print STDOUT "</ul>\n";
    }

    $tree->delete();

    exit(0);
    Dean Karres, Feb 5, 2009
    #1
    1. Advertising

  2. Dean Karres <> wrote:

    > Hi I posted something similar to this over in comp.lang.perl.modules



    I saw it there, but thought that you had made it too hard to help,
    so I moved on.

    Have you seen the Posting Guidelines that are posted here frequently?

    They contain many tips that make it more likely that someone will
    take the time to figure out what is going on, like make a short
    and complete program that we can run, use __DATA__ to supply file
    contents, use warnings, use strict, etc...

    It is often a Good Idea to try and make the smallest program
    possible that will still produce the problem you need help with.

    I'll do that much for you at least:

    ---------------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;
    use HTML::TreeBuilder;

    my $tree = HTML::TreeBuilder->new_from_file(*DATA);

    my $body = eval { $tree->look_down('_tag', 'body'); };
    die __LINE__ . ": " . $@ if $@;

    die "missing a BODY tag\n" unless $body;

    my @bodyElementList = eval { $body->content_list(); };
    die __LINE__ . ": " . $@ if $@;


    foreach my $i ( 0 .. $#bodyElementList )
    {
    warn "i=$i\n";
    my $H2 = eval { $bodyElementList[$i]->look_down('_tag', 'h2'); };
    die __LINE__ . ": " . $@ if $@;

    }

    __DATA__
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
    www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
    <meta http-equiv="Content-Type" content="text/html;
    charset=utf-8" />
    <title>thing</title>
    <link href="index.css" rel="stylesheet" type="text/css" />
    </head>

    <body>
    <h1>A Title</h1>

    <a name="foo"></a><h2>Nav topic 1</h2>

    <h2><a name="foo2"></a>Nav topic 2</h2>

    <h2><a name="foo3">Nav topic 3</a</h2>

    <h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic 4</
    a></a></h2>

    <h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
    a></h2>

    <a name="foo"><h2>Nav topic 6</h2></a>

    <a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
    </a> </a>

    </body>
    </html>
    ---------------------------------



    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad J McClellan, Feb 5, 2009
    #2
    1. Advertising

  3. Dean Karres

    Dean Karres Guest

    Oops. I apologize. I am rarely in these groups plus I have poor
    vision so except for doing a general search for my issue before
    posting I did not see the guidelines. Thank you for reformatting the
    question.

    All the best,
    Dean...K...
    Dean Karres, Feb 5, 2009
    #3
  4. Dean Karres

    smallpond Guest

    On Feb 5, 3:24 pm, Dean Karres <> wrote:
    > Hi I posted something similar to this over in comp.lang.perl.modules
    > but 1) it looks like that group is not very active and 2) it does not
    > look like TreeBuilder is being actively maintained -- CPAN has open
    > bug reports on TreeBuilder that are over a year old.
    >
    > So what i want to do is harvest some info from a lot of web pages on
    > my site.  I want to grab the first H1 tag in the file and then, if
    > there are H2 tags that follow the H1, I want the "sub"-H2s.  If there
    > are more H1s or if there are H2s that precede the first H1 or follow
    > any secondary H1 then they should be ignored.
    >
    > There are some outstanding issues with the existence and location of
    > "A" tags either inside or surrounding the H2 tags but I don't think
    > that has anything to do with my problem.
    >
    > My problem comes from the sample html file:
    >
    > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    > <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    >   <head>
    >     <meta http-equiv="Content-Type" content="text/html;
    > charset=utf-8" />
    >     <title>thing</title>
    >     <link href="index.css" rel="stylesheet" type="text/css" />
    >   </head>
    >
    >   <body>
    >     <h1>A Title</h1>
    >
    >     <a name="foo"></a><h2>Nav topic 1</h2>
    >
    >     <h2><a name="foo2"></a>Nav topic 2</h2>
    >
    >     <h2><a name="foo3">Nav topic 3</a</h2>
    >
    >     <h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic4</
    > a></a></h2>
    >
    >     <h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
    > a></h2>
    >
    >     <a name="foo"><h2>Nav topic 6</h2></a>
    >
    >     <a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
    > </a> </a>
    >
    >   </body>
    > </html>
    >


    Have you run this through validator? Maybe the module has a problem
    with
    malformed html.
    smallpond, Feb 5, 2009
    #4
  5. Dean Karres

    Dean Karres Guest

    > Have you run this through validator?

    A great idea! The passed in html was very poorly formed... that was
    not part of my test case; I just threw in som nearly random html with
    H# tags mixed with A tags.

    After validating and cleaning the html TreeBuilder did exactly what i
    hoped it would.

    Which brings up a different issue. THe TreeBuilder docs say there is
    a flag "$root->warn(value)" that "determines whether syntax errors
    during parsing should generate warnings, emitted via Perl's ''warn''
    function." I set this to true and ran the script on the obviously bad
    html and saw no warnings.

    However, I hope I can trust my users to validate their html.

    cheers,
    Dean Karres, Feb 5, 2009
    #5
  6. On 2009-02-05 21:19, Tad J McClellan <> wrote:
    > Dean Karres <> wrote:
    >> Hi I posted something similar to this over in comp.lang.perl.modules

    [...]
    > It is often a Good Idea to try and make the smallest program
    > possible that will still produce the problem you need help with.
    >
    > I'll do that much for you at least:
    >
    > ---------------------------------
    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    > use HTML::TreeBuilder;
    >
    > my $tree = HTML::TreeBuilder->new_from_file(*DATA);
    >
    > my $body = eval { $tree->look_down('_tag', 'body'); };
    > die __LINE__ . ": " . $@ if $@;
    >
    > die "missing a BODY tag\n" unless $body;
    >
    > my @bodyElementList = eval { $body->content_list(); };
    > die __LINE__ . ": " . $@ if $@;


    | $h->content_list()
    | Returns a list of the child nodes of this element -- i.e., what
    | nodes (elements or text segments) are inside/under this element.
    | (Note that this may be an empty list.)

    Note: "elements or text segments".

    >
    >
    > foreach my $i ( 0 .. $#bodyElementList )
    > {
    > warn "i=$i\n";
    > my $H2 = eval { $bodyElementList[$i]->look_down('_tag', 'h2'); };
    > die __LINE__ . ": " . $@ if $@;


    Here the code assumes that all the members of @bodyElementList are
    objects which have a method look_down(). But this is only true of
    elements - text segments are simple strings, not objects.


    > __DATA__

    [...]
    > <a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>

    ^
    Here is the text segment - the blank between
    '<a href="http://www.usc.edu">' and '<a name="foo">'.


    Tip of the day: Use the perl debugger and/or Data::Dumper.

    hp
    Peter J. Holzer, Feb 7, 2009
    #6
  7. Dean Karres

    Larry Gates Guest

    On Thu, 5 Feb 2009 15:26:32 -0800 (PST), Dean Karres wrote:

    >> Have you run this through validator?

    >
    > A great idea! The passed in html was very poorly formed... that was
    > not part of my test case; I just threw in som nearly random html with
    > H# tags mixed with A tags.
    >
    > After validating and cleaning the html TreeBuilder did exactly what i
    > hoped it would.
    >
    > Which brings up a different issue. THe TreeBuilder docs say there is
    > a flag "$root->warn(value)" that "determines whether syntax errors
    > during parsing should generate warnings, emitted via Perl's ''warn''
    > function." I set this to true and ran the script on the obviously bad
    > html and saw no warnings.
    >
    > However, I hope I can trust my users to validate their html.
    >
    > cheers,


    I would advise you to get the html with
    use LWP::Simple;
    , which has yet to fail me, now that I've used it a grand total of twice.

    My current output is:

    C:\MinGW\source> perl tree2.pl
    i=0
    i=1
    i=2
    i=3
    i=4
    i=5
    i=6
    i=7
    i=8
    21: Can't call method "look_down" without a package or object reference at
    tree2
    ..pl line 20.

    C:\MinGW\source>type tree2.pl
    #!/usr/bin/perl
    use warnings;
    use strict;
    use HTML::TreeBuilder;

    my $tree = HTML::TreeBuilder->new_from_file(*DATA);

    my $body = eval { $tree->look_down('_tag', 'body'); };
    die __LINE__ . ": " . $@ if $@;

    die "missing a BODY tag\n" unless $body;

    my @bodyElementList = eval { $body->content_list(); };
    die __LINE__ . ": " . $@ if $@;


    foreach my $i ( 0 .. $#bodyElementList )
    {
    warn "i=$i\n";
    my $H2 = eval { $bodyElementList[$i]->look_down('_tag', 'h2'); };
    die __LINE__ . ": " . $@ if $@;

    }

    __DATA__
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
    www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
    <meta http-equiv="Content-Type" content="text/html;
    charset=utf-8" />
    <title>thing</title>
    <link href="index.css" rel="stylesheet" type="text/css" />
    </head>

    <body>
    <h1>A Title</h1>

    <a name="foo"></a><h2>Nav topic 1</h2>

    <h2><a name="foo2"></a>Nav topic 2</h2>

    <h2><a name="foo3">Nav topic 3</a</h2>

    <h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic 4</
    a></a></h2>

    <h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
    a></h2>

    <a name="foo"><h2>Nav topic 6</h2></a>

    <a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
    </a> </a>

    </body>
    </html>

    # perl tree2.pl
    C:\MinGW\source>


    --
    larry gates

    Yes, we have consensus that we need 64 bit support. :)
    -- Larry Wall in <>
    Larry Gates, Feb 13, 2009
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Aumann
    Replies:
    0
    Views:
    329
    Greg Aumann
    Jun 28, 2006
  2. Fredrik Lundh
    Replies:
    0
    Views:
    435
    Fredrik Lundh
    Jul 1, 2006
  3. John W. Kennedy

    Equivalent of Perl HTML::TreeBuilder?

    John W. Kennedy, Jul 29, 2004, in forum: Ruby
    Replies:
    2
    Views:
    125
  4. Bruce Horrocks
    Replies:
    1
    Views:
    105
    Bruce Horrocks
    Jun 12, 2005
  5. Replies:
    7
    Views:
    1,339
Loading...

Share This Page