HTML::TreeBuilder issue

D

Dean Karres

Hi I posted something similar to this over in comp.lang.perl.modules
but 1) it looks like that group is not very active and 2) it does not
look like TreeBuilder is being actively maintained -- CPAN has open
bug reports on TreeBuilder that are over a year old.

So what i want to do is harvest some info from a lot of web pages on
my site. I want to grab the first H1 tag in the file and then, if
there are H2 tags that follow the H1, I want the "sub"-H2s. If there
are more H1s or if there are H2s that precede the first H1 or follow
any secondary H1 then they should be ignored.

There are some outstanding issues with the existence and location of
"A" tags either inside or surrounding the H2 tags but I don't think
that has anything to do with my problem.

My problem comes from the sample html file:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
<title>thing</title>
<link href="index.css" rel="stylesheet" type="text/css" />
</head>

<body>
<h1>A Title</h1>

<a name="foo"></a><h2>Nav topic 1</h2>

<h2><a name="foo2"></a>Nav topic 2</h2>

<h2><a name="foo3">Nav topic 3</a</h2>

<h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic 4</
a></a></h2>

<h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
a></h2>

<a name="foo"><h2>Nav topic 6</h2></a>

<a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
</a> </a>

</body>
</html>


This was just a trivial test case to see if I was using TreeBuilder
correctly. If I have 0-6 H2s then the scriot works as I expect it
to. If there are 7+ H2s the script fails on the 7th H2 with the
message:

Can't call method "look_down" without a package or object
reference at /my/script.cgi line 88

I have marked line 88 below in a comment.

A thing I just noticed is, if I insert a P tag and paragraph between
the 6th and 7th H2 then the whole thing works.

This sure feels like a bug.

Anyway, the code segment is below. The evals were put in during a
frustrating few minutes yesterday:


eval { $body = $tree->look_down('_tag', 'body'); };
die __LINE__ . ": " . $@ if $@;

die "$ARGV[0] is missing a BODY tag\n" if (! $body);

eval { @bodyElementList = $body->content_list(); };
die __LINE__ . ": " . $@ if $@;

for (my $i = 0; $i <= $#bodyElementList; $i++)
{
if (! $firstH1)
{
eval { $H1 = $bodyElementList[$i]->look_down('_tag', 'h1'); };
die __LINE__ . ": " . $@ if $@;

if ($H1)
{
print STDOUT "<li><a href=\"\">" . $H1->as_trimmed_text
() .
"</a></li>\n";

$firstH1++;
}
}
else
{
#
# LINE 88 below
#
eval { $H2 = $bodyElementList[$i]->look_down('_tag', 'h2'); };
die __LINE__ . ": " . $@ if $@;

if ($H2)
{
if ($h2Count == 0)
{
print STDOUT "<ul>\n";
}

if ($H2->is_inside('a'))
{
if (($A = $H2->look_up('_tag', 'a', 'href', qr/.+/)))
{
print STDOUT " <li><a href=\"" . $A->attr
('href') .
"\">" . $H2->as_trimmed_text() .
"</a></li>\n";
}
elsif (($A = $H2->look_up('_tag', 'a', 'name', qr/.
+/)))
{
print STDOUT " <li><a href=\"$ARGV[0]/#" .
$A->attr('name') . "\">" .
$H2->as_trimmed_text() . "</a></li>
\n";
}
else
{
print STDOUT " <li>" . $H2->as_trimmed_text() .
"</li>\n";
}
}
elsif ($H2->look_down('_tag', 'a'))
{
if (($A = $H2->look_down('_tag', 'a', 'href', qr/.
+/)))
{
print STDOUT " <li><a href=\"" . $A->attr
('href') .
"\">" . $H2->as_trimmed_text() .
"</a></li>\n";
}
elsif (($A = $H2->look_down('_tag', 'a',
'name', qr/.+/)))
{
print STDOUT " <li><a href=\"$ARGV[0]/#" .
$A->attr('name') . "\">" .
$H2->as_trimmed_text() . "</a></li>
\n";
}
else
{
print STDOUT " <li>" . $H2->as_trimmed_text() .
"</li>\n";
}
}
else
{
print STDOUT " <li>" . $H2->as_trimmed_text() .
"</li>\n";
}

$h2Count++;
}
}
}

if ($h2Count)
{
print STDOUT "</ul>\n";
}

$tree->delete();

exit(0);
 
T

Tad J McClellan

Dean Karres said:
Hi I posted something similar to this over in comp.lang.perl.modules


I saw it there, but thought that you had made it too hard to help,
so I moved on.

Have you seen the Posting Guidelines that are posted here frequently?

They contain many tips that make it more likely that someone will
take the time to figure out what is going on, like make a short
and complete program that we can run, use __DATA__ to supply file
contents, use warnings, use strict, etc...

It is often a Good Idea to try and make the smallest program
possible that will still produce the problem you need help with.

I'll do that much for you at least:

---------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file(*DATA);

my $body = eval { $tree->look_down('_tag', 'body'); };
die __LINE__ . ": " . $@ if $@;

die "missing a BODY tag\n" unless $body;

my @bodyElementList = eval { $body->content_list(); };
die __LINE__ . ": " . $@ if $@;


foreach my $i ( 0 .. $#bodyElementList )
{
warn "i=$i\n";
my $H2 = eval { $bodyElementList[$i]->look_down('_tag', 'h2'); };
die __LINE__ . ": " . $@ if $@;

}

__DATA__
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
<title>thing</title>
<link href="index.css" rel="stylesheet" type="text/css" />
</head>

<body>
<h1>A Title</h1>

<a name="foo"></a><h2>Nav topic 1</h2>

<h2><a name="foo2"></a>Nav topic 2</h2>

<h2><a name="foo3">Nav topic 3</a</h2>

<h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic 4</
a></a></h2>

<h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
a></h2>

<a name="foo"><h2>Nav topic 6</h2></a>

<a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
</a> </a>

</body>
</html>
---------------------------------
 
D

Dean Karres

Oops. I apologize. I am rarely in these groups plus I have poor
vision so except for doing a general search for my issue before
posting I did not see the guidelines. Thank you for reformatting the
question.

All the best,
Dean...K...
 
S

smallpond

Hi I posted something similar to this over in comp.lang.perl.modules
but 1) it looks like that group is not very active and 2) it does not
look like TreeBuilder is being actively maintained -- CPAN has open
bug reports on TreeBuilder that are over a year old.

So what i want to do is harvest some info from a lot of web pages on
my site.  I want to grab the first H1 tag in the file and then, if
there are H2 tags that follow the H1, I want the "sub"-H2s.  If there
are more H1s or if there are H2s that precede the first H1 or follow
any secondary H1 then they should be ignored.

There are some outstanding issues with the existence and location of
"A" tags either inside or surrounding the H2 tags but I don't think
that has anything to do with my problem.

My problem comes from the sample html file:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
    <title>thing</title>
    <link href="index.css" rel="stylesheet" type="text/css" />
  </head>

  <body>
    <h1>A Title</h1>

    <a name="foo"></a><h2>Nav topic 1</h2>

    <h2><a name="foo2"></a>Nav topic 2</h2>

    <h2><a name="foo3">Nav topic 3</a</h2>

    <h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic4</
a></a></h2>

    <h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
a></h2>

    <a name="foo"><h2>Nav topic 6</h2></a>

    <a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
</a> </a>

  </body>
</html>

Have you run this through validator? Maybe the module has a problem
with
malformed html.
 
D

Dean Karres

Have you run this through validator?

A great idea! The passed in html was very poorly formed... that was
not part of my test case; I just threw in som nearly random html with
H# tags mixed with A tags.

After validating and cleaning the html TreeBuilder did exactly what i
hoped it would.

Which brings up a different issue. THe TreeBuilder docs say there is
a flag "$root->warn(value)" that "determines whether syntax errors
during parsing should generate warnings, emitted via Perl's ''warn''
function." I set this to true and ran the script on the obviously bad
html and saw no warnings.

However, I hope I can trust my users to validate their html.

cheers,
 
P

Peter J. Holzer

Dean Karres said:
Hi I posted something similar to this over in comp.lang.perl.modules
[...]
It is often a Good Idea to try and make the smallest program
possible that will still produce the problem you need help with.

I'll do that much for you at least:

---------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file(*DATA);

my $body = eval { $tree->look_down('_tag', 'body'); };
die __LINE__ . ": " . $@ if $@;

die "missing a BODY tag\n" unless $body;

my @bodyElementList = eval { $body->content_list(); };
die __LINE__ . ": " . $@ if $@;

| $h->content_list()
| Returns a list of the child nodes of this element -- i.e., what
| nodes (elements or text segments) are inside/under this element.
| (Note that this may be an empty list.)

Note: "elements or text segments".
foreach my $i ( 0 .. $#bodyElementList )
{
warn "i=$i\n";
my $H2 = eval { $bodyElementList[$i]->look_down('_tag', 'h2'); };
die __LINE__ . ": " . $@ if $@;

Here the code assumes that all the members of @bodyElementList are
objects which have a method look_down(). But this is only true of
elements - text segments are simple strings, not objects.

__DATA__ [...]
<a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
^
Here is the text segment - the blank between
'<a href="http://www.usc.edu">' and '<a name="foo">'.


Tip of the day: Use the perl debugger and/or Data::Dumper.

hp
 
L

Larry Gates

A great idea! The passed in html was very poorly formed... that was
not part of my test case; I just threw in som nearly random html with
H# tags mixed with A tags.

After validating and cleaning the html TreeBuilder did exactly what i
hoped it would.

Which brings up a different issue. THe TreeBuilder docs say there is
a flag "$root->warn(value)" that "determines whether syntax errors
during parsing should generate warnings, emitted via Perl's ''warn''
function." I set this to true and ran the script on the obviously bad
html and saw no warnings.

However, I hope I can trust my users to validate their html.

cheers,

I would advise you to get the html with
use LWP::Simple;
, which has yet to fail me, now that I've used it a grand total of twice.

My current output is:

C:\MinGW\source> perl tree2.pl
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
21: Can't call method "look_down" without a package or object reference at
tree2
..pl line 20.

C:\MinGW\source>type tree2.pl
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file(*DATA);

my $body = eval { $tree->look_down('_tag', 'body'); };
die __LINE__ . ": " . $@ if $@;

die "missing a BODY tag\n" unless $body;

my @bodyElementList = eval { $body->content_list(); };
die __LINE__ . ": " . $@ if $@;


foreach my $i ( 0 .. $#bodyElementList )
{
warn "i=$i\n";
my $H2 = eval { $bodyElementList[$i]->look_down('_tag', 'h2'); };
die __LINE__ . ": " . $@ if $@;

}

__DATA__
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
<title>thing</title>
<link href="index.css" rel="stylesheet" type="text/css" />
</head>

<body>
<h1>A Title</h1>

<a name="foo"></a><h2>Nav topic 1</h2>

<h2><a name="foo2"></a>Nav topic 2</h2>

<h2><a name="foo3">Nav topic 3</a</h2>

<h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic 4</
a></a></h2>

<h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
a></h2>

<a name="foo"><h2>Nav topic 6</h2></a>

<a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
</a> </a>

</body>
</html>

# perl tree2.pl
C:\MinGW\source>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top