D
Dean Karres
Hi I posted something similar to this over in comp.lang.perl.modules
but 1) it looks like that group is not very active and 2) it does not
look like TreeBuilder is being actively maintained -- CPAN has open
bug reports on TreeBuilder that are over a year old.
So what i want to do is harvest some info from a lot of web pages on
my site. I want to grab the first H1 tag in the file and then, if
there are H2 tags that follow the H1, I want the "sub"-H2s. If there
are more H1s or if there are H2s that precede the first H1 or follow
any secondary H1 then they should be ignored.
There are some outstanding issues with the existence and location of
"A" tags either inside or surrounding the H2 tags but I don't think
that has anything to do with my problem.
My problem comes from the sample html file:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
<title>thing</title>
<link href="index.css" rel="stylesheet" type="text/css" />
</head>
<body>
<h1>A Title</h1>
<a name="foo"></a><h2>Nav topic 1</h2>
<h2><a name="foo2"></a>Nav topic 2</h2>
<h2><a name="foo3">Nav topic 3</a</h2>
<h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic 4</
a></a></h2>
<h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
a></h2>
<a name="foo"><h2>Nav topic 6</h2></a>
<a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
</a> </a>
</body>
</html>
This was just a trivial test case to see if I was using TreeBuilder
correctly. If I have 0-6 H2s then the scriot works as I expect it
to. If there are 7+ H2s the script fails on the 7th H2 with the
message:
Can't call method "look_down" without a package or object
reference at /my/script.cgi line 88
I have marked line 88 below in a comment.
A thing I just noticed is, if I insert a P tag and paragraph between
the 6th and 7th H2 then the whole thing works.
This sure feels like a bug.
Anyway, the code segment is below. The evals were put in during a
frustrating few minutes yesterday:
eval { $body = $tree->look_down('_tag', 'body'); };
die __LINE__ . ": " . $@ if $@;
die "$ARGV[0] is missing a BODY tag\n" if (! $body);
eval { @bodyElementList = $body->content_list(); };
die __LINE__ . ": " . $@ if $@;
for (my $i = 0; $i <= $#bodyElementList; $i++)
{
if (! $firstH1)
{
eval { $H1 = $bodyElementList[$i]->look_down('_tag', 'h1'); };
die __LINE__ . ": " . $@ if $@;
if ($H1)
{
print STDOUT "<li><a href=\"\">" . $H1->as_trimmed_text
() .
"</a></li>\n";
$firstH1++;
}
}
else
{
#
# LINE 88 below
#
eval { $H2 = $bodyElementList[$i]->look_down('_tag', 'h2'); };
die __LINE__ . ": " . $@ if $@;
if ($H2)
{
if ($h2Count == 0)
{
print STDOUT "<ul>\n";
}
if ($H2->is_inside('a'))
{
if (($A = $H2->look_up('_tag', 'a', 'href', qr/.+/)))
{
print STDOUT " <li><a href=\"" . $A->attr
('href') .
"\">" . $H2->as_trimmed_text() .
"</a></li>\n";
}
elsif (($A = $H2->look_up('_tag', 'a', 'name', qr/.
+/)))
{
print STDOUT " <li><a href=\"$ARGV[0]/#" .
$A->attr('name') . "\">" .
$H2->as_trimmed_text() . "</a></li>
\n";
}
else
{
print STDOUT " <li>" . $H2->as_trimmed_text() .
"</li>\n";
}
}
elsif ($H2->look_down('_tag', 'a'))
{
if (($A = $H2->look_down('_tag', 'a', 'href', qr/.
+/)))
{
print STDOUT " <li><a href=\"" . $A->attr
('href') .
"\">" . $H2->as_trimmed_text() .
"</a></li>\n";
}
elsif (($A = $H2->look_down('_tag', 'a',
'name', qr/.+/)))
{
print STDOUT " <li><a href=\"$ARGV[0]/#" .
$A->attr('name') . "\">" .
$H2->as_trimmed_text() . "</a></li>
\n";
}
else
{
print STDOUT " <li>" . $H2->as_trimmed_text() .
"</li>\n";
}
}
else
{
print STDOUT " <li>" . $H2->as_trimmed_text() .
"</li>\n";
}
$h2Count++;
}
}
}
if ($h2Count)
{
print STDOUT "</ul>\n";
}
$tree->delete();
exit(0);
but 1) it looks like that group is not very active and 2) it does not
look like TreeBuilder is being actively maintained -- CPAN has open
bug reports on TreeBuilder that are over a year old.
So what i want to do is harvest some info from a lot of web pages on
my site. I want to grab the first H1 tag in the file and then, if
there are H2 tags that follow the H1, I want the "sub"-H2s. If there
are more H1s or if there are H2s that precede the first H1 or follow
any secondary H1 then they should be ignored.
There are some outstanding issues with the existence and location of
"A" tags either inside or surrounding the H2 tags but I don't think
that has anything to do with my problem.
My problem comes from the sample html file:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
<title>thing</title>
<link href="index.css" rel="stylesheet" type="text/css" />
</head>
<body>
<h1>A Title</h1>
<a name="foo"></a><h2>Nav topic 1</h2>
<h2><a name="foo2"></a>Nav topic 2</h2>
<h2><a name="foo3">Nav topic 3</a</h2>
<h2><a name="foo4"><a href="http://www.harvard.edu">Nav topic 4</
a></a></h2>
<h2><a href="http://www.mit.edu"><a name="foo5">Nav topic 5</a></
a></h2>
<a name="foo"><h2>Nav topic 6</h2></a>
<a href="http://www.usc.edu"> <a name="foo"> <h2>Nav topic 7</h2>
</a> </a>
</body>
</html>
This was just a trivial test case to see if I was using TreeBuilder
correctly. If I have 0-6 H2s then the scriot works as I expect it
to. If there are 7+ H2s the script fails on the 7th H2 with the
message:
Can't call method "look_down" without a package or object
reference at /my/script.cgi line 88
I have marked line 88 below in a comment.
A thing I just noticed is, if I insert a P tag and paragraph between
the 6th and 7th H2 then the whole thing works.
This sure feels like a bug.
Anyway, the code segment is below. The evals were put in during a
frustrating few minutes yesterday:
eval { $body = $tree->look_down('_tag', 'body'); };
die __LINE__ . ": " . $@ if $@;
die "$ARGV[0] is missing a BODY tag\n" if (! $body);
eval { @bodyElementList = $body->content_list(); };
die __LINE__ . ": " . $@ if $@;
for (my $i = 0; $i <= $#bodyElementList; $i++)
{
if (! $firstH1)
{
eval { $H1 = $bodyElementList[$i]->look_down('_tag', 'h1'); };
die __LINE__ . ": " . $@ if $@;
if ($H1)
{
print STDOUT "<li><a href=\"\">" . $H1->as_trimmed_text
() .
"</a></li>\n";
$firstH1++;
}
}
else
{
#
# LINE 88 below
#
eval { $H2 = $bodyElementList[$i]->look_down('_tag', 'h2'); };
die __LINE__ . ": " . $@ if $@;
if ($H2)
{
if ($h2Count == 0)
{
print STDOUT "<ul>\n";
}
if ($H2->is_inside('a'))
{
if (($A = $H2->look_up('_tag', 'a', 'href', qr/.+/)))
{
print STDOUT " <li><a href=\"" . $A->attr
('href') .
"\">" . $H2->as_trimmed_text() .
"</a></li>\n";
}
elsif (($A = $H2->look_up('_tag', 'a', 'name', qr/.
+/)))
{
print STDOUT " <li><a href=\"$ARGV[0]/#" .
$A->attr('name') . "\">" .
$H2->as_trimmed_text() . "</a></li>
\n";
}
else
{
print STDOUT " <li>" . $H2->as_trimmed_text() .
"</li>\n";
}
}
elsif ($H2->look_down('_tag', 'a'))
{
if (($A = $H2->look_down('_tag', 'a', 'href', qr/.
+/)))
{
print STDOUT " <li><a href=\"" . $A->attr
('href') .
"\">" . $H2->as_trimmed_text() .
"</a></li>\n";
}
elsif (($A = $H2->look_down('_tag', 'a',
'name', qr/.+/)))
{
print STDOUT " <li><a href=\"$ARGV[0]/#" .
$A->attr('name') . "\">" .
$H2->as_trimmed_text() . "</a></li>
\n";
}
else
{
print STDOUT " <li>" . $H2->as_trimmed_text() .
"</li>\n";
}
}
else
{
print STDOUT " <li>" . $H2->as_trimmed_text() .
"</li>\n";
}
$h2Count++;
}
}
}
if ($h2Count)
{
print STDOUT "</ul>\n";
}
$tree->delete();
exit(0);