P
Patrick Joly
I am having a hard time parsing HTML with HTML::TokeParser, i.e.
trying to fetch text up to the next '/p' end tag in a string with the
get_text() method. In my example, get_text('/p') fetches up to the first
</p> but the html stream get emptied as a by-product. That is, I can no
longer get any more text after the first get_text in a loop. Any ideas as
to what I might be doing wrong in Section-A below? I wasted all afternoon
on this and I am spent. TIA!
use strict;
use warnings;
use HTML::TokeParser;
my ($t, $p, $tok, $html);
$html = q!
<p><i>¼ cup olive oil</i></p>
<b>¼ cup canned tuna</b>
<p><i>and chopped liver</i></p>!;
#
# Section - A
# there should be more than 1 iteration, but there isn't
$p = new HTML::TokeParser( \$html );
$p->unbroken_text(1);
my $i = 0;
while (my $txt = $p->get_text( '/p' ) ) {
print $txt;
print "\n(iteration" . ++$i . ")\n\n";
}
print "see, no text left after first iteration\n\n";
#
# Section - B
# the following shows there indeed are 2 separate
# '/p' tags and the Text tokens print just fine
$p = new HTML::TokeParser( \$html );
$p->unbroken_text(1);
$i = 0;
while (my $tok = $p->get_token ) {
print 'Token is: ' . $tok->[0] . " -> " . $tok->[1];
print "\n(iteration" . ++$i . ")\n\n";
}
__END__
Here is the output I get:
-------------------------
+ cup olive oil
(iteration1)
see, no text left after first iteration
Token is: T ->
(iteration1)
Token is: S -> p
(iteration2)
Token is: S -> i
(iteration3)
Token is: T -> ¼ cup olive oil
(iteration4)
Token is: E -> i
(iteration5)
Token is: E -> p
(iteration6)
Token is: T ->
(iteration7)
Token is: S -> b
(iteration8)
Token is: T -> ¼ cup canned tuna
(iteration9)
Token is: E -> b
(iteration10)
Token is: T ->
(iteration11)
Token is: S -> p
(iteration12)
Token is: S -> i
(iteration13)
Token is: T -> and chopped liver
(iteration14)
Token is: E -> i
(iteration15)
Token is: E -> p
(iteration16)
trying to fetch text up to the next '/p' end tag in a string with the
get_text() method. In my example, get_text('/p') fetches up to the first
</p> but the html stream get emptied as a by-product. That is, I can no
longer get any more text after the first get_text in a loop. Any ideas as
to what I might be doing wrong in Section-A below? I wasted all afternoon
on this and I am spent. TIA!
use strict;
use warnings;
use HTML::TokeParser;
my ($t, $p, $tok, $html);
$html = q!
<p><i>¼ cup olive oil</i></p>
<b>¼ cup canned tuna</b>
<p><i>and chopped liver</i></p>!;
#
# Section - A
# there should be more than 1 iteration, but there isn't
$p = new HTML::TokeParser( \$html );
$p->unbroken_text(1);
my $i = 0;
while (my $txt = $p->get_text( '/p' ) ) {
print $txt;
print "\n(iteration" . ++$i . ")\n\n";
}
print "see, no text left after first iteration\n\n";
#
# Section - B
# the following shows there indeed are 2 separate
# '/p' tags and the Text tokens print just fine
$p = new HTML::TokeParser( \$html );
$p->unbroken_text(1);
$i = 0;
while (my $tok = $p->get_token ) {
print 'Token is: ' . $tok->[0] . " -> " . $tok->[1];
print "\n(iteration" . ++$i . ")\n\n";
}
__END__
Here is the output I get:
-------------------------
+ cup olive oil
(iteration1)
see, no text left after first iteration
Token is: T ->
(iteration1)
Token is: S -> p
(iteration2)
Token is: S -> i
(iteration3)
Token is: T -> ¼ cup olive oil
(iteration4)
Token is: E -> i
(iteration5)
Token is: E -> p
(iteration6)
Token is: T ->
(iteration7)
Token is: S -> b
(iteration8)
Token is: T -> ¼ cup canned tuna
(iteration9)
Token is: E -> b
(iteration10)
Token is: T ->
(iteration11)
Token is: S -> p
(iteration12)
Token is: S -> i
(iteration13)
Token is: T -> and chopped liver
(iteration14)
Token is: E -> i
(iteration15)
Token is: E -> p
(iteration16)