HTML::TokeParser

D

DVH

Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

<td colspan="2">&nbsp;</td>

<td align="left" colspan="3">

<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

</a>

</td>

</tr>

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline
 
S

Stephen Hildrey

DVH said:
I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')

You probably want ->[1] rather than ->[2]

Regards,
Steve
 
I

it_says_BALLS_on_your forehead

DVH said:
Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

<td colspan="2">&nbsp;</td>

<td align="left" colspan="3">

<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

</a>

</td>

</tr>

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline

after searching on CPAN for HTML::TokeParser, and looking at the
$p->get_tag( @tags ) method,
it looks like:

The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:

[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is returned
like this:

["/$tag", $text]

....so you get an array reference back. why are you adding {class} into
your code?
 
I

it_says_BALLS_on_your forehead

it_says_BALLS_on_your forehead said:
DVH said:
Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

<td colspan="2">&nbsp;</td>

<td align="left" colspan="3">

<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

</a>

</td>

</tr>

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline

after searching on CPAN for HTML::TokeParser, and looking at the
$p->get_tag( @tags ) method,
it looks like:

The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:

[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is returned
like this:

["/$tag", $text]

...so you get an array reference back. why are you adding {class} into
your code?

ahh, my mistake...
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html");

while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}

....yeah, you need to look at index 1, not index 2.
 
D

DVH

Stephen Hildrey said:
DVH said:
I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq
'docSel-titleLink')

You probably want ->[1] rather than ->[2]

I did. I had thought it would be tag[2] because I was looking for the third
tag within those brackets, but obviously not.

Thank you, that now works. I have a couple more questions (ah they always
do...)

Firstly, the HTML puts a lot of whitespace in the middle of the hrefs. Is
there a reasonably simple way of getting rid of that? The site is at
http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&hits=
10 if you need to see it.

Secondly, I'm working towards getting following those hrefs and then parsing
the text I find there. Would I be better off using WWW::Mechanize to do
this?

Thanks again for your help.
 
D

DVH

it_says_BALLS_on_your forehead said:
it_says_BALLS_on_your forehead said:
DVH said:
Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

<td colspan="2">&nbsp;</td>

<td align="left" colspan="3">

<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

</a>

</td>

</tr>

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
t( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline

after searching on CPAN for HTML::TokeParser, and looking at the
$p->get_tag( @tags ) method,
it looks like:

The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:

[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is returned
like this:

["/$tag", $text]

...so you get an array reference back. why are you adding {class} into
your code?

ahh, my mistake...
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html");

while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}

...yeah, you need to look at index 1, not index 2.

Thanks. It works with [1].
 
A

A. Sinan Unur

Stephen Hildrey said:
DVH said:
I'm trying to get tokeparser to fetch a series of hyperlinks and
print the URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array
into hash at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq
'docSel-titleLink')

You probably want ->[1] rather than ->[2]

I did. I had thought it would be tag[2] because I was looking for the
third tag within those brackets, but obviously not.

Thank you, that now works. I have a couple more questions (ah they
always do...)

Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.

ITYM "the HTML contains".

Is there a reasonably simple way of getting rid of that? The site is
at
http://europa.eu.int/rapid/recentPressReleasesAction.do? guiLanguage=en&
hits= 10 if you need to see it.

Secondly, I'm working towards getting following those hrefs and then
parsing the text I find there. Would I be better off using
WWW::Mechanize to do this?

#!/usr/bin/perl

use strict;
use warnings;

use HTML::LinkExtractor;
use LWP::Simple;

my $url = q{http://europa.eu.int/rapid/recentPressReleasesAction.do?
guiLanguage=en};
my $html = get $url;

die "Cannot get <$url>\n" unless $html;

my $lx = HTML::LinkExtractor->new;
$lx->parse(\$html);

use Data::Dumper;

for my $link ( @{ $lx->links } ) {
if ($link->{class} eq 'docSel-formatLink') {
print Dumper $link;
}
}


__END__
 
D

DVH

A. Sinan Unur said:
Stephen Hildrey said:
DVH wrote:
I'm trying to get tokeparser to fetch a series of hyperlinks and
print the URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array
into hash at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')

You probably want ->[1] rather than ->[2]

I did. I had thought it would be tag[2] because I was looking for the
third tag within those brackets, but obviously not.

Thank you, that now works. I have a couple more questions (ah they
always do...)

Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.

ITYM "the HTML contains".

Is there a reasonably simple way of getting rid of that? The site is
at
http://europa.eu.int/rapid/recentPressReleasesAction.do? guiLanguage=en&
hits= 10 if you need to see it.

Secondly, I'm working towards getting following those hrefs and then
parsing the text I find there. Would I be better off using
WWW::Mechanize to do this?

#!/usr/bin/perl

use strict;
use warnings;

use HTML::LinkExtractor;
use LWP::Simple;

my $url = q{http://europa.eu.int/rapid/recentPressReleasesAction.do?
guiLanguage=en};
my $html = get $url;

die "Cannot get <$url>\n" unless $html;

my $lx = HTML::LinkExtractor->new;
$lx->parse(\$html);

use Data::Dumper;

for my $link ( @{ $lx->links } ) {
if ($link->{class} eq 'docSel-formatLink') {
print Dumper $link;
}
}


__END__

Sorry for getting back to you three days late, but thanks to both of you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top