HTML::TokeParser

DVH · Oct 16, 2005

Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

<td colspan="2"> </td>

<td align="left" colspan="3">

<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

</a>

</td>

</tr>

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline

Stephen Hildrey · Oct 16, 2005

DVH said:
I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')

You probably want ->[1] rather than ->[2]

Regards,
Steve

it_says_BALLS_on_your forehead · Oct 16, 2005

DVH said:
Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

<td colspan="2"> </td>

<td align="left" colspan="3">

<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

</a>

</td>

</tr>

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline

after searching on CPAN for HTML::TokeParser, and looking at the
$p->get_tag( @tags ) method,
it looks like:

The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:

[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is returned
like this:

["/$tag", $text]

....so you get an array reference back. why are you adding {class} into
your code?

it_says_BALLS_on_your forehead · Oct 16, 2005

it_says_BALLS_on_your forehead said:
DVH said:

Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

<td colspan="2"> </td>

<td align="left" colspan="3">

<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

</a>

</td>

</tr>

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline

Click to expand...

after searching on CPAN for HTML::TokeParser, and looking at the
$p->get_tag( @tags ) method,
it looks like:

The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:

[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is returned
like this:

["/$tag", $text]

...so you get an array reference back. why are you adding {class} into
your code?

ahh, my mistake...
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html");

while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}

....yeah, you need to look at index 1, not index 2.

DVH · Oct 16, 2005

Stephen Hildrey said:
DVH said:

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

Click to expand...

'docSel-titleLink')

You probably want ->[1] rather than ->[2]

I did. I had thought it would be tag[2] because I was looking for the third
tag within those brackets, but obviously not.

Thank you, that now works. I have a couple more questions (ah they always
do...)

Firstly, the HTML puts a lot of whitespace in the middle of the hrefs. Is
there a reasonably simple way of getting rid of that? The site is at
http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&hits=
10 if you need to see it.

Secondly, I'm working towards getting following those hrefs and then parsing
the text I find there. Would I be better off using WWW::Mechanize to do
this?

Thanks again for your help.

DVH · Oct 16, 2005

it_says_BALLS_on_your forehead said:
it_says_BALLS_on_your forehead said:

DVH said:

Hi,

I'm trying to get tokeparser to fetch a series of hyperlinks and print the
URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array into hash
at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
{"

The HTML looks like this:

=======================================

<td colspan="2"> </td>

<td align="left" colspan="3">

<a title="" class="docSel-titleLink"
href="pressReleasesAction.do?reference=EPSO/05/06">

My link text here

</a>

</td>

</tr>

---------------------------------------------

My script looks like this:

#!/usr/bin/perl -w

use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

my $content =
t( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
hits=500" ) or die $!;

my $stream = HTML::TokeParser->new( \$content ) or die $!;

my ($tag, $headline, $url);

while ( $tag = $stream->get_tag("a") ) {

if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

$url = $tag->[2]{href} || "--";

$headline = $stream->get_trimmed_text('/a')

print $url

print $headline

Click to expand...

after searching on CPAN for HTML::TokeParser, and looking at the
$p->get_tag( @tags ) method,
it looks like:

The tag information is returned as an array reference in the same form
as for $p->get_token above, but the type code (first element) is
missing. A start tag will be returned like this:

[$tag, $attr, $attrseq, $text]
The tagname of end tags are prefixed with "/", i.e. end tag is returned
like this:

["/$tag", $text]

...so you get an array reference back. why are you adding {class} into
your code?

Click to expand...

ahh, my mistake...
use HTML::TokeParser;
$p = HTML::TokeParser->new(shift||"index.html");

while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $p->get_trimmed_text("/a");
print "$url\t$text\n";
}

...yeah, you need to look at index 1, not index 2.

Thanks. It works with [1].

A. Sinan Unur · Oct 16, 2005

Stephen Hildrey said:
Stephen Hildrey said:

DVH said:

I'm trying to get tokeparser to fetch a series of hyperlinks and
print the URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array
into hash at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

Click to expand...

'docSel-titleLink')

You probably want ->[1] rather than ->[2]

Click to expand...

I did. I had thought it would be tag[2] because I was looking for the
third tag within those brackets, but obviously not.

Thank you, that now works. I have a couple more questions (ah they
always do...)

Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.

ITYM "the HTML contains".

Is there a reasonably simple way of getting rid of that? The site is
at
http://europa.eu.int/rapid/recentPressReleasesAction.do? guiLanguage=en&
hits= 10 if you need to see it.

Secondly, I'm working towards getting following those hrefs and then
parsing the text I find there. Would I be better off using
WWW::Mechanize to do this?

#!/usr/bin/perl

use strict;
use warnings;

use HTML::LinkExtractor;
use LWP::Simple;

my $url = q{http://europa.eu.int/rapid/recentPressReleasesAction.do?
guiLanguage=en};
my $html = get $url;

die "Cannot get <$url>\n" unless $html;

my $lx = HTML::LinkExtractor->new;
$lx->parse(\$html);

use Data:

umper;

for my $link ( @{ $lx->links } ) {
if ($link->{class} eq 'docSel-formatLink') {
print Dumper $link;
}
}

__END__

DVH · Oct 19, 2005

A. Sinan Unur said:
Stephen Hildrey said:

DVH wrote:
I'm trying to get tokeparser to fetch a series of hyperlinks and
print the URL followed by the link text.

The following script ("eurofeed.pl") gives me "Can't coerce array
into hash at eurofeed.pl line 31"

Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')

You probably want ->[1] rather than ->[2]

Click to expand...

I did. I had thought it would be tag[2] because I was looking for the
third tag within those brackets, but obviously not.

Thank you, that now works. I have a couple more questions (ah they
always do...)

Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.

Click to expand...

ITYM "the HTML contains".

Is there a reasonably simple way of getting rid of that? The site is
at
http://europa.eu.int/rapid/recentPressReleasesAction.do? guiLanguage=en&
hits= 10 if you need to see it.

Secondly, I'm working towards getting following those hrefs and then
parsing the text I find there. Would I be better off using
WWW::Mechanize to do this?

Click to expand...

#!/usr/bin/perl

use strict;
use warnings;

use HTML::LinkExtractor;
use LWP::Simple;

my $url = q{http://europa.eu.int/rapid/recentPressReleasesAction.do?
guiLanguage=en};
my $html = get $url;

die "Cannot get <$url>\n" unless $html;

my $lx = HTML::LinkExtractor->new;
$lx->parse(\$html);

use Data:umper;

for my $link ( @{ $lx->links } ) {
if ($link->{class} eq 'docSel-formatLink') {
print Dumper $link;
}
}

__END__

Sorry for getting back to you three days late, but thanks to both of you.

A. Sinan Unur · Oct 19, 2005

....
Sorry for getting back to you three days late, but thanks to both
of you.

You are welcome. Hope it helped.

Sinan

TokeParser	0	Nov 7, 2006
weird issue with HTML::TokeParser and Fork	4	May 4, 2008
How do I follow links stored in an array?	3	Apr 29, 2008
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Sort by number of characters	1	Nov 2, 2023
Javascript DOM	1	Mar 29, 2023
Uncaught ReferenceError: item is not defined at HTMLButtonElement.onclick in the: <button onclick="item.inserir()">Inserir dados</button>	1	Apr 22, 2023
HTML::TokeParser; __DATA__ as a filehandle	2	Oct 24, 2006

HTML::TokeParser

DVH

Stephen Hildrey

it_says_BALLS_on_your forehead

it_says_BALLS_on_your forehead

DVH

DVH

A. Sinan Unur

DVH

A. Sinan Unur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads