HTML::TokeParser

Discussion in 'Perl Misc' started by DVH, Oct 16, 2005.

  1. DVH

    DVH Guest

    Hi,

    I'm trying to get tokeparser to fetch a series of hyperlinks and print the
    URL followed by the link text.

    The following script ("eurofeed.pl") gives me "Can't coerce array into hash
    at eurofeed.pl line 31"

    Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
    {"

    The HTML looks like this:

    =======================================

    <td colspan="2">&nbsp;</td>

    <td align="left" colspan="3">

    <a title="" class="docSel-titleLink"
    href="pressReleasesAction.do?reference=EPSO/05/06">

    My link text here

    </a>

    </td>

    </tr>

    ---------------------------------------------

    My script looks like this:

    #!/usr/bin/perl -w

    use strict;

    use LWP::Simple;

    use HTML::TokeParser;

    use XML::RSS;

    my $content =
    et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
    hits=500" ) or die $!;

    my $stream = HTML::TokeParser->new( \$content ) or die $!;

    my ($tag, $headline, $url);

    while ( $tag = $stream->get_tag("a") ) {

    if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {

    $url = $tag->[2]{href} || "--";

    $headline = $stream->get_trimmed_text('/a')

    print $url

    print $headline

    -----------------------------------------------------------

    I think the problem lies in the ordering of tags, but that's as far as I've
    got with working out what's wrong.
     
    DVH, Oct 16, 2005
    #1
    1. Advertising

  2. DVH wrote:
    > I'm trying to get tokeparser to fetch a series of hyperlinks and print the
    > URL followed by the link text.
    >
    > The following script ("eurofeed.pl") gives me "Can't coerce array into hash
    > at eurofeed.pl line 31"
    >
    > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')


    You probably want ->[1] rather than ->[2]

    Regards,
    Steve
    --
    Stephen Hildrey
    E-mail: / Tel: +442071931337
    Jabber: / MSN:
     
    Stephen Hildrey, Oct 16, 2005
    #2
    1. Advertising

  3. DVH wrote:
    > Hi,
    >
    > I'm trying to get tokeparser to fetch a series of hyperlinks and print the
    > URL followed by the link text.
    >
    > The following script ("eurofeed.pl") gives me "Can't coerce array into hash
    > at eurofeed.pl line 31"
    >
    > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
    > {"
    >
    > The HTML looks like this:
    >
    > =======================================
    >
    > <td colspan="2">&nbsp;</td>
    >
    > <td align="left" colspan="3">
    >
    > <a title="" class="docSel-titleLink"
    > href="pressReleasesAction.do?reference=EPSO/05/06">
    >
    > My link text here
    >
    > </a>
    >
    > </td>
    >
    > </tr>
    >
    > ---------------------------------------------
    >
    > My script looks like this:
    >
    > #!/usr/bin/perl -w
    >
    > use strict;
    >
    > use LWP::Simple;
    >
    > use HTML::TokeParser;
    >
    > use XML::RSS;
    >
    > my $content =
    > et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
    > hits=500" ) or die $!;
    >
    > my $stream = HTML::TokeParser->new( \$content ) or die $!;
    >
    > my ($tag, $headline, $url);
    >
    > while ( $tag = $stream->get_tag("a") ) {
    >
    > if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
    >
    > $url = $tag->[2]{href} || "--";
    >
    > $headline = $stream->get_trimmed_text('/a')
    >
    > print $url
    >
    > print $headline
    >
    > -----------------------------------------------------------
    >
    > I think the problem lies in the ordering of tags, but that's as far as I've
    > got with working out what's wrong.


    after searching on CPAN for HTML::TokeParser, and looking at the
    $p->get_tag( @tags ) method,
    it looks like:

    The tag information is returned as an array reference in the same form
    as for $p->get_token above, but the type code (first element) is
    missing. A start tag will be returned like this:

    [$tag, $attr, $attrseq, $text]
    The tagname of end tags are prefixed with "/", i.e. end tag is returned
    like this:

    ["/$tag", $text]

    ....so you get an array reference back. why are you adding {class} into
    your code?
     
    it_says_BALLS_on_your forehead, Oct 16, 2005
    #3
  4. it_says_BALLS_on_your forehead wrote:
    > DVH wrote:
    > > Hi,
    > >
    > > I'm trying to get tokeparser to fetch a series of hyperlinks and print the
    > > URL followed by the link text.
    > >
    > > The following script ("eurofeed.pl") gives me "Can't coerce array into hash
    > > at eurofeed.pl line 31"
    > >
    > > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink')
    > > {"
    > >
    > > The HTML looks like this:
    > >
    > > =======================================
    > >
    > > <td colspan="2">&nbsp;</td>
    > >
    > > <td align="left" colspan="3">
    > >
    > > <a title="" class="docSel-titleLink"
    > > href="pressReleasesAction.do?reference=EPSO/05/06">
    > >
    > > My link text here
    > >
    > > </a>
    > >
    > > </td>
    > >
    > > </tr>
    > >
    > > ---------------------------------------------
    > >
    > > My script looks like this:
    > >
    > > #!/usr/bin/perl -w
    > >
    > > use strict;
    > >
    > > use LWP::Simple;
    > >
    > > use HTML::TokeParser;
    > >
    > > use XML::RSS;
    > >
    > > my $content =
    > > et( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
    > > hits=500" ) or die $!;
    > >
    > > my $stream = HTML::TokeParser->new( \$content ) or die $!;
    > >
    > > my ($tag, $headline, $url);
    > >
    > > while ( $tag = $stream->get_tag("a") ) {
    > >
    > > if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
    > >
    > > $url = $tag->[2]{href} || "--";
    > >
    > > $headline = $stream->get_trimmed_text('/a')
    > >
    > > print $url
    > >
    > > print $headline
    > >
    > > -----------------------------------------------------------
    > >
    > > I think the problem lies in the ordering of tags, but that's as far as I've
    > > got with working out what's wrong.

    >
    > after searching on CPAN for HTML::TokeParser, and looking at the
    > $p->get_tag( @tags ) method,
    > it looks like:
    >
    > The tag information is returned as an array reference in the same form
    > as for $p->get_token above, but the type code (first element) is
    > missing. A start tag will be returned like this:
    >
    > [$tag, $attr, $attrseq, $text]
    > The tagname of end tags are prefixed with "/", i.e. end tag is returned
    > like this:
    >
    > ["/$tag", $text]
    >
    > ...so you get an array reference back. why are you adding {class} into
    > your code?


    ahh, my mistake...
    use HTML::TokeParser;
    $p = HTML::TokeParser->new(shift||"index.html");

    while (my $token = $p->get_tag("a")) {
    my $url = $token->[1]{href} || "-";
    my $text = $p->get_trimmed_text("/a");
    print "$url\t$text\n";
    }

    ....yeah, you need to look at index 1, not index 2.
     
    it_says_BALLS_on_your forehead, Oct 16, 2005
    #4
  5. DVH

    DVH Guest

    Stephen Hildrey <> wrote in message
    news:...
    > DVH wrote:
    > > I'm trying to get tokeparser to fetch a series of hyperlinks and print

    the
    > > URL followed by the link text.
    > >
    > > The following script ("eurofeed.pl") gives me "Can't coerce array into

    hash
    > > at eurofeed.pl line 31"
    > >
    > > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

    'docSel-titleLink')
    >
    > You probably want ->[1] rather than ->[2]


    I did. I had thought it would be tag[2] because I was looking for the third
    tag within those brackets, but obviously not.

    Thank you, that now works. I have a couple more questions (ah they always
    do...)

    Firstly, the HTML puts a lot of whitespace in the middle of the hrefs. Is
    there a reasonably simple way of getting rid of that? The site is at
    http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&hits=
    10 if you need to see it.

    Secondly, I'm working towards getting following those hrefs and then parsing
    the text I find there. Would I be better off using WWW::Mechanize to do
    this?

    Thanks again for your help.
     
    DVH, Oct 16, 2005
    #5
  6. DVH

    DVH Guest

    it_says_BALLS_on_your forehead <> wrote in message
    news:...
    >
    > it_says_BALLS_on_your forehead wrote:
    > > DVH wrote:
    > > > Hi,
    > > >
    > > > I'm trying to get tokeparser to fetch a series of hyperlinks and print

    the
    > > > URL followed by the link text.
    > > >
    > > > The following script ("eurofeed.pl") gives me "Can't coerce array into

    hash
    > > > at eurofeed.pl line 31"
    > > >
    > > > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

    'docSel-titleLink')
    > > > {"
    > > >
    > > > The HTML looks like this:
    > > >
    > > > =======================================
    > > >
    > > > <td colspan="2">&nbsp;</td>
    > > >
    > > > <td align="left" colspan="3">
    > > >
    > > > <a title="" class="docSel-titleLink"
    > > > href="pressReleasesAction.do?reference=EPSO/05/06">
    > > >
    > > > My link text here
    > > >
    > > > </a>
    > > >
    > > > </td>
    > > >
    > > > </tr>
    > > >
    > > > ---------------------------------------------
    > > >
    > > > My script looks like this:
    > > >
    > > > #!/usr/bin/perl -w
    > > >
    > > > use strict;
    > > >
    > > > use LWP::Simple;
    > > >
    > > > use HTML::TokeParser;
    > > >
    > > > use XML::RSS;
    > > >
    > > > my $content =
    > > >

    t( "http://europa.eu.int/rapid/recentPressReleasesAction.do?guiLanguage=en&
    > > > hits=500" ) or die $!;
    > > >
    > > > my $stream = HTML::TokeParser->new( \$content ) or die $!;
    > > >
    > > > my ($tag, $headline, $url);
    > > >
    > > > while ( $tag = $stream->get_tag("a") ) {
    > > >
    > > > if ($tag->[2]{class} and $tag->[2]{class} eq 'docSel-titleLink') {
    > > >
    > > > $url = $tag->[2]{href} || "--";
    > > >
    > > > $headline = $stream->get_trimmed_text('/a')
    > > >
    > > > print $url
    > > >
    > > > print $headline
    > > >
    > > > -----------------------------------------------------------
    > > >
    > > > I think the problem lies in the ordering of tags, but that's as far as

    I've
    > > > got with working out what's wrong.

    > >
    > > after searching on CPAN for HTML::TokeParser, and looking at the
    > > $p->get_tag( @tags ) method,
    > > it looks like:
    > >
    > > The tag information is returned as an array reference in the same form
    > > as for $p->get_token above, but the type code (first element) is
    > > missing. A start tag will be returned like this:
    > >
    > > [$tag, $attr, $attrseq, $text]
    > > The tagname of end tags are prefixed with "/", i.e. end tag is returned
    > > like this:
    > >
    > > ["/$tag", $text]
    > >
    > > ...so you get an array reference back. why are you adding {class} into
    > > your code?

    >
    > ahh, my mistake...
    > use HTML::TokeParser;
    > $p = HTML::TokeParser->new(shift||"index.html");
    >
    > while (my $token = $p->get_tag("a")) {
    > my $url = $token->[1]{href} || "-";
    > my $text = $p->get_trimmed_text("/a");
    > print "$url\t$text\n";
    > }
    >
    > ...yeah, you need to look at index 1, not index 2.
    >


    Thanks. It works with [1].
     
    DVH, Oct 16, 2005
    #6
  7. "DVH" <> wrote in
    news:diug96$jfj$-infra.bt.com:

    >
    > Stephen Hildrey <> wrote in message
    > news:...
    >> DVH wrote:
    >> > I'm trying to get tokeparser to fetch a series of hyperlinks and
    >> > print the URL followed by the link text.
    >> >
    >> > The following script ("eurofeed.pl") gives me "Can't coerce array
    >> > into hash at eurofeed.pl line 31"
    >> >
    >> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

    > 'docSel-titleLink')
    >>
    >> You probably want ->[1] rather than ->[2]

    >
    > I did. I had thought it would be tag[2] because I was looking for the
    > third tag within those brackets, but obviously not.
    >
    > Thank you, that now works. I have a couple more questions (ah they
    > always do...)
    >
    > Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.


    ITYM "the HTML contains".


    > Is there a reasonably simple way of getting rid of that? The site is
    > at
    > http://europa.eu.int/rapid/recentPressReleasesAction.do?

    guiLanguage=en&
    > hits= 10 if you need to see it.
    >
    > Secondly, I'm working towards getting following those hrefs and then
    > parsing the text I find there. Would I be better off using
    > WWW::Mechanize to do this?


    #!/usr/bin/perl

    use strict;
    use warnings;

    use HTML::LinkExtractor;
    use LWP::Simple;

    my $url = q{http://europa.eu.int/rapid/recentPressReleasesAction.do?
    guiLanguage=en};
    my $html = get $url;

    die "Cannot get <$url>\n" unless $html;

    my $lx = HTML::LinkExtractor->new;
    $lx->parse(\$html);

    use Data::Dumper;

    for my $link ( @{ $lx->links } ) {
    if ($link->{class} eq 'docSel-formatLink') {
    print Dumper $link;
    }
    }


    __END__

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Oct 16, 2005
    #7
  8. DVH

    DVH Guest

    A. Sinan Unur <> wrote in message
    news:Xns96F1B3F245A6asu1cornelledu@127.0.0.1...
    > "DVH" <> wrote in
    > news:diug96$jfj$-infra.bt.com:
    >
    > >
    > > Stephen Hildrey <> wrote in message
    > > news:...
    > >> DVH wrote:
    > >> > I'm trying to get tokeparser to fetch a series of hyperlinks and
    > >> > print the URL followed by the link text.
    > >> >
    > >> > The following script ("eurofeed.pl") gives me "Can't coerce array
    > >> > into hash at eurofeed.pl line 31"
    > >> >
    > >> > Line 31 is "if ($tag->[2]{class} and $tag->[2]{class} eq

    > > 'docSel-titleLink')
    > >>
    > >> You probably want ->[1] rather than ->[2]

    > >
    > > I did. I had thought it would be tag[2] because I was looking for the
    > > third tag within those brackets, but obviously not.
    > >
    > > Thank you, that now works. I have a couple more questions (ah they
    > > always do...)
    > >
    > > Firstly, the HTML puts a lot of whitespace in the middle of the hrefs.

    >
    > ITYM "the HTML contains".
    >
    >
    > > Is there a reasonably simple way of getting rid of that? The site is
    > > at
    > > http://europa.eu.int/rapid/recentPressReleasesAction.do?

    > guiLanguage=en&
    > > hits= 10 if you need to see it.
    > >
    > > Secondly, I'm working towards getting following those hrefs and then
    > > parsing the text I find there. Would I be better off using
    > > WWW::Mechanize to do this?

    >
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > use HTML::LinkExtractor;
    > use LWP::Simple;
    >
    > my $url = q{http://europa.eu.int/rapid/recentPressReleasesAction.do?
    > guiLanguage=en};
    > my $html = get $url;
    >
    > die "Cannot get <$url>\n" unless $html;
    >
    > my $lx = HTML::LinkExtractor->new;
    > $lx->parse(\$html);
    >
    > use Data::Dumper;
    >
    > for my $link ( @{ $lx->links } ) {
    > if ($link->{class} eq 'docSel-formatLink') {
    > print Dumper $link;
    > }
    > }
    >
    >
    > __END__


    Sorry for getting back to you three days late, but thanks to both of you.
     
    DVH, Oct 19, 2005
    #8
  9. "DVH" <> wrote in news:dj6a0n$7a8$1
    @nwrdmz01.dmz.ncs.ea.ibs-infra.bt.com:

    > A. Sinan Unur <> wrote in message
    > news:Xns96F1B3F245A6asu1cornelledu@127.0.0.1...

    ....
    > Sorry for getting back to you three days late, but thanks to both
    > of you.


    You are welcome. Hope it helped.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Oct 19, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Patrick Joly
    Replies:
    0
    Views:
    89
    Patrick Joly
    Feb 25, 2004
  2. Maqo
    Replies:
    4
    Views:
    139
    A. Sinan Unur
    Feb 23, 2005
  3. jussi
    Replies:
    3
    Views:
    130
    Sherm Pendley
    Oct 7, 2005
  4. Abram

    HTML::TokeParser & TableExtract

    Abram, Apr 25, 2006, in forum: Perl Misc
    Replies:
    16
    Views:
    217
    David Combs
    May 22, 2006
  5. -did-not-set--mail-host-address

    HTML::TokeParser; __DATA__ as a filehandle

    -did-not-set--mail-host-address, Oct 24, 2006, in forum: Perl Misc
    Replies:
    2
    Views:
    134
    Brian Wilkins
    Oct 24, 2006
Loading...

Share This Page