Parsing HTML with HTML::TableExtract

Discussion in 'Perl Misc' started by Ninja Li, Nov 27, 2009.

  1. Ninja Li

    Ninja Li Guest

    Hi,

    I am trying to a comma-delimited file by parsing HTML from the
    website "http://www.earnings.com/conferencecall.asp?client=cb"
    using HTML::TableExtract module (Thanks for Tad McClellan for the
    introduction). However, I got the following error message when running
    my script at the end of the post:
    ----------------------
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    Use of uninitialized value in join or string at conference.pl line 25.
    HOGGF.PK
     ,HOGG ROBINSON GROUP PLC,Half- Year HOGG ROBINSON GROUP PLC
    Earnings Conference Call,,,4:00 AM
    ................
    ----------------------

    Also notice the large spaces between first value "HOGGF.PK" and
    second "HOGG ROBINSON GROUP PLC". There are only a few spaces after
    the first field in the original HTML. For what I could see so far, it
    seems the empty values in the fields are not handled correctly. The
    source code is at the end of the post.

    Please advise the root cause and the fix.

    Thanks in advance.

    Nick

    ----------------------------------------------
    Source code:

    use warnings;
    use strict;
    use LWP::Simple;
    use HTML::TableExtract;

    my $html = get 'http://www.earnings.com/conferencecall.asp?
    client=cb';

    my @headers =
    (
    'SYMBOL',
    'COMPANY',
    'EVENT TITLE',
    'WEBCAST',
    'TRANSCRIPT',
    'TIME'
    );

    my $te = HTML::TableExtract->new( headers => \@headers );
    $te->parse($html);

    foreach my $ts ( $te->tables )
    {
    foreach my $row ( $ts->rows )
    {
    my $csv = join ',', @$row;
    print "$csv\n";
    }
    }
    Ninja Li, Nov 27, 2009
    #1
    1. Advertising

  2. Ninja Li

    Guest

    On Fri, 27 Nov 2009 14:57:07 -0800 (PST), Ninja Li <> wrote:

    >Hi,
    >
    > I am trying to a comma-delimited file by parsing HTML from the
    >website "http://www.earnings.com/conferencecall.asp?client=cb"
    >using HTML::TableExtract module (Thanks for Tad McClellan for the
    >introduction). However, I got the following error message when running
    >my script at the end of the post:
    >----------------------
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >Use of uninitialized value in join or string at conference.pl line 25.
    >HOGGF.PK
    >  ,HOGG ROBINSON GROUP PLC,Half- Year HOGG ROBINSON GROUP PLC
    >Earnings Conference Call,,,4:00 AM
    >...............
    >----------------------
    >
    > Also notice the large spaces between first value "HOGGF.PK" and
    >second "HOGG ROBINSON GROUP PLC". There are only a few spaces after
    >the first field in the original HTML. For what I could see so far, it
    >seems the empty values in the fields are not handled correctly. The
    >source code is at the end of the post.
    >
    > Please advise the root cause and the fix.
    >
    > Thanks in advance.
    >
    > Nick
    >

    What have you done to find out what caused this rediculous
    number of warnings? Nothing from your code it seems.
    Something is off, WAY off! Something wrong with your content or
    headers. Have to learn the module, actually you have to read the docs
    for it. Then, plan ahead. Look at the source of the html.

    This is not rocket science.

    -sln
    , Nov 27, 2009
    #2
    1. Advertising

  3. On Fri, 27 Nov 2009 14:57:07 -0800 (PST),
    Ninja Li <> wrote:
    > Hi,
    >
    > I am trying to a comma-delimited file by parsing HTML from the
    > website "http://www.earnings.com/conferencecall.asp?client=cb"
    > using HTML::TableExtract module (Thanks for Tad McClellan for the
    > introduction). However, I got the following error message when running
    > my script at the end of the post:
    > ----------------------
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > Use of uninitialized value in join or string at conference.pl line 25.
    > HOGGF.PK
    >  ,HOGG ROBINSON GROUP PLC,Half- Year HOGG ROBINSON GROUP PLC
    > Earnings Conference Call,,,4:00 AM
    > ...............


    Tha is not the only output. I get more.

    > Also notice the large spaces between first value "HOGGF.PK" and
    > second "HOGG ROBINSON GROUP PLC". There are only a few spaces after
    > the first field in the original HTML. For what I could see so far, it


    Check the 'original' HTML again. What's currently at that URL has the
    spaces that you see. I guess they muct have changed it since you last
    looked at it.

    > seems the empty values in the fields are not handled correctly. The
    > source code is at the end of the post.


    Define 'correctly'. Or rather, find out what HTML::TableExtract defines
    as correctly, and adjust your expectations to that. Cells without text
    content seem to be returned as undefined values. It's your job to deal
    with that in whichever way you think it should be dealt with.

    > Please advise the root cause and the fix.


    If you want, I can send you a contract and rate card.

    Martien
    --
    |
    Martien Verbruggen |
    | Can't say that it is, 'cause it ain't.
    |
    Martien Verbruggen, Nov 28, 2009
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sdfgsd
    Replies:
    6
    Views:
    190
    sdfgsd
    Oct 9, 2003
  2. Paul

    Perl HTML::TableExtract Question

    Paul, Apr 17, 2005, in forum: Perl Misc
    Replies:
    3
    Views:
    201
  3. Jim Monty
    Replies:
    0
    Views:
    103
    Jim Monty
    May 16, 2005
  4. Maqo
    Replies:
    3
    Views:
    143
    Bob Walton
    May 25, 2005
  5. Ted Byers
    Replies:
    8
    Views:
    215
    Peter J. Holzer
    Sep 1, 2009
Loading...

Share This Page