2 problems parsing output from HTML::TableExtract

Discussion in 'Perl Misc' started by Ted Byers, Sep 1, 2009.

  1. Ted Byers

    Ted Byers Guest

    I have to automate parsing email that comes in with its data in an
    HTML file (so I have no control over the content or how it is
    formatted.

    HTML::TableExtract has proved priceless in getting this done.
    However, there are two issues that are giving me grief.

    The first is probably simple, at least for regex experts. There are
    characters in the string that, while not a problem in browsers
    displaying the HTML, are a problem for my attempts to refine this data
    down to text I can work with in my other code. There are a plethora
    of instances of the that Emacs displays as '\240\. But the following
    statement doesn't remove them.

    $payload_tmp =~ s/\240//g;

    Neither does:
    $payload_tmp =~ s/\\240//g;

    I suspect it is a printer/display control character that results in
    the following text being underlined when displayed using a browser
    like MS IE or Firefox. What I don't know is what value I ought to use
    in my regex to get rid of it.

    I think I know what I can do, to work around this, but I would like to
    know how to construct a regular expression to get rid of it.

    The more important question gets down to how to deal with a warning I
    get on some output produced by HTML::TableExtract.

    In the html I get, there is one table, but without proper table
    headers, and there are two logical tables in this one HTML table
    separated by rows that have no visible values in their cells. Those
    cells without useful data cause problems in the output that manifests
    with a warning message:

    Use of uninitialized value $row in join or string at c:/test_path/
    Email_test_7.pl line 188, <GEN0> line 27252.

    Here is the code block the warning relates to:
    my $te = HTML::TableExtract->new();
    $payload =~ s/\r//g;
    my $payload_tmp = $payload;
    $payload_tmp =~ s/\n//g;
    $payload_tmp =~ s/\240//g;
    $te->parse($payload_tmp);
    my ($ts,$tn);
    $tn = 0;
    foreach $ts ($te->tables) {
    my $row;
    my $rown = 0;
    foreach $row ($ts->rows) {
    next unless defined $row;
    next unless defined @$row;# not sure about this one, but I tried it
    because the join mentioned in the warning uses @$row
    my $fount = @$row;
    next unless defined $fount;
    next if ($fount == 0);
    my $trow = join(',',@$row);
    print "\tRow: $rown\t",$trow,"\n";
    $rown++;
    }
    $tn++;
    }

    Since I know the HTML that is producing this output, I just want to
    skip over and ignore the rows having cells that have no data. Since
    the warning says I have an 'uninitialized value $row in join or
    string', I tried to skip is $row is undefined, and if the row has no
    data, but these tests are not having the desired effect. It is as if
    they weren't there. I don't know why I'd get a message that $row is
    undefined and yet a statement "next unless defined $row;" has no
    effect.

    What did I miss here?
    Ted Byers, Sep 1, 2009
    #1
    1. Advertising

  2. On 2009-09-01 18:53, Ted Byers <> wrote:
    > I have to automate parsing email that comes in with its data in an
    > HTML file (so I have no control over the content or how it is
    > formatted.
    >
    > HTML::TableExtract has proved priceless in getting this done.
    > However, there are two issues that are giving me grief.
    >
    > The first is probably simple, at least for regex experts. There are
    > characters in the string that, while not a problem in browsers
    > displaying the HTML, are a problem for my attempts to refine this data
    > down to text I can work with in my other code. There are a plethora
    > of instances of the that Emacs displays as '\240\.


    \240 (\x{A0} in hex) is the non-breaking space.

    > But the following
    > statement doesn't remove them.
    >
    > $payload_tmp =~ s/\240//g;


    This should work, provided the "\240" is there when you do the
    substitution. In HTML, the non-breaking space is often written as
    "&nbsp;". Are you sure that you are looking at the text you are feeding
    to your script and not some processed version?


    > Neither does:
    > $payload_tmp =~ s/\\240//g;


    This shouldn't.


    >
    > I suspect it is a printer/display control character that results in
    > the following text being underlined when displayed using a browser
    > like MS IE or Firefox.


    Please read http://www.w3.org/TR/html401/



    [...]

    > Use of uninitialized value $row in join or string at c:/test_path/
    > Email_test_7.pl line 188, <GEN0> line 27252.
    >
    > Here is the code block the warning relates to:

    [...]
    > my $row;
    > my $rown = 0;
    > foreach $row ($ts->rows) {
    > next unless defined $row;
    > next unless defined @$row;# not sure about this one, but I tried it
    > because the join mentioned in the warning uses @$row
    > my $fount = @$row;
    > next unless defined $fount;
    > next if ($fount == 0);
    > my $trow = join(',',@$row);


    I assume this is line 188 because it's the only line with a join in it.
    However I don't see how this line can be reached if $row is undefined.
    Are you sure that this is the code you are running?

    Please post a short, complete script that we can run. If you post a
    short snippet from a longer script it is always possible that the error
    is somewhere else. Also, you will probably find the error while trying
    to make the script as short as possible and won't have to ask at all.

    hp
    Peter J. Holzer, Sep 1, 2009
    #2
    1. Advertising

  3. Ted Byers

    Guest

    On Tue, 1 Sep 2009 11:53:33 -0700 (PDT), Ted Byers <> wrote:

    <snip>
    >However, there are two issues that are giving me grief.
    >
    >The first is probably simple, at least for regex experts. There are
    >characters in the string that, while not a problem in browsers

    <snip>
    >statement doesn't remove them.
    >
    > $payload_tmp =~ s/\240//g;

    ^
    would be a rx variable for $240
    >
    >Neither does:
    > $payload_tmp =~ s/\\240//g;

    s/\\240//g;
    works for me

    <snip>
    >The more important question gets down to how to deal with a warning I
    >get on some output produced by HTML::TableExtract.
    >
    >In the html I get, there is one table, but without proper table
    >headers, and there are two logical tables in this one HTML table
    >separated by rows that have no visible values in their cells. Those
    >cells without useful data cause problems in the output that manifests
    >with a warning message:
    >
    >Use of uninitialized value $row in join or string at c:/test_path/
    >Email_test_7.pl line 188, <GEN0> line 27252.
    >
    >Here is the code block the warning relates to:
    > my $te = HTML::TableExtract->new();
    > $payload =~ s/\r//g;
    > my $payload_tmp = $payload;
    > $payload_tmp =~ s/\n//g;
    > $payload_tmp =~ s/\240//g;
    > $te->parse($payload_tmp);
    > my ($ts,$tn);
    > $tn = 0;
    > foreach $ts ($te->tables) {
    > my $row;
    > my $rown = 0;
    > foreach $row ($ts->rows) {
    > next unless defined $row;
    > next unless defined @$row;# not sure about this one, but I tried it
    >because the join mentioned in the warning uses @$row
    > my $fount = @$row;
    > next unless defined $fount;
    > next if ($fount == 0);
    > my $trow = join(',',@$row);
    > print "\tRow: $rown\t",$trow,"\n";
    > $rown++;
    > }
    > $tn++;
    > }

    <snip>
    >What did I miss here?


    See below.
    -sln
    =====================
    use strict;
    use warnings;

    my $string = '
    start\\
    240\\24
    0\\240\\240\\2
    40\\240-end
    ';
    print $string,"\n";
    $string =~ s/\n//g;
    $string =~ s/\\240//g;
    # ^^
    # works for me

    print $string,"\n";


    my $row = [qw{this is a row of data},undef,undef,'end'];
    # ^^^^^ ^^^^^
    # oh no, undefined elements
    # join will give warning
    #
    my $trow = join(',',@$row);
    print "$trow\n";

    # to fix, rip out the undefined elements in a new copy of row.
    # can either strip the undef's:
    my @row_copy = map {defined $_ ? $_ : ()} @$row;
    # or can blank them out:
    # my @row_copy = map {defined $_ ? $_ : ()} @$row;
    #
    $trow = join ',', @row_copy;
    print "$trow\n";

    __END__

    # Lets fix this up, (untested)

    foreach $ts ($te->tables) {
    my $row;
    my $rown = 0;
    foreach $row ($ts->rows) {
    next if !defined($row);
    my @row_copy = map {defined $_ ? $_ : ()} @$row;
    next if !scalar(@row_copy);
    $trow = join ',', @row_copy;
    print "\tRow: $rown\t",$trow,"\n";
    $rown++;
    }
    $tn++;
    , Sep 1, 2009
    #3
  4. Ted Byers

    Guest

    On Tue, 01 Sep 2009 13:42:17 -0700, wrote:

    ># or can blank them out:
    > # my @row_copy = map {defined $_ ? $_ : ()} @$row;

    ^^
    # my @row_copy = map {defined $_ ? $_ : ''} @$row;

    -sln
    , Sep 1, 2009
    #4
  5. Ted Byers

    Ted Byers Guest

    On Sep 1, 4:46 pm, wrote:
    > On Tue, 01 Sep 2009 13:42:17 -0700, wrote:
    > ># or can blank them out:
    > >    # my @row_copy = map {defined $_ ? $_ : ()} @$row;

    >
    >                                              ^^
    >     # my @row_copy = map {defined $_ ? $_ : ''} @$row;
    >
    > -sln


    Thanks everyone. Problem solved; and I learned a bunch too. ;-)

    Cheers

    Ted
    Ted Byers, Sep 1, 2009
    #5
  6. Ted Byers

    Guest

    On Tue, 01 Sep 2009 13:46:09 -0700, wrote:

    >On Tue, 01 Sep 2009 13:42:17 -0700, wrote:
    >
    >># or can blank them out:
    >> # my @row_copy = map {defined $_ ? $_ : ()} @$row;

    > ^^
    > # my @row_copy = map {defined $_ ? $_ : ''} @$row;
    >
    >-sln


    Should you decide to just define blanks instead of deleting
    the elements, you won't have to create a temporary array,
    just do it in place with this:

    defined $_ or $_ = '' for (@$row);

    So, then the code would look something like this:

    foreach $ts ($te->tables) {
    my $row;
    my $rown = 0;
    foreach $row ($ts->rows) {
    next if (!defined($row) or !@$row);
    defined $_ or $_ = '' for (@$row); # just blank undef's
    $trow = join ',', @$row;
    print "\tRow: $rown\t",$trow,"\n";
    $rown++;
    }
    $tn++;

    -sln
    , Sep 1, 2009
    #6
  7. Ted Byers

    Uri Guttman Guest

    >>>>> "s" == sln <> writes:

    s> Should you decide to just define blanks instead of deleting
    s> the elements, you won't have to create a temporary array,
    s> just do it in place with this:

    s> defined $_ or $_ = '' for (@$row);

    s> defined $_ or $_ = '' for (@$row); # just blank undef's
    s> $trow = join ',', @$row;

    you can merge those with a map:

    $trow = join ',', map { defined ? $_ : '' } @$row;

    and if you are using 5.10 with the defined or op // that is even
    simpler:

    $trow = join ',', map { $_ // '' } @$row;

    and with 5.10 the for modifier line could also become:

    $_ //= '' for @{$row} ;

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
    Uri Guttman, Sep 1, 2009
    #7
  8. Ted Byers

    Uri Guttman Guest

    >>>>> "BM" == Ben Morrow <> writes:

    BM> Quoth "Uri Guttman" <>:
    >>
    >> $trow = join ',', map { defined ? $_ : '' } @$row;
    >>
    >> $trow = join ',', map { $_ // '' } @$row;


    BM> Meh. Those are all ugly. I much prefer

    BM> {
    BM> no warnings "uninitialized";
    BM> $trow = join ",", @$row;
    BM> }

    that needs a block, and is longer. and i don't like to use the warnings
    pragma unless absolutely necessary. just my style vs yours.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
    Uri Guttman, Sep 1, 2009
    #8
  9. On 2009-09-01 20:40, Tad J McClellan <> wrote:
    > Peter J. Holzer <> wrote:
    >> On 2009-09-01 18:53, Ted Byers <> wrote:
    >>> Use of uninitialized value $row in join or string at c:/test_path/
    >>> Email_test_7.pl line 188, <GEN0> line 27252.
    >>>
    >>> Here is the code block the warning relates to:

    >> [...]
    >>> foreach $row ($ts->rows) {
    >>> next unless defined $row;

    [...]
    >>> my $trow = join(',',@$row);

    >>
    >> I assume this is line 188 because it's the only line with a join in it.
    >> However I don't see how this line can be reached if $row is undefined.
    >> Are you sure that this is the code you are running?

    >
    >
    > I was confused too. The error message is misleading, it is not $row
    > that is undefined, it is one of the elements in @$row that is undef.


    Ah, yes. That makes sense.

    hp
    Peter J. Holzer, Sep 1, 2009
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sdfgsd
    Replies:
    6
    Views:
    188
    sdfgsd
    Oct 9, 2003
  2. Paul

    Perl HTML::TableExtract Question

    Paul, Apr 17, 2005, in forum: Perl Misc
    Replies:
    3
    Views:
    201
  3. Jim Monty
    Replies:
    0
    Views:
    99
    Jim Monty
    May 16, 2005
  4. Maqo
    Replies:
    3
    Views:
    140
    Bob Walton
    May 25, 2005
  5. Ninja Li

    Parsing HTML with HTML::TableExtract

    Ninja Li, Nov 27, 2009, in forum: Perl Misc
    Replies:
    2
    Views:
    204
    Martien Verbruggen
    Nov 28, 2009
Loading...

Share This Page