HTML::TableExtract punctuation parsing

Discussion in 'Perl Misc' started by Maqo, May 22, 2005.

  1. Maqo

    Maqo Guest

    Is there any way to prevent HTML::TableExtract from mangling punctuation
    in parsed text? For example, the below code is parsing “don’t come”
    in the target URL as “don’t comeâ€. Is it something about the
    document encoding, or a limitation of the module?

    Many thanks!

    ------------------------------------------------------------

    use LWP::Simple;
    use HTML::TableExtract;

    $URL =
    "http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2005/IO+May-June+2005.htm";

    $content = get($URL);
    my $te = new HTML::TableExtract( depth=>1, count=>4, gridmap=>0,
    keep_html=>1);

    $te->parse($content);
    foreach $ts ($te->table_states)
    {
    foreach $row ($ts->rows)
    {
    print $$row[0];
    }
    }
     
    Maqo, May 22, 2005
    #1
    1. Advertising

  2. Maqo

    Bob Walton Guest

    Maqo wrote:

    > Is there any way to prevent HTML::TableExtract from mangling punctuation
    > in parsed text? For example, the below code is parsing “don’t come” in
    > the target URL as “don’t comeâ€. Is it something about the
    > document encoding, or a limitation of the module?

    ....

    >$URL =

    "http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2005/IO+May-June+2005.htm";


    My browser says that web page is Unicode with UTF-8 encoding. If
    you process it as Unicode with UTF-8 encoding, you'll probably be
    fine. Otherwise, as you noted, you'll get gibberish. If you
    view the results of your print() with a Unicode with UTF-8
    viewer, you should be OK, as you are doing nothing that should
    alter the non-ASCII characters.

    Your web browser is probably a good candidate for such a viewer,
    providing you set the right character code/encoding.

    ....
    --
    Bob Walton
    Email: http://bwalton.com/cgi-bin/emailbob.pl
     
    Bob Walton, May 23, 2005
    #2
    1. Advertising

  3. Maqo

    Maqo Guest

    Bob Walton wrote:

    > My browser says that web page is Unicode with UTF-8 encoding. If you
    > process it as Unicode with UTF-8 encoding, you'll probably be fine.
    > Otherwise, as you noted, you'll get gibberish. If you view the results
    > of your print() with a Unicode with UTF-8 viewer, you should be OK, as
    > you are doing nothing that should alter the non-ASCII characters.


    Thanks Bob, that's what I had suspected as well, which is why I can't
    for the life of me understand why this is still giving me gibberish (I
    must be missing something with respect to proper decoding of UTF-8):

    use LWP::UserAgent;
    use HTML::Encoding 'encoding_from_http_message';
    use Encode;

    my $URL =
    "http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2005/IO+May-June+2005.htm";

    my $content = LWP::UserAgent->new->get($URL, 'Accept-Charset'=>'UTF-8');
    my $enco = encoding_from_http_message($content);
    my $utf8 = decode($enco => $content->content());
    open (OUT, ">:encoding(utf8)", "out.html");
    print OUT $utf8;
    close (OUT);
     
    Maqo, May 24, 2005
    #3
  4. Maqo

    Bob Walton Guest

    Maqo wrote:

    > Bob Walton wrote:
    >
    >> My browser says that web page is Unicode with UTF-8 encoding. If you
    >> process it as Unicode with UTF-8 encoding, you'll probably be fine.
    >> Otherwise, as you noted, you'll get gibberish. If you view the
    >> results of your print() with a Unicode with UTF-8 viewer, you should
    >> be OK, as you are doing nothing that should alter the non-ASCII
    >> characters.

    >
    >
    > Thanks Bob, that's what I had suspected as well, which is why I can't
    > for the life of me understand why this is still giving me gibberish (I
    > must be missing something with respect to proper decoding of UTF-8):
    >
    > use LWP::UserAgent;
    > use HTML::Encoding 'encoding_from_http_message';
    > use Encode;
    >
    > my $URL =
    > "http://www.pimco.com/LeftNav/Late+Breaking+Commentary/IO/2005/IO+May-June+2005.htm";
    >
    >
    > my $content = LWP::UserAgent->new->get($URL, 'Accept-Charset'=>'UTF-8');
    > my $enco = encoding_from_http_message($content);
    > my $utf8 = decode($enco => $content->content());
    > open (OUT, ">:encoding(utf8)", "out.html");
    > print OUT $utf8;
    > close (OUT);


    Well, I'm certainly no expert at all these encodings, but I note
    that when running your program above verbatim, one still ends up
    with "out.html" containing UTF-8 encoded Unicode. In fact,
    out.html is character-for-character identical with the file
    generated from:

    use LWP::Simple;
    open OUT,">out1.html" or die "Oops, $!";
    print OUT get('http://www.p...');
    #[trailing portion of long URL elided]

    It seems that what you really want to do is convert the "weird"
    quote and apostrophe characters and the em-dash from Unicode to
    their nearest ASCII equivalents. There is certainly no
    general-purpose converter to take Unicode and make "best guess"
    ASCII out of it (what would it do with Chinese characters, for
    example?). Perl can convert the UTF-8 encoding to true Unicode
    in Perl strings (which apparently is happening with your $utf8
    variable), and one could then use the tr/// operator to convert
    the unwanted codes to the ASCII characters you want to use as
    their approximation.

    For example, try adding this line to your above program just
    after your "my $utf8..." line and before the open():

    $utf8=~tr/\x{2019}\x{201c}\x{201d}\x{2013}/'""-/;

    and see if that will suffice. It appears as if the call to
    ->decode() of the Encode module is needed to convert the UTF-8
    encoding from the web page to a true Unicode string. It may thus
    be misleading to call it $utf8 -- perhaps $unicode would be more
    descriptive?

    BTW, you should test your open to ensure it executed
    successfully. The typical paradigm is:

    open(...) or die "Your error message, $!";

    --
    Bob Walton
    Email: http://bwalton.com/cgi-bin/emailbob.pl
     
    Bob Walton, May 25, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sdfgsd
    Replies:
    6
    Views:
    203
    sdfgsd
    Oct 9, 2003
  2. Paul

    Perl HTML::TableExtract Question

    Paul, Apr 17, 2005, in forum: Perl Misc
    Replies:
    3
    Views:
    219
  3. Jim Monty
    Replies:
    0
    Views:
    113
    Jim Monty
    May 16, 2005
  4. Ted Byers
    Replies:
    8
    Views:
    225
    Peter J. Holzer
    Sep 1, 2009
  5. Ninja Li

    Parsing HTML with HTML::TableExtract

    Ninja Li, Nov 27, 2009, in forum: Perl Misc
    Replies:
    2
    Views:
    232
    Martien Verbruggen
    Nov 28, 2009
Loading...

Share This Page