Re: Clear the "Wide character in print" warning and leave theoutput unmangled

Discussion in 'Perl Misc' started by Peter J. Holzer, Nov 3, 2012.

  1. On 2012-11-02 05:23, <> wrote:
    > None of the advice on perlunifaq or elsewhere can both
    > * Clear the "Wide character in print" warning, and
    > * Leave the output non doubly encoded.
    >
    > #!/usr/bin/perl
    >
    > # How to test this program:
    > # $ export restriction=TW; PERLLIB=$HOME/perl5/lib/perl5 ./ytpl jidanni2 > /tmp/o
    > # $ cat /tmp/o
    > # That will show you any problems it has.


    Thanks for providing a complete script which demonstrates the problem.
    This makes finding the problem simpler. However:

    [...]
    > use WebService::GData::Constants qw:)all);
    > use WebService::GData::YouTube;
    > die 'Specify a user please.' unless my $user = shift;


    I'm not going to create a youtube account just to test this script.
    So I cannot test it.

    Unfortunately, you didn't report where the "Wide character in print"
    warning occurs, either, and it is not obvious to me from the source
    code. I am guessing that it happens in the last loop, because you tried
    to use decode_utf8 there.

    So I'm just giving generic advice here:

    1) Always use “binmode(..., ":encoding(...)");†explicitely on STDIN,
    STDOUT and STDERR. The encoding must be the one your terminal uses,
    so if your terminal supports UTF-8, use that. (for production, you
    might want to use “use open ":locale"â€, but for debugging it's best
    to eliminate any source of variable behaviour and hardcode the
    encoding).

    2) Try to shorten your program further, to make it easier to see where
    the problem is without actually running the program.

    3) When processing character data, convert from (external) byte
    encodings to (internal) character strings as early as possible.

    My guess is that you get some byte encoded data from the
    WebService::GData module. You should decode() this, and you should do
    this as early as possible so that the rest of your code doesn't have
    to care about the encoding. This is especially necessary if you
    combine strings from several sources which might use different
    encodings.

    4) When searching for encoding problems, I like to use this simple
    function to dump strings to stdout:

    sub dumpstr {
    my ($s) = @_;

    print utf8::is_utf8($s) ? "char" : "byte";
    print ":";
    for (split //, $s) {
    printf " %#02x", ord($_);
    }
    print "\n";
    }

    use it to dump the string that is giving you the warning or that is
    double-encoded. That will usually tell you *what* is wrong with the
    string, but not *why* it is wrong. Then go backwards through the code
    to see where you get the string from. If the string is computed from
    some other string(s) (e.g. concatenation, substring, etc), dump the
    inputs in the same way. Eventually you will have identified the
    source of the "wrong" string, and then you can probably fix it with
    a simple call to decode() right at the source. (If you get the string
    from a module, you might also want to file a bug report).

    hp

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Nov 3, 2012
    #1
    1. Advertising

  2. On 2012-11-03 12:08, Peter J. Holzer <> wrote:
    > On 2012-11-02 05:23, <> wrote:
    >> None of the advice on perlunifaq or elsewhere can both
    >> * Clear the "Wide character in print" warning, and
    >> * Leave the output non doubly encoded.
    >>
    >> #!/usr/bin/perl
    >>
    >> # How to test this program:
    >> # $ export restriction=TW; PERLLIB=$HOME/perl5/lib/perl5 ./ytpl jidanni2 > /tmp/o
    >> # $ cat /tmp/o
    >> # That will show you any problems it has.

    >
    > Thanks for providing a complete script which demonstrates the problem.
    > This makes finding the problem simpler. However:
    >
    > [...]
    >> use WebService::GData::Constants qw:)all);
    >> use WebService::GData::YouTube;
    >> die 'Specify a user please.' unless my $user = shift;

    >
    > I'm not going to create a youtube account just to test this script.
    > So I cannot test it.


    I spoke too soon. It turns out that the script can retrieve other
    people's playlists, so I can run it with "jidanni2" as a parameter and
    don't need an account and playlist of my own.


    > Unfortunately, you didn't report where the "Wide character in print"
    > warning occurs, either, and it is not obvious to me from the source
    > code. I am guessing that it happens in the last loop, because you tried
    > to use decode_utf8 there.


    Yes, the guess was correct.


    > My guess is that you get some byte encoded data from the
    > WebService::GData module. You should decode() this, and you should do
    > this as early as possible so that the rest of your code doesn't have
    > to care about the encoding. This is especially necessary if you
    > combine strings from several sources which might use different
    > encodings.


    This guess was also correct. $entry->title returns what looks like an
    UTF-8 encoded string. So what I guess should be "å°ç£è»æ©Ÿ Taiwan
    Aircrafts found in Google Earth" (char: 0x53f0 0x7063 0x8ecd 0x6a5f 0x20
    0x20 0x20 0x54 0x61 ...) is returned as (char: 0xe5 0x8f 0xb0 0xe7 0x81
    0xa3 0xe8 0xbb 0x8d 0xe6 0xa9 0x9f 0x20 0x20 0x20 0x54 0x61 ...).
    To make it even more confusing, the string is marked as a character
    string (the UTF8 bit is on) instead of a byte string. This is definitely
    a bug in WebService::GData.

    And even worse, it isn't even reliably wrong: "Rob 'N' Raz Featuring
    Leila K - Got To Get" has an U+200E character just before the -. This is
    probably where the "wide character" warning came from. After I put in an
    appropriate “decode("UTF-8" $entry->title)â€, it dies now. So I would
    have to wrap that in an eval {} block or possibly use some heuristics to
    check whether decoding is necessary or not. This is where I stop
    and let you take over.

    So, to summarize:

    1) Put in “binmode STDOUT, ":encoding(UTF8)";â€
    2) Put in “use utf8;â€
    3) decode() the return value of $entry->title (and possibly some other
    calls) but be aware that this doesn't always work, so you need a
    fall-back strategy.
    4) Report the bug.

    hp

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Nov 3, 2012
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jenny
    Replies:
    2
    Views:
    602
    Jenny
    Aug 4, 2004
  2. David

    Response.Clear() doesn't clear

    David, Jan 31, 2008, in forum: ASP .Net
    Replies:
    2
    Views:
    994
    Mark Fitzpatrick
    Jan 31, 2008
  3. InvalidLastName

    Unrecognized element 'add' after <clear></clear>

    InvalidLastName, Feb 26, 2007, in forum: ASP .Net Web Services
    Replies:
    3
    Views:
    925
    Steven Cheng[MSFT]
    Mar 6, 2007
  4. Yuri Shtil

    Wide character in print

    Yuri Shtil, Jul 31, 2003, in forum: Perl Misc
    Replies:
    6
    Views:
    198
    Jürgen Exner
    Aug 5, 2003
  5. tcgo

    Why "Wide character in print"?

    tcgo, Sep 30, 2012, in forum: Perl Misc
    Replies:
    40
    Views:
    2,483
    Eric Pozharski
    Nov 13, 2012
Loading...

Share This Page