Perl 5.8.x, Unicode and In-memory Filehandles

Discussion in 'Perl Misc' started by Bernard Chan, Mar 1, 2006.

  1. Bernard Chan

    Bernard Chan Guest

    Hello all,

    I have just started out experimenting the Unicode capabilities of Perl.
    I am currently working on a Web development project involving both
    output buffering with Perl's open() in-memory filehandles, and Unicode
    handling. Separately they work fine, but I have spent a lot of time
    integrating them onto one platform. Hopefully experts around here may
    give me some insights as to what I have missed.

    I have written a module IO::OutputBuffer which is expected to be used as
    follows:

    $buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
    # now STDOUT points to the in-memory buffer
    print "blablabla"; # Everything goes to in-memory buffer
    # Content verified; commit to real STDOUT
    IO::OutputBuffer::flush($buf_ctx);
    # Stop buffering
    IO::OutputBuffer::end($buf_ctx);
    # STDOUT reverted to original

    Because stray output is likely to make Apache-CGI complain, I would like
    to capture all the output, validate it and then eventually commit to the
    actual output stream before the script exits (there is also a similar
    facility for capturing STDERR to log file, but not shown).

    Basically, as a next step, I would like to make use of PerlIO layers to
    implement some encoding conversion for clients who do not support UTF-8.
    Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
    I will keep using that. For instance, if the user profile (or HTTP
    request header) indicates he prefers Big5, I will do a UTF-8->Big5
    conversion, for instance.

    As a test, I added some code within the buffering perimeters performing
    a test reading of a Chinese file with UTF-8 encoding. I would like to
    output its content to the client side, performing a simulated conversion
    to Big5 before returning.

    I have minimized the process to a script as short as below:

    ================================================

    #!/usr/bin/perl -w

    binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding

    BEGIN {
    require "require.pl";
    }

    #use IO::OutputBuffer;
    #$b_out = IO::OutputBuffer::start(\*STDOUT);
    my ($io_sys, $BUF);
    open $io_sys, ">&", \*STDOUT; close STDOUT;
    open STDOUT, ">", \$BUF;


    open FILE, "<:encoding(utf8)", "utf8_1.txt";
    @lines = <FILE>;
    close FILE;

    print (join("<br>\n", @lines));

    #IO::OutputBuffer::flush($b_out);
    my $buffered_content = $BUF;
    $BUF = '';
    seek STDOUT, 0, 0;

    print $io_sys $buffered_content;

    ====================================

    However, I cannot get the file content to display in proper Big5.
    Instead, I got apparently Unicode code points as follows:

    Wide character in print at output_minimal.pl line 20.
    "\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
    "\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
    UTF-8:
    \x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x{00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
    <br>

    <br>
    \x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x{00e4}\x{00bb}\x{00a5}
    UTF-8
    \x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x{00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}

    I guess that Perl has erroneously treated the content as non-Unicode and
    thus tries to convert individual bytes as ISO8859-1 to Big5. I have
    tried to insert utf8::upgrade($buffered_content) and then verified with
    utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.

    Anyone can help me? Thank you.

    Regards,
    Bernard Chan.


    *** Free account sponsored by SecureIX.com ***
    *** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
     
    Bernard Chan, Mar 1, 2006
    #1
    1. Advertising

  2. Bernard Chan

    MSG Guest

    Bernard Chan wrote:
    > Hello all,
    >
    > I have just started out experimenting the Unicode capabilities of Perl.
    > I am currently working on a Web development project involving both
    > output buffering with Perl's open() in-memory filehandles, and Unicode
    > handling. Separately they work fine, but I have spent a lot of time
    > integrating them onto one platform. Hopefully experts around here may
    > give me some insights as to what I have missed.
    >
    > I have written a module IO::OutputBuffer which is expected to be used as
    > follows:
    >
    > $buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
    > # now STDOUT points to the in-memory buffer
    > print "blablabla"; # Everything goes to in-memory buffer
    > # Content verified; commit to real STDOUT
    > IO::OutputBuffer::flush($buf_ctx);
    > # Stop buffering
    > IO::OutputBuffer::end($buf_ctx);
    > # STDOUT reverted to original
    >
    > Because stray output is likely to make Apache-CGI complain, I would like
    > to capture all the output, validate it and then eventually commit to the
    > actual output stream before the script exits (there is also a similar
    > facility for capturing STDERR to log file, but not shown).
    >
    > Basically, as a next step, I would like to make use of PerlIO layers to
    > implement some encoding conversion for clients who do not support UTF-8.
    > Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
    > I will keep using that. For instance, if the user profile (or HTTP
    > request header) indicates he prefers Big5, I will do a UTF-8->Big5
    > conversion, for instance.
    >
    > As a test, I added some code within the buffering perimeters performing
    > a test reading of a Chinese file with UTF-8 encoding. I would like to
    > output its content to the client side, performing a simulated conversion
    > to Big5 before returning.
    >
    > I have minimized the process to a script as short as below:
    >
    > ================================================
    >
    > #!/usr/bin/perl -w
    >
    > binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding
    >
    > BEGIN {
    > require "require.pl";
    > }
    >
    > #use IO::OutputBuffer;
    > #$b_out = IO::OutputBuffer::start(\*STDOUT);
    > my ($io_sys, $BUF);
    > open $io_sys, ">&", \*STDOUT; close STDOUT;
    > open STDOUT, ">", \$BUF;
    >
    >
    > open FILE, "<:encoding(utf8)", "utf8_1.txt";
    > @lines = <FILE>;
    > close FILE;
    >
    > print (join("<br>\n", @lines));
    >
    > #IO::OutputBuffer::flush($b_out);
    > my $buffered_content = $BUF;
    > $BUF = '';
    > seek STDOUT, 0, 0;
    >
    > print $io_sys $buffered_content;
    >
    > ====================================
    >
    > However, I cannot get the file content to display in proper Big5.
    > Instead, I got apparently Unicode code points as follows:
    >
    > Wide character in print at output_minimal.pl line 20.
    > "\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
    > "\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
    > UTF-8:
    > \x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x{00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
    > <br>
    >
    > <br>
    > \x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x{00e4}\x{00bb}\x{00a5}
    > UTF-8
    > \x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x{00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}
    >
    > I guess that Perl has erroneously treated the content as non-Unicode and
    > thus tries to convert individual bytes as ISO8859-1 to Big5. I have
    > tried to insert utf8::upgrade($buffered_content) and then verified with
    > utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.
    >
    > Anyone can help me? Thank you.
    >
    > Regards,
    > Bernard Chan.
    >
    >
    > *** Free account sponsored by SecureIX.com ***
    > *** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***


    It seems suspicious that you set your STDOUT to "big5" at the very
    beginning and then open and close STDOUT many times afterwards.
    By the time you print, your STDOUT has already resumed to be
    "standard'.
    Anyway "wide character" warning indicates that you are outputing
    unicode to an non-unicode file handle.
     
    MSG, Mar 1, 2006
    #2
    1. Advertising

  3. Bernard Chan

    Bernard Chan Guest

    I am inclined to think this may be related to the in-memory nature of
    the filehandle. In the latest revision of the test script I have tried this:

    ================================================
    #!/usr/bin/perl -w

    BEGIN {
    require "require.pl";
    }

    my ($io_sys, $BUF);
    open $io_sys, ">&", \*STDOUT; close STDOUT;
    open STDOUT, ">:utf8", \$BUF;

    open FILE, "<:encoding(utf8)", "utf8_1.txt";
    @lines = <FILE>;
    close FILE;

    my $buffered_content2 = (join("<br>\n", @lines)); # (1)
    print (join("<br>\n", @lines));

    my $buffered_content = $BUF;
    $BUF = '';
    seek STDOUT, 0, 0;

    binmode($io_sys, ":encoding(big5)");
    print $io_sys $buffered_content2; # (2)
    ================================================

    Basically the modifications are labelled as (1) and (2). Line (1) is the
    actual added line. In this program, when I try to print
    $buffered_content on line (2) as before, the same output as previously
    quoted was seen. However, when I change line (2) to $buffered_content2,
    the output is exactly what I wanted (Big5). So it seems like there are
    differences because the expression resulted from join() in both cases
    were identical. The only difference was that one was read from the
    variable representing the in-memory buffer, while the other directly as
    generated from the join().

    I checked that bytewise the two strings are byte-to-byte identical, and
    that after using utf8::upgrade($buffer_content) both strings are valid
    UTF-8 with the UTF-8 flag set, but "eq" the two strings still returns
    false. I think there should be some intricate stuff in there.

    Anyone may explain why this is so? Thank you in advance.

    Regards,
    Bernard Chan.

    *** Free account sponsored by SecureIX.com ***
    *** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
     
    Bernard Chan, Mar 1, 2006
    #3
  4. Bernard Chan

    Bernard Chan Guest

    MSG wrote:
    >
    > It seems suspicious that you set your STDOUT to "big5" at the very
    > beginning and then open and close STDOUT many times afterwards. By
    > the time you print, your STDOUT has already resumed to be "standard'.
    >

    That is because I would like to simulate the output buffering trickery I
    would normally do with the module as described in my previous post, as I
    would like to hide later scripts that they are printing to an in-memory
    filehandle. If there are more elegant ways to do so without all these
    trouble, please tell me so. Thank you.

    I have removed the initial binmode() from my latest test script (see my
    other post that I am posting in a few minutes). The original intent was
    to set the PerlIO layer on the real STDOUT (not the in-memory one). I
    may be able to avoid this.

    And I would like to ask, if I binmode(STDOUT, "....."), will the PerlIO
    layers installed be lost when I duped it (>&)? You see, I am just duping
    filehandles around to make other routines unaware of the extra buffering
    layer. If the layers will be lost in the duped filehandle, then you are
    right, but I couldn't find anything said in the docs about this behaviour.

    > Anyway "wide character" warning indicates that you are outputing
    > unicode to an non-unicode file handle.


    I have eliminated the wide character warning in the later test, after I
    added ":utf8" to the open() that creates the in-memory filehandle. But
    the problem remains.

    Regards,
    Bernard Chan.

    *** Free account sponsored by SecureIX.com ***
    *** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
     
    Bernard Chan, Mar 1, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vilmos Soti
    Replies:
    17
    Views:
    283
    Rocco Caputo
    May 9, 2004
  2. Alex Hart
    Replies:
    0
    Views:
    100
    Alex Hart
    Jan 24, 2005
  3. Alex Hart
    Replies:
    1
    Views:
    114
    Anno Siegel
    Jan 24, 2005
  4. BZ

    <> and filehandles in hashes

    BZ, Sep 8, 2005, in forum: Perl Misc
    Replies:
    2
    Views:
    98
  5. Sisyphus
    Replies:
    4
    Views:
    151
    Sisyphus
    Mar 17, 2006
Loading...

Share This Page