Perl 5.8.x, Unicode and In-memory Filehandles

B

Bernard Chan

Hello all,

I have just started out experimenting the Unicode capabilities of Perl.
I am currently working on a Web development project involving both
output buffering with Perl's open() in-memory filehandles, and Unicode
handling. Separately they work fine, but I have spent a lot of time
integrating them onto one platform. Hopefully experts around here may
give me some insights as to what I have missed.

I have written a module IO::OutputBuffer which is expected to be used as
follows:

$buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
# now STDOUT points to the in-memory buffer
print "blablabla"; # Everything goes to in-memory buffer
# Content verified; commit to real STDOUT
IO::OutputBuffer::flush($buf_ctx);
# Stop buffering
IO::OutputBuffer::end($buf_ctx);
# STDOUT reverted to original

Because stray output is likely to make Apache-CGI complain, I would like
to capture all the output, validate it and then eventually commit to the
actual output stream before the script exits (there is also a similar
facility for capturing STDERR to log file, but not shown).

Basically, as a next step, I would like to make use of PerlIO layers to
implement some encoding conversion for clients who do not support UTF-8.
Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
I will keep using that. For instance, if the user profile (or HTTP
request header) indicates he prefers Big5, I will do a UTF-8->Big5
conversion, for instance.

As a test, I added some code within the buffering perimeters performing
a test reading of a Chinese file with UTF-8 encoding. I would like to
output its content to the client side, performing a simulated conversion
to Big5 before returning.

I have minimized the process to a script as short as below:

================================================

#!/usr/bin/perl -w

binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding

BEGIN {
require "require.pl";
}

#use IO::OutputBuffer;
#$b_out = IO::OutputBuffer::start(\*STDOUT);
my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">", \$BUF;


open FILE, "<:encoding(utf8)", "utf8_1.txt";
@lines = <FILE>;
close FILE;

print (join("<br>\n", @lines));

#IO::OutputBuffer::flush($b_out);
my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;

print $io_sys $buffered_content;

====================================

However, I cannot get the file content to display in proper Big5.
Instead, I got apparently Unicode code points as follows:

Wide character in print at output_minimal.pl line 20.
"\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
UTF-8:
\x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x{00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
<br>

<br>
\x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x{00e4}\x{00bb}\x{00a5}
UTF-8
\x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x{00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}

I guess that Perl has erroneously treated the content as non-Unicode and
thus tries to convert individual bytes as ISO8859-1 to Big5. I have
tried to insert utf8::upgrade($buffered_content) and then verified with
utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.

Anyone can help me? Thank you.

Regards,
Bernard Chan.


*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
 
M

MSG

Bernard said:
Hello all,

I have just started out experimenting the Unicode capabilities of Perl.
I am currently working on a Web development project involving both
output buffering with Perl's open() in-memory filehandles, and Unicode
handling. Separately they work fine, but I have spent a lot of time
integrating them onto one platform. Hopefully experts around here may
give me some insights as to what I have missed.

I have written a module IO::OutputBuffer which is expected to be used as
follows:

$buf_ctx = IO::OutputBuffer::start(\*STDOUT); # start in-memory buffer
# now STDOUT points to the in-memory buffer
print "blablabla"; # Everything goes to in-memory buffer
# Content verified; commit to real STDOUT
IO::OutputBuffer::flush($buf_ctx);
# Stop buffering
IO::OutputBuffer::end($buf_ctx);
# STDOUT reverted to original

Because stray output is likely to make Apache-CGI complain, I would like
to capture all the output, validate it and then eventually commit to the
actual output stream before the script exits (there is also a similar
facility for capturing STDERR to log file, but not shown).

Basically, as a next step, I would like to make use of PerlIO layers to
implement some encoding conversion for clients who do not support UTF-8.
Otherwise, I may need to use Text::Iconv but I guess if I can use PerlIO
I will keep using that. For instance, if the user profile (or HTTP
request header) indicates he prefers Big5, I will do a UTF-8->Big5
conversion, for instance.

As a test, I added some code within the buffering perimeters performing
a test reading of a Chinese file with UTF-8 encoding. I would like to
output its content to the client side, performing a simulated conversion
to Big5 before returning.

I have minimized the process to a script as short as below:

================================================

#!/usr/bin/perl -w

binmode(STDOUT, ":encoding(big5)") or die "$!"; # Output encoding

BEGIN {
require "require.pl";
}

#use IO::OutputBuffer;
#$b_out = IO::OutputBuffer::start(\*STDOUT);
my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">", \$BUF;


open FILE, "<:encoding(utf8)", "utf8_1.txt";
@lines = <FILE>;
close FILE;

print (join("<br>\n", @lines));

#IO::OutputBuffer::flush($b_out);
my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;

print $io_sys $buffered_content;

====================================

However, I cannot get the file content to display in proper Big5.
Instead, I got apparently Unicode code points as follows:

Wide character in print at output_minimal.pl line 20.
"\x{00e7}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0081}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ab}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0094}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0096}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0087}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ac}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a9}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e4}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bb}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00b8}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0085}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e6}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00bc}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00a2}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00e5}" does not map to big5-eten at output_minimal.pl line 28.
"\x{00ad}" does not map to big5-eten at output_minimal.pl line 28.
"\x{0097}" does not map to big5-eten at output_minimal.pl line 28.
UTF-8:
\x{00e7}\x{00b9}\x{0081}\x{00e9}\x{00ab}\x{0094}\x{00e4}\x{00b8}\x{00ad}\x{00e6}\x{0096}\x{0087}
<br>

<br>
\x{00e6}\x{00b8}\x{00ac}\x{00e8}\x{00a9}\x{00a6}\x{00e4}\x{00bb}\x{00a5}
UTF-8
\x{00e8}\x{00bc}\x{00b8}\x{00e5}\x{0085}\x{00a5}\x{00e6}\x{00bc}\x{00a2}\x{00e5}\x{00ad}\x{0097}

I guess that Perl has erroneously treated the content as non-Unicode and
thus tries to convert individual bytes as ISO8859-1 to Big5. I have
tried to insert utf8::upgrade($buffered_content) and then verified with
utf8::is_utf8() to ensure the input sequence is indeed valid UTF-8.

Anyone can help me? Thank you.

Regards,
Bernard Chan.


*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

It seems suspicious that you set your STDOUT to "big5" at the very
beginning and then open and close STDOUT many times afterwards.
By the time you print, your STDOUT has already resumed to be
"standard'.
Anyway "wide character" warning indicates that you are outputing
unicode to an non-unicode file handle.
 
B

Bernard Chan

I am inclined to think this may be related to the in-memory nature of
the filehandle. In the latest revision of the test script I have tried this:

================================================
#!/usr/bin/perl -w

BEGIN {
require "require.pl";
}

my ($io_sys, $BUF);
open $io_sys, ">&", \*STDOUT; close STDOUT;
open STDOUT, ">:utf8", \$BUF;

open FILE, "<:encoding(utf8)", "utf8_1.txt";
@lines = <FILE>;
close FILE;

my $buffered_content2 = (join("<br>\n", @lines)); # (1)
print (join("<br>\n", @lines));

my $buffered_content = $BUF;
$BUF = '';
seek STDOUT, 0, 0;

binmode($io_sys, ":encoding(big5)");
print $io_sys $buffered_content2; # (2)
================================================

Basically the modifications are labelled as (1) and (2). Line (1) is the
actual added line. In this program, when I try to print
$buffered_content on line (2) as before, the same output as previously
quoted was seen. However, when I change line (2) to $buffered_content2,
the output is exactly what I wanted (Big5). So it seems like there are
differences because the expression resulted from join() in both cases
were identical. The only difference was that one was read from the
variable representing the in-memory buffer, while the other directly as
generated from the join().

I checked that bytewise the two strings are byte-to-byte identical, and
that after using utf8::upgrade($buffer_content) both strings are valid
UTF-8 with the UTF-8 flag set, but "eq" the two strings still returns
false. I think there should be some intricate stuff in there.

Anyone may explain why this is so? Thank you in advance.

Regards,
Bernard Chan.

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
 
B

Bernard Chan

MSG said:
It seems suspicious that you set your STDOUT to "big5" at the very
beginning and then open and close STDOUT many times afterwards. By
the time you print, your STDOUT has already resumed to be "standard'.
That is because I would like to simulate the output buffering trickery I
would normally do with the module as described in my previous post, as I
would like to hide later scripts that they are printing to an in-memory
filehandle. If there are more elegant ways to do so without all these
trouble, please tell me so. Thank you.

I have removed the initial binmode() from my latest test script (see my
other post that I am posting in a few minutes). The original intent was
to set the PerlIO layer on the real STDOUT (not the in-memory one). I
may be able to avoid this.

And I would like to ask, if I binmode(STDOUT, "....."), will the PerlIO
layers installed be lost when I duped it (>&)? You see, I am just duping
filehandles around to make other routines unaware of the extra buffering
layer. If the layers will be lost in the duped filehandle, then you are
right, but I couldn't find anything said in the docs about this behaviour.
Anyway "wide character" warning indicates that you are outputing
unicode to an non-unicode file handle.

I have eliminated the wide character warning in the later test, after I
added ":utf8" to the open() that creates the in-memory filehandle. But
the problem remains.

Regards,
Bernard Chan.

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top