Alan J. Flavell said:
Apropos of which, I suppose I ought at some point to repeat with
5.8.1 the tests that I had reported for 5.8.0 in
http://www.google.com/groups?selm=Pine.LNX.4.53.0308170139110.
6451%40lxplus005.cern.ch
(message (e-mail address removed) )
and related thread, about apparently broken newlines handling with
utf-16LE
Or could you perhaps throw any light, if you're interested, on what I
was seeing there and the subsequent followup?
Right... I've some some testing on this, and I would say it's
definitely a bug... Also that it has nothing to do with utf16le,
specifically; rather that it is a problem with the :crlf layer.
Please excuse the rather long post.
All the tests below have exactly the same results with 5.8.0 and
5.8.2. All tests have been run on i686-linux-thread-multi, but as of
5.8 they ought to give the same results on all platforms, given that
all filehandles are explicitly binmode()d. (I could be wrong: if Win32
systems have :crlf pushed by default then it's *definitely* worth
pushing :raw before you do anything else if you're dealing with utf16)
First, input. This is a modified version of your script/test file from
the above post. The output has been line-wrapped for posting.
% od -x utf16
0000000 feff 004e 004f 0054 0045 0053 0020 0046
^^^^ BOM (le)
0000020 004f 0052 4120 0041 0044 0044 0049 0054
^^^^ a char >FF
0000040 0049 004f 004e 0041 004c 0020 0041 00a0
a char >7F <FF ^^^^
0000060 0055 004e 0044 002e 000d 000a 000d 000a
DOSish newlines ^^^^-^^^^
0000100
% cat read
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw/:fallbacks is_utf8 _utf8_on/;
use PerlIO::encoding;
my $bom = "\x{feff}";
# just so we know what's what
$PerlIO::encoding::fallback = FB_PERLQQ;
binmode STDOUT, ":encoding(ascii)";
# the first argument is the list of layers to use
open my $IN, "<$ARGV[0]", "utf16" or die $!;
$\ = "\n"; $, = " ";
$_ = <$IN>;
print "utf8 flag is", is_utf8($_) ? "on" : "off";
# force utf8 flag on if we were given two arguments
$ARGV[1] and _utf8_on($_), print "forcing utf8";
s/^$bom// and print "snipped BOM";
chomp;
# this is a slightly clearer display format
print map {sprintf "%04x", $_} unpack '(U)*', $_;
print;
__END__
% ./read ":encoding(utf16le)"
utf8 flag is on
snipped BOM
004e 004f 0054 0045 0053 0020 0046 004f 0052 4120 0041 0044 0044 0049
0054 0049 004f 004e 0041 004c 0020 0041 00a0 0055 004e 0044 002e 000d
DOSish newline not stripped ^^^^
NOTES FOR\x{4120}ADDITIONAL A\x{00a0}UND.
% ./read ":encoding(utf16le):crlf"
utf8 flag is off
00ef 00bb 00bf 004e 004f 0054 0045 0053 0020 0046 004f 0052
^^^^-^^^^-^^^^ this is \x{feff} in utf8
00e4 0084 00a0 0041 0044 0044 0049 0054 0049 004f 004e 0041
^^^^-^^^^-^^^^ ditto \x{4120}
004c 0020 0041 00c2 00a0 0055 004e 0044 002e
DOSish newline is stripped, however ^^
\x{00ef}\x{00bb}\x{00bf}NOTES FOR\x{00e4}\x{0084}\x{00a0}ADDITIONAL
A\x{00c2}\x{00a0}UND.
% ./read ":encoding(utf16le):crlf" 1
utf8 flag is off
forcing utf8
snipped BOM
004e 004f 0054 0045 0053 0020 0046 004f 0052 4120 0041 0044 0044 0049
0054 0049 004f 004e 0041 004c 0020 0041 00a0 0055 004e 0044 002e
NOTES FOR\x{4120}ADDITIONAL A\x{00a0}UND.
So the problem here is that :crlf fails to set the utf8 flag on the
data when it should. Now, output.
% perl -e'binmode STDOUT, ":encoding(utf16le)";
print "\xa0hello\n\n"' > out
% od -x out
0000000 00a0 0068 0065 006c 006c 006f 000a 000a
0000020
% perl -e'binmode STDOUT, ":crlf:encoding(utf16le)";
print "\xa0hello\n\n"' > out
% od -x out
0000000 00a0 0068 0065 006c 006c 006f 0a0d 0d00
0000020 000a
0000022
This is not actually quite such nonsense as it seems: because 'od -x'
byteswaps everything, the file actually ends '6f 00 0d 0a 00 0d 0a 00',
which is the perfectly reasonable result of treating the binary
UTF16 data as text. So we do the :crlf before the UTF16:
% perl -e'binmode STDOUT, ":encoding(utf16le):crlf";
print "\xa0hello\n\n"' > out
Malformed UTF-8 character (unexpected continuation byte 0xa0, with no
preceding start byte) in null operation.
% od -x out
0000000 0000 0068 0065 006c 006c 006f 000d 000a
0000020 000d 000a
0000024
This last would give the desired result, but seems to have the
converse problem from above: that it is trying to treat as utf8 data
that should be treated as bytes.
Having a look at perlio.c suggests to me (though I can't entirely
follow it) that a :crlf layer always has PERLIO_F_UTF8 off, when in
fact it should check the state of the layer below and set itself
accordingly. Having a think about the issued involved suggests to me
that Microsoft should *really* have taken to opportunity of changing
to utf16 to ditch using \r\n... but there we go.
I would seriously consider not using :crlf at all, but instead writing
a :nl layer that maps any of \n, \r, \r\n to \n on input and any of
\n, \r, \r\n to \r\n on output... seems to me that'd be more use, in
general. I guess it would probably be slower.
Ben