Read/write with UCS-2* encodings - Possible???

Discussion in 'Perl Misc' started by Ilya Zakharevich, Feb 17, 2009.

  1. What should one fix to make UCS-2* encodings work on the file handles
    in Perl? E.g., should not

    perl -wlpe "BEGIN{binmode STDOUT, q:)encoding(UCS-2));}" < xyz > xyz1

    just work? Currently, it requires additional `binmode STDOUT' in
    advance, which changes the semantic. And doing

    binmode STDOUT;
    binmode STDOUT, q:)encoding(UCS-2));
    binmode STDOUT, q:)crlf);

    does not work, since :crlf layer is put AFTER :encoding, not before it
    as expected...

    Another indication is that

    piconv -t UCS-2

    gives wrong results on DOSISH platforms (which is not surprizing,
    since the version I have uses q:)encoding(UCS-2))).

    For best results, I would prefer a solution which allows doing

    binmode STDOUT, q:)encoding(UCS-2));

    and

    binmode STDOUT, q:)crlf);

    in arbitrary order so that the result does not depend on the order
    (as now), but works ;-/.

    Thanks,
    Ilya
     
    Ilya Zakharevich, Feb 17, 2009
    #1
    1. Advertising

  2. Ilya Zakharevich

    Marc Lucksch Guest

    Ilya Zakharevich schrieb:
    > What should one fix to make UCS-2* encodings work on the file handles
    > in Perl? E.g., should not


    I have no idea about UCS-2, but I had the same problems in UTF16

    > binmode STDOUT;
    > binmode STDOUT, q:)encoding(UCS-2));
    > binmode STDOUT, q:)crlf);
    >
    > does not work, since :crlf layer is put AFTER :encoding, not before it
    > as expected...
    >
    > Another indication is that
    >
    > piconv -t UCS-2
    >
    > gives wrong results on DOSISH platforms (which is not surprizing,
    > since the version I have uses q:)encoding(UCS-2))).


    To make perl generate UTF-16, UTF-32 files that can be used by the
    Windows Editor (CRLF), I used a small trick of which I also put in my
    Sofu module:
    http://search.cpan.org/~maluku/Sofu-0.3/lib/Data/Sofu.pm#NOTE_on_Unicode

    #Write Windows CRLF UTF-16 Files
    open my $fh,">:raw:encoding(UTF-16):crlf:utf8","out.sofu";

    #Write Unix UTF-16 Files
    open my $fh,">:raw:encoding(UTF-16)","out.sofu";
    #Same goes for UTF-32

    print $fh chr(65279); #Print UTF-8 Byte Order Mark (Some programs
    want it, some programs die on it...)

    When I tested it, it worked both on Windows and Linux.

    Maybe this helps you

    Marc "Maluku" Lucksch
     
    Marc Lucksch, Feb 17, 2009
    #2
    1. Advertising

  3. On 2009-02-17 14:16, Marc Lucksch <> wrote:
    > To make perl generate UTF-16, UTF-32 files that can be used by the
    > Windows Editor (CRLF), I used a small trick of which I also put in my
    > Sofu module:
    > http://search.cpan.org/~maluku/Sofu-0.3/lib/Data/Sofu.pm#NOTE_on_Unicode
    >
    > #Write Windows CRLF UTF-16 Files
    > open my $fh,">:raw:encoding(UTF-16):crlf:utf8","out.sofu";


    Why the ":utf8"? It doesn't make sense to me (you want UTF-16, not
    UTF-8, and you most definitely don't want to double-encode), and it
    doesn't seem to make any difference anyway.


    > print $fh chr(65279); #Print UTF-8 Byte Order Mark (Some programs
    > want it, some programs die on it...)


    :encoding(UTF-16) already causes a BOM to be written, so this writes a
    second BOM.

    hp

    PS: Only tested with 5.10.0.
     
    Peter J. Holzer, Feb 17, 2009
    #3
  4. Ilya Zakharevich

    Marc Lucksch Guest

    Peter J. Holzer schrieb:
    > On 2009-02-17 14:16, Marc Lucksch <> wrote:
    >> To make perl generate UTF-16, UTF-32 files that can be used by the
    >> Windows Editor (CRLF), I used a small trick of which I also put in my
    >> Sofu module:
    >> http://search.cpan.org/~maluku/Sofu-0.3/lib/Data/Sofu.pm#NOTE_on_Unicode
    >>
    >> #Write Windows CRLF UTF-16 Files
    >> open my $fh,">:raw:encoding(UTF-16):crlf:utf8","out.sofu";

    >
    > Why the ":utf8"? It doesn't make sense to me (you want UTF-16, not
    > UTF-8, and you most definitely don't want to double-encode), and it
    > doesn't seem to make any difference anyway.

    As far as I remember when I wrote that, I ended up getting a warning
    about wide characters from the :crlf layer, it doesn't happen anymore in
    perl5.10.0 (see below). So it is not needed anymore (still work though)

    The :utf8 layer somehow makes the next layer accept wide characters even
    if it shouldn't.

    It tested it with perl5.8.1 when I did that. And the test routine I
    wrote for this also works for perl5.10.0. (in Data::Sofu 0.3).


    >> print $fh chr(65279); #Print UTF-8 Byte Order Mark (Some programs
    >> want it, some programs die on it...)

    >
    > :encoding(UTF-16) already causes a BOM to be written, so this writes a
    > second BOM.
    >

    Yeah, that was my error, I didn't delete that line.. There were two
    lines before that discribing how to make UTF-8. and that line belonged
    to that.

    Take my old test script:

    #!/usr/bin/perl
    use strict;
    use warnings;

    open my $fh,">:raw:encoding(UTF-16):crlf:utf8","windows.txt";
    print $fh "Hello\nWorld";
    close $fh;

    open $fh,">:raw:encoding(UTF-16)","unix.txt";
    print $fh "Hello\nWorld";
    close $fh;

    # this is logical, since perlIO layers are from left to right
    open $fh,">:raw:encoding(UTF-16):crlf","logical.txt";
    print $fh "Hello\nWorld";
    close $fh;

    # this is unlogical, since perlIO layers are from left to right, but test it
    # anyway
    open $fh,">:raw:crlf:encoding(UTF-16)","unlogical.txt";
    print $fh "Hello\nWorld";
    close $fh;

    # In Windows :crlf is the default
    open $fh,">:encoding(UTF-16)","justutf-16.txt";
    print $fh "Hello\nWorld";
    close $fh;


    Test on Windows with perl5.10.0

    windows.txt:
    Editor:
    Hello
    World
    Vim:
    Hello
    World
    [converted][noeol]

    unix.txt:
    Editor:
    HelloWorld
    Vim:
    Hello
    World
    [unix][converted][noeol]


    And now the others:

    logical.txt:
    Works as the windows.txt one (didn't for me before)
    And there is no more warning. *happy*


    Unlogical:
    VIM:
    þÿ\0H\0e\0\l\0l\0o\0
    \0W\0o\0r\0l\0d
    [noeol]
    Editor:
    Hello਀圀漀爀氀

    justutf-16.txt:
    Same as unlogical.txt, which is strange

    So to conclude:
    open my $fh,">:raw:encoding(UTF-16):crlf:utf8","windows.txt";# working
    open $fh,">:raw:encoding(UTF-16)","unix.txt"; #working for unix

    open $fh,">:raw:encoding(UTF-16):crlf","logical.txt"; #Working in 5.10.0

    When I add "\x{343f}" to the string it still works well in the Editor,
    but my VIM7.2 won't read the files anymore. :(

    Marc "Maluku" Lucksch
     
    Marc Lucksch, Feb 17, 2009
    #4
  5. On 2009-02-17, Ben Morrow <> wrote:
    > (You presumably know you can use
    >
    > binmode STDOUT, ":raw:encoding(UCS-2):crlf";
    >
    > rather that three separate statements?)


    No, I did not. And adding :utf8 at the end fixes a warning as well.

    So now the question boils down to: how to make

    binmode STDOUT, 'encoding(UCS-2)';

    done on a filehandle which is in :crlf mode do the moral equivalent of

    binmode STDOUT, ":raw:encoding(UCS-2):crlf";

    And how one would easily switch :crlf layer off on such a handle?
    Doing just `binmode' switches off encoding as well; and my perl does
    not support :lf...

    (When this works, a lot of programs would magically start to work as expected.)

    >> Another indication is that
    >>
    >> piconv -t UCS-2
    >>
    >> gives wrong results on DOSISH platforms (which is not surprizing,
    >> since the version I have uses q:)encoding(UCS-2))).

    >
    > That's just a bug in piconv, then. It should binmode its filehandles if
    > it's writing potentially binary data.


    How would it know this? What is the semantic of binmode()? As usual,
    the documentation is close to useless:

    The directives alter the behaviour of the file handle.

    Thank a lot!!! HOW do they alter the behaviour??? Is the intent to
    be incremental :)crlf does not change the encoding layers), or is the
    semantic to remove all the layers, and add the specified ones?

    >> For best results, I would prefer a solution which allows doing
    >>
    >> binmode STDOUT, q:)encoding(UCS-2));
    >>
    >> and
    >>
    >> binmode STDOUT, q:)crlf);
    >>
    >> in arbitrary order so that the result does not depend on the order
    >> (as now), but works ;-/.

    >
    > That would be... weird. It matters whether the LF->CRLF conversion is
    > done before or after the characters->UCS-2 conversion, since you get
    > different results. There's already more than enough weirdness in PerlIO
    > without adding more.


    No, I think this is not adding weirdness, but using it. IIRC, the
    'binmode' directive is passed through the layers, and they have a
    possibility to handle it.

    The wide-char encoding layer should notice that somebody wants to add
    :crlf, and should not let it pass through itself: the newly created
    layer should be anchored `before' the encoding layer.

    Yours,
    Ilya
     
    Ilya Zakharevich, Feb 18, 2009
    #5
  6. Ilya Zakharevich

    Marc Lucksch Guest

    Ilya Zakharevich schrieb:
    > binmode STDOUT, ":raw:encoding(UCS-2):crlf";
    >
    > And how one would easily switch :crlf layer off on such a handle?
    > Doing just `binmode' switches off encoding as well; and my perl does
    > not support :lf...


    binmode STDOUT ":pop"; #Removes the topmost layer

    See perldoc perlio:

    > :pop
    >
    > A pseudo layer that removes the top-most layer. Gives perl code a way
    > to manipulate the layer stack. Should be considered as experimental.
    > Note that :pop only works on real layers and will not undo the effects
    > of pseudo layers like :utf8 .
     
    Marc Lucksch, Feb 18, 2009
    #6
  7. On 2009-02-18 13:04, Marc Lucksch <> wrote:
    > Ilya Zakharevich schrieb:
    >> binmode STDOUT, ":raw:encoding(UCS-2):crlf";
    >>
    >> And how one would easily switch :crlf layer off on such a handle?
    >> Doing just `binmode' switches off encoding as well; and my perl does
    >> not support :lf...

    >
    > binmode STDOUT ":pop"; #Removes the topmost layer
    >
    > See perldoc perlio:
    >
    > > :pop
    > >
    > > A pseudo layer that removes the top-most layer. Gives perl code a way
    > > to manipulate the layer stack. Should be considered as experimental.
    > > Note that :pop only works on real layers and will not undo the effects

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    > > of pseudo layers like :utf8 .

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^

    :crlf is a pseudo-layer.

    hp
     
    Peter J. Holzer, Feb 18, 2009
    #7
  8. On 2009-02-18, Ben Morrow <> wrote:
    >> > I must admit I've never had anything to do with creating Win32-format
    >> > text files. Generally the extent of my knowledge of such things is 'how
    >> > do I get this wretched OS to stop messing up my files?' :).

    >>
    >> This is one of the sides of the problem. *This* part can be fixed if
    >> :encoding(UCS2) would just remove :crlf flag.


    > Yes, but I don't think this will change either.


    Sorry, can't parse your sentence... (Removing :crlf has no bad side
    effects, and a lot of good ones [like some programs suddenly starting
    to produce non-junk ;-].)

    >> By definition, piconv does not work with binary files...


    > Yes it does.


    If you think so, your mental picture of text/vs/binary is IMO
    seriously screwed up... Not surprising if you claim little experience
    with DOSISH systems.

    > 'Text' as used by perl means 'some 8bit extension of ASCII,
    > with reasonably short lines delimited with the OS newline sequence, no
    > non-spacing control characters and no NULs'.


    Nope, `text' means non-`binary'. And `binary' means that preservation
    of the exact layout of bits is tantamount. So changing encoding on
    binary data is a nonono.

    > By this definition UCS-2 is
    > (and most wide encodings are) 'binary'. -B will return true on a UCS-2
    > file, for instance.


    -B makes no sense at all in today's world. Anyway, even when it had,
    it has no relation to :crlf etc.

    >> > While this is certainly possible, I somewhat doubt the behaviour of
    >> > :encoding will be changed now.

    >>
    >> Given that its current behaviour makes no sense, why not?


    > Because perl 5.8 was released a good while ago, so people have been
    > working with and around the current behaviour for some time. Changing it
    > now would almost certainly break working code,


    I claim that removing :crlf can't break any code. I suspect that
    reinserting it on top of :encoding(UCS2) would also have no
    detrimental side effects, but to check this, one needs more knowledge
    about (IMO, completely botched) behaviour of PerlIO.

    In short, if "remove :crlf" worked AFTER "insert :encoding(UCS2)", one
    must make sure that it still does. (Well, given the current pitiful
    state of Perl, it might be that just `applying voodoo programing' may be
    a sufficient justification - just do several experiments on how Perl
    behaves without any inspection of source code...)

    > whereas a new layer on CPAN would not,


    .... and would not fix thousands of programs which do not work now...

    > and would allow those with old perls to get the new behaviour if
    > they want it.


    This does not make any sense to me. The problem is not with "those
    who want it" (there are too few of them), but with "those who need it"
    (read: anyone working on DOSISH, or writing code which can potentially
    be used on DOSISH platforms).

    > It's possible that some sort of flag to :encoding (or to 'use
    > PerlIO::encoding') would be OK.


    No. What is OK is to have correct behaviour. If having correct
    behaviour has a non-0 (but still negligible) chance of breaking old
    stuff, one should be able to request bug-for-bug compatibility by
    environment variable.

    Yours,
    Ilya
     
    Ilya Zakharevich, Feb 20, 2009
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. wael

    unicode (UCS-2 encoded)

    wael, Aug 22, 2003, in forum: C++
    Replies:
    10
    Views:
    1,904
  2. Fredrik Lundh

    codec to parse raw UCS data?

    Fredrik Lundh, Aug 19, 2003, in forum: Python
    Replies:
    1
    Views:
    312
    Oleg Leschov
    Aug 20, 2003
  3. Replies:
    9
    Views:
    437
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Jun 9, 2006
  4. rahul
    Replies:
    0
    Views:
    274
    rahul
    Apr 27, 2009
  5. rahul
    Replies:
    2
    Views:
    291
    Gabriel Genellina
    Apr 27, 2009
Loading...

Share This Page