Problem handling a Unicode file

Discussion in 'Perl Misc' started by MoshiachNow, Aug 28, 2006.

  1. MoshiachNow

    MoshiachNow Guest

    HI,

    Got a file that when opened with a Notepad looks like (a sample line) :

    [HKEY_LOCAL_MACHINE\

    I know it's some type of Unicode (can not figure which one),since when
    I print lines in Perl - get the following:

    [ H K E Y _ L O C A L _ M A C H I N E \

    I basicaly need to replace some strings inside the file,so I need to
    decode it from Unicode,and eventually save it in unicode.
    Have tried the following:

    1.open(FILE, ":utf8", "Araxi.reg"); #no good,stll prints spaces
    between charachters

    2.my $STRING = decode("EBCDIC", $_); #no good,stll prints spaces
    between charachters

    All this did not get me far.
    How do I achieve the above goals (after establishing the exact unicode
    format) ?
    Thanks
    MoshiachNow, Aug 28, 2006
    #1
    1. Advertising

  2. MoshiachNow wrote:
    > HI,
    >
    > Got a file that when opened with a Notepad looks like (a sample line) :
    >
    > [HKEY_LOCAL_MACHINE\
    >
    > I know it's some type of Unicode (can not figure which one),since when
    > I print lines in Perl - get the following:
    >
    > [ H K E Y _ L O C A L _ M A C H I N E \
    >
    > I basicaly need to replace some strings inside the file,so I need to
    > decode it from Unicode,and eventually save it in unicode.
    > Have tried the following:
    >
    > 1.open(FILE, ":utf8", "Araxi.reg"); #no good,stll prints spaces
    > between charachters


    When Microsoft adopted Unicode it had not yet become clear that utf8
    was the "usual" encoding and they went for utf16le as their default
    encoding.

    open(FILE, "<:encoding(utf16le)", "Araxi.reg") or die $!;

    Actaully you can leave out the 'le' as the BOM will tell Perl the
    byte-order.

    IIRC Windows puts a BOM on utf8 files too so it is in principle
    possible to open a file that could be latin1, utf8, utf16be or utf16le
    and infer the encoding.

    AFAIF there's no simple encoding() in Perl to do this as BOMed utf8
    post-dates the initial implementation of Unicode in Perl.
    Brian McCauley, Aug 28, 2006
    #2
    1. Advertising

  3. MoshiachNow

    MoshiachNow Guest

    Thanks,

    did just that.Reads the file nicely.
    Then I want to reolace strings in the file and write it back in utf16
    to Araxi2.reg.
    I use the code below,but the file does not look good in Notepad
    anymore,meaning the format is not exactly utf16 ...

    open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
    Araxi.reg: $!"; #Read UNICODE FILE TO ASCII
    open (FILE1,">Araxi1.reg") || die "Could not open Araxi1.reg: $!";
    while (<FILE>) {
    print FILE1;
    }
    close FILE;
    close FILE1;

    open (FILE1,"Araxi1.reg") || die "Could not open Araxi1.reg: $!";
    #get old server name
    while (<FILE1>) {
    chomp;
    if (/Host/) {
    ($OLDNAME) = m/"Host"="(\w*-\w*)"/;
    #print "OLDNAME=$OLDNAME\n";
    $OLDNAME_SMALL = lc $OLDNAME;
    #print "OLDNAME_SMALL=$OLDNAME_SMALL\n";
    last;
    }
    }
    close FILE1;

    open (FILE,'>:encoding(utf16)',"Araxi2.reg") || die "Could not open
    Araxi2.reg: $!"; #CONVERT A UNICODE FILE TO ASCII
    open (FILE1,"Araxi1.reg") || die "Could not open Araxi1.reg: $!";
    while (<FILE1>) {
    s/$OLDNAME/$computer/; #replace capitals
    s/$OLDNAME_SMALL/$computer_small/; #replace small letters
    names
    print FILE "$_";
    }
    MoshiachNow, Aug 28, 2006
    #3
  4. MoshiachNow

    Dr.Ruud Guest

    MoshiachNow schreef:

    > I use the code below,but the file does not look good in Notepad
    > anymore,meaning the format is not exactly utf16 ...
    >
    > open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
    > Araxi.reg: $!"; #Read UNICODE FILE TO ASCII
    > open (FILE1,">Araxi1.reg") || die "Could not open Araxi1.reg: $!";


    You need to use the utf16le layer for the output to.

    #!/usr/bin/perl
    use warnings ;
    use strict ;

    my $fni = 'Araxi.reg' ;
    my $fno = 'Araxi1.reg' ;

    open my $fhi, '<:encoding(utf16)', $fni
    or die "open '$fni', stopped $!" ;

    open my $fho, '>:encoding(utf16)', $fno
    or die "open '$fno', stopped $!" ;


    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Aug 28, 2006
    #4
  5. MoshiachNow wrote:

    > open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
    > Araxi.reg: $!"; #Read UNICODE FILE TO ASCII


    That comment is highly missleading. It should say "Read utf16 file into
    Unicode".

    The file is in utf16. The strings that are read from it are in Unicode.
    Actually Perl will internally represent the stings in utf8, but
    conceptually they are just Unicode. One thing they certainly are not is
    ASCII. Of course if the data happens to contain no characters beyond
    0x7F then the internal represtation of the Unicode string will be
    identical to the equivalent ASCII string.
    Brian McCauley, Aug 28, 2006
    #5
  6. On 2006-08-28 10:11, MoshiachNow <> wrote:
    > Thanks,
    >
    > did just that.Reads the file nicely.
    > Then I want to reolace strings in the file and write it back in utf16
    > to Araxi2.reg.
    > I use the code below,but the file does not look good in Notepad
    > anymore,meaning the format is not exactly utf16 ...


    Notepad needs the BOM at the beginning of the file to recognize it
    is UTF16, so you have to write that:

    > open (FILE,'>:encoding(utf16)',"Araxi2.reg") || die "Could not open

    print FILE "\x{FEFF}";

    or, if you prefer symbolic names:

    use charnames ':short';
    ....
    print FILE "\N{BOM}";


    hp


    --
    _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
    |_|_) | Sysadmin WSR | > ist?
    | | | | Was sonst wäre der Sinn des Erfindens?
    __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
    Peter J. Holzer, Aug 28, 2006
    #6
  7. MoshiachNow

    Dr.Ruud Guest

    Peter J. Holzer schreef:

    > Notepad needs the BOM at the beginning of the file to recognize it
    > is UTF16, so you have to write that:


    With "encoding(UTF-16)", the IO-layer takes care of that. But then you
    leave it up to Perl (Encode::perlIO?) to choose between UTF-16LE and
    UTF-16BE. See also perldoc Encode::Unicode.


    At opening, the file is 0 bytes, but after printing a single space, it
    becomes 4 bytes, with the first two holding the BOM:

    #!/usr/bin/perl
    use warnings ;
    use strict ;

    my $fni = 'Araxi.reg' ;
    my $fno = 'Araxi1.reg' ;

    open my $fhi, '<:encoding(UTF-16)', $fni
    or die "open '$fni', stopped $!" ;

    open my $fho, '>:encoding(UTF-16)', $fno
    or die "open '$fno', stopped $!" ;

    print $fho ' ' ;
    __END__

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Aug 28, 2006
    #7
  8. MoshiachNow

    MoshiachNow Guest

    Thanks,

    I do exactly as advised above,but checking the output in bynary
    dipslay,I see that all bytes are interchanged within the words - see
    below.
    Have tried also "utf16-LE",this did not help.

    Good utf16 input file:
    FF FE 57 00 69 00 6E 00

    Bad output file:
    FE FF 00 57 00 69 00 6E

    (Indeed,the print FILE "\x{FEFF}"; statement does not look like is
    required,since it's been taken care of internally by Perl.)

    So what can be still wrong ?
    MoshiachNow, Aug 29, 2006
    #8
  9. MoshiachNow

    Dr.Ruud Guest

    MoshiachNow schreef:

    > all bytes are interchanged within the words


    That is the UTF16-LE order, so it would have been wrong if you would
    have seen something else. Do you understand the role of the BOM (Byte
    Order Mark) now?
    http://en.wikipedia.org/wiki/Byte_Order_Mark

    Create a fresh file in Notepad with just the word "test" in it, and do a
    File/Save As..., with Encoding "Unicode", and you'll see that Windows
    defaults to UTF16-LE.

    You'll also find an Encoding "Unicode big-endian" there, that is
    UTF16-BE. But why would you want the bytes in a different order than the
    default for the platform?

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Aug 29, 2006
    #9
  10. MoshiachNow

    MoshiachNow Guest

    HI,

    I do run exactly this :
    open my $fhi, '<:encoding(UTF-16)', $fni
    or die "open '$fni', stopped $!" ;


    open my $fho, '>:encoding(UTF-16)', $fno
    or die "open '$fno', stopped $!" ;

    and expect input and output files to be in the same order,but they are
    not.

    I DID try adding the following line,it did not help:

    print $fho "\x{FEFF}";
    MoshiachNow, Aug 29, 2006
    #10
  11. MoshiachNow

    Dr.Ruud Guest

    MoshiachNow schreef:

    > I do run exactly this :
    > open my $fhi, '<:encoding(UTF-16)', $fni
    > or die "open '$fni', stopped $!" ;
    >
    >
    > open my $fho, '>:encoding(UTF-16)', $fno
    > or die "open '$fno', stopped $!" ;
    >
    > and expect input and output files to be in the same order


    Why do you expect that? At input, the BOM rules. At output, the platform
    rules.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Aug 29, 2006
    #11
  12. MoshiachNow

    MoshiachNow Guest

    Thanks a lot.
    Read the article,got the ide of the BOM.

    The only thing that got me a valid output file was:

    open (FILE, ">:raw:encoding(UTF16-LE)", "Araxi.reg") || die "Could not
    open Araxi.reg: $!";
    print FILE "\x{FEFF}";

    Any other sequence will not work well.

    Thanks !
    MoshiachNow, Aug 29, 2006
    #12
  13. MoshiachNow

    Dr.Ruud Guest

    MoshiachNow schreef:

    > The only thing that got me a valid output file was:
    >
    > open (FILE, ">:raw:encoding(UTF16-LE)", "Araxi.reg") || die "Could not
    > open Araxi.reg: $!";
    > print FILE "\x{FEFF}";


    That lay-out really hurts my eyes. Next time, quote something of the
    article that you reply on, or provide a "> [short summary]".

    #!/usr/bin/perl
    use warnings ;
    use strict ;
    use charnames ':short' ;

    my ($fni, $ei) = ('Araxi.reg' , ':encoding(utf16)') ;
    my ($fno, $eo) = ('Araxi1.reg', ':raw:encoding(utf16le)') ;

    open my $fhi, "<$ei", $fni or die "open '$fni': $!" ;
    open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
    print $fho "\N{BOM}" ;

    print $fho "test\n" ;

    # ... etc.


    Your ":raw" is a good solution.
    I tried "binmode $fho" instead, but got a "Wide character print"
    warning. So I put a "use utf8" near the top, but then the BOM was output
    as utf8, it looks like $fho's IO-layer was ignored. A "binmode $fho,
    ':encoding(utf16le)'" might work too, but I am converted to ":raw" now,
    thanks.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Aug 29, 2006
    #13
  14. On 2006-08-29 11:40, Dr.Ruud <> wrote:
    > MoshiachNow schreef:
    >> all bytes are interchanged within the words

    >
    > That

    ^^^^
    Could you quote what you mean by "that"? It makes the your posting a
    bit hard to understand.

    > is the UTF16-LE order,


    Nope. The sequence MoshiachNow called "bad" is UTF16-BE.

    [...]
    > You'll also find an Encoding "Unicode big-endian" there, that is
    > UTF16-BE. But why would you want the bytes in a different order than
    > the default for the platform?


    He doesn't. He wants UTF16-LE (what he labeled "good input file") but
    gets UTF16-BE instead.

    hp


    --
    _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
    |_|_) | Sysadmin WSR | > ist?
    | | | | Was sonst wäre der Sinn des Erfindens?
    __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
    Peter J. Holzer, Aug 29, 2006
    #14
  15. MoshiachNow

    Dr.Ruud Guest

    Peter J. Holzer schreef:
    > Dr.Ruud:
    >> MoshiachNow:


    >>> all bytes are interchanged within the words

    >>
    >> That

    > ^^^^
    > Could you quote what you mean by "that"? It makes the your posting a
    > bit hard to understand.
    >
    >> is the UTF16-LE order,

    >
    > Nope. The sequence MoshiachNow called "bad" is UTF16-BE.


    Sorry for the confusion. My "That" was only the quoted phrase itself
    (and not the meaning that it had in the original posting), to express
    that the interchanged bytes from C<print "\x{FEFF}"> to (binary display)
    "FF FE" was the thing to go for.



    > [...]
    >> You'll also find an Encoding "Unicode big-endian" there, that is
    >> UTF16-BE. But why would you want the bytes in a different order than
    >> the default for the platform?

    >
    > He doesn't. He wants UTF16-LE (what he labeled "good input file") but
    > gets UTF16-BE instead.


    Yes, I mixed up there, I think because I couldn't understand why he
    didn't just go for ':encoding(UTF16)'.


    Sidenote:

    #!/usr/bin/perl
    # Script-ID: utf16.pl
    use warnings ;
    use strict ;

    my ($fno, $eo) = ('utf16.txt', ':encoding(UTF16)') ;
    open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
    print $fho "\n" ;
    __END__

    results in a 5 byte file (Windows, Perl 5.8.8):
    FE FF 00 0D 0A

    Anyone knows a good reason for why that doesn't result in:
    FE FF 00 0D 00 0A
    ?
    (I understand how it happens, but the "why" escapes me.)

    With
    ':raw:encoding(UTF16)'
    and
    print $fho "\r\n"
    one can produce the "right" output of course.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Aug 30, 2006
    #15
  16. On 2006-08-30 00:23, Dr.Ruud <> wrote:
    > Sidenote:
    >
    > #!/usr/bin/perl
    > # Script-ID: utf16.pl
    > use warnings ;
    > use strict ;
    >
    > my ($fno, $eo) = ('utf16.txt', ':encoding(UTF16)') ;
    > open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
    > print $fho "\n" ;
    > __END__
    >
    > results in a 5 byte file (Windows, Perl 5.8.8):
    > FE FF 00 0D 0A
    >
    > Anyone knows a good reason for why that doesn't result in:
    > FE FF 00 0D 00 0A
    > ?
    > (I understand how it happens, but the "why" escapes me.)


    I think the "why" is a simple bug.

    > With
    > ':raw:encoding(UTF16)'
    > and
    > print $fho "\r\n"
    > one can produce the "right" output of course.


    It looks like the :crlf layer is applied in the wrong place (after
    :encoding(UTF16) instead of before).

    my ($fno, $eo) = ('utf16.txt', 'encoding(UTF-16):crlf') ;
    open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
    print $fho "\n" ;

    also produces the right result (for Windows) on Linux, so I guess

    my ($fno, $eo) = ('utf16.txt', ':raw:encoding(UTF-16):crlf') ;
    open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
    print $fho "\n" ;

    should work on Windows (don't have a Windows machine at hand to test
    it).

    hp


    --
    _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
    |_|_) | Sysadmin WSR | > ist?
    | | | | Was sonst wäre der Sinn des Erfindens?
    __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
    Peter J. Holzer, Aug 30, 2006
    #16
  17. MoshiachNow

    Dr.Ruud Guest

    Peter J. Holzer schreef:
    > Dr.Ruud:


    >> Sidenote:
    >>
    >> #!/usr/bin/perl
    >> # Script-ID: utf16.pl
    >> use warnings ;
    >> use strict ;
    >>
    >> my ($fno, $eo) = ('utf16.txt', ':encoding(UTF16)') ;
    >> open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
    >> print $fho "\n" ;
    >> __END__
    >>
    >> results in a 5 byte file (Windows, Perl 5.8.8):
    >> FE FF 00 0D 0A
    >>
    >> Anyone knows a good reason for why that doesn't result in:
    >> FE FF 00 0D 00 0A
    >> ?
    >> (I understand how it happens, but the "why" escapes me.)

    >
    > I think the "why" is a simple bug.


    Yes, I'll report it. (ticket #40255)


    > I guess
    >
    > my ($fno, $eo) = ('utf16.txt', ':raw:encoding(UTF-16):crlf') ;
    > open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
    > print $fho "\n" ;
    >
    > should work on Windows (don't have a Windows machine at hand to test
    > it).


    Yes, that writes the "platform-proper" 6 bytes.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Aug 30, 2006
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Richard Schulman

    Unicode string handling problem

    Richard Schulman, Sep 6, 2006, in forum: Python
    Replies:
    8
    Views:
    335
    John Machin
    Sep 7, 2006
  2. Richard Schulman

    Unicode string handling problem (revised)

    Richard Schulman, Sep 6, 2006, in forum: Python
    Replies:
    1
    Views:
    249
    John Machin
    Sep 6, 2006
  3. Jeremy
    Replies:
    1
    Views:
    803
    Alex Willmer
    Jan 11, 2011
  4. Jeremy
    Replies:
    0
    Views:
    577
    Jeremy
    Jan 11, 2011
  5. iaminsik
    Replies:
    4
    Views:
    162
    iaminsik
    May 8, 2008
Loading...

Share This Page