Problem handling a Unicode file

M

MoshiachNow

HI,

Got a file that when opened with a Notepad looks like (a sample line) :

[HKEY_LOCAL_MACHINE\

I know it's some type of Unicode (can not figure which one),since when
I print lines in Perl - get the following:

[ H K E Y _ L O C A L _ M A C H I N E \

I basicaly need to replace some strings inside the file,so I need to
decode it from Unicode,and eventually save it in unicode.
Have tried the following:

1.open(FILE, ":utf8", "Araxi.reg"); #no good,stll prints spaces
between charachters

2.my $STRING = decode("EBCDIC", $_); #no good,stll prints spaces
between charachters

All this did not get me far.
How do I achieve the above goals (after establishing the exact unicode
format) ?
Thanks
 
B

Brian McCauley

MoshiachNow said:
HI,

Got a file that when opened with a Notepad looks like (a sample line) :

[HKEY_LOCAL_MACHINE\

I know it's some type of Unicode (can not figure which one),since when
I print lines in Perl - get the following:

[ H K E Y _ L O C A L _ M A C H I N E \

I basicaly need to replace some strings inside the file,so I need to
decode it from Unicode,and eventually save it in unicode.
Have tried the following:

1.open(FILE, ":utf8", "Araxi.reg"); #no good,stll prints spaces
between charachters

When Microsoft adopted Unicode it had not yet become clear that utf8
was the "usual" encoding and they went for utf16le as their default
encoding.

open(FILE, "<:encoding(utf16le)", "Araxi.reg") or die $!;

Actaully you can leave out the 'le' as the BOM will tell Perl the
byte-order.

IIRC Windows puts a BOM on utf8 files too so it is in principle
possible to open a file that could be latin1, utf8, utf16be or utf16le
and infer the encoding.

AFAIF there's no simple encoding() in Perl to do this as BOMed utf8
post-dates the initial implementation of Unicode in Perl.
 
M

MoshiachNow

Thanks,

did just that.Reads the file nicely.
Then I want to reolace strings in the file and write it back in utf16
to Araxi2.reg.
I use the code below,but the file does not look good in Notepad
anymore,meaning the format is not exactly utf16 ...

open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
Araxi.reg: $!"; #Read UNICODE FILE TO ASCII
open (FILE1,">Araxi1.reg") || die "Could not open Araxi1.reg: $!";
while (<FILE>) {
print FILE1;
}
close FILE;
close FILE1;

open (FILE1,"Araxi1.reg") || die "Could not open Araxi1.reg: $!";
#get old server name
while (<FILE1>) {
chomp;
if (/Host/) {
($OLDNAME) = m/"Host"="(\w*-\w*)"/;
#print "OLDNAME=$OLDNAME\n";
$OLDNAME_SMALL = lc $OLDNAME;
#print "OLDNAME_SMALL=$OLDNAME_SMALL\n";
last;
}
}
close FILE1;

open (FILE,'>:encoding(utf16)',"Araxi2.reg") || die "Could not open
Araxi2.reg: $!"; #CONVERT A UNICODE FILE TO ASCII
open (FILE1,"Araxi1.reg") || die "Could not open Araxi1.reg: $!";
while (<FILE1>) {
s/$OLDNAME/$computer/; #replace capitals
s/$OLDNAME_SMALL/$computer_small/; #replace small letters
names
print FILE "$_";
}
 
D

Dr.Ruud

MoshiachNow schreef:
I use the code below,but the file does not look good in Notepad
anymore,meaning the format is not exactly utf16 ...

open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
Araxi.reg: $!"; #Read UNICODE FILE TO ASCII
open (FILE1,">Araxi1.reg") || die "Could not open Araxi1.reg: $!";

You need to use the utf16le layer for the output to.

#!/usr/bin/perl
use warnings ;
use strict ;

my $fni = 'Araxi.reg' ;
my $fno = 'Araxi1.reg' ;

open my $fhi, '<:encoding(utf16)', $fni
or die "open '$fni', stopped $!" ;

open my $fho, '>:encoding(utf16)', $fno
or die "open '$fno', stopped $!" ;
 
B

Brian McCauley

MoshiachNow said:
open (FILE,'<:encoding(utf16)',"Araxi.reg") || die "Could not open
Araxi.reg: $!"; #Read UNICODE FILE TO ASCII

That comment is highly missleading. It should say "Read utf16 file into
Unicode".

The file is in utf16. The strings that are read from it are in Unicode.
Actually Perl will internally represent the stings in utf8, but
conceptually they are just Unicode. One thing they certainly are not is
ASCII. Of course if the data happens to contain no characters beyond
0x7F then the internal represtation of the Unicode string will be
identical to the equivalent ASCII string.
 
P

Peter J. Holzer

Thanks,

did just that.Reads the file nicely.
Then I want to reolace strings in the file and write it back in utf16
to Araxi2.reg.
I use the code below,but the file does not look good in Notepad
anymore,meaning the format is not exactly utf16 ...

Notepad needs the BOM at the beginning of the file to recognize it
is UTF16, so you have to write that:
open (FILE,'>:encoding(utf16)',"Araxi2.reg") || die "Could not open
print FILE "\x{FEFF}";

or, if you prefer symbolic names:

use charnames ':short';
....
print FILE "\N{BOM}";


hp
 
D

Dr.Ruud

Peter J. Holzer schreef:
Notepad needs the BOM at the beginning of the file to recognize it
is UTF16, so you have to write that:

With "encoding(UTF-16)", the IO-layer takes care of that. But then you
leave it up to Perl (Encode::perlIO?) to choose between UTF-16LE and
UTF-16BE. See also perldoc Encode::Unicode.


At opening, the file is 0 bytes, but after printing a single space, it
becomes 4 bytes, with the first two holding the BOM:

#!/usr/bin/perl
use warnings ;
use strict ;

my $fni = 'Araxi.reg' ;
my $fno = 'Araxi1.reg' ;

open my $fhi, '<:encoding(UTF-16)', $fni
or die "open '$fni', stopped $!" ;

open my $fho, '>:encoding(UTF-16)', $fno
or die "open '$fno', stopped $!" ;

print $fho ' ' ;
__END__
 
M

MoshiachNow

Thanks,

I do exactly as advised above,but checking the output in bynary
dipslay,I see that all bytes are interchanged within the words - see
below.
Have tried also "utf16-LE",this did not help.

Good utf16 input file:
FF FE 57 00 69 00 6E 00

Bad output file:
FE FF 00 57 00 69 00 6E

(Indeed,the print FILE "\x{FEFF}"; statement does not look like is
required,since it's been taken care of internally by Perl.)

So what can be still wrong ?
 
D

Dr.Ruud

MoshiachNow schreef:
all bytes are interchanged within the words

That is the UTF16-LE order, so it would have been wrong if you would
have seen something else. Do you understand the role of the BOM (Byte
Order Mark) now?
http://en.wikipedia.org/wiki/Byte_Order_Mark

Create a fresh file in Notepad with just the word "test" in it, and do a
File/Save As..., with Encoding "Unicode", and you'll see that Windows
defaults to UTF16-LE.

You'll also find an Encoding "Unicode big-endian" there, that is
UTF16-BE. But why would you want the bytes in a different order than the
default for the platform?
 
M

MoshiachNow

HI,

I do run exactly this :
open my $fhi, '<:encoding(UTF-16)', $fni
or die "open '$fni', stopped $!" ;


open my $fho, '>:encoding(UTF-16)', $fno
or die "open '$fno', stopped $!" ;

and expect input and output files to be in the same order,but they are
not.

I DID try adding the following line,it did not help:

print $fho "\x{FEFF}";
 
D

Dr.Ruud

MoshiachNow schreef:
I do run exactly this :
open my $fhi, '<:encoding(UTF-16)', $fni
or die "open '$fni', stopped $!" ;


open my $fho, '>:encoding(UTF-16)', $fno
or die "open '$fno', stopped $!" ;

and expect input and output files to be in the same order

Why do you expect that? At input, the BOM rules. At output, the platform
rules.
 
M

MoshiachNow

Thanks a lot.
Read the article,got the ide of the BOM.

The only thing that got me a valid output file was:

open (FILE, ">:raw:encoding(UTF16-LE)", "Araxi.reg") || die "Could not
open Araxi.reg: $!";
print FILE "\x{FEFF}";

Any other sequence will not work well.

Thanks !
 
D

Dr.Ruud

MoshiachNow schreef:
The only thing that got me a valid output file was:

open (FILE, ">:raw:encoding(UTF16-LE)", "Araxi.reg") || die "Could not
open Araxi.reg: $!";
print FILE "\x{FEFF}";

That lay-out really hurts my eyes. Next time, quote something of the
article that you reply on, or provide a "> [short summary]".

#!/usr/bin/perl
use warnings ;
use strict ;
use charnames ':short' ;

my ($fni, $ei) = ('Araxi.reg' , ':encoding(utf16)') ;
my ($fno, $eo) = ('Araxi1.reg', ':raw:encoding(utf16le)') ;

open my $fhi, "<$ei", $fni or die "open '$fni': $!" ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\N{BOM}" ;

print $fho "test\n" ;

# ... etc.


Your ":raw" is a good solution.
I tried "binmode $fho" instead, but got a "Wide character print"
warning. So I put a "use utf8" near the top, but then the BOM was output
as utf8, it looks like $fho's IO-layer was ignored. A "binmode $fho,
':encoding(utf16le)'" might work too, but I am converted to ":raw" now,
thanks.
 
P

Peter J. Holzer

MoshiachNow schreef:

That
^^^^
Could you quote what you mean by "that"? It makes the your posting a
bit hard to understand.
is the UTF16-LE order,

Nope. The sequence MoshiachNow called "bad" is UTF16-BE.

[...]
You'll also find an Encoding "Unicode big-endian" there, that is
UTF16-BE. But why would you want the bytes in a different order than
the default for the platform?

He doesn't. He wants UTF16-LE (what he labeled "good input file") but
gets UTF16-BE instead.

hp
 
D

Dr.Ruud

Peter J. Holzer schreef:
Dr.Ruud:
^^^^
Could you quote what you mean by "that"? It makes the your posting a
bit hard to understand.


Nope. The sequence MoshiachNow called "bad" is UTF16-BE.

Sorry for the confusion. My "That" was only the quoted phrase itself
(and not the meaning that it had in the original posting), to express
that the interchanged bytes from C<print "\x{FEFF}"> to (binary display)
"FF FE" was the thing to go for.


[...]
You'll also find an Encoding "Unicode big-endian" there, that is
UTF16-BE. But why would you want the bytes in a different order than
the default for the platform?

He doesn't. He wants UTF16-LE (what he labeled "good input file") but
gets UTF16-BE instead.

Yes, I mixed up there, I think because I couldn't understand why he
didn't just go for ':encoding(UTF16)'.


Sidenote:

#!/usr/bin/perl
# Script-ID: utf16.pl
use warnings ;
use strict ;

my ($fno, $eo) = ('utf16.txt', ':encoding(UTF16)') ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\n" ;
__END__

results in a 5 byte file (Windows, Perl 5.8.8):
FE FF 00 0D 0A

Anyone knows a good reason for why that doesn't result in:
FE FF 00 0D 00 0A
?
(I understand how it happens, but the "why" escapes me.)

With
':raw:encoding(UTF16)'
and
print $fho "\r\n"
one can produce the "right" output of course.
 
P

Peter J. Holzer

Sidenote:

#!/usr/bin/perl
# Script-ID: utf16.pl
use warnings ;
use strict ;

my ($fno, $eo) = ('utf16.txt', ':encoding(UTF16)') ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\n" ;
__END__

results in a 5 byte file (Windows, Perl 5.8.8):
FE FF 00 0D 0A

Anyone knows a good reason for why that doesn't result in:
FE FF 00 0D 00 0A
?
(I understand how it happens, but the "why" escapes me.)

I think the "why" is a simple bug.
With
':raw:encoding(UTF16)'
and
print $fho "\r\n"
one can produce the "right" output of course.

It looks like the :crlf layer is applied in the wrong place (after
:encoding(UTF16) instead of before).

my ($fno, $eo) = ('utf16.txt', 'encoding(UTF-16):crlf') ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\n" ;

also produces the right result (for Windows) on Linux, so I guess

my ($fno, $eo) = ('utf16.txt', ':raw:encoding(UTF-16):crlf') ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\n" ;

should work on Windows (don't have a Windows machine at hand to test
it).

hp
 
D

Dr.Ruud

Peter J. Holzer schreef:
Dr.Ruud:

I think the "why" is a simple bug.

Yes, I'll report it. (ticket #40255)

I guess

my ($fno, $eo) = ('utf16.txt', ':raw:encoding(UTF-16):crlf') ;
open my $fho, ">$eo", $fno or die "open '$fno': $!" ;
print $fho "\n" ;

should work on Windows (don't have a Windows machine at hand to test
it).

Yes, that writes the "platform-proper" 6 bytes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,480
Members
44,900
Latest member
Nell636132

Latest Threads

Top