UTF16 and Control M's

E

Eileen

Hi,

I have a text file with CTRL-M's. It is encoded as UTF16. When I try
to search for a string in this file, nothing is found. If I remove the
control-m's in vi, my search works. However, I cannot get the
control-m's to be removed using Perl. I've tried:

my $file= "myfile.xml";
while (<IN>) {
s/\cM//g;
}

and

my $file= "myfile.xml";
while (<IN>) {
s/\x{0x0D00}//g;
}

and

my $file= "myfile.xml";
while (<IN>) {
s/\^M//g;
}

and

while (<IN>) {
s/\cM//g;
}

all to no avail. I've tried it on Unix perl as well as Windows perl.
Again, I can remove the characters with vi (using s/^V^M//g).

Does anyone have any ideas on what to do? If I convert the file to
UTF8, the substitution and subsequent searches work. However, I have
several hundred files to deal with, and they are all encoded as UTF16.

Thanks,

Eileen
 
A

Alan J. Flavell

Sorry, I left out the first part of the script.there's the full
script:

#!/usr/local/bin/perl -w

We also recommend "use strict;" around here. Take advantage of all of
Perl's opportunities for helping you identify mistakes.
$file = "kono.xml"; ^
my

open (IN, $file) or die "cannot open $file\n";

Don't omit "$!" from the error report: it helps to understand the
reason for the failure.
I didn't realize you could specify the encoding of a file in Perl.

Another good reason to [check that you're using at least version
5.8.0 and] take a few moments out to read the introduction to the
new support for Unicode. (In earlier Perls you'd need to explicitly
invoke the relevant module to do this stuff).
the \x{0x0D00} was identified by one of my Unicode editors,and was a
stab in the dark on my part :)

But what have you learned from the experience?

- if you are reading text, and have properly defined the encoding,
then internally your characters can be referenced by their unicode
code point values, _not_ by their externally-encoded bit patterns.

- if, on the other hand, you are reading the data as a bunch of bytes
(i.e effectively "as binary") then you'd need to handle the byte-pairs
as byte-pairs, not as unicode characters. This is not to be
recommended in current versions of Perl (unless your data is somehow
defective, and you got to write a fixup routine of some kind).

- the new notation e.g \x{263a} denotes a _wide unicode character_ in
Perl's native unicode representation. That value is the Unicode code
point (in this case the smiley, "U+263a" as the Unicode Consortium's
notation would write it). Don't confuse it with the external coding
representation, which (_if_ you had read utf-16LE coding in binary
format, which I don't recommend) would have been \x3a\x26.

hope this helps

(You'd also be advised to take a read of
http://web.presby.edu/~nnqadmin/nnq/nquote.html )


p.s I have the impression that the regulars around here have nominated
me by default as the character encoding spokesman. I must admit that
I'm sometimes at the edge of my expertise, so I _do_ hope they're
watching closely, and will pounce as necessary if I say something
wrong or explain it badly...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top