UTF16 and Control M's

Eileen · Jul 2, 2003

Hi,

I have a text file with CTRL-M's. It is encoded as UTF16. When I try
to search for a string in this file, nothing is found. If I remove the
control-m's in vi, my search works. However, I cannot get the
control-m's to be removed using Perl. I've tried:

my $file= "myfile.xml";
while (<IN>) {
s/\cM//g;
}

and

my $file= "myfile.xml";
while (<IN>) {
s/\x{0x0D00}//g;
}

and

my $file= "myfile.xml";
while (<IN>) {
s/\^M//g;
}

and

while (<IN>) {
s/\cM//g;
}

all to no avail. I've tried it on Unix perl as well as Windows perl.
Again, I can remove the characters with vi (using s/^V^M//g).

Does anyone have any ideas on what to do? If I convert the file to
UTF8, the substitution and subsequent searches work. However, I have
several hundred files to deal with, and they are all encoded as UTF16.

Thanks,

Eileen

Michael P. Broida · Jul 2, 2003

The Ctrl-M is a "carriage-return" which is \r in Perl.

Mike

Alan J. Flavell · Jul 3, 2003

The Ctrl-M is a "carriage-return" which is \r in Perl.

Beware of Usenauts bearing TOFU.

Alan J. Flavell · Jul 3, 2003

Sorry, I left out the first part of the script.there's the full
script:

#!/usr/local/bin/perl -w

We also recommend "use strict;" around here. Take advantage of all of
Perl's opportunities for helping you identify mistakes.

$file = "kono.xml"; ^
my

open (IN, $file) or die "cannot open $file\n";

Don't omit "$!" from the error report: it helps to understand the
reason for the failure.

I didn't realize you could specify the encoding of a file in Perl.

Another good reason to [check that you're using at least version
5.8.0 and] take a few moments out to read the introduction to the
new support for Unicode. (In earlier Perls you'd need to explicitly
invoke the relevant module to do this stuff).

the \x{0x0D00} was identified by one of my Unicode editors,and was a
stab in the dark on my part

But what have you learned from the experience?

- if you are reading text, and have properly defined the encoding,
then internally your characters can be referenced by their unicode
code point values, _not_ by their externally-encoded bit patterns.

- if, on the other hand, you are reading the data as a bunch of bytes
(i.e effectively "as binary") then you'd need to handle the byte-pairs
as byte-pairs, not as unicode characters. This is not to be
recommended in current versions of Perl (unless your data is somehow
defective, and you got to write a fixup routine of some kind).

- the new notation e.g \x{263a} denotes a _wide unicode character_ in
Perl's native unicode representation. That value is the Unicode code
point (in this case the smiley, "U+263a" as the Unicode Consortium's
notation would write it). Don't confuse it with the external coding
representation, which (_if_ you had read utf-16LE coding in binary
format, which I don't recommend) would have been \x3a\x26.

hope this helps

(You'd also be advised to take a read of
http://web.presby.edu/~nnqadmin/nnq/nquote.html )

p.s I have the impression that the regulars around here have nominated
me by default as the character encoding spokesman. I must admit that
I'm sometimes at the edge of my expertise, so I _do_ hope they're
watching closely, and will pounce as necessary if I say something
wrong or explain it badly...

Error in Handling Unicode(UTF16-LE) File & String	4	May 6, 2008
Python 3.0 automatic decoding of UTF16	25	Dec 5, 2008
UTF16, BOM, and Windows Line endings	4	Feb 6, 2006
Blue J Ciphertext Program	2	Nov 22, 2023
UTF16 codec doesn't round-trip?	1	May 28, 2005
SVG not full width and space	0	Sep 15, 2023
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
EEG stream data with mne and brainfolw	0	Jul 26, 2023

UTF16 and Control M's

Eileen

Michael P. Broida

Alan J. Flavell

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads