UTF16 and Control M's

Discussion in 'Perl Misc' started by Eileen, Jul 2, 2003.

  1. Eileen

    Eileen Guest

    Hi,

    I have a text file with CTRL-M's. It is encoded as UTF16. When I try
    to search for a string in this file, nothing is found. If I remove the
    control-m's in vi, my search works. However, I cannot get the
    control-m's to be removed using Perl. I've tried:

    my $file= "myfile.xml";
    while (<IN>) {
    s/\cM//g;
    }

    and

    my $file= "myfile.xml";
    while (<IN>) {
    s/\x{0x0D00}//g;
    }

    and

    my $file= "myfile.xml";
    while (<IN>) {
    s/\^M//g;
    }

    and

    while (<IN>) {
    s/\cM//g;
    }

    all to no avail. I've tried it on Unix perl as well as Windows perl.
    Again, I can remove the characters with vi (using s/^V^M//g).

    Does anyone have any ideas on what to do? If I convert the file to
    UTF8, the substitution and subsequent searches work. However, I have
    several hundred files to deal with, and they are all encoded as UTF16.

    Thanks,

    Eileen
     
    Eileen, Jul 2, 2003
    #1
    1. Advertising

  2. The Ctrl-M is a "carriage-return" which is \r in Perl.

    Mike

    Eileen wrote:
    >
    > Hi,
    >
    > I have a text file with CTRL-M's. It is encoded as UTF16. When I try
    > to search for a string in this file, nothing is found. If I remove the
    > control-m's in vi, my search works. However, I cannot get the
    > control-m's to be removed using Perl. I've tried:
    >
    > my $file= "myfile.xml";
    > while (<IN>) {
    > s/\cM//g;
    > }
    >
    > and
    >
    > my $file= "myfile.xml";
    > while (<IN>) {
    > s/\x{0x0D00}//g;
    > }
    >
    > and
    >
    > my $file= "myfile.xml";
    > while (<IN>) {
    > s/\^M//g;
    > }
    >
    > and
    >
    > while (<IN>) {
    > s/\cM//g;
    > }
    >
    > all to no avail. I've tried it on Unix perl as well as Windows perl.
    > Again, I can remove the characters with vi (using s/^V^M//g).
    >
    > Does anyone have any ideas on what to do? If I convert the file to
    > UTF8, the substitution and subsequent searches work. However, I have
    > several hundred files to deal with, and they are all encoded as UTF16.
    >
    > Thanks,
    >
    > Eileen
     
    Michael P. Broida, Jul 2, 2003
    #2
    1. Advertising

  3. On Wed, Jul 2, Michael P. Broida staggered uncertainly out onto Usenet
    atop a fullquote:

    > The Ctrl-M is a "carriage-return" which is \r in Perl.


    Beware of Usenauts bearing TOFU.
     
    Alan J. Flavell, Jul 3, 2003
    #3
  4. On Thu, Jul 2, Eileen inscribed on the eternal scroll:

    > Sorry, I left out the first part of the script.there's the full
    > script:
    >
    > #!/usr/local/bin/perl -w


    We also recommend "use strict;" around here. Take advantage of all of
    Perl's opportunities for helping you identify mistakes.

    > $file = "kono.xml";

    ^
    my

    > open (IN, $file) or die "cannot open $file\n";


    Don't omit "$!" from the error report: it helps to understand the
    reason for the failure.

    > I didn't realize you could specify the encoding of a file in Perl.


    Another good reason to [check that you're using at least version
    5.8.0 and] take a few moments out to read the introduction to the
    new support for Unicode. (In earlier Perls you'd need to explicitly
    invoke the relevant module to do this stuff).

    > the \x{0x0D00} was identified by one of my Unicode editors,and was a
    > stab in the dark on my part :)


    But what have you learned from the experience?

    - if you are reading text, and have properly defined the encoding,
    then internally your characters can be referenced by their unicode
    code point values, _not_ by their externally-encoded bit patterns.

    - if, on the other hand, you are reading the data as a bunch of bytes
    (i.e effectively "as binary") then you'd need to handle the byte-pairs
    as byte-pairs, not as unicode characters. This is not to be
    recommended in current versions of Perl (unless your data is somehow
    defective, and you got to write a fixup routine of some kind).

    - the new notation e.g \x{263a} denotes a _wide unicode character_ in
    Perl's native unicode representation. That value is the Unicode code
    point (in this case the smiley, "U+263a" as the Unicode Consortium's
    notation would write it). Don't confuse it with the external coding
    representation, which (_if_ you had read utf-16LE coding in binary
    format, which I don't recommend) would have been \x3a\x26.

    hope this helps

    (You'd also be advised to take a read of
    http://web.presby.edu/~nnqadmin/nnq/nquote.html )


    p.s I have the impression that the regulars around here have nominated
    me by default as the character encoding spokesman. I must admit that
    I'm sometimes at the edge of my expertise, so I _do_ hope they're
    watching closely, and will pounce as necessary if I say something
    wrong or explain it badly...
     
    Alan J. Flavell, Jul 3, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xah Lee

    convert gb18030 to utf16

    Xah Lee, Mar 6, 2005, in forum: Python
    Replies:
    2
    Views:
    1,561
    Xah Lee
    Mar 7, 2005
  2. John Perks and Sarah Mount

    UTF16 codec doesn't round-trip?

    John Perks and Sarah Mount, May 28, 2005, in forum: Python
    Replies:
    1
    Views:
    483
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    May 28, 2005
  3. Fuzzyman
    Replies:
    4
    Views:
    590
    Fuzzyman
    Feb 7, 2006
  4. news.fe.internet.bosch.com

    Regarding UTF16

    news.fe.internet.bosch.com, Feb 2, 2006, in forum: C Programming
    Replies:
    5
    Views:
    370
    those who know me have no need of my name
    Feb 12, 2006
  5. jmgeu

    utf8 to utf16

    jmgeu, Mar 9, 2007, in forum: VHDL
    Replies:
    0
    Views:
    485
    jmgeu
    Mar 9, 2007
Loading...

Share This Page