Error in Handling Unicode(UTF16-LE) File & String

Discussion in 'Perl Misc' started by iaminsik, May 6, 2008.

  1. iaminsik

    iaminsik Guest

    In most cases, I converted utf-16le files into utf-8 encoding.
    But, I want to handle utf-16le files directly.

    My first source is "read a line from utf-16le file and write it in
    utf-16le encoding".
    It works well.

    ==========================================================
    use utf8;
    use Encode;

    open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
    binmode $infile;
    open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
    binmode $outfile;

    while ($line = <$infile>)
    {
    print $outfile $line;
    }

    close($infile);
    close($outfile);
    ==========================================================

    Second source is "read one line, split it into array, and print array
    by line in utf-16le encoding".
    It seemed to work well, but some characters were broken. It didn't
    work well.
    After a long web searching, I recognized Unicode::String could solve
    this problem.

    ==========================================================
    use utf8;
    use Encode;

    $\ = "\n";

    open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
    binmode $infile;
    open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
    binmode $outfile;

    while ($line = <$infile>)
    {
    chomp($line);
    @words = split(/[ ]+/, $line);
    foreach $word (@words)
    {
    print $outfile $word;
    }
    }

    close($infile);
    close($outfile);
    ==========================================================

    Using Unicode::String, I made the third source, but still it doesn't
    work.
    It means "reading" is OK, but split function isn't.
    Is there any solution?
    ==========================================================
    use utf8;
    use Encode;
    use Unicode::String;
    Unicode::String->stringify_as('utf16');

    $\ = "\n";

    open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
    binmode $infile;
    open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
    binmode $outfile;

    while ($line = <$infile>)
    {
    chomp($line);
    $sep = new Unicode::String ("[ ]+");
    @words = split($sep, $line);
    foreach $word (@words)
    {
    print $outfile $word;
    }
    }

    close($infile);
    close($outfile);
    ==========================================================

    Best Regards.
    Remi
     
    iaminsik, May 6, 2008
    #1
    1. Advertising

  2. iaminsik

    Ben Bullock Guest

    On Tue, 06 May 2008 01:00:50 -0700, iaminsik wrote:

    > In most cases, I converted utf-16le files into utf-8 encoding. But, I
    > want to handle utf-16le files directly.
    >
    > My first source is "read a line from utf-16le file and write it in
    > utf-16le encoding".
    > It works well.


    No it doesn't. Your problems are all in the first file.

    > open ($infile, "<:encoding(UTF-16LE):crlf", "unicodefile.dat");
    > binmode $infile;
    > open ($outfile, ">:raw:encoding(UTF-16LE):crlf", "unicodefile.out");
    > binmode $outfile;


    Do you know what binmode does? You'd better have another look at the
    manual (perldoc -f binmode). The binmode statements here switch OFF all
    the :raw:encoding(UTF... stuff you'd put in the previous lines, which
    explains all the other problems you had.

    To demonstrate, try this:

    #!/usr/local/bin/perl
    use warnings;
    use strict;
    use utf8;
    use Encode;
    binmode STDOUT, "utf8";
    my $utf8 = "モンスター 自惚れ";
    for (qw/file1 file2/) {
    open (my $outfile, ">:raw:encoding(UTF-16LE):crlf", "$_.dat") or die
    $!;
    binmode $outfile if /1/; # do what you did for file1 only
    print $outfile $utf8;
    close $outfile or die $!;
    open (my $infile, "<:encoding(UTF-16LE):crlf", "$_.dat");
    while (my $line = <$infile>)
    {
    print "$_: $line\n";
    }
    close($infile) or die $!;
    }


    The reason your code appeared to work is because you never did anything
    with the data. It was actually just reading and writing it as bytes
    without any knowledge of the encoding. As soon as you tried to manipulate
    the data, the problem which had been there all along became visible.

    P.S. use warnings; use strict; & check the values of open and close as
    above.
     
    Ben Bullock, May 6, 2008
    #2
    1. Advertising

  3. iaminsik

    Ben Bullock Guest

    On Tue, 06 May 2008 10:44:09 +0000, Ben Bullock wrote:

    > open (my $infile, "<:encoding(UTF-16LE):crlf", "$_.dat");


    > P.S. use warnings; use strict; & check the values of open and close as
    > above.


    Oops!
     
    Ben Bullock, May 6, 2008
    #3
  4. iaminsik

    Ben Bullock Guest

    iaminsik <> wrote:

    > The first source generates 'wide character warnings',
    > and saves outfile in utf8 format, weirdly.


    It's not weird; you have "use utf8;" there, so it reads in using the
    encoding you specified, then the binmode switches off the output
    formatting, then it prints it out in the default format, which
    generates wide character warnings because you haven't explicitly set
    the mode of the output to anything. Use

    binmode $outfile,"utf8";

    to switch those wide character warnings off.

    > I made 'binmode $outfile;' as a comment line,
    > and it saves outfile in UTF-16LE format I wanted.


    Good news.

    > 3. Several Questions
    > I, a newbie in Perl programming language, couldn't understand two
    > parts in your codes.
    > ========================================================================
    > for (qw/file1 file2/) { <===== what it means? it's a short expression
    > for loop?


    This sets $_ to "file1" then "file2". qw/a b/ equals ('a', 'b').

    > binmode $outfile if /1/; <===== what /l/ means?


    It's not an l it's a 1. "if /1/" has the effect of saying 'if $_ is
    "file1"'. The /1/ detects the character '1' in the name. Try
    experimenting with the code to understand what it does.
     
    Ben Bullock, May 7, 2008
    #4
  5. iaminsik

    iaminsik Guest

    On 5¿ù7ÀÏ, ¿ÀÈÄ2½Ã46ºÐ, (Ben Bullock) wrote:
    > iaminsik <> wrote:
    > > The first source generates 'wide character warnings',
    > > and saves outfile in utf8 format, weirdly.

    >
    > It's not weird; you have "use utf8;" there, so it reads in using the
    > encoding you specified, then the binmode switches off the output
    > formatting, then it prints it out in the default format, which
    > generates wide character warnings because you haven't explicitly set
    > the mode of the output to anything. Use
    >
    > binmode $outfile,"utf8";
    >
    > to switch those wide character warnings off.
    >
    > > I made 'binmode $outfile;' as a comment line,
    > > and it saves outfile in UTF-16LE format I wanted.

    >
    > Good news.
    >
    > > 3. Several Questions
    > > I, a newbie in Perl programming language, couldn't understand two
    > > parts in your codes.
    > > ========================================================================
    > > for (qw/file1 file2/) { <===== what it means? it's a short expression
    > > for loop?

    >
    > This sets $_ to "file1" then "file2". qw/a b/ equals ('a', 'b').
    >
    > > binmode $outfile if /1/; <===== what /l/ means?

    >
    > It's not an l it's a 1. "if /1/" has the effect of saying 'if $_ is
    > "file1"'. The /1/ detects the character '1' in the name. Try
    > experimenting with the code to understand what it does.


    Your comment helped me a lot.
    Thanks, Ben!

    Best Regards,
    Remi.
     
    iaminsik, May 8, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xah Lee

    convert gb18030 to utf16

    Xah Lee, Mar 6, 2005, in forum: Python
    Replies:
    2
    Views:
    1,561
    Xah Lee
    Mar 7, 2005
  2. John Perks and Sarah Mount

    UTF16 codec doesn't round-trip?

    John Perks and Sarah Mount, May 28, 2005, in forum: Python
    Replies:
    1
    Views:
    483
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    May 28, 2005
  3. Fuzzyman
    Replies:
    4
    Views:
    590
    Fuzzyman
    Feb 7, 2006
  4. news.fe.internet.bosch.com

    Regarding UTF16

    news.fe.internet.bosch.com, Feb 2, 2006, in forum: C Programming
    Replies:
    5
    Views:
    370
    those who know me have no need of my name
    Feb 12, 2006
  5. alex
    Replies:
    3
    Views:
    128
    Ben Morrow
    Mar 4, 2004
Loading...

Share This Page