Unicode and Perl

Discussion in 'Perl Misc' started by Bill H, Aug 1, 2006.

  1. Bill H

    Bill H Guest

    I have a perl program that reads in text files and creates web pages
    using the content of the text files. I now have to include text in
    Russian in these text files so I need to save them as unicode. The
    problem that arises is that my perl program can no longer read the
    files due to the fact that all the text is in unicode, not just the
    russian parts. Is there a way fixing the text so that only the unicode
    parts are in unicode and the rest are in straight text.

    Here is an example of what is bering read in:

    QUESTIONS1=1. Я предпочитаю делать что-либо в
    группе
    QUESTIONS2=2. I love learning new skills
    QUESTIONS3=3. My word is my bond
    QUESTIONS4=4. I like being the boss

    I looked at the source of the unicode text and a null character (0) is
    inserted before every non-unicode letter, is there a way of removing
    these nulls?

    Bill H
    Bill H, Aug 1, 2006
    #1
    1. Advertising

  2. "Bill H" <> writes:

    > I have a perl program that reads in text files and creates web pages
    > using the content of the text files. I now have to include text in
    > Russian in these text files so I need to save them as unicode. The
    > problem that arises is that my perl program can no longer read the
    > files due to the fact that all the text is in unicode, not just the
    > russian parts. Is there a way fixing the text so that only the unicode
    > parts are in unicode and the rest are in straight text.
    >
    > Here is an example of what is bering read in:
    >
    > QUESTIONS1=1. Я предпочитаю делать что-либо в
    > группе
    > QUESTIONS2=2. I love learning new skills
    > QUESTIONS3=3. My word is my bond
    > QUESTIONS4=4. I like being the boss
    >
    > I looked at the source of the unicode text and a null character (0) is
    > inserted before every non-unicode letter, is there a way of removing
    > these nulls?


    1) Are you sure you created source files in UTF-8? [not UTF-16]

    2) Have you tried to explicitly specify encoding of input and output files?
    #v+
    open(FILE, "<:utf8", $file_name)
    #v-

    See "man perlopentut" for more details

    --
    [pl2en: Andrew] Andrzej Adam Filip : :
    Andrzej Adam Filip, Aug 1, 2006
    #2
    1. Advertising

  3. Bill H wrote:
    > I have a perl program that reads in text files and creates web pages
    > using the content of the text files. I now have to include text in
    > Russian in these text files so I need to save them as unicode.


    Well, sort of.
    - there are other character sets that allow multi-lingual text within the
    same file. But Unicode is certainly a good choice
    - "Unicode" text can be encoded in many different ways. Which one are you
    talking about?

    > The
    > problem that arises is that my perl program can no longer read the
    > files due to the fact that all the text is in unicode, not just the
    > russian parts.


    Well, yes, that's usually what happens. And it's the beauty of Unicode that
    specifically you _don't_ have to encode each language in a different
    character set because it covers them all (well, at least unless you are
    going very exotic).

    > Is there a way fixing the text so that only the unicode
    > parts are in unicode and the rest are in straight text.


    You are dealing with a language where the characters for this language are
    not in Unicode? Not even in a surrogate set? I find this very hard to
    believe to say the least.

    > Here is an example of what is bering read in:
    >
    > QUESTIONS1=1. ? ??????????? ?????? ???-???? ?
    > ??????
    > QUESTIONS2=2. I love learning new skills
    > QUESTIONS3=3. My word is my bond
    > QUESTIONS4=4. I like being the boss
    >
    > I looked at the source of the unicode text and a null character (0) is
    > inserted before every non-unicode letter, is there a way of removing
    > these nulls?


    There are no non-Unicode letters in your sample (or maybe my Newsreader
    doesn't display them). It appears to me you have cyrillic and latin
    characters and of course both are included in Unicode.
    If there is really a null character before some other characters as you are
    claiming then the software that generated the text is bogus.
    Or are you talking about a null byte, maybe? Then chances are you saved your
    file as UTF-16. Unfortunately you didn't show us any of your Perl code,
    therefore there is no way for us to check if you are reading the file as
    UTF-16, too. And as I mentioned at the very beginning, you are not telling
    us which encoding you are using, either.

    jue
    Jürgen Exner, Aug 1, 2006
    #3
  4. Bill H

    Mumia W. Guest

    On 08/01/2006 08:34 AM, Bill H wrote:
    > I have a perl program that reads in text files and creates web pages
    > using the content of the text files. I now have to include text in
    > Russian in these text files so I need to save them as unicode. The
    > problem that arises is that my perl program can no longer read the
    > files due to the fact that all the text is in unicode, not just the
    > russian parts. Is there a way fixing the text so that only the unicode
    > parts are in unicode and the rest are in straight text.
    >
    > Here is an example of what is bering read in:
    >
    > QUESTIONS1=1. Я предпочитаю делать что-либо в
    > группе
    > QUESTIONS2=2. I love learning new skills
    > QUESTIONS3=3. My word is my bond
    > QUESTIONS4=4. I like being the boss
    >
    > I looked at the source of the unicode text and a null character (0) is
    > inserted before every non-unicode letter, is there a way of removing
    > these nulls?
    >
    > Bill H
    >


    What you posted is clearly utf-8, but nulls before each ascii
    character suggest utf-16 (?). "Perldoc -f open" and "perldoc
    perluniintro" and "perldoc Encode::Supported" will help you
    figure out the right IO layer to use when reading those files.
    Mumia W., Aug 1, 2006
    #4
  5. Bill H

    Bill H Guest

    Mumia W. wrote:
    > On 08/01/2006 08:34 AM, Bill H wrote:
    > > I have a perl program that reads in text files and creates web pages
    > > using the content of the text files. I now have to include text in
    > > Russian in these text files so I need to save them as unicode. The
    > > problem that arises is that my perl program can no longer read the
    > > files due to the fact that all the text is in unicode, not just the
    > > russian parts. Is there a way fixing the text so that only the unicode
    > > parts are in unicode and the rest are in straight text.
    > >
    > > Here is an example of what is bering read in:
    > >
    > > QUESTIONS1=1. Я предпочитаю делать что-либо в
    > > группе
    > > QUESTIONS2=2. I love learning new skills
    > > QUESTIONS3=3. My word is my bond
    > > QUESTIONS4=4. I like being the boss
    > >
    > > I looked at the source of the unicode text and a null character (0) is
    > > inserted before every non-unicode letter, is there a way of removing
    > > these nulls?
    > >
    > > Bill H
    > >

    >
    > What you posted is clearly utf-8, but nulls before each ascii
    > character suggest utf-16 (?). "Perldoc -f open" and "perldoc
    > perluniintro" and "perldoc Encode::Supported" will help you
    > figure out the right IO layer to use when reading those files.


    Thanks to everyone for all the suggestions. The unicode is made by
    saving the text file with windows XP's notepad. I will try these
    suggestions

    Bill H
    Bill H, Aug 1, 2006
    #5
  6. Bill H

    Matt Garrish Guest

    Bill H wrote:

    > Mumia W. wrote:
    > > On 08/01/2006 08:34 AM, Bill H wrote:
    > >
    > > What you posted is clearly utf-8, but nulls before each ascii
    > > character suggest utf-16 (?). "Perldoc -f open" and "perldoc
    > > perluniintro" and "perldoc Encode::Supported" will help you
    > > figure out the right IO layer to use when reading those files.

    >
    > Thanks to everyone for all the suggestions. The unicode is made by
    > saving the text file with windows XP's notepad. I will try these
    > suggestions
    >


    What does that tell anyone, though? Notepad will let you save in UTF-8
    and little- and big-endian UTF16 (which it likes to call "unicode").
    Which encoding did you choose?

    Matt
    Matt Garrish, Aug 1, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,909
    Robert Mark Bram
    Sep 28, 2003
  2. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    940
    Grzegorz ¦liwiñski
    Jan 19, 2011
  3. Chirag Mistry
    Replies:
    6
    Views:
    162
    Ollivier Robert
    Feb 8, 2008
  4. Aqua
    Replies:
    3
    Views:
    136
  5. Terry Reedy
    Replies:
    0
    Views:
    68
    Terry Reedy
    Jan 7, 2014
Loading...

Share This Page