Unicode and Perl

Bill H · Aug 1, 2006

I have a perl program that reads in text files and creates web pages
using the content of the text files. I now have to include text in
Russian in these text files so I need to save them as unicode. The
problem that arises is that my perl program can no longer read the
files due to the fact that all the text is in unicode, not just the
russian parts. Is there a way fixing the text so that only the unicode
parts are in unicode and the rest are in straight text.

Here is an example of what is bering read in:

QUESTIONS1=1. Ð¯ Ð¿Ñ€ÐµÐ´Ð¿Ð¾Ñ‡Ð¸Ñ‚Ð°ÑŽ Ð´ÐµÐ»Ð°Ñ‚ÑŒ Ñ‡Ñ‚Ð¾-Ð»Ð¸Ð±Ð¾ Ð²
Ð³Ñ€ÑƒÐ¿Ð¿Ðµ
QUESTIONS2=2. I love learning new skills
QUESTIONS3=3. My word is my bond
QUESTIONS4=4. I like being the boss

I looked at the source of the unicode text and a null character (0) is
inserted before every non-unicode letter, is there a way of removing
these nulls?

Bill H

Andrzej Adam Filip · Aug 1, 2006

Bill H said:
I have a perl program that reads in text files and creates web pages
using the content of the text files. I now have to include text in
Russian in these text files so I need to save them as unicode. The
problem that arises is that my perl program can no longer read the
files due to the fact that all the text is in unicode, not just the
russian parts. Is there a way fixing the text so that only the unicode
parts are in unicode and the rest are in straight text.

Here is an example of what is bering read in:

QUESTIONS1=1. Ð¯ Ð¿Ñ€ÐµÐ´Ð¿Ð¾Ñ‡Ð¸Ñ‚Ð°ÑŽ Ð´ÐµÐ»Ð°Ñ‚ÑŒ Ñ‡Ñ‚Ð¾-Ð»Ð¸Ð±Ð¾ Ð²
Ð³Ñ€ÑƒÐ¿Ð¿Ðµ
QUESTIONS2=2. I love learning new skills
QUESTIONS3=3. My word is my bond
QUESTIONS4=4. I like being the boss

I looked at the source of the unicode text and a null character (0) is
inserted before every non-unicode letter, is there a way of removing
these nulls?

1) Are you sure you created source files in UTF-8? [not UTF-16]

2) Have you tried to explicitly specify encoding of input and output files?
#v+
open(FILE, "<:utf8", $file_name)
#v-

See "man perlopentut" for more details

Jürgen Exner · Aug 1, 2006

Bill said:
I have a perl program that reads in text files and creates web pages
using the content of the text files. I now have to include text in
Russian in these text files so I need to save them as unicode.

Well, sort of.
- there are other character sets that allow multi-lingual text within the
same file. But Unicode is certainly a good choice
- "Unicode" text can be encoded in many different ways. Which one are you
talking about?

The
problem that arises is that my perl program can no longer read the
files due to the fact that all the text is in unicode, not just the
russian parts.

Well, yes, that's usually what happens. And it's the beauty of Unicode that
specifically you _don't_ have to encode each language in a different
character set because it covers them all (well, at least unless you are
going very exotic).

Is there a way fixing the text so that only the unicode
parts are in unicode and the rest are in straight text.

You are dealing with a language where the characters for this language are
not in Unicode? Not even in a surrogate set? I find this very hard to
believe to say the least.

Here is an example of what is bering read in:

QUESTIONS1=1. ? ??????????? ?????? ???-???? ?
??????
QUESTIONS2=2. I love learning new skills
QUESTIONS3=3. My word is my bond
QUESTIONS4=4. I like being the boss

I looked at the source of the unicode text and a null character (0) is
inserted before every non-unicode letter, is there a way of removing
these nulls?

There are no non-Unicode letters in your sample (or maybe my Newsreader
doesn't display them). It appears to me you have cyrillic and latin
characters and of course both are included in Unicode.
If there is really a null character before some other characters as you are
claiming then the software that generated the text is bogus.
Or are you talking about a null byte, maybe? Then chances are you saved your
file as UTF-16. Unfortunately you didn't show us any of your Perl code,
therefore there is no way for us to check if you are reading the file as
UTF-16, too. And as I mentioned at the very beginning, you are not telling
us which encoding you are using, either.

jue

Mumia W. · Aug 1, 2006

I have a perl program that reads in text files and creates web pages
using the content of the text files. I now have to include text in
Russian in these text files so I need to save them as unicode. The
problem that arises is that my perl program can no longer read the
files due to the fact that all the text is in unicode, not just the
russian parts. Is there a way fixing the text so that only the unicode
parts are in unicode and the rest are in straight text.

Here is an example of what is bering read in:

QUESTIONS1=1. Ð¯ Ð¿Ñ€ÐµÐ´Ð¿Ð¾Ñ‡Ð¸Ñ‚Ð°ÑŽ Ð´ÐµÐ»Ð°Ñ‚ÑŒ Ñ‡Ñ‚Ð¾-Ð»Ð¸Ð±Ð¾ Ð²
Ð³Ñ€ÑƒÐ¿Ð¿Ðµ
QUESTIONS2=2. I love learning new skills
QUESTIONS3=3. My word is my bond
QUESTIONS4=4. I like being the boss

I looked at the source of the unicode text and a null character (0) is
inserted before every non-unicode letter, is there a way of removing
these nulls?

Bill H

What you posted is clearly utf-8, but nulls before each ascii
character suggest utf-16 (?). "Perldoc -f open" and "perldoc
perluniintro" and "perldoc Encode::Supported" will help you
figure out the right IO layer to use when reading those files.

Bill H · Aug 1, 2006

Mumia said:
What you posted is clearly utf-8, but nulls before each ascii
character suggest utf-16 (?). "Perldoc -f open" and "perldoc
perluniintro" and "perldoc Encode::Supported" will help you
figure out the right IO layer to use when reading those files.

Thanks to everyone for all the suggestions. The unicode is made by
saving the text file with windows XP's notepad. I will try these
suggestions

Bill H

Matt Garrish · Aug 1, 2006

Bill said:
Thanks to everyone for all the suggestions. The unicode is made by
saving the text file with windows XP's notepad. I will try these
suggestions

What does that tell anyone, though? Notepad will let you save in UTF-8
and little- and big-endian UTF16 (which it likes to call "unicode").
Which encoding did you choose?

Matt

what is wrong in my code?? (python 3.3)	4	Sep 27, 2013
LWP and Unicode	17	Oct 2, 2006
Python interface to ODF documents?	0	Feb 15, 2009
Can't install racc	2	Jun 9, 2009
polymorphic regex -- encoding issue	7	Oct 18, 2007
import array like structure using perl	2	Dec 26, 2012
eval and unicode	12	Mar 20, 2008
International i18n character problems	4	Mar 8, 2007

Unicode and Perl

Bill H

Andrzej Adam Filip

Jürgen Exner

Mumia W.

Bill H

Matt Garrish

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads