Unicode and Perl

B

Bill H

I have a perl program that reads in text files and creates web pages
using the content of the text files. I now have to include text in
Russian in these text files so I need to save them as unicode. The
problem that arises is that my perl program can no longer read the
files due to the fact that all the text is in unicode, not just the
russian parts. Is there a way fixing the text so that only the unicode
parts are in unicode and the rest are in straight text.

Here is an example of what is bering read in:

QUESTIONS1=1. Я предпочитаю делать что-либо в
группе
QUESTIONS2=2. I love learning new skills
QUESTIONS3=3. My word is my bond
QUESTIONS4=4. I like being the boss

I looked at the source of the unicode text and a null character (0) is
inserted before every non-unicode letter, is there a way of removing
these nulls?

Bill H
 
A

Andrzej Adam Filip

Bill H said:
I have a perl program that reads in text files and creates web pages
using the content of the text files. I now have to include text in
Russian in these text files so I need to save them as unicode. The
problem that arises is that my perl program can no longer read the
files due to the fact that all the text is in unicode, not just the
russian parts. Is there a way fixing the text so that only the unicode
parts are in unicode and the rest are in straight text.

Here is an example of what is bering read in:

QUESTIONS1=1. Я предпочитаю делать что-либо в
группе
QUESTIONS2=2. I love learning new skills
QUESTIONS3=3. My word is my bond
QUESTIONS4=4. I like being the boss

I looked at the source of the unicode text and a null character (0) is
inserted before every non-unicode letter, is there a way of removing
these nulls?

1) Are you sure you created source files in UTF-8? [not UTF-16]

2) Have you tried to explicitly specify encoding of input and output files?
#v+
open(FILE, "<:utf8", $file_name)
#v-

See "man perlopentut" for more details
 
J

Jürgen Exner

Bill said:
I have a perl program that reads in text files and creates web pages
using the content of the text files. I now have to include text in
Russian in these text files so I need to save them as unicode.

Well, sort of.
- there are other character sets that allow multi-lingual text within the
same file. But Unicode is certainly a good choice
- "Unicode" text can be encoded in many different ways. Which one are you
talking about?
The
problem that arises is that my perl program can no longer read the
files due to the fact that all the text is in unicode, not just the
russian parts.

Well, yes, that's usually what happens. And it's the beauty of Unicode that
specifically you _don't_ have to encode each language in a different
character set because it covers them all (well, at least unless you are
going very exotic).
Is there a way fixing the text so that only the unicode
parts are in unicode and the rest are in straight text.

You are dealing with a language where the characters for this language are
not in Unicode? Not even in a surrogate set? I find this very hard to
believe to say the least.
Here is an example of what is bering read in:

QUESTIONS1=1. ? ??????????? ?????? ???-???? ?
??????
QUESTIONS2=2. I love learning new skills
QUESTIONS3=3. My word is my bond
QUESTIONS4=4. I like being the boss

I looked at the source of the unicode text and a null character (0) is
inserted before every non-unicode letter, is there a way of removing
these nulls?

There are no non-Unicode letters in your sample (or maybe my Newsreader
doesn't display them). It appears to me you have cyrillic and latin
characters and of course both are included in Unicode.
If there is really a null character before some other characters as you are
claiming then the software that generated the text is bogus.
Or are you talking about a null byte, maybe? Then chances are you saved your
file as UTF-16. Unfortunately you didn't show us any of your Perl code,
therefore there is no way for us to check if you are reading the file as
UTF-16, too. And as I mentioned at the very beginning, you are not telling
us which encoding you are using, either.

jue
 
M

Mumia W.

I have a perl program that reads in text files and creates web pages
using the content of the text files. I now have to include text in
Russian in these text files so I need to save them as unicode. The
problem that arises is that my perl program can no longer read the
files due to the fact that all the text is in unicode, not just the
russian parts. Is there a way fixing the text so that only the unicode
parts are in unicode and the rest are in straight text.

Here is an example of what is bering read in:

QUESTIONS1=1. Я предпочитаю делать что-либо в
группе
QUESTIONS2=2. I love learning new skills
QUESTIONS3=3. My word is my bond
QUESTIONS4=4. I like being the boss

I looked at the source of the unicode text and a null character (0) is
inserted before every non-unicode letter, is there a way of removing
these nulls?

Bill H

What you posted is clearly utf-8, but nulls before each ascii
character suggest utf-16 (?). "Perldoc -f open" and "perldoc
perluniintro" and "perldoc Encode::Supported" will help you
figure out the right IO layer to use when reading those files.
 
B

Bill H

Mumia said:
What you posted is clearly utf-8, but nulls before each ascii
character suggest utf-16 (?). "Perldoc -f open" and "perldoc
perluniintro" and "perldoc Encode::Supported" will help you
figure out the right IO layer to use when reading those files.

Thanks to everyone for all the suggestions. The unicode is made by
saving the text file with windows XP's notepad. I will try these
suggestions

Bill H
 
M

Matt Garrish

Bill said:
Thanks to everyone for all the suggestions. The unicode is made by
saving the text file with windows XP's notepad. I will try these
suggestions

What does that tell anyone, though? Notepad will let you save in UTF-8
and little- and big-endian UTF16 (which it likes to call "unicode").
Which encoding did you choose?

Matt
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top