Interpreting non-ascii characters.

D

ddtl

Hello everybody,

I want to create a script which reads files in a
current directory and renames them according to some
scheme. The file names are in Russian - sometimes
the names encoded as win-1251, sometimes as koi8-r etc.
I want to read in file name and convert it to list for
further processing. The problem is that Python treats
non-ascii characters as multibyte characters - for
example, hex code for "Small Character A" in koi8-r is
0xc1, but Python interprets it as a sequence of
\xd0, \xb1 bytes.

What can I do so that Python interprets non-ascii
characters correctly?
 
J

John Machin

Hello everybody,

I want to create a script which reads files in a
current directory and renames them according to some
scheme. The file names are in Russian - sometimes
the names encoded as win-1251, sometimes as koi8-r etc.

You have a file system with 8-bit file names with no indication of
'codepage' or 'encoding', either globally or per file? Which operating
system are you using?
I want to read in file name and convert it to list for
further processing.

Read file name from a text file? Or do you mean using e.g. glob.glob()
or os.listdir()

What do you mean by "convert it to list"? Do you mean 'foo.txt' -> ['f',
'o', ....etc]??? Why?
The problem is that Python treats
non-ascii characters as multibyte characters - for
example, hex code for "Small Character A" in koi8-r is
0xc1, but Python interprets it as a sequence of
\xd0, \xb1 bytes.

Python is very unlikely to do that all by itself. Please show us the
script or whatever evidence you have. I strongly suggest that
immediately after "reading" a file name, you do
print repr(file_name)
NOT
print file_name
so that you can see *exactly* what you've got.

Are you sure about the \xb1??? Consider this:
What do you get when you do that?
What can I do so that Python interprets non-ascii
characters correctly?

Know how your non-ascii characters are encoded. Tell Python what to do
with them.

Read this:
http://www.amk.ca/python/howto/unicode

Hope this helps,
John
 
O

Omari Norman

I want to create a script which reads files in a
current directory and renames them according to some
scheme. The file names are in Russian - sometimes
the names encoded as win-1251, sometimes as koi8-r etc.
I want to read in file name and convert it to list for
further processing. The problem is that Python treats

Apparently os.listdir returns a list of Unicode objects if the pathname
you give it is a Unicode object. So, Python should then convert the
Russian filenames to Unicode, using whatever encoding necessary. (I
don't know, however, how Python would know what to do if the filenames
are in a bunch of different encodings, as you say.)

If you can get the filenames into Unicode, then you can manipulate them
however you like.
 
D

ddtl


I have a bunch of directories and files from different systems
(each directory contains files from the same system) which are
encoded differently (though all of them are in Russian), so the
following encodings are present: koi8-r, win-1251, utf-8 etc.,
and I want to transliterate them into a regular ASCII so that they
would be readable regardless of the system. Personally I use both
Linux and Windows. So what I do, is read file name using os.listdir,
convert to list ('foo.txt' => ['f', 'o', ... , 't'], except that
file names are in Russian), transliterate (some letters in Russian
have to be transliterated into 2 or even 3 Latin letters),
and then rename file.

It seems though that after all I solved the problem - I thought
that my Windows (2000) used win-1251 and Linux used koi8-r and
because of that I couldn't understand what are those strange
codes I got while experimenting with locally created Cyrillic
file names, but in effect Linux uses utf-8, and Windows uses cp866,
so after getting it and reading the article you suggested I
solved the problem.

Thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,234
Latest member
SkyeWeems

Latest Threads

Top