Detecting filename-encoding (on WinXP)?

  • Thread starter Tim N. van der Leeuw
  • Start date
T

Tim N. van der Leeuw

Hi,

I have a need to store directory and filenames in a database. For the
database I chose to use UTF-8 encoding; but the actual encoding used is
probably immaterial: whichever coding I take, I'll run into this issue
eventually.

At first my code worked until I ran into a directory full of Cyrillic
characters and my program blew up.

So now what I need to know is, how do I find out in what encoding a
particular filename is? Is there a portable way for doing this? And if
not, then what is the non-portable way for doing this on Windows?
(WinXP)
(If there's only a non-portable way then I'll worry about porting it
later, if and when this program will ever have a need to run on a
Unix-like environment)


Many thanks in advance,

--Tim
 
M

Magnus Lycka

Tim said:
Hi,

I have a need to store directory and filenames in a database. For the
database I chose to use UTF-8 encoding; but the actual encoding used is
probably immaterial: whichever coding I take, I'll run into this issue
eventually.

At first my code worked until I ran into a directory full of Cyrillic
characters and my program blew up.

How did you find the files? Did you pass a Unicode path as argument
to os.listdir()? See http://www.python.org/peps/pep-0277.html
 
T

Tim N. van der Leeuw

Hi Magnus,

I get the filename from a URL, which probably is not in any kind of
unicode-string but just a plain ASCII string. It should be possible to
cast this to an ASCII string -- I'll try it right away to see if this
works.

Thanks!

--Tim
 
C

Christos Georgiou

So now what I need to know is, how do I find out in what encoding a
particular filename is? Is there a portable way for doing this?

You said the filename comes as data, and not as contents of os.listdir(),
right?

You can only know (for almost certain) what encoding is *not* the filename
(by looping over encodings and marking those where .decode fails).

If it was textual data, you could be more successful in guessing (btw, it's
been a long time since I requested example texts from various encodings for
my encoding-guessing app, but I was sent only one) by testing characters in
pairs and their frequencies.
 
T

Tim N. van der Leeuw

Actually, the directory-name comes in as a URL and as such I had no
problems yet just creating a unicode-string from it which I can pass to
os.walk(), and get proper unicode-filenames back from it.
Then I can encode them into utf-8 and pass them to the database-layer
and it all works.

cheers,

--Tim
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top