Preserving unicode filename encoding

J

Julien Phalip

Hi,

I've noticed that the encoding of non-ascii filenames can be inconsistent between platforms when using the built-in open() function to create files.

For example, on a Ubuntu 10.04.4 LTS box, the character u'ş' (u'\u015f') gets encoded as u'ş' (u's\u0327'). Note how the two characters look exactly the same but are encoded differently. The original character usesonly one code (u'\u015f'), but the resulting character that is saved on the file system will be made of a combination of two codes: the letter 's' followed by a diacritical cedilla (u's\u0327'). (You can learn more about diacritics in [1]). On the Mac, however, the original encoding is always preserved.

This issue was also discussed in a blog post by Ned Batchelder [2]. One suggested approach is to normalize the filename, however this could result in loss of information (what if, for example, the original filename did contain combining diacritics and we wanted to preserve them).

Ideally, it would be preferable to preserve the original encoding. Is that possible or is that completely out of Python's control?

Thanks a lot,

Julien

[1] http://en.wikipedia.org/wiki/Combining_diacritic#Unicode_ranges
[2] http://nedbatchelder.com/blog/201106/filenames_with_accents.html
 
N

Nobody

I've noticed that the encoding of non-ascii filenames can be inconsistent
between platforms when using the built-in open() function to create files.

For example, on a Ubuntu 10.04.4 LTS box, the character u'ÅŸ' (u'\u015f')
gets encoded as u'ş' (u's\u0327'). Note how the two characters look
exactly the same but are encoded differently. The original character uses
only one code (u'\u015f'), but the resulting character that is saved on
the file system will be made of a combination of two codes: the letter 's'
followed by a diacritical cedilla (u's\u0327'). (You can learn more about
diacritics in [1]). On the Mac, however, the original encoding is always
preserved.

This issue was also discussed in a blog post by Ned Batchelder [2].

You are conflating two distinct issues here: representation (how a
given "character" is represented as a Unicode string) and encoding (how a
given Unicode string is represented as a byte string).

E.g. you state:
For example, on a Ubuntu 10.04.4 LTS box, the character u'ÅŸ' (u'\u015f')
gets encoded as u'ş' (u's\u0327').

which is incorrect.

The latter isn't an "encoding" of the former. They are alternate Unicode
representations of the same character. The former uses a pre-composed
character (LATIN SMALL LETTER S WITH CEDILLA) while the latter uses a
letter 's' with a combining accent (COMBINING CEDILLA).

Unlike the Mac, neither Unix nor Windows will automatically normalise
Unicode strings. A Unix filename is a sequence of bytes, nothing more and
nothing less. This is part of the reason why Unix filenames are case
sensitive: case applies to characters, and the kernel doesn't know which
characters, if any, those bytes are meant to represent.

Python will convert a Unicode string to a sequence of bytes using the
filesystem encoding. If the encoding is UTF-8, then u'\u015f'
will be encoded as b'\xc5\x9f', while u's\u0327' will be encoded as
b's\xcc\xa7'.

If you want to convert a Unicode string to a given normalisation, you can
use unicodedata.normalize(), e.g.:
unicodedata.normalize('NFC', 's\u0327') '\u015f'
unicodedata.normalize('NFD', '\u015f')
's\u0327'

However: if you want to access an existing file, you must use the filename
as it appears on disc. On Unix and Windows, it's perfectly possible to
have two files named e.g. '\u015f.txt' and 's\u0327.txt' in the same
directory. Which one gets opened depends upon the exact sequence of
Unicode codepoints passed to open().

The situation is different on the Mac, where system libraries
automatically impose a specific representation on filenames, and will
normalise Unicode strings to that representation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top