Re: Unicode File Names

Discussion in 'Python' started by John Machin, Oct 17, 2008.

  1. John Machin

    John Machin Guest

    On Oct 17, 11:43 am, Jordan <> wrote:
    > I've got a bunch of files with Japanese characters in their names and
    > os.listdir() replaces those characters with ?'s. I'm trying to open
    > the files several steps later, and obviously Python isn't going to
    > find '01-????.jpg' (formally '01-ひらがな.jpg') because it doesn't exist.
    > I'm not sure where in the process I'm able to stop that from
    > happening. Thanks.


    The Fine Manual says:
    """
    listdir( path)

    Return a list containing the names of the entries in the directory.
    The list is in arbitrary order. It does not include the special
    entries '.' and '..' even if they are present in the directory.
    Availability: Macintosh, Unix, Windows.
    Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a
    Unicode object, the result will be a list of Unicode objects.
    """

    Are you unsure whether your version of Python is 2.3 or later?
     
    John Machin, Oct 17, 2008
    #1
    1. Advertising

  2. John Machin

    John Machin Guest

    On Oct 17, 12:52 pm, Jordan <> wrote:
    > On Oct 16, 9:20 pm, John Machin <> wrote:
    >
    >
    >
    > > On Oct 17, 11:43 am, Jordan <> wrote:

    >
    > > > I've got a bunch of files with Japanese characters in their names and
    > > > os.listdir() replaces those characters with ?'s. I'm trying to open
    > > > the files several steps later, and obviously Python isn't going to
    > > > find '01-????.jpg' (formally '01-ひらがな.jpg') because it doesn't exist.
    > > > I'm not sure where in the process I'm able to stop that from
    > > > happening. Thanks.

    >
    > > The Fine Manual says:
    > > """
    > > listdir( path)

    >
    > > Return a list containing the names of the entries in the directory.
    > > The list is in arbitrary order. It does not include the special
    > > entries '.' and '..' even if they are present in the directory.
    > > Availability: Macintosh, Unix, Windows.
    > > Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a
    > > Unicode object, the result will be a list of Unicode objects.
    > > """

    >
    > > Are you unsure whether your version of Python is 2.3 or later?

    >
    > *** Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32
    > bit (Intel)] on win32. *** says my interpreter
    >
    > when it says "if path is a Unicode object...", does that mean the path
    > name must have a Unicode char?


    If path is a Unicode [should read unicode] object of length > 0, then
    *all* characters in path are by definition unicode characters.

    Where are you getting your path from? If you are doing os.listdir(r'c:
    \test') then do os.listdir(ur'c:\test'). If you are getting it from
    the command line or somehow else as a variable, instead of
    os.listdir(path), try os.listdir(unicode(path)). If that fails with a
    message like "UnicodeDecodeError: 'ascii' codec can't decode .....",
    then you'll need something like os.listdir(unicode(path,
    encoding='cp1252')) # cp1252 being the most likely suspect :)

    I strongly suggest that you read this:
    http://www.amk.ca/python/howto/unicode
    which contains lots of useful information, including an answer to your
    original question.
     
    John Machin, Oct 17, 2008
    #2
    1. Advertising

  3. John Machin

    John Machin Guest

    On Oct 17, 2:56 pm, Jordan <> wrote:
    > I'm not quite sure now if the problem is me, windows, or zipfile
    > (which I kinda failed to mention before). Using
    > os.listdir(unicode(os.listdir()))


    You mean os.listdir(unicode(os.getcwd())), I presume.


    > seems to have been a step in the
    > right direction (thanks Chris and John). When testing things in the
    > python interpreter, I don't seem to hit issues after using the above
    > mentioned line.
    >
    >
    Code:
    >>> l = os.listdir(unicode(os.getcwd()))[color=green][color=darkred]
    > >>> l[/color][/color]
    >
    > u'01-\u3072\u3089\u304c\u306a.jpg'
    > u'02-\u3072\u3089\u304c\u306a.jpg'
    > u'03-\u3072\u3089\u304c\u306a.jpg'
    >[color=green][color=darkred]
    > >>>for thing in l:[/color][/color]
    >
    > ...    print thing
    > 01-ひらがな.jpg
    > 02-ひらがな.jpg
    > 03-ひらがな.jpg
    >
    > 
    > Yay.
    >
    > Having a file that tries "for thing in l: print thing" fails with:


    >
    > File "C:\Python25\Lib\encodings\cp437.py", line 12, in encode
    > return codecs.charmap_encode(input,errors,encoding_map)
    > UnicodeEncodeError: 'charmap' codec can't encode characters in
    > position 13-16: character maps to <undefined>
    >
    > I'm perfectly willing to let command prompt refuse to print that (it's
    > debugging only) if the next issue was resolved >_>:


    use print repr(thing) for debugging.

    >
    > """
    > Note: There is no official file name encoding for ZIP files. If you
    > have unicode file names, please convert them to byte strings in your
    > desired encoding before passing them to write(). WinZip interprets all
    > file names as encoded in CP437, also known as DOS Latin.
    > """
    >
    > I'm simply not sure what this means and how to deal with it.


    Step 1:
    Read appendix D of http://www.pkware.com/documents/casestudies/APPNOTE.TXT

    Step 2:
    Note the change history at the start of that document:
    """
    6.3.0 -Added tape positioning storage 09/29/2006
    parameters
    [snip]
    -Added option for Unicode filename
    storage
    """

    Step 3: Read http://bugs.python.org/issue1734346

    Step 4: Either wait for Python 2.7 or apply the patch to your own copy
    of zipfile ...
     
    John Machin, Oct 17, 2008
    #3
  4. > Step 4: Either wait for Python 2.7 or apply the patch to your own copy
    > of zipfile ...


    Actually, this is released in Python 2.6, see r62724.

    Regards,
    Martin
     
    Martin v. Lowis, Oct 17, 2008
    #4
  5. John Machin

    John Machin Guest

    On Oct 17, 6:32 pm, "Martin v. Lo"wis" <> wrote:
    > > Step 4: Either wait for Python 2.7 or apply the patch to your own copy
    > > of zipfile ...

    >
    > Actually, this is released in Python 2.6, see r62724.


    Hi Martin,

    That's good. I was lead astray by the fact that the 2.6 docs still
    contain the note that the OP asked about: "There is no official file
    name encoding for ZIP files. If you have unicode file names, you must
    convert them to byte strings in your desired encoding before passing
    them to write(). WinZip interprets all file names as encoded in CP437,
    also known as DOS Latin."

    The first sentence was and is bafflegab, the second didn't mention the
    portability issues arising from its suggestion (and is now not true),
    and the third needs explanation or omission. I believe that WinZip has
    supported utf8 since v11.2.

    Should the note be removed, or should it say something like "Unicode
    file names are supported. New in Python 2.6."? Is there anything else
    that should be mentioned?

    More on cp437: I see where you mentioned to the patch author that a
    unicode string should be encoded in cp437 if possible, but this was
    not done -- it first tries ascii. What are your views on what encoding
    should be assumed if the utf8 flag is not set?

    Cheers,
    John
     
    John Machin, Oct 17, 2008
    #5
  6. John Machin

    Mark Tolonen Guest

    "Jordan" <> wrote in message
    news:...
    >>>> l = os.listdir(unicode(os.getcwd()))


    Other options to get the same result:

    l = os.listdir(os.getcwdu())
    l = os.listdir(u'.')

    Oddly, os.getcwd() and os.getcwdu() both still exist in Python 3.0. Since
    the behavior is now identical it seems os.getcwdu() should be dropped.

    -Mark
     
    Mark Tolonen, Oct 17, 2008
    #6
  7. > Should the note be removed, or should it say something like "Unicode
    > file names are supported. New in Python 2.6."? Is there anything else
    > that should be mentioned?


    The note should be corrected, documenting the behaviour implemented.

    > More on cp437: I see where you mentioned to the patch author that a
    > unicode string should be encoded in cp437 if possible, but this was
    > not done -- it first tries ascii. What are your views on what encoding
    > should be assumed if the utf8 flag is not set?


    There isn't any standard that is widely followed (just as the note that
    you declared bafflegab says). While APPNOTE.TXT specifies it as cp437,
    implementations often ignore that, because a) they didn't know, and b)
    cp437 was too limited for what they want to do. So we see all kinds of
    alternative implementations - often involving the locale's code page
    (and on Windows, both OEMCP and ACP get used - often just as a side
    effect of whatever internal representation the applications use).

    In 2.x, Python doesn't need to decide, so when opening a zip file, the
    file names get reported as byte strings unless they have the UTF-8
    bit set (in which case they get decoded). In 3.x, file names (in the
    zipfile module) uniformly use the (unicode) character string type, hence
    that version implements the spec, by decoding as 437.

    Upon encoding, chosing between ASCII and CP437 has trade-offs. Notice
    how both are formally complying to the spec, as ASCII is a subset of
    CP437 (i.e. even though it uses the ASCII codec, it *still* encodes
    as CP437). The tradeoffs can be studied by looking at three groups
    of file names:
    - pure ASCII; choice does not matter (both ascii and cp437 can
    encode the file name, and both get the same result)
    - arbitrary string containing non-CP437 characters; choice does
    not matter (neither ascii nor cp437 can encode, so the UTF-8
    bit must be used)
    - others; here are the tradeoffs. Pro ASCII: receiver can unambiguously
    reproduce the original file name, as the UTF-8 bit will be set.
    Pro CP437: old software (unaware of the UTF-8 bit) has a chance
    of correctly guessing the file name (if it followed APPNOTE.TXT).

    I (now) prefer the tradeoff being taken, as it's the one that
    produces more reliable results in the long run (i.e. when more
    and more zip readers support UTF-8).

    Regards,
    Martin
     
    Martin v. Lwis, Oct 18, 2008
    #7
  8. John Machin

    John Machin Guest

    On Oct 18, 5:57pm, "Martin v. Lwis" <> wrote:
    > > Should the note be removed, or should it say something like "Unicode
    > > file names are supported. New in Python 2.6."? Is there anything else
    > > that should be mentioned?

    >
    > The note should be corrected, documenting the behaviour implemented.
    >
    > > More on cp437: I see where you mentioned to the patch author that a
    > > unicode string should be encoded in cp437 if possible, but this was
    > > not done -- it first tries ascii. What are your views on what encoding
    > > should be assumed if the utf8 flag is not set?

    >


    [lots of enlightenment snipped]

    Thanks heaps, Martin.
    Cheers,
    John
     
    John Machin, Oct 18, 2008
    #8
  9. > Oddly, os.getcwd() and os.getcwdu() both still exist in Python 3.0.
    > Since the behavior is now identical it seems os.getcwdu() should be
    > dropped.


    It is dropped, and os.getcwdb() has been added.

    Regards,
    Martin
     
    Martin v. Lwis, Oct 18, 2008
    #9
  10. John Machin

    Mark Tolonen Guest

    ""Martin v. Lwis"" <> wrote in message
    news:48f9de43$0$5124$...
    >> Oddly, os.getcwd() and os.getcwdu() both still exist in Python 3.0.
    >> Since the behavior is now identical it seems os.getcwdu() should be
    >> dropped.

    >
    > It is dropped, and os.getcwdb() has been added.


    Must be changed post 3.0rc1, but I seem to remember reading about that now
    in another thread:

    Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:08) [MSC v.1500 32 bit
    (Intel)]
    on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import os
    >>> [s for s in dir(os) if 'cwd' in s]

    ['getcwd', 'getcwdu']

    -Mark
     
    Mark Tolonen, Oct 18, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. fBechmann
    Replies:
    0
    Views:
    399
    fBechmann
    Jun 10, 2004
  2. Michal Ludvig

    File names, character sets and Unicode

    Michal Ludvig, Dec 12, 2008, in forum: Python
    Replies:
    1
    Views:
    315
    Marc 'BlackJack' Rintsch
    Dec 12, 2008
  3. kai_nerda
    Replies:
    0
    Views:
    625
    kai_nerda
    Apr 3, 2010
  4. Sfdesigner Sfdesigner
    Replies:
    5
    Views:
    168
    Chris Shea
    Aug 13, 2007
  5. SpringFlowers AutumnMoon

    how to glob with international or unicode file names?

    SpringFlowers AutumnMoon, Oct 14, 2007, in forum: Ruby
    Replies:
    2
    Views:
    122
    7stud --
    Oct 14, 2007
Loading...

Share This Page