Printing Filenames with non-Ascii-Characters

Discussion in 'Python' started by =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=, Feb 1, 2005.

  1. Hi,

    I am very new to Python and have run into the following problem. If I do
    something like

    dir = os.listdir(somepath)
    for d in dir:
    print d

    The program fails for filenames that contain non-ascii characters.

    'ascii' codec can't encode characters in position 33-34:

    I have noticed that this seems to be a very common problem. I have read a lot
    of postings regarding it but not really found a solution. Is there a simple
    one?

    What I specifically do not understand is why Python wants to interpret the
    string as ASCII at all. Where is this setting hidden?

    I am running Python 2.3.4 on Windows XP and I want to run the program on
    Debian sarge later.

    Ciao, MM
    --
    Marian Aldenhövel, Rosenhain 23, 53123 Bonn. +49 228 624013.
    http://www.marian-aldenhoevel.de
    "There is a procedure to follow in these cases, and if followed it can
    pretty well guarantee a generous measure of success, success here
    defined as survival with major extremities remaining attached."
    =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=, Feb 1, 2005
    #1
    1. Advertising

  2. =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=

    aurora Guest

    On Tue, 01 Feb 2005 20:28:11 +0100, Marian Aldenhövel
    <> wrote:

    > Hi,
    >
    > I am very new to Python and have run into the following problem. If I do
    > something like
    >
    > dir = os.listdir(somepath)
    > for d in dir:
    > print d
    >
    > The program fails for filenames that contain non-ascii characters.
    >
    > 'ascii' codec can't encode characters in position 33-34:
    >
    > I have noticed that this seems to be a very common problem. I have read
    > a lot
    > of postings regarding it but not really found a solution. Is there a
    > simple
    > one?


    English windows command prompt uses cp437 charset. To print it, use

    print d.encode('cp437')

    The issue is a terminal only understand certain character set. If you have
    unicode string, like d in your case, you have to encode it before it can
    be printed. (We really need native unicode terminal!!!) If you don't
    encode, Python will do it for you. The default encoding is ASCII. Any
    string that contains non-ASCII character will give you trouble. In my
    opinion Python is too conversative to use the 'strict' encoding which
    gives users unaware of unicode a lot of woes.

    So how did you get a unicoded d to start with? If 'somepath' is unicode,
    os.listdir returns a list of unicode. So why is somepath unicode? Either
    you have entered a unicode literal or it comes from some other sources.
    One possible source is XML parser, which returns string in unicode.

    Windows NT support unicode filename. I'm not sure about Linux. The result
    maybe slightly differ.






    >
    > What I specifically do not understand is why Python wants to interpret
    > the
    > string as ASCII at all. Where is this setting hidden?
    >
    > I am running Python 2.3.4 on Windows XP and I want to run the program on
    > Debian sarge later.
    >
    > Ciao, MM
    aurora, Feb 1, 2005
    #2
    1. Advertising

  3. =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=

    Serge Orlov Guest

    Marian Aldenhövel wrote:
    > Hi,
    >
    > I am very new to Python and have run into the following problem. If I

    do
    > something like
    >
    > dir = os.listdir(somepath)
    > for d in dir:
    > print d
    >
    > The program fails for filenames that contain non-ascii characters.
    >
    > 'ascii' codec can't encode characters in position 33-34:
    >
    > I have noticed that this seems to be a very common problem. I have

    read a lot
    > of postings regarding it but not really found a solution. Is there a

    simple
    > one?


    No :) You're trying to deal with legacy terminals, you can't reliably
    print unicode characters across various terminals. It's not really
    Python's fault.

    >
    > What I specifically do not understand is why Python wants to

    interpret the
    > string as ASCII at all. Where is this setting hidden?


    http://www.python.org/moin/PrintFails Let me know if it's not clear. It
    would be great if other people fixed/improved this page.

    > I am running Python 2.3.4 on Windows XP and I want to run the program

    on
    > Debian sarge later.


    You need cross platform terminal that supports unicode output.
    Sergey.
    Serge Orlov, Feb 1, 2005
    #3
  4. Marian Aldenhövel wrote:
    > Hi,
    >
    > I am very new to Python and have run into the following problem. If I do
    > something like
    >
    > dir = os.listdir(somepath)
    > for d in dir:
    > print d
    >
    > The program fails for filenames that contain non-ascii characters.
    >
    > 'ascii' codec can't encode characters in position 33-34:


    If you read this carefully, you'll notice that Python has tried and
    failed to *encode* a decoded ( = unicode) string using the 'ascii'
    codec. IOW, d seems to be bound to a unicode string. Which is unexpected
    unless maybe the argument passed to os.listdir (somepath) is a Unicode
    string, too. (If given a Unicode string as argument, os.listdir will
    return the list as a list of unicode names).

    If you're printing to the console, modern Pythons will try to guess the
    console's encoding (e.g. cp850). I would expect a UnicodeEncodeError if
    the print fails because the characters do not map to the console's
    encoding, not the error you're seeing.

    How *are* you running the program. In the console (cmd.exe)? Or from
    some IDE?

    >
    > I have noticed that this seems to be a very common problem. I have read
    > a lot
    > of postings regarding it but not really found a solution. Is there a simple
    > one?
    >
    > What I specifically do not understand is why Python wants to interpret the
    > string as ASCII at all. Where is this setting hidden?


    Don't be tempted to ever change sys.defaultencoding in site.py, this is
    site specific, meaning that if you ever distribute them, programs
    relying on this setting may fail on other people's Python installations.

    --
    Vincent Wehren

    >
    > I am running Python 2.3.4 on Windows XP and I want to run the program on
    > Debian sarge later.
    >
    > Ciao, MM
    vincent wehren, Feb 1, 2005
    #4
  5. Hi,

    Thank you very much, you have collectively cleared up some of the confusion.

    > English windows command prompt uses cp437 charset.


    To be exact my Windows is german but I am not outputting to the command prompt
    window. I am using eclipse with the pydev plugin as development platform and
    the output is redirected to the console view in the IDE. I am not sure how
    this affects the problem and have since tried a vanilla console too. The
    problem stays the same, though.

    I wonder what surprises are waiting for me when I first move this to my
    linux-box :). I believe it uses UTF-8 throughout.

    > print d.encode('cp437')


    So I would have to specify the encoding on every call to print? I am sure to
    forget and I don't like the program dying, in my case garbled output would be
    much more acceptable.

    Is there some global way of forcing an encoding instead of the default
    'ascii'? I have found references to setencoding() but this seems to have gone
    away.

    > The issue is a terminal only understand certain character set.


    I have experimented a bit now and I can make it work using encode(). The
    eclipse console uses a different encoding than my windows command prompt, by
    the way. I am sure this can be configured somewhere but I do not really care
    at the moment.

    > If you have unicode string, like d in your case, you have to encode it before
    > it can be printed.


    I got that now.

    So encode() is a method of a unicode string, right?. I come from a background
    of statically typed languages so I am a bit queasy when I am not allowed to
    explicitly specify type.

    How can I, maybe by print()-ing something, find out what type d actually is
    of? Just to make sure and get a better feeling for the system?

    Should d at any time not be a unicode string but some other flavour of string,
    will encode() still work? Or do I need to write a function myPrint() that
    distinguishes them by type and calls encode() only for unicode strings?

    > So how did you get a unicoded d to start with?


    I have asked myself this question before after reading the docs for
    os.listdir(). But I have no way of finding out what type d really is (see
    question above :)). So I was dead-reckoning.

    Can I force a string to be of a certain type? Like

    nonunicode=unicode.encode("specialencoding")

    How would I do it the other way round? From encoded representation to full
    unicode?

    > If 'somepath' is unicode, os.listdir returns a list of unicode.
    > So why is somepath unicode?


    > One possible source is XML parser, which returns string in unicode.


    I get a root-directory from XML and I walk the filesystem from there. That
    explains it.

    > Windows NT support unicode filename. I'm not sure about Linux. The
    > result maybe slightly differ.


    I think I will worry about that later. I can create files using german umlauts
    on the linux box. I am sure I will find a way to move those names into my
    Python program.

    I will not move data between the systems so there will not be much of
    a problem.

    Ciao, MM
    --
    Marian Aldenhövel, Rosenhain 23, 53123 Bonn. +49 228 624013.
    http://www.marian-aldenhoevel.de
    "There is a procedure to follow in these cases, and if followed it can
    pretty well guarantee a generous measure of success, success here
    defined as survival with major extremities remaining attached."
    =?ISO-8859-15?Q?Marian_Aldenh=F6vel?=, Feb 2, 2005
    #5
  6. Hi,

    > Don't be tempted to ever change sys.defaultencoding in site.py, this is
    > site specific, meaning that if you ever distribute them, programs
    > relying on this setting may fail on other people's Python installations.


    But wouldn't that be correct in my case?

    > If you're printing to the console, modern Pythons will try to guess the
    > console's encoding (e.g. cp850).


    But it seems to have quessed wrong. I don't blame it, I would not know of
    any way to reliably figure out this setting.

    My console can print the filenames in question fine, I can verify that by
    simple listing the directory, so it can display more than plain ascii.
    The error message seems to indicate that ascii is used as target.

    So if I were to fix this in sity.py to configure whatever encoding is
    actually used on my system, I could print() my filenames without explicitly
    calling encode()?

    If the program then fails on other people's installations that would mean
    one of two things:

    1) They have not configured their encoding correctly.
    2) The data to be printed cannot be encoded. This is unlikely as it comes
    from a local filename.

    So wouldn't fixing site.py be the right thing to do? To enable Python to print
    everything that can actually be printed and not barf at things it could print
    but cannot because it defaults to plain ascii?

    Ciao, MM
    --
    Marian Aldenhövel, Rosenhain 23, 53123 Bonn. +49 228 624013.
    http://www.marian-aldenhoevel.de
    "There is a procedure to follow in these cases, and if followed it can
    pretty well guarantee a generous measure of success, success here
    defined as survival with major extremities remaining attached."
    =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=, Feb 2, 2005
    #6
  7. =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=

    Max M Guest

    Marian Aldenhövel wrote:

    > > If you're printing to the console, modern Pythons will try to guess the
    > > console's encoding (e.g. cp850).

    >
    > But it seems to have quessed wrong. I don't blame it, I would not know of
    > any way to reliably figure out this setting.


    Have you set the coding cookie in your file?

    Try adding this as the first or second line.

    # -*- coding: cp850 -*-

    Python will then know how your file is encoded

    --

    hilsen/regards Max M, Denmark

    http://www.mxm.dk/
    IT's Mad Science
    Max M, Feb 2, 2005
    #7
  8. Hi,

    > Have you set the coding cookie in your file?


    Yes. I set it to Utf-8 as that's what I use for all my development.

    > Try adding this as the first or second line.
    >
    > # -*- coding: cp850 -*-
    >
    > Python will then know how your file is encoded


    That is relevant to the encoding of source-files, right? How does it affect
    printing to standard out?

    If it would I would expect UTF-8 data on my console. That would be fine, it
    can encode everything and as I have written in another posting in my case
    garbled data is better than termination of my program.

    But it uses 'ascii', at least if I can believe the error message it gave.

    Ciao, MM
    --
    Marian Aldenhövel, Rosenhain 23, 53123 Bonn. +49 228 624013.
    http://www.marian-aldenhoevel.de
    "There is a procedure to follow in these cases, and if followed it can
    pretty well guarantee a generous measure of success, success here
    defined as survival with major extremities remaining attached."
    =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=, Feb 2, 2005
    #8
  9. Marian Aldenhövel wrote:
    >
    > But wouldn't that be correct in my case?
    >


    This is what I get inside Eclipse using pydev when I run:

    <code>
    import os
    dirname = "c:/test"
    print dirname
    for fname in os.listdir(dirname):
    print fname
    if os.path.isfile(fname):
    print fname
    </code>:

    c:/test
    straßenschild.png
    test.py
    Übersetzung.rtf


    This is what I get passing a unicode argument to os.listdir:

    <code>
    import os
    dirname = u"c:/test"
    print dirname # will print fine, all ascii subset compatible
    for fname in os.listdir(dirname):
    print fname
    if os.path.isfile(fname):
    print fname
    </code>

    c:/test
    Traceback (most recent call last):
    File "C:\Programme\eclipse\workspace\myFirstProject\pythonFile.py",
    line 5, in ?
    print fname
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in
    position 4: ordinal not in range(128)

    which is probably what you are getting, right?

    You are trying to write *Unicode* objects containing characters outside
    of the 0-128 to a multi byte-oriented output without telling Python the
    appropriate encoding to use. Inside eclipse, Python will always use
    ascii and never guess.

    import os
    dirname = u"c:/test"
    print dirname
    for fname in os.listdir(dirname):
    print type(fname)

    c:/test
    <type 'unicode'>
    <type 'unicode'>
    <type 'unicode'>



    so finally:
    <code>
    import os
    dirname = u"c:/test"
    print dirname
    for fname in os.listdir(dirname):
    print fname.encode("mbcs")
    </code>

    gives:

    c:/test
    straßenschild.png
    test.py
    Übersetzung.rtf

    Instead of "mbcs", which should be available on all Windows systems, you
    could have used "cp1252" when working on a German locale; inside Eclipse
    even "utf-16-le" would work, underscoring that the way the 'output
    device' handles encodings is decisive. I know this all seems awkward at
    first, but Python's drive towards uncompromising explicitness pays off
    big time when you're dealing with multilingual data.

    --
    Vincent Wehren
    vincent wehren, Feb 2, 2005
    #9
  10. =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=

    aurora Guest

    > > print d.encode('cp437')
    >
    > So I would have to specify the encoding on every call to print? I am
    > sure to
    > forget and I don't like the program dying, in my case garbled output
    > would be
    > much more acceptable.


    Marian I'm with you. You never known you have put enough encode in all the
    right places and there is no static type checking to help you. So that
    short answer is to set a different default in sitecustomize.py. I'm trying
    to writeup something about unicode in Python, once I understand what's
    going on inside...
    aurora, Feb 3, 2005
    #10
  11. Hi,

    > Python's drive towards uncompromising explicitness pays off
    > big time when you're dealing with multilingual data.


    Except for the very implicit choice of 'ascii' as an encoding when
    it cannot make a good guess of course :).

    All in all I agree, however.

    Ciao, MM
    --
    Marian Aldenhövel, Rosenhain 23, 53123 Bonn. +49 228 624013.
    http://www.marian-aldenhoevel.de
    "There is a procedure to follow in these cases, and if followed it can
    pretty well guarantee a generous measure of success, success here
    defined as survival with major extremities remaining attached."
    =?ISO-8859-1?Q?Marian_Aldenh=F6vel?=, Feb 3, 2005
    #11
  12. Marian Aldenhövel wrote:
    > Hi,
    >
    > > Python's drive towards uncompromising explicitness pays off

    >
    >> big time when you're dealing with multilingual data.

    >
    >
    > Except for the very implicit choice of 'ascii' as an encoding when
    > it cannot make a good guess of course :).


    Since 'ascii' is a legal subset Unicode and of most prevailing
    encodings, this is the only sensible thing to do. It is outside of the
    ascii range where characters become ambigious and need additional
    interpretation. Where other languages might ignore the problem at hand
    and send garbled data or replace characters to the output, Python at
    least let's you respond to conversion problems/errors.

    >
    > All in all I agree, however.


    That's good to hear ;)

    --
    Vincent Wehren



    >
    > Ciao, MM
    vincent wehren, Feb 4, 2005
    #12
  13. Marian Aldenhövel wrote:
    > dir = os.listdir(somepath)
    > for d in dir:
    > print d
    >
    > The program fails for filenames that contain non-ascii characters.
    >
    > 'ascii' codec can't encode characters in position 33-34:


    I cannot reproduce this. On my system, all such file names print just
    fine, in Python 2.3.4.

    Are you using the Windows XP cmd.exe window to perform this experiment?
    What (language) version of XP are you using?
    What are your regional settings?
    What is sys.stdout.encoding?
    What is the result of locale.setlocale(locale.LC_ALL, "")?

    > What I specifically do not understand is why Python wants to interpret the
    > string as ASCII at all. Where is this setting hidden?


    If this is in cmd.exe (or equivalent), Python
    1. verifies that the output is indeed a terminal
    2. if it is, determines what code page the terminal uses
    3. tries to find a codec for this code page
    4. if this fails, falls back to ASCII

    In your case, step 4 must have happend. It's not clear to my why this
    happened, unless you are using a Japanese version of XP (or some other
    version for which Python does not have a codec).

    > I am running Python 2.3.4 on Windows XP and I want to run the program on
    > Debian sarge later.


    In Linux, make sure that LANG is set to a value that allows Python to
    infer the encoding of the terminal.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 8, 2005
    #13
  14. Marian Aldenhövel wrote:
    > > If you're printing to the console, modern Pythons will try to guess the
    > > console's encoding (e.g. cp850).

    >
    > But it seems to have quessed wrong. I don't blame it, I would not know of
    > any way to reliably figure out this setting.


    It's actually very easy. Python invokes GetConsoleOutputCP() to find out
    the encoding of the console (if the output is to a console, as
    determined by isatty()).

    > My console can print the filenames in question fine, I can verify that by
    > simple listing the directory, so it can display more than plain ascii.
    > The error message seems to indicate that ascii is used as target.


    Yes, because that is the fallback.

    > So if I were to fix this in sity.py to configure whatever encoding is
    > actually used on my system, I could print() my filenames without explicitly
    > calling encode()?


    Yes. However, you cannot put a reasonable value in there, because
    different parts of your system use different encodings. In particular,
    the console likely uses CP 850, whereas the rest of your system
    likely uses CP 1252.

    > So wouldn't fixing site.py be the right thing to do?


    No. If they put CP850 into sitecustomize, Unicode in user interfaces
    (menus etc) might be displayed as moji-bake, as the user interface
    will likely assume CP1252, not CP850.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 8, 2005
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Nicholas Clarke

    filenames with non ascii characters

    Nicholas Clarke, Jan 14, 2004, in forum: Java
    Replies:
    1
    Views:
    380
    Michiel Konstapel
    Jan 15, 2004
  2. B.J.
    Replies:
    4
    Views:
    726
    Toby Inkster
    Apr 23, 2005
  3. pdenize
    Replies:
    1
    Views:
    1,022
    Martin v. Löwis
    Jul 20, 2009
  4. bruce
    Replies:
    38
    Views:
    255
    Mark Lawrence
    Nov 1, 2013
  5. MRAB
    Replies:
    0
    Views:
    85
Loading...

Share This Page