UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux & Windows XP

Discussion in 'Python' started by Mike Dee, Feb 22, 2005.

  1. Mike Dee

    Mike Dee Guest

    A very very basic UTF-8 question that's driving me nuts:

    If I have this in the beginning of my Python script in Linux:

    #!/usr/bin/env python
    # -*- coding: UTF-8 -*-

    should I - or should I not - be able to use non-ASCII characters
    in strings and in Tk GUI button labels and GUI window titles and in
    raw_input data without Python returning wrong case in manipulated
    strings and/or gibberished characters in Tk GUI title?



    With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the
    German / Swedish / Finnish / etc "umlauted" letter A (= a diaresis;
    that is an 'A' with two dots above it, or an O with two dots above.)

    In Linux in the Tk(?) GUI of my 'program' I get an uppercase "A"
    with a tilde above - followed by a general currency symbol ['spider'].
    That is, two wrong characters where a small umlauted letter "a"
    should be.

    But in Windows XP exactly the *same* code (The initiating "#!/usr/bin
    /env python" and all..) works just fine in the Tk GUI - non-ascii
    characters showing just as they should. (The code in both cases is
    without any u' prefixes in strings.)



    I have UTF-8 set as the encoding of my Suse 9.2 / KDE localization, I
    have saved my 'source code' in UTF-8 format and I have tried to read
    *a lot* of information about Unicode and I have heard it said many
    times that Python handles unicode very well -- so why can it be so
    bl**dy difficult to get an umlauted (two-dotted) letter a to be
    properly handled by Python 2.3? In Windows I have Python 2.4 - but the
    following case-insanity applies for Windows-Python as well:

    For example, if I do this in my Linux konsole (no difference whether it
    be in KDE Konsole window or the non-gui one via CTRL-ALT-F2):

    >>>aoumlautxyz="12xyz" # number 1 = umlauted a, number 2 = uml o
    >>>print aoumlautxyz.(upper)


    then the resulting string is NOT all upper case - it is a lowercase
    umlauted a, then a lowercase umlauted o then uppercase XYZ

    And command:

    >>> print aoumlautxyz.title()


    ...results in a string where a-umlaut, o-umlaut and yz are lowercase and
    only the Z in the middle is uppercase.

    this >>>print aoumlautxyz.lower()

    ... prints o.k.


    Am I missing something very basic here? Earlier there was a difference in
    my results between running the scripts in the CTRL ALT F2-konsole and the
    KDE-one, but I think running unicode_start & installing an unicode console
    font at some point of time ironed that one out.

    If this is due to some strange library, could someone please give me a
    push to a spot where to read about fixing it? Or am I just too stupid,
    and that's it. (I bet that really is what it boils down to..)


    <rant>

    I cannot be the only (non-pro) person in Europe who might need to use non-
    ASCII characters in GUI titles / button labels, in strings provided by the
    users of the software with raw_input (like person's name that begins with
    an umlauted letter or includes one or several of them) ..in comments, and
    so on.

    How would you go about making a script where a) the user types in any text
    (that might or might not include umlauted characters) and b) that text then
    gets uppercased, lowercased or "titled" and c) printed?

    Isn't it enough to have that 'UTF-8 encoding declaration' in the beginning,
    and then just like get the user's raw_input, mangle it about with .title()
    or some such tool, and then just spit it out with a print statement?

    One can hardly expect the users to type characters like unicode('\xc3\
    xa4\xc3\xb6\xc3\xbc', 'utf-8')u'\xe4\xf6\xfc' u"äöü".encode('utf-8') or
    whatnot, and encode & decode to and fro till the cows come home just to
    get a letter or two in their name to show up correctly.

    It's a shame that the Linux Cookbook, Learning Python 2nd ed, Absolute
    beginners guide to Python, Running Linux, Linux in a Nutshell, Suse 9.2
    Pro manuals and the online documentation I have bumped into with Google
    (like in unicode.org or python.org or even the Python Programming Faq
    1.3.9 / Unicode error) do not contain enough - or simple enough -
    information for a Python/Linux newbie to get 'it'.

    For what it's worth, in Kmail my encoding iso ISO8859-1. I tried that
    coding one in my KDE and my Python scripts, earlier too, but it was
    no better; actually that was why I started this Unicode sh.. ..thing.


    Am I beyond hope?

    </rant>


    Mike d
     
    Mike Dee, Feb 22, 2005
    #1
    1. Advertising

  2. Do Re Mi chel La Si Do, Feb 22, 2005
    #2
    1. Advertising

  3. Re: UTF-8 / German, Scandinavian letters - is it really this difficult??Linux & Windows XP

    Mike Dee wrote:
    > If I have this in the beginning of my Python script in Linux:
    >
    > #!/usr/bin/env python
    > # -*- coding: UTF-8 -*-
    >
    > should I - or should I not - be able to use non-ASCII characters
    > in strings and in Tk GUI button labels and GUI window titles and in
    > raw_input data without Python returning wrong case in manipulated
    > strings and/or gibberished characters in Tk GUI title?


    If you use byte strings, you should expect moji-bake. The coding
    declaration primarily affects Unicode literals, and has little
    effect on byte string literals. So try putting a "u" in front
    of each such string.

    > With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the
    > German / Swedish / Finnish / etc "umlauted" letter A (= a diaresis;
    > that is an 'A' with two dots above it, or an O with two dots above.)


    You explicitly requested that these characters are *not* ISO-8859-1,
    by saying that you want them as UTF-8. The LATIN CAPITAL LETTER A WITH
    DIAERESIS can be encoded in many different character sets, e.g.
    ISO-8859-15, windows1252, UTF-8, UTF-16, euc-jp, T.101, ...

    In different encodings, different byte sequences are used to represent
    the same character. If you pass a byte string to Tk, it does not know
    which encoding you meant to use (this is known in the Python source,
    but lost on the way to Tk). So it guesses ISO-8859-1; this guess is
    wrong because it really is UTF-8 in your case.

    OTOH, if you use a Unicode string, it is very clear what internal
    representation each character has.

    > How would you go about making a script where a) the user types in any text
    > (that might or might not include umlauted characters) and b) that text then
    > gets uppercased, lowercased or "titled" and c) printed?


    Use Unicode.

    > Isn't it enough to have that 'UTF-8 encoding declaration' in the beginning,
    > and then just like get the user's raw_input, mangle it about with .title()
    > or some such tool, and then just spit it out with a print statement?


    No.

    > One can hardly expect the users to type characters like unicode('\xc3\
    > xa4\xc3\xb6\xc3\xbc', 'utf-8')u'\xe4\xf6\xfc' u"äöü".encode('utf-8') or
    > whatnot, and encode & decode to and fro till the cows come home just to
    > get a letter or two in their name to show up correctly.


    This is not necessary.

    > Am I beyond hope?


    Perhaps not. You should, however, familiarize yourself with the notion
    of character encodings, and how the same character can have different
    byte represenations, and the same byte representation can have different
    interpretations as a character. If libraries disagree on how to
    interpret bytes as characters, you get moji-bake (ghost characters;
    a Japanese term for the problem, as Japanese users are familiar with
    the problem for a long time)

    The Python Unicode type solves these problems for good, but you
    need to use it correctly.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 22, 2005
    #3
  4. Mike Dee

    Serge Orlov Guest

    Mike Dee wrote:
    > [snip wrestling with byte strings]


    In addition to Martin reply I just want to add two notes:
    1. Interactive console in python 2.3 has a bug that was fixed
    in 2.4, so you can't enter unicode strings at the prompt:

    C:\Python24>python.exe
    >>> a=u'абв'
    >>> a

    u'\u0430\u0431\u0432'

    C:\Python23>python.exe
    >>> a=u'абв'
    >>> a

    u'\xa0\xa1\xa2'

    in 2.3 you need to use decode method to get unicode strings:
    >>> import sys
    >>> a2='абв'.decode(sys.stdin.encoding)
    >>> a2

    u'\u0430\u0431\u0432'

    2. Suse ships buggy build of python so title doesn't work
    properly, see discussion http://tinyurl.com/4k3au

    >>> print aoumlautxyz.title()

    ÄÖXyz

    You will need to call setlocale to help you:

    >>> import locale
    >>> locale.setlocale(locale.LC_ALL,'')

    'en_US.utf-8'
    >>> print aoumlautxyz.title()

    Äöxyz

    Serge.
     
    Serge Orlov, Feb 22, 2005
    #4
  5. Re: UTF-8 / German,Scandinavian letters - is it really this difficult??Linux & Windows XP

    Hace mucho tiempo en una galaxia muy, muy lejana, Mike Dee escribió:
    > A very very basic UTF-8 question that's driving me nuts:
    >
    > If I have this in the beginning of my Python script in Linux:
    >
    > #!/usr/bin/env python
    > # -*- coding: UTF-8 -*-
    >
    > should I - or should I not - be able to use non-ASCII characters
    > in strings and in Tk GUI button labels and GUI window titles and in
    > raw_input data without Python returning wrong case in manipulated
    > strings and/or gibberished characters in Tk GUI title?

    ....

    I'd recommend reading "The Absolute Minimum Every Software Developer
    Absolutely, Positively Must Know About Unicode and Character Sets (No
    Excuses!)", by Joel Spolsky:
    - http://www.joelonsoftware.com/articles/Unicode.html

    It's not based on Python (nor any other language either...), but I find
    it *very* useful.

    Regards,

    --
    Mariano
     
    Mariano Draghi, Feb 22, 2005
    #5
  6. Mike Dee

    Fuzzyman Guest

    Mike Dee wrote:
    > A very very basic UTF-8 question that's driving me nuts:
    >
    > If I have this in the beginning of my Python script in Linux:
    >
    > #!/usr/bin/env python
    > # -*- coding: UTF-8 -*-
    >
    > should I - or should I not - be able to use non-ASCII characters
    > in strings and in Tk GUI button labels and GUI window titles and in
    > raw_input data without Python returning wrong case in manipulated
    > strings and/or gibberished characters in Tk GUI title?
    >
    >
    >

    [snip..]

    Yet another reply... :)

    My understanding is that the encoding declaration (as above) only
    applies to the source code - and will not make your string literals
    into unicode objects, nor set the default encoding for the interpreter.


    This will mean string literals in your source code will be encoded as
    UTF8 - if you handle them with normal string operations you might get
    funny results.


    Regards,

    Fuzzy
    http://www.voidspace.org.uk/python/index.shtml
     
    Fuzzyman, Feb 22, 2005
    #6
  7. Mike Dee

    Max M Guest

    Re: UTF-8 / German, Scandinavian letters - is it really this difficult??Linux & Windows XP

    Fuzzyman wrote:
    > Mike Dee wrote:


    >>#!/usr/bin/env python
    >># -*- coding: UTF-8 -*-


    > This will mean string literals in your source code will be encoded as
    > UTF8 - if you handle them with normal string operations you might get
    > funny results.


    It means that you don't have to explicitely set the encoding on strings.

    If your coding isn't set you must write:

    ust = 'æøå'.decode('utf-8')

    If it is set, you can just write:

    ust = u'æøå'

    And this string will automatically be utf-8 encoded:

    st = 'æøå'

    So you should be able to convert it to unicode without giving an encoding:

    ust = unicode(st)

    --

    hilsen/regards Max M, Denmark

    http://www.mxm.dk/
    IT's Mad Science
     
    Max M, Feb 22, 2005
    #7
  8. Mike Dee

    Fuzzyman Guest

    Max M wrote:
    > Fuzzyman wrote:
    > > Mike Dee wrote:

    >
    > >>#!/usr/bin/env python
    > >># -*- coding: UTF-8 -*-

    >
    > > This will mean string literals in your source code will be encoded

    as
    > > UTF8 - if you handle them with normal string operations you might

    get
    > > funny results.

    >
    > It means that you don't have to explicitely set the encoding on

    strings.
    >
    > If your coding isn't set you must write:
    >
    > ust = 'æøå'.decode('utf-8')
    >


    Which is now deprecated isn't it ? (including encoded string literals
    in source without declaring an encoiding).

    > If it is set, you can just write:
    >
    > ust = u'æøå'
    >
    > And this string will automatically be utf-8 encoded:
    >
    > st = 'æøå'
    >
    > So you should be able to convert it to unicode without giving an

    encoding:
    >
    > ust = unicode(st)
    >


    So all your non unicode string literals will be utf-8 encoded. Normal
    string operations will handle them with the default encoding, which is
    likely to be something else. A likely source of confusion, unless you
    handle everything as unicode.

    But then I suppose if you have any non-ascii characters in your source
    code you *must* be explicit about what encoding they are in, or you are
    asking for trouble.

    Regards,


    Fuzzy
    http://www.voidspace.org.uk/python/index.shtml

    > --
    >
    > hilsen/regards Max M, Denmark
    >
    > http://www.mxm.dk/
    > IT's Mad Science
     
    Fuzzyman, Feb 22, 2005
    #8
  9. Mike Dee

    Duncan Booth Guest

    Max M wrote:

    > And this string will automatically be utf-8 encoded:
    >
    > st = 'æøå'
    >
    > So you should be able to convert it to unicode without giving an
    > encoding:
    >
    > ust = unicode(st)
    >

    No.

    Strings have no knowledge of their encoding. As you describe the string
    will be utf-8 encoded, but you still have to tell it that when you decode
    it.
     
    Duncan Booth, Feb 22, 2005
    #9
  10. Mike Dee

    Paul Boddie Guest

    Mike Dee <> wrote in message news:<cve1vh$c37$>...
    > A very very basic UTF-8 question that's driving me nuts:
    >
    > If I have this in the beginning of my Python script in Linux:
    >
    > #!/usr/bin/env python
    > # -*- coding: UTF-8 -*-
    >
    > should I - or should I not - be able to use non-ASCII characters
    > in strings


    For string literals, with the "coding" declaration, Python will accept
    that the bytes sitting in your source file inside the string literal
    didn't get there by accident - ie. that you meant to put the bytes
    [0xC3, 0xA6, 0xC3, 0xB8, 0xC3, 0xA5] into a string when you entered
    "æøå" in a UTF-8-enabled text editor. (Apologies if you don't see the
    three Scandinavian characters properly there in your preferred
    newsreader.)

    For Unicode literals (eg. u"æøå" in that same UTF-8-enabled text
    editor), Python will not only accept the above but also use the
    "coding" declaration to produce a Unicode object which unambiguously
    represents the sequence of characters - ie. something that can be
    used/processed to expose the intended characters in your program at
    run-time without any confusion about which characters are being
    represented.

    > and in Tk GUI button labels and GUI window titles and in
    > raw_input data without Python returning wrong case in manipulated
    > strings and/or gibberished characters in Tk GUI title?


    This is the challenging part. Having just experimented with using both
    string literals and Unicode literals with Tkinter on a Fedora Core 3
    system, with a program edited in a UTF-8 environment and with a
    ubiquitous UTF-8-based locale, it's actually possible to write
    non-ASCII characters into those literals and to get Tkinter to display
    them as I intended, but I've struck lucky with that particular
    combination - a not entirely unintended consequence of the Red Hat
    people going all out for UTF-8 everywhere (see below for more on
    that).

    Consider this snippet (with that "coding" declaration at the top):

    button1["text"] = "æøå"
    button2["text"] = u"æøå"

    In an environment with UTF-8-enabled editors, my program running in a
    UTF-8 locale, and with Tk supporting treating things as UTF-8 (I would
    imagine), I see what I intended. But what if I choose to edit my
    program in an editor employing a different encoding? Let's say I enter
    the program in an editor employing the mac-iceland encoding, even
    declaring it in the "coding" declaration at the top of the file.
    Running the program now yields a very strange label for the first
    button, but a correct label for the second one.

    What happens is that with a non-UTF-8 source file, running in a UTF-8
    locale with the same Tk as before, the text for the first button
    consists of a sequence of bytes that Tk then interprets incorrectly
    (probably as ISO-8859-1 as a sort of failsafe mode when it doesn't
    think the text is encoded using UTF-8), whereas the text for the
    second button is encoded from the unambiguous Unicode representation
    and is not misunderstood by Tk.

    Now, one might argue that I should change the locale to suit the
    encoding of the text file, but it soon becomes very impractical to
    take this approach. Besides, I don't think mac-iceland (an admittedly
    bizarre example) forms part of a supported locale on the system I have
    access to.

    > With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the
    > German / Swedish / Finnish / etc "umlauted" letter A (= a diaresis;
    > that is an 'A' with two dots above it, or an O with two dots above.)
    >
    > In Linux in the Tk(?) GUI of my 'program' I get an uppercase "A"
    > with a tilde above - followed by a general currency symbol ['spider'].
    > That is, two wrong characters where a small umlauted letter "a"
    > should be.


    That sort of demonstrates that the bytes used to represent your
    character are produced by a UTF-8 encoding of that character. Sadly,
    Tk then chooses to interpret them as ISO-8859-1, I guess. One thing to
    verify is whether Tk is aware of anything other than ISO-8859-1 on
    your system; another thing is to use Unicode objects and literal to at
    least avoid the guessing games.

    > But in Windows XP exactly the *same* code (The initiating "#!/usr/bin
    > /env python" and all..) works just fine in the Tk GUI - non-ascii
    > characters showing just as they should. (The code in both cases is
    > without any u' prefixes in strings.)


    It's pretty much a rule with internationalised applications in Python
    that Unicode is the way to go, even if it seems hard at first. This
    means that you should use Unicode literals in your programs should the
    need arise - I can't say it does very often in my case.

    > I have UTF-8 set as the encoding of my Suse 9.2 / KDE localization, I
    > have saved my 'source code' in UTF-8 format and I have tried to read
    > *a lot* of information about Unicode and I have heard it said many
    > times that Python handles unicode very well -- so why can it be so
    > bl**dy difficult to get an umlauted (two-dotted) letter a to be
    > properly handled by Python 2.3? In Windows I have Python 2.4 - but the
    > following case-insanity applies for Windows-Python as well:


    Python does handle Unicode pretty well, but it's getting the data in
    and out of Python, combined with other components and their means of
    presenting/representing that data which is usually the problem.

    > For example, if I do this in my Linux konsole (no difference whether it
    > be in KDE Konsole window or the non-gui one via CTRL-ALT-F2):
    >
    > >>>aoumlautxyz="12xyz" # number 1 = umlauted a, number 2 = uml o
    > >>>print aoumlautxyz.(upper)

    >
    > then the resulting string is NOT all upper case - it is a lowercase
    > umlauted a, then a lowercase umlauted o then uppercase XYZ


    In this case, you're using normal strings which are effectively
    locale-dependent byte strings. I guess that what happens is that the
    byte sequence gets passed to the system's string processing routines
    which then fail to convert the non-ASCII characters to upper case
    according to their understanding of what the bytes actually mean in
    terms of being characters. I'll accept that consoles can be pretty
    nasty in exposing the right encoding to Python (search for "setlocale"
    in the comp.lang.python archives) and that without some trickery you
    won't get decent results. However, by employing Unicode objects and
    explicitly identifying the encoding of the input, you should get the
    results you are looking for:

    aoumlautxyz=unicode("äöxyz", "utf-8")

    This assumes that UTF-8 really is the encoding of the input.

    [...]

    > One can hardly expect the users to type characters like unicode('\xc3\
    > xa4\xc3\xb6\xc3\xbc', 'utf-8')u'\xe4\xf6\xfc' u"äöü".encode('utf-8') or
    > whatnot, and encode & decode to and fro till the cows come home just to
    > get a letter or two in their name to show up correctly.


    With a "coding" declaration and Unicode literals, you won't even need
    to use the unicode constructor. Again, I must confess that much of the
    data I work with doesn't originate in the program code itself, so I
    rarely need to even think of this issue.

    > It's a shame that the Linux Cookbook, Learning Python 2nd ed, Absolute
    > beginners guide to Python, Running Linux, Linux in a Nutshell, Suse 9.2
    > Pro manuals and the online documentation I have bumped into with Google
    > (like in unicode.org or python.org or even the Python Programming Faq
    > 1.3.9 / Unicode error) do not contain enough - or simple enough -
    > information for a Python/Linux newbie to get 'it'.


    One side-effect of the "big push" to UTF-8 amongst the Linux
    distribution vendors/maintainers is the evasion of issues such as
    filesystem encodings and "real" Unicode at the system level. In
    Python, when you have a Unicode object, you are dealing with idealised
    sequences of characters, whereas in many system and library APIs out
    there you either get back a sequence of anonymous bytes or a sequence
    of UTF-8 bytes that people are pretending is Unicode, right up until
    the point where someone recompiles the software to use UTF-16 instead,
    thus causing havoc. Anyone who has needed to expose filesystems
    created by Linux distributions before the UTF-8 "big push" to later
    distributions can attest to the fact that the "see no evil" brass
    monkey is wearing a T-shirt with "UTF-8" written on it.

    Paul
     
    Paul Boddie, Feb 22, 2005
    #10
  11. Mike Dee

    Serge Orlov Guest

    Paul Boddie wrote:
    > One side-effect of the "big push" to UTF-8 amongst the Linux
    > distribution vendors/maintainers is the evasion of issues such as
    > filesystem encodings and "real" Unicode at the system level. In
    > Python, when you have a Unicode object, you are dealing with
    > idealised
    > sequences of characters, whereas in many system and library APIs out
    > there you either get back a sequence of anonymous bytes or a sequence
    > of UTF-8 bytes that people are pretending is Unicode, right up until
    > the point where someone recompiles the software to use UTF-16
    > instead,
    > thus causing havoc. Anyone who has needed to expose filesystems
    > created by Linux distributions before the UTF-8 "big push" to later
    > distributions can attest to the fact that the "see no evil" brass
    > monkey is wearing a T-shirt with "UTF-8" written on it.


    Unfortunately the monkey is painted in the air with a stick, so
    not everyone can see it. Python can't. Given a random linux system
    how can you tell if the monkey has pushed it already or not?

    Serge.
     
    Serge Orlov, Feb 22, 2005
    #11
  12. Re: UTF-8 / German, Scandinavian letters - is it really this difficult??Linux & Windows XP

    Fuzzyman wrote:
    >>ust = 'æøå'.decode('utf-8')
    >>

    >
    > Which is now deprecated isn't it ? (including encoded string literals
    > in source without declaring an encoiding).


    Not having an encoding declaration while having non-ASCII characters
    in source code is deprecated.

    Having non-ASCII characters in string literals is not deprecated
    (assuming there is an encoding declaration in the source); trusting
    then that the string literals are utf-8-encoded (and decoding
    them such) is fine.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 23, 2005
    #12
  13. Mike Dee

    Paul Boddie Guest

    "Serge Orlov" <> wrote in message news:<>...
    > Paul Boddie wrote:
    > > Anyone who has needed to expose filesystems
    > > created by Linux distributions before the UTF-8 "big push" to later
    > > distributions can attest to the fact that the "see no evil" brass
    > > monkey is wearing a T-shirt with "UTF-8" written on it.

    >
    > Unfortunately the monkey is painted in the air with a stick, so
    > not everyone can see it. Python can't. Given a random linux system
    > how can you tell if the monkey has pushed it already or not?


    That's a good question. See this article for an example of the
    frustration caused:

    http://groups.google.no/groups?selm=b1npav$cci$&output=gplain

    Paul
     
    Paul Boddie, Feb 23, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Paul Smith

    Is This Really that Difficult?

    Paul Smith, Dec 8, 2005, in forum: ASP .Net
    Replies:
    5
    Views:
    658
    Kevin Spencer
    Dec 8, 2005
  2. Kristian Niemi
    Replies:
    3
    Views:
    561
    Daniel R. Tobias
    Jul 12, 2004
  3. Merrigan
    Replies:
    4
    Views:
    611
    Chris
    Dec 14, 2007
  4. Venugopal
    Replies:
    11
    Views:
    1,686
    Tassilo v. Parseval
    Nov 5, 2003
  5. ICPooreMan
    Replies:
    4
    Views:
    138
    Wenzel.Peppmeyer
    Apr 7, 2007
Loading...

Share This Page