strxfrm works with unicode string ?

Discussion in 'Python' started by nicolas.riesch@genevoise.ch, Jun 17, 2005.

  1. Guest

    I am trying to use strxfm with unicode strings, but it does not work.
    This is what I did:

    >>> import locale
    >>> s=u'\u00e9'
    >>> print s

    é
    >>> locale.setlocale(locale.LC_ALL, '')

    'French_Switzerland.1252'
    >>> locale.strxfrm(s)


    Traceback (most recent call last):
    File "<pyshell#20>", line 1, in -toplevel-
    locale.strxfrm(s)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
    position 0: ordinal not in range(128)
    >>>


    Someone sees what I did wrong ?
    , Jun 17, 2005
    #1
    1. Advertising

  2. Gerald Klix Guest

    How about:

    import locale
    s=u'\u00e9'
    print s


    locale.setlocale(locale.LC_ALL, '')


    locale.strxfrm( s.encode( "latin-1" ) )

    ---
    HTH,
    Gerald

    schrieb:
    > I am trying to use strxfm with unicode strings, but it does not work.
    > This is what I did:
    >
    >
    >>>>import locale
    >>>>s=u'\u00e9'
    >>>>print s

    >
    > é
    >
    >>>>locale.setlocale(locale.LC_ALL, '')

    >
    > 'French_Switzerland.1252'
    >
    >>>>locale.strxfrm(s)

    >
    >
    > Traceback (most recent call last):
    > File "<pyshell#20>", line 1, in -toplevel-
    > locale.strxfrm(s)
    > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
    > position 0: ordinal not in range(128)
    >
    >
    > Someone sees what I did wrong ?
    >


    --
    GPG-Key: http://keyserver.veridis.com:11371/search?q=0xA140D634
    Gerald Klix, Jun 17, 2005
    #2
    1. Advertising

  3. Guest

    Gruëzi, Gerald ;-)

    Well, ok, but I don't understand why I should first convert a pure
    unicode string into a byte string.
    The encoding ( here, latin-1) seems an arbitrary choice.

    Your solution works, but is it a workaround or the real way to use
    strxfrm ?
    It seems a little artificial to me, but perhaps I haven't understood
    something ...

    Does this mean that you cannot pass a unicode string to strxfrm ?

    Bonne journée !
    , Jun 17, 2005
    #3
  4. Gerald Klix Guest

    Sali Nicolas :)),
    please see below for my answers.

    schrieb:
    > Gruëzi, Gerald ;-)
    >
    > Well, ok, but I don't understand why I should first convert a pure
    > unicode string into a byte string.
    > The encoding ( here, latin-1) seems an arbitrary choice.

    Well "latin-1" is only encoding, about which I know that it works on
    my xterm and which I can type without spelling errors :)
    >
    > Your solution works, but is it a workaround or the real way to use
    > strxfrm ?
    > It seems a little artificial to me, but perhaps I haven't understood
    > something ...

    In Python 2.3.4 I had some strange encounters with the locale module,
    In the end I considered it broken, at least when it came to currency
    formating.
    >
    > Does this mean that you cannot pass a unicode string to strxfrm ?

    This works here for my home-grown python 2.4 on Jurrasic Debian Woody:

    import locale
    s=u'\u00e9'
    print s

    print locale.setlocale(locale.LC_ALL, '')
    print repr( locale.strxfrm( s.encode( "latin-1" ) ) )
    print repr( locale.strxfrm( s.encode( "utf-8" ) ) )

    The output is rather strange:

    é
    de_DE
    "\x10\x01\x05\x01\x02\x01'@/locale"
    "\x0c\x01\x0c\x01\x04\x01'@/locale"

    Another (not so) weird thing happens when I unset LANG.

    bear@special:~ > unset LANG
    bear@special:~ > python2.4 ttt.py
    Traceback (most recent call last):
    File "ttt.py", line 3, in ?
    print s
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
    position 0: ordinal not in range(128)

    Acually it's more weird, that printing works with LANG=de_DE.

    Back to your question. A quick glance at the C-sources of the
    _localemodule.c reveals:

    if (!PyArg_ParseTuple(args, "s:strxfrm", &s))

    So yes, strxfrm does not accept unicode!

    I am inclined to consider this a bug.
    A least it is not consistent with strcoll.
    Strcoll accepts either 2 strings or 2 unicode strings,
    at least when HAVE_WCSCOLL was defined when python
    was compiled on your plattform.

    BTW: Which platform do you use?

    HTH,
    Gerald

    PS: If you have access to irc, you can also ask at
    irc://irc.freenode.net#python.de.



    --
    GPG-Key: http://keyserver.veridis.com:11371/search?q=0xA140D634
    Gerald Klix, Jun 17, 2005
    #4
  5. Magnus Lycka Guest

    wrote:
    > Gruëzi, Gerald ;-)
    >
    > Well, ok, but I don't understand why I should first convert a pure
    > unicode string into a byte string.
    > The encoding ( here, latin-1) seems an arbitrary choice.


    Yes. The correct choice would be 'cp1252', not 'latin-1',
    since that's what your locale setting indicates.

    It seems to me that Python is on a journey from the ASCII
    world to the Unicode world, and it will take a few more
    versions before it gets there. Going from 2.2 to 2.3 was
    a bumpy part of the ride, and it's still not smooth.

    Just try to use raw_input with national characters. As far
    as I remember it hasn't worked (on windows at least) since
    2.2.

    The clear improvement from 2.3 is that if you print unicode
    strings to stdout, they will look correct both in the GUI
    and in text mode (cmd.exe). That never worked before since
    Windows use different code pages in Windows and in the text
    mode (which is supposed to be DOS compatible).
    Magnus Lycka, Jun 21, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    500
    Gabriele *darkbard* Farina
    May 16, 2006
  2. Holger Joukl
    Replies:
    5
    Views:
    516
    Ben Finney
    Dec 13, 2006
  3. Tuomas

    Undeterministic strxfrm?

    Tuomas, Sep 4, 2007, in forum: Python
    Replies:
    6
    Views:
    228
    Tuomas
    Sep 5, 2007
  4. Tuomas Vesterinen

    Ambiguous locale.strxfrm

    Tuomas Vesterinen, May 22, 2009, in forum: Python
    Replies:
    2
    Views:
    323
    Tuomas Vesterinen
    May 23, 2009
  5. Chirag Mistry
    Replies:
    6
    Views:
    162
    Ollivier Robert
    Feb 8, 2008
Loading...

Share This Page