strxfrm works with unicode string ?

N

nicolas.riesch

I am trying to use strxfm with unicode strings, but it does not work.
This is what I did:

Traceback (most recent call last):
File "<pyshell#20>", line 1, in -toplevel-
locale.strxfrm(s)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(128)
Someone sees what I did wrong ?
 
G

Gerald Klix

How about:

import locale
s=u'\u00e9'
print s


locale.setlocale(locale.LC_ALL, '')


locale.strxfrm( s.encode( "latin-1" ) )
 
N

nicolas.riesch

Gruëzi, Gerald ;-)

Well, ok, but I don't understand why I should first convert a pure
unicode string into a byte string.
The encoding ( here, latin-1) seems an arbitrary choice.

Your solution works, but is it a workaround or the real way to use
strxfrm ?
It seems a little artificial to me, but perhaps I haven't understood
something ...

Does this mean that you cannot pass a unicode string to strxfrm ?

Bonne journée !
 
G

Gerald Klix

Sali Nicolas :)),
please see below for my answers.

Gruëzi, Gerald ;-)

Well, ok, but I don't understand why I should first convert a pure
unicode string into a byte string.
The encoding ( here, latin-1) seems an arbitrary choice.
Well "latin-1" is only encoding, about which I know that it works on
my xterm and which I can type without spelling errors :)
Your solution works, but is it a workaround or the real way to use
strxfrm ?
It seems a little artificial to me, but perhaps I haven't understood
something ...
In Python 2.3.4 I had some strange encounters with the locale module,
In the end I considered it broken, at least when it came to currency
formating.
Does this mean that you cannot pass a unicode string to strxfrm ?
This works here for my home-grown python 2.4 on Jurrasic Debian Woody:

import locale
s=u'\u00e9'
print s

print locale.setlocale(locale.LC_ALL, '')
print repr( locale.strxfrm( s.encode( "latin-1" ) ) )
print repr( locale.strxfrm( s.encode( "utf-8" ) ) )

The output is rather strange:

é
de_DE
"\x10\x01\x05\x01\x02\x01'@/locale"
"\x0c\x01\x0c\x01\x04\x01'@/locale"

Another (not so) weird thing happens when I unset LANG.

bear@special:~ > unset LANG
bear@special:~ > python2.4 ttt.py
Traceback (most recent call last):
File "ttt.py", line 3, in ?
print s
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 0: ordinal not in range(128)

Acually it's more weird, that printing works with LANG=de_DE.

Back to your question. A quick glance at the C-sources of the
_localemodule.c reveals:

if (!PyArg_ParseTuple(args, "s:strxfrm", &s))

So yes, strxfrm does not accept unicode!

I am inclined to consider this a bug.
A least it is not consistent with strcoll.
Strcoll accepts either 2 strings or 2 unicode strings,
at least when HAVE_WCSCOLL was defined when python
was compiled on your plattform.

BTW: Which platform do you use?

HTH,
Gerald

PS: If you have access to irc, you can also ask at
irc://irc.freenode.net#python.de.
 
M

Magnus Lycka

Gruëzi, Gerald ;-)

Well, ok, but I don't understand why I should first convert a pure
unicode string into a byte string.
The encoding ( here, latin-1) seems an arbitrary choice.

Yes. The correct choice would be 'cp1252', not 'latin-1',
since that's what your locale setting indicates.

It seems to me that Python is on a journey from the ASCII
world to the Unicode world, and it will take a few more
versions before it gets there. Going from 2.2 to 2.3 was
a bumpy part of the ride, and it's still not smooth.

Just try to use raw_input with national characters. As far
as I remember it hasn't worked (on windows at least) since
2.2.

The clear improvement from 2.3 is that if you print unicode
strings to stdout, they will look correct both in the GUI
and in text mode (cmd.exe). That never worked before since
Windows use different code pages in Windows and in the text
mode (which is supposed to be DOS compatible).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Unicode 2
SMTPHandler and Unicode 13
Short confusing example with unicode, print, and __str__ 0
Thinking Unicode 0
Ascii to Unicode. 4
Unicode error 19
helping with unicode 4
Right solution to unicode error? 21

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top