Undeterministic strxfrm?

Tuomas · Sep 4, 2007

Python 2.4.3 (#3, Jun 4 2006, 09:19:30)
[GCC 4.0.0 20050519 (Red Hat 4.0.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information..... locale.setlocale(locale.LC_COLLATE, 'en_US.utf8')
.... return locale.strxfrm(s.encode('utf8'))
....'\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xf5\xb79''\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xb5'
May be this is enough for a sort order but I need to be able to catch
equals too. Any hints/explanations?

Gabriel Genellina · Sep 4, 2007

Python 2.4.3 (#3, Jun 4 2006, 09:19:30)
[GCC 4.0.0 20050519 (Red Hat 4.0.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.... locale.setlocale(locale.LC_COLLATE, 'en_US.utf8')
... return locale.strxfrm(s.encode('utf8'))
...'\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xf5\xb79''\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xb5'
May be this is enough for a sort order but I need to be able to catch
equals too. Any hints/explanations?

I can't use your same locale, but with my own locale settings, I get
consistent results:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
py> import locale
py> locale.setlocale(locale.LC_COLLATE, 'Spanish_Argentina')
'Spanish_Argentina.1252'
py> def key(s):
.... return locale.strxfrm(s.encode('utf8'))
....
py> first=key(u'maupassant guy')
py> print repr(first)
'\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02
\x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
py> print repr(key(u'maupassant guy'))
'\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02
\x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
py> print first==key(u'maupassant guy')
True

Same thing with Python 2.4.4

Tuomas · Sep 4, 2007

Gabriel said:
En Tue, 04 Sep 2007 07:34:54 -0300, Tuomas

Python 2.4.3 (#3, Jun 4 2006, 09:19:30)
[GCC 4.0.0 20050519 (Red Hat 4.0.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import locale
def key(s):

Click to expand...

... locale.setlocale(locale.LC_COLLATE, 'en_US.utf8')
... return locale.strxfrm(s.encode('utf8'))
...

first=key(u'maupassant guy')
first==key(u'maupassant guy') False
first

Click to expand...

'\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xf5\xb79'

key(u'maupassant guy')

Click to expand...

'\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xb5'

May be this is enough for a sort order but I need to be able to catch
equals too. Any hints/explanations?

Click to expand...

I can't use your same locale, but with my own locale settings, I get
consistent results:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
py> import locale
py> locale.setlocale(locale.LC_COLLATE, 'Spanish_Argentina')
'Spanish_Argentina.1252'
py> def key(s):
... return locale.strxfrm(s.encode('utf8'))
...

Because I am writing a multi language application I need to plase the
locale setting inside the key function. Actually I am implementing
binary search in a locally sorted list of strings and should be able to
count on stable results of strxfrm despite possibly visiting another
locale at meantime. Could repeated calls to setlocale cause some problems?

py> first=key(u'maupassant guy')
py> print repr(first)
'\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02

\x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
py> print repr(key(u'maupassant guy'))
'\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02

\x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
py> print first==key(u'maupassant guy')
True

Same thing with Python 2.4.4

I get the same unstability with my locale 'fi_FI.utf8' too, so I am
wondering if the source of the problem is the clib or the Python wrapper
around it. Differences in strxfrm results for identical source are
allways in the few latest bytes of the results.

Peter Otten · Sep 4, 2007

Am Tue, 04 Sep 2007 19:54:57 +0000 schrieb Tuomas:

I get the same unstability with my locale 'fi_FI.utf8' too, so I am
wondering if the source of the problem is the clib or the Python wrapper
around it. Differences in strxfrm results for identical source are
allways in the few latest bytes of the results.

Python seems to be the culprit as there is a relatively recent
strxfrm-related bugfix, see

http://svn.python.org/view/python/trunk/Modules/_localemodule.c?rev=54669

If I understand it correctly the error makes it likely that the resulting
string has trailing garbage characters.

Peter

Tuomas · Sep 5, 2007

Peter said:
Python seems to be the culprit as there is a relatively recent
strxfrm-related bugfix, see

Thanks Peter. Can't find it, do you have the issue number?

http://svn.python.org/view/python/trunk/Modules/_localemodule.c?rev=54669

If I understand it correctly the error makes it likely that the resulting
string has trailing garbage characters.

Reading the rev 54669 it seems to me, that the bug is not fixed. Man says:

STRXFRM(3): ... size_t strxfrm(char *dest, const char *src, size_t n);
.... The first n characters of the transformed string
are placed in dest. The transformation is based on the programâ€™s
current locale for category LC_COLLATE.
.... The strxfrm() function returns the number of bytes required to
store the transformed string in dest excluding the terminating â€˜\0â€™
character. If the value returned is n or more, the contents of dest are
*indeterminate*.

Accordin the man pages Python should know the size of the result it
expects and don't trust the size strxfrm returns. I don't completely
understand the collate algorithm, but it should offer different levels
of collate. So Python too, should offer those levels as a second
parameter. Hovever strxfrm don't offer more parameters either except
there is another function strcasecmp. So Python should be able to
calculate the expected size before calling strxfrm or strcasecmp. I
don't how it is possible. May be strcoll knows better and I should kick
strxfrm off and take strcoll instead. It costs converting the seach key
in every step of the search.

Tuomas

Gabriel Genellina · Sep 5, 2007

Thanks Peter. Can't find it, do you have the issue number?

I think it's not in the issue tracker - see
http://xforce.iss.net/xforce/xfdb/34060
The fix is already in 2.5.1
http://www.python.org/download/releases/2.5.1/NEWS.txt

Reading the rev 54669 it seems to me, that the bug is not fixed. Man
says:

STRXFRM(3): ... size_t strxfrm(char *dest, const char *src, size_t n);
... The first n characters of the transformed string
are placed in dest. The transformation is based on the programâ€™s
current locale for category LC_COLLATE.
... The strxfrm() function returns the number of bytes required to
store the transformed string in dest excluding the terminating â€˜\0â€™
character. If the value returned is n or more, the contents of dest are
*indeterminate*.

Accordin the man pages Python should know the size of the result it
expects and don't trust the size strxfrm returns. I don't completely
understand the collate algorithm, but it should offer different levels
of collate. So Python too, should offer those levels as a second
parameter. Hovever strxfrm don't offer more parameters either except
there is another function strcasecmp. So Python should be able to
calculate the expected size before calling strxfrm or strcasecmp. I
don't how it is possible. May be strcoll knows better and I should kick
strxfrm off and take strcoll instead. It costs converting the seach key
in every step of the search.

No. That's why strxfrm is called twice: the first one returns the required
buffer size, the buffer is resized, and strxfrm is called again. That's a
rather common sequence when buffer sizes are not known in advance.
[Note that `dest` is indeterminate, NOT the function return value which
always returns the required buffer size]

Tuomas · Sep 5, 2007

Gabriel said:
I think it's not in the issue tracker - see
http://xforce.iss.net/xforce/xfdb/34060
The fix is already in 2.5.1
http://www.python.org/download/releases/2.5.1/NEWS.txt

Thanks Gabriel, I'll try Python 2.5.1.

Reading the rev 54669 it seems to me, that the bug is not fixed. Man
says:

STRXFRM(3): ... size_t strxfrm(char *dest, const char *src, size_t n);
... The first n characters of the transformed string
are placed in dest. The transformation is based on the programâ€™s
current locale for category LC_COLLATE.
... The strxfrm() function returns the number of bytes required to
store the transformed string in dest excluding the terminating â€˜\0â€™
character. If the value returned is n or more, the contents of dest are
*indeterminate*.

Accordin the man pages Python should know the size of the result it
expects and don't trust the size strxfrm returns. I don't completely
understand the collate algorithm, but it should offer different levels
of collate. So Python too, should offer those levels as a second
parameter. Hovever strxfrm don't offer more parameters either except
there is another function strcasecmp. So Python should be able to
calculate the expected size before calling strxfrm or strcasecmp. I
don't how it is possible. May be strcoll knows better and I should kick
strxfrm off and take strcoll instead. It costs converting the seach key
in every step of the search.

Click to expand...

No. That's why strxfrm is called twice: the first one returns the
required buffer size, the buffer is resized, and strxfrm is called
again. That's a rather common sequence when buffer sizes are not known
in advance.
[Note that `dest` is indeterminate, NOT the function return value which
always returns the required buffer size]

OK, I made too quick conclusions of the man text without knowing the
details.

Tuomas

Output confusion	2	Mar 9, 2023
Windows binary stdin goes EOF after \x1a character	2	Oct 15, 2010
WSGI/wsgiref: modifying output on windows ?	2	Jun 3, 2007
Question about Reading Files	7	Sep 5, 2010
netlink messages	0	Jun 11, 2007
Format	0	Jul 2, 2008
Anyone can give some instructions on the function of this asm?	7	Mar 2, 2006

Undeterministic strxfrm?

Tuomas

Gabriel Genellina

Tuomas

Peter Otten

Tuomas

Gabriel Genellina

Tuomas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads