Undeterministic strxfrm?

T

Tuomas

Python 2.4.3 (#3, Jun 4 2006, 09:19:30)
[GCC 4.0.0 20050519 (Red Hat 4.0.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information..... locale.setlocale(locale.LC_COLLATE, 'en_US.utf8')
.... return locale.strxfrm(s.encode('utf8'))
....'\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xf5\xb79''\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xb5'
May be this is enough for a sort order but I need to be able to catch
equals too. Any hints/explanations?
 
G

Gabriel Genellina

Python 2.4.3 (#3, Jun 4 2006, 09:19:30)
[GCC 4.0.0 20050519 (Red Hat 4.0.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.... locale.setlocale(locale.LC_COLLATE, 'en_US.utf8')
... return locale.strxfrm(s.encode('utf8'))
...'\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xf5\xb79''\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xb5'
May be this is enough for a sort order but I need to be able to catch
equals too. Any hints/explanations?

I can't use your same locale, but with my own locale settings, I get
consistent results:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
py> import locale
py> locale.setlocale(locale.LC_COLLATE, 'Spanish_Argentina')
'Spanish_Argentina.1252'
py> def key(s):
.... return locale.strxfrm(s.encode('utf8'))
....
py> first=key(u'maupassant guy')
py> print repr(first)
'\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02
\x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
py> print repr(key(u'maupassant guy'))
'\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02
\x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
py> print first==key(u'maupassant guy')
True

Same thing with Python 2.4.4
 
T

Tuomas

Gabriel said:
En Tue, 04 Sep 2007 07:34:54 -0300, Tuomas
Python 2.4.3 (#3, Jun 4 2006, 09:19:30)
[GCC 4.0.0 20050519 (Red Hat 4.0.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import locale
def key(s):
... locale.setlocale(locale.LC_COLLATE, 'en_US.utf8')
... return locale.strxfrm(s.encode('utf8'))
...
first=key(u'maupassant guy')
first==key(u'maupassant guy') False
first
'\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xf5\xb79'
key(u'maupassant guy')
'\x18\x0c \x1b\x0c\x1e\x1e\x0c\x19\x1f\x12
$\x01\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x01\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x02\x01\xb5'

May be this is enough for a sort order but I need to be able to catch
equals too. Any hints/explanations?


I can't use your same locale, but with my own locale settings, I get
consistent results:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
py> import locale
py> locale.setlocale(locale.LC_COLLATE, 'Spanish_Argentina')
'Spanish_Argentina.1252'
py> def key(s):
... return locale.strxfrm(s.encode('utf8'))
...

Because I am writing a multi language application I need to plase the
locale setting inside the key function. Actually I am implementing
binary search in a locally sorted list of strings and should be able to
count on stable results of strxfrm despite possibly visiting another
locale at meantime. Could repeated calls to setlocale cause some problems?
py> first=key(u'maupassant guy')
py> print repr(first)
'\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02

\x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
py> print repr(key(u'maupassant guy'))
'\x0eQ\x0e\x02\x0e\x9f\x0e~\x0e\x02\x0e\x91\x0e\x91\x0e\x02\x0ep\x0e\x99\x07\x02

\x0e%\x0e\x9f\x0e\xa7\x01\x01\x01\x01'
py> print first==key(u'maupassant guy')
True

Same thing with Python 2.4.4

I get the same unstability with my locale 'fi_FI.utf8' too, so I am
wondering if the source of the problem is the clib or the Python wrapper
around it. Differences in strxfrm results for identical source are
allways in the few latest bytes of the results.
 
P

Peter Otten

Am Tue, 04 Sep 2007 19:54:57 +0000 schrieb Tuomas:
I get the same unstability with my locale 'fi_FI.utf8' too, so I am
wondering if the source of the problem is the clib or the Python wrapper
around it. Differences in strxfrm results for identical source are
allways in the few latest bytes of the results.

Python seems to be the culprit as there is a relatively recent
strxfrm-related bugfix, see

http://svn.python.org/view/python/trunk/Modules/_localemodule.c?rev=54669

If I understand it correctly the error makes it likely that the resulting
string has trailing garbage characters.

Peter
 
T

Tuomas

Peter said:
Python seems to be the culprit as there is a relatively recent
strxfrm-related bugfix, see

Thanks Peter. Can't find it, do you have the issue number?
http://svn.python.org/view/python/trunk/Modules/_localemodule.c?rev=54669

If I understand it correctly the error makes it likely that the resulting
string has trailing garbage characters.

Reading the rev 54669 it seems to me, that the bug is not fixed. Man says:

STRXFRM(3): ... size_t strxfrm(char *dest, const char *src, size_t n);
.... The first n characters of the transformed string
are placed in dest. The transformation is based on the program’s
current locale for category LC_COLLATE.
.... The strxfrm() function returns the number of bytes required to
store the transformed string in dest excluding the terminating ‘\0’
character. If the value returned is n or more, the contents of dest are
*indeterminate*.

Accordin the man pages Python should know the size of the result it
expects and don't trust the size strxfrm returns. I don't completely
understand the collate algorithm, but it should offer different levels
of collate. So Python too, should offer those levels as a second
parameter. Hovever strxfrm don't offer more parameters either except
there is another function strcasecmp. So Python should be able to
calculate the expected size before calling strxfrm or strcasecmp. I
don't how it is possible. May be strcoll knows better and I should kick
strxfrm off and take strcoll instead. It costs converting the seach key
in every step of the search.

Tuomas
 
G

Gabriel Genellina

Thanks Peter. Can't find it, do you have the issue number?

I think it's not in the issue tracker - see
http://xforce.iss.net/xforce/xfdb/34060
The fix is already in 2.5.1
http://www.python.org/download/releases/2.5.1/NEWS.txt
Reading the rev 54669 it seems to me, that the bug is not fixed. Man
says:

STRXFRM(3): ... size_t strxfrm(char *dest, const char *src, size_t n);
... The first n characters of the transformed string
are placed in dest. The transformation is based on the program’s
current locale for category LC_COLLATE.
... The strxfrm() function returns the number of bytes required to
store the transformed string in dest excluding the terminating ‘\0’
character. If the value returned is n or more, the contents of dest are
*indeterminate*.

Accordin the man pages Python should know the size of the result it
expects and don't trust the size strxfrm returns. I don't completely
understand the collate algorithm, but it should offer different levels
of collate. So Python too, should offer those levels as a second
parameter. Hovever strxfrm don't offer more parameters either except
there is another function strcasecmp. So Python should be able to
calculate the expected size before calling strxfrm or strcasecmp. I
don't how it is possible. May be strcoll knows better and I should kick
strxfrm off and take strcoll instead. It costs converting the seach key
in every step of the search.

No. That's why strxfrm is called twice: the first one returns the required
buffer size, the buffer is resized, and strxfrm is called again. That's a
rather common sequence when buffer sizes are not known in advance.
[Note that `dest` is indeterminate, NOT the function return value which
always returns the required buffer size]
 
T

Tuomas

Gabriel said:
I think it's not in the issue tracker - see
http://xforce.iss.net/xforce/xfdb/34060
The fix is already in 2.5.1
http://www.python.org/download/releases/2.5.1/NEWS.txt

Thanks Gabriel, I'll try Python 2.5.1.
Reading the rev 54669 it seems to me, that the bug is not fixed. Man
says:

STRXFRM(3): ... size_t strxfrm(char *dest, const char *src, size_t n);
... The first n characters of the transformed string
are placed in dest. The transformation is based on the program’s
current locale for category LC_COLLATE.
... The strxfrm() function returns the number of bytes required to
store the transformed string in dest excluding the terminating ‘\0’
character. If the value returned is n or more, the contents of dest are
*indeterminate*.

Accordin the man pages Python should know the size of the result it
expects and don't trust the size strxfrm returns. I don't completely
understand the collate algorithm, but it should offer different levels
of collate. So Python too, should offer those levels as a second
parameter. Hovever strxfrm don't offer more parameters either except
there is another function strcasecmp. So Python should be able to
calculate the expected size before calling strxfrm or strcasecmp. I
don't how it is possible. May be strcoll knows better and I should kick
strxfrm off and take strcoll instead. It costs converting the seach key
in every step of the search.


No. That's why strxfrm is called twice: the first one returns the
required buffer size, the buffer is resized, and strxfrm is called
again. That's a rather common sequence when buffer sizes are not known
in advance.
[Note that `dest` is indeterminate, NOT the function return value which
always returns the required buffer size]

OK, I made too quick conclusions of the man text without knowing the
details.

Tuomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top