sre is broken in SuSE 9.2

D

Denis S. Otkidach

On all platfroms \w matches all unicode letters when used with flag
re.UNICODE, but this doesn't work on SuSE 9.2:

Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
[GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
BTW, is correctly recognize this character as lowercase letter:'Ll'

I've looked through all SuSE patches applied, but found nothing related.
What is the reason for broken behavior? Incorrect configure options?
 
S

Serge Orlov

Denis said:
On all platfroms \w matches all unicode letters when used with flag
re.UNICODE, but this doesn't work on SuSE 9.2:

Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
[GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
BTW, is correctly recognize this character as lowercase letter:'Ll'

I've looked through all SuSE patches applied, but found nothing related.
What is the reason for broken behavior? Incorrect configure options?

I can get the same results on RedHat's python 2.2.3 if I pass re.L
option, it looks like this option is implicitly set in Suse.

Serge
 
D

Denis S. Otkidach

On all platfroms \w matches all unicode letters when used with flag
re.UNICODE, but this doesn't work on SuSE 9.2:
[...]
I can get the same results on RedHat's python 2.2.3 if I pass re.L
option, it looks like this option is implicitly set in Suse.

Looks like you are right:
<_sre.SRE_Match object at 0x40375560>

But I see nothing related to implicit re.L option in their patches and
the sources themselves are the same as on other platforms. I'd prefer
to find the source of problem.
 
D

Daniel Dittmar

Denis said:
On all platfroms \w matches all unicode letters when used with flag
re.UNICODE, but this doesn't work on SuSE 9.2:

I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
RedHat), check sys.maxunicode.

This is not an explanation, but perhaps a hint where to look.

Daniel
 
D

Denis S. Otkidach

I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
RedHat), check sys.maxunicode.

This is not an explanation, but perhaps a hint where to look.

Yes, it uses UCS4. But debian build with UCS4 works fine, so this is
not a problem. Can --with-wctype-functions configure option be the
source of problem?
 
F

Fredrik Lundh

Denis said:
Yes, it uses UCS4. But debian build with UCS4 works fine, so this is
not a problem. Can --with-wctype-functions configure option be the
source of problem?

yes.

that option disables Python's own Unicode database, and relies on the C library's
wctype.h (iswalpha, etc) to behave properly for Unicode characters. this isn't true
for all environments.

is this an official SuSE release? do they often release stuff that hasn't been tested
at all?

</F>
 
S

Serge Orlov

Denis said:
On all platfroms \w matches all unicode letters when used with flag
re.UNICODE, but this doesn't work on SuSE 9.2:
[...]
I can get the same results on RedHat's python 2.2.3 if I pass re.L
option, it looks like this option is implicitly set in Suse.

Looks like you are right:
<_sre.SRE_Match object at 0x40375560>

But I see nothing related to implicit re.L option in their patches
and the sources themselves are the same as on other platforms. I'd
prefer to find the source of problem.

I found that

print u'\xc4'.isalpha()
import locale
print locale.getlocale()

produces different results on Suse (python 2.3.3)

False
(None, None)


and RedHat (python 2.2.3)

1
(None, None)

Serge.
 
D

Denis S. Otkidach

yes.

that option disables Python's own Unicode database, and relies on the C library's
wctype.h (iswalpha, etc) to behave properly for Unicode characters. this isn't true
for all environments.

is this an official SuSE release? do they often release stuff that hasn't been tested
at all?

Yes, it's official release:
# rpm -qi python
Name : python Relocations: (not relocatable)
Version : 2.3.4 Vendor: SUSE LINUX AG, Nuernberg, Germany
Release : 3 Build Date: Tue Oct 5 02:28:25 2004
Install date: Fri Jan 28 13:53:49 2005 Build Host: gambey.suse.de
Group : Development/Languages/Python Source RPM: python-2.3.4-3.src.rpm
Size : 15108594 License: Artistic License, Other License(s), see package
Signature : DSA/SHA1, Tue Oct 5 02:42:38 2004, Key ID a84edae89c800aca
Packager : http://www.suse.de/feedback
URL : http://www.python.org/
Summary : Python Interpreter
<snip>

BTW, where have they found something with Artistic License in Python?
 
S

Serge Orlov

Denis said:
On all platfroms \w matches all unicode letters when used with flag
re.UNICODE, but this doesn't work on SuSE 9.2:

Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
[GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
BTW, is correctly recognize this character as lowercase letter:'Ll'

I've looked through all SuSE patches applied, but found nothing
related. What is the reason for broken behavior? Incorrect
configure options?

To summarize the discussion: either it's a bug in glibc or there is an
option to specify modern POSIX locale. POSIX locale consist of
characters from the portable character set, unicode is certainly
portable.

Serge.
 
F

Fredrik Lundh

Peter said:
What about the environment variable LANG? I have SuSE 9.1 and
LANG = de_DE.UTF-8. Your example is running well on my computer.

Python's Unicode subsystem shouldn't depend on the system's LANG
setting.

</F>
 
S

Serge Orlov

Peter said:
What about the environment variable LANG? I have SuSE 9.1 and
LANG = de_DE.UTF-8. Your example is running well on my computer.

This thread is about problems only with LANG=C or LANG=POSIX, it's not
about other locales. Other locales are working as expected.

Serge.
 
P

Peter Maas

Serge said:
Denis S. Otkidach wrote:
To summarize the discussion: either it's a bug in glibc or there is an
option to specify modern POSIX locale. POSIX locale consist of
characters from the portable character set, unicode is certainly
portable.

What about the environment variable LANG? I have SuSE 9.1 and
LANG = de_DE.UTF-8. Your example is running well on my computer.
 
D

Denis S. Otkidach

This thread is about problems only with LANG=C or LANG=POSIX, it's not
about other locales. Other locales are working as expected.

You are not right. I have LANG=de_DE.UTF-8, and the Python test_re.py
doesn't pass. $LANG doesn't matter if I don't call setlocale.
Fortunately setting any non-C locale solves the problem for all (I
believe) unicode character:
[u'\xb5\xba\xe4\u0430']
 
S

Serge Orlov

Denis said:
You are not right. I have LANG=de_DE.UTF-8, and the Python test_re.py
doesn't pass.

I meant "only with C or POSIX locales" when I wrote "only with LANG=C
or LANG=POSIX". My bad.
$LANG doesn't matter if I don't call setlocale.
Sure.

Fortunately setting any non-C locale solves the problem for all (I
believe) unicode character:
[u'\xb5\xba\xe4\u0430']

I can't find the strict definition of isalpha, but I believe average
C program shouldn't care about the current locale alphabet, so isalpha
is a union of all supported characters in all alphabets

Serge.
 
F

Fredrik Lundh

Serge said:
re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
[u'\xb5\xba\xe4\u0430']

I can't find the strict definition of isalpha, but I believe average
C program shouldn't care about the current locale alphabet, so isalpha
is a union of all supported characters in all alphabets

nope. isalpha() depends on the locale, as does all other ctype functions
(this also applies to wctype, on some platforms).

</F>
 
F

Fredrik Lundh

Serge said:
re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
[u'\xb5\xba\xe4\u0430']

I can't find the strict definition of isalpha, but I believe average
C program shouldn't care about the current locale alphabet, so isalpha
is a union of all supported characters in all alphabets

btw, what does isalpha have to do with this example?

</F>
 
S

Serge Orlov

Martin v. Löwis said:
Yes, but U+00E4 is not in the portable character set. The portable
character set is defined here:

http://www.opengroup.org/onlinepubs/007908799/xbd/charset.html

Thanks for the link. They write (in 1997 or earlier ?):

The wide-character value for each member of the Portable
Character Set will equal its value when used as the lone character
in an integer character constant. Wide-character codes for other
characters are locale- and *implementation-dependent*

Emphasis is mine. So how many libc implementations with
non-unicode wide-character codes do we have in 2005?
I'm really interested to know.

Serge.
 
S

Serge Orlov

Fredrik said:
Serge said:
re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
[u'\xb5\xba\xe4\u0430']

I can't find the strict definition of isalpha, but I believe average
C program shouldn't care about the current locale alphabet, so
isalpha is a union of all supported characters in all alphabets

nope. isalpha() depends on the locale, as does all other ctype
functions (this also applies to wctype, on some platforms).

I mean "all supported characters in all alphabets [in the current
locale]". For example in ru_RU.koi8-r isalpha should return
true for characters in English and Russian alphabets. In
ru_RU.koi8-u -- for characters in English, Russia and Ukrain
alphabets, in ru_RU.utf-8 -- for all supported by the implementation
alphabetic characters in unicode. IMHO iswalpha in POSIX
locale can return true for all alphabetic characters in unicode
instead of being limited by English alphabet.

Serge.

true in
 
S

Serge Orlov

Fredrik said:
Serge said:
re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
[u'\xb5\xba\xe4\u0430']

I can't find the strict definition of isalpha, but I believe average
C program shouldn't care about the current locale alphabet, so
isalpha is a union of all supported characters in all alphabets

btw, what does isalpha have to do with this example?

It has to do with this thread. u'\xe4'.isalpha() returns false in
Suse. It's in the same boat as \w

Serge.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top