sre is broken in SuSE 9.2

Fredrik Lundh · Feb 12, 2005

Serge said:
The wide-character value for each member of the Portable
Character Set will equal its value when used as the lone character
in an integer character constant. Wide-character codes for other
characters are locale- and *implementation-dependent*

Emphasis is mine.

the relevant part for this thread is *locale-*. if wctype depends on the
locale, it cannot be used for generic build. (custom interpreters are an-
other thing, but they shouldn't be shipped as "python").

</F>

Denis S. Otkidach · Feb 12, 2005

re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
[u'\xb5\xba\xe4\u0430']

Click to expand...

I can't find the strict definition of isalpha, but I believe average
C program shouldn't care about the current locale alphabet, so isalpha
is a union of all supported characters in all alphabets

Click to expand...

btw, what does isalpha have to do with this example?

The same problem is with isalpha. In most distributions:....
True True True True

And in SuSE 9.2:....
False False False False

Denis S. Otkidach · Feb 12, 2005

the relevant part for this thread is *locale-*. if wctype depends on
the locale, it cannot be used for generic build. (custom interpreters
are an- other thing, but they shouldn't be shipped as "python").

You are right. But isalpha behavior looks strange for me anyway: why
cyrillic character '\u0430' is recognized as alpha one for de_DE locale,
but is not for C?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 13, 2005

Serge said:
Emphasis is mine. So how many libc implementations with
non-unicode wide-character codes do we have in 2005?

Solaris has supported 2-byte wchar_t implementations for many
years, and so I believe did HP-UX and AIX.

ISO C99 defines a constant __STDC_ISO_10646__ which an
implementation can use to indicate that wchar_t uses
Unicode (aka ISO 10646) in all locales. Very few
implementations define this constant at this time, though.

Regards,
Martin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 13, 2005

Denis said:
You are right. But isalpha behavior looks strange for me anyway: why
cyrillic character '\u0430' is recognized as alpha one for de_DE locale,
but is not for C?

In glibc, all "real" locales are based on
/usr/share/locale/i18n/locales/i18n, e.g. for de_DE through

LC_CTYPE
copy "i18n"

i18n includes U+0430 as a character, through

lower /
....
% TABLE 11 CYRILLIC/
<U0430>..<U045F>;<U0461>..(2)..<U047F>;/

This makes U+0430 a letter in all locales including i18n
(unless locally overridden). This entire approach apparently
is based on ISO 14652, which, in section 4.3.3, introduces
the "i18n" LC_CTYPE category.

Why the C locale does not use i18n, I don't know. Most likely,
the intention is that the "C" locale works without any
additional data files - you should ask the glibc developers.
OTOH, there is a definition file POSIX for what appears
to be the POSIX locale.

I'd like to point out that this implementation is potentially
in violation of ISO 14652; annex A.2.2 says that the notion
of a POSIX locale is replaced with the i18n FDCC-set. So
accordingly, I would expect that i18n is used in POSIX as
well - see for yourself that it isn't in glibc 2.3.2.

Again, I suggest to ask the glibc developers as to why
this is so.

Regards,
Martin

subclassing a module: misleading(?) error message	4	Jan 4, 2007
MethodType in python 2.2	0	Nov 10, 2004
poplib.POP3.list() returns extra value?	2	Jul 28, 2005
anybody help me	1	Feb 10, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

sre is broken in SuSE 9.2

Fredrik Lundh

Denis S. Otkidach

Denis S. Otkidach

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads