sre is broken in SuSE 9.2

F

Fredrik Lundh

Serge said:
The wide-character value for each member of the Portable
Character Set will equal its value when used as the lone character
in an integer character constant. Wide-character codes for other
characters are locale- and *implementation-dependent*

Emphasis is mine.

the relevant part for this thread is *locale-*. if wctype depends on the
locale, it cannot be used for generic build. (custom interpreters are an-
other thing, but they shouldn't be shipped as "python").

</F>
 
D

Denis S. Otkidach

re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
[u'\xb5\xba\xe4\u0430']

I can't find the strict definition of isalpha, but I believe average
C program shouldn't care about the current locale alphabet, so isalpha
is a union of all supported characters in all alphabets

btw, what does isalpha have to do with this example?

The same problem is with isalpha. In most distributions:....
True True True True

And in SuSE 9.2:....
False False False False
 
D

Denis S. Otkidach

the relevant part for this thread is *locale-*. if wctype depends on
the locale, it cannot be used for generic build. (custom interpreters
are an- other thing, but they shouldn't be shipped as "python").

You are right. But isalpha behavior looks strange for me anyway: why
cyrillic character '\u0430' is recognized as alpha one for de_DE locale,
but is not for C?
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Serge said:
Emphasis is mine. So how many libc implementations with
non-unicode wide-character codes do we have in 2005?

Solaris has supported 2-byte wchar_t implementations for many
years, and so I believe did HP-UX and AIX.

ISO C99 defines a constant __STDC_ISO_10646__ which an
implementation can use to indicate that wchar_t uses
Unicode (aka ISO 10646) in all locales. Very few
implementations define this constant at this time, though.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Denis said:
You are right. But isalpha behavior looks strange for me anyway: why
cyrillic character '\u0430' is recognized as alpha one for de_DE locale,
but is not for C?

In glibc, all "real" locales are based on
/usr/share/locale/i18n/locales/i18n, e.g. for de_DE through

LC_CTYPE
copy "i18n"

i18n includes U+0430 as a character, through

lower /
....
% TABLE 11 CYRILLIC/
<U0430>..<U045F>;<U0461>..(2)..<U047F>;/

This makes U+0430 a letter in all locales including i18n
(unless locally overridden). This entire approach apparently
is based on ISO 14652, which, in section 4.3.3, introduces
the "i18n" LC_CTYPE category.

Why the C locale does not use i18n, I don't know. Most likely,
the intention is that the "C" locale works without any
additional data files - you should ask the glibc developers.
OTOH, there is a definition file POSIX for what appears
to be the POSIX locale.

I'd like to point out that this implementation is potentially
in violation of ISO 14652; annex A.2.2 says that the notion
of a POSIX locale is replaced with the i18n FDCC-set. So
accordingly, I would expect that i18n is used in POSIX as
well - see for yourself that it isn't in glibc 2.3.2.

Again, I suggest to ask the glibc developers as to why
this is so.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top