sre is broken in SuSE 9.2

Discussion in 'Python' started by Denis S. Otkidach, Feb 10, 2005.

  1. On all platfroms \w matches all unicode letters when used with flag
    re.UNICODE, but this doesn't work on SuSE 9.2:

    Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
    [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> re.compile(ur'\w+', re.U).match(u'\xe4')
    >>>


    BTW, is correctly recognize this character as lowercase letter:
    >>> import unicodedata
    >>> unicodedata.category(u'\xe4')

    'Ll'

    I've looked through all SuSE patches applied, but found nothing related.
    What is the reason for broken behavior? Incorrect configure options?

    --
    Denis S. Otkidach
    http://www.python.ru/ [ru]
    Denis S. Otkidach, Feb 10, 2005
    #1
    1. Advertising

  2. Denis S. Otkidach

    Serge Orlov Guest

    Denis S. Otkidach wrote:
    > On all platfroms \w matches all unicode letters when used with flag
    > re.UNICODE, but this doesn't work on SuSE 9.2:
    >
    > Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
    > [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
    > Type "help", "copyright", "credits" or "license" for more

    information.
    > >>> import re
    > >>> re.compile(ur'\w+', re.U).match(u'\xe4')
    > >>>

    >
    > BTW, is correctly recognize this character as lowercase letter:
    > >>> import unicodedata
    > >>> unicodedata.category(u'\xe4')

    > 'Ll'
    >
    > I've looked through all SuSE patches applied, but found nothing

    related.
    > What is the reason for broken behavior? Incorrect configure options?


    I can get the same results on RedHat's python 2.2.3 if I pass re.L
    option, it looks like this option is implicitly set in Suse.

    Serge
    Serge Orlov, Feb 10, 2005
    #2
    1. Advertising

  3. On 10 Feb 2005 03:59:51 -0800
    "Serge Orlov" <> wrote:

    > > On all platfroms \w matches all unicode letters when used with flag
    > > re.UNICODE, but this doesn't work on SuSE 9.2:

    [...]
    > I can get the same results on RedHat's python 2.2.3 if I pass re.L
    > option, it looks like this option is implicitly set in Suse.


    Looks like you are right:

    >>> import re
    >>> re.compile(ur'\w+', re.U).match(u'\xe4')
    >>> from locale import *
    >>> setlocale(LC_ALL, 'de_DE')

    'de_DE'
    >>> re.compile(ur'\w+', re.U).match(u'\xe4')

    <_sre.SRE_Match object at 0x40375560>

    But I see nothing related to implicit re.L option in their patches and
    the sources themselves are the same as on other platforms. I'd prefer
    to find the source of problem.

    --
    Denis S. Otkidach
    http://www.python.ru/ [ru]
    Denis S. Otkidach, Feb 10, 2005
    #3
  4. Denis S. Otkidach wrote:

    > On all platfroms \w matches all unicode letters when used with flag
    > re.UNICODE, but this doesn't work on SuSE 9.2:


    I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
    RedHat), check sys.maxunicode.

    This is not an explanation, but perhaps a hint where to look.

    Daniel
    Daniel Dittmar, Feb 10, 2005
    #4
  5. On Thu, 10 Feb 2005 16:23:09 +0100
    Daniel Dittmar <> wrote:

    > Denis S. Otkidach wrote:
    >
    > > On all platfroms \w matches all unicode letters when used with flag
    > > re.UNICODE, but this doesn't work on SuSE 9.2:

    >
    > I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
    > RedHat), check sys.maxunicode.
    >
    > This is not an explanation, but perhaps a hint where to look.


    Yes, it uses UCS4. But debian build with UCS4 works fine, so this is
    not a problem. Can --with-wctype-functions configure option be the
    source of problem?

    --
    Denis S. Otkidach
    http://www.python.ru/ [ru]
    Denis S. Otkidach, Feb 10, 2005
    #5
  6. Denis S. Otkidach wrote:

    >> > On all platfroms \w matches all unicode letters when used with flag
    >> > re.UNICODE, but this doesn't work on SuSE 9.2:

    >>
    >> I think Python on SuSE 9.2 uses UCS4 for unicode strings (as does
    >> RedHat), check sys.maxunicode.
    >>
    >> This is not an explanation, but perhaps a hint where to look.

    >
    > Yes, it uses UCS4. But debian build with UCS4 works fine, so this is
    > not a problem. Can --with-wctype-functions configure option be the
    > source of problem?


    yes.

    that option disables Python's own Unicode database, and relies on the C library's
    wctype.h (iswalpha, etc) to behave properly for Unicode characters. this isn't true
    for all environments.

    is this an official SuSE release? do they often release stuff that hasn't been tested
    at all?

    </F>
    Fredrik Lundh, Feb 10, 2005
    #6
  7. Denis S. Otkidach

    Serge Orlov Guest

    Denis S. Otkidach wrote:
    > On 10 Feb 2005 03:59:51 -0800
    > "Serge Orlov" <> wrote:
    >
    > > > On all platfroms \w matches all unicode letters when used with

    flag
    > > > re.UNICODE, but this doesn't work on SuSE 9.2:

    > [...]
    > > I can get the same results on RedHat's python 2.2.3 if I pass re.L
    > > option, it looks like this option is implicitly set in Suse.

    >
    > Looks like you are right:
    >
    > >>> import re
    > >>> re.compile(ur'\w+', re.U).match(u'\xe4')
    > >>> from locale import *
    > >>> setlocale(LC_ALL, 'de_DE')

    > 'de_DE'
    > >>> re.compile(ur'\w+', re.U).match(u'\xe4')

    > <_sre.SRE_Match object at 0x40375560>
    >
    > But I see nothing related to implicit re.L option in their patches
    > and the sources themselves are the same as on other platforms. I'd
    > prefer to find the source of problem.


    I found that

    print u'\xc4'.isalpha()
    import locale
    print locale.getlocale()

    produces different results on Suse (python 2.3.3)

    False
    (None, None)


    and RedHat (python 2.2.3)

    1
    (None, None)

    Serge.
    Serge Orlov, Feb 10, 2005
    #7
  8. On Thu, 10 Feb 2005 17:46:06 +0100
    "Fredrik Lundh" <> wrote:

    > > Can --with-wctype-functions configure option be the
    > > source of problem?

    >
    > yes.
    >
    > that option disables Python's own Unicode database, and relies on the C library's
    > wctype.h (iswalpha, etc) to behave properly for Unicode characters. this isn't true
    > for all environments.
    >
    > is this an official SuSE release? do they often release stuff that hasn't been tested
    > at all?


    Yes, it's official release:
    # rpm -qi python
    Name : python Relocations: (not relocatable)
    Version : 2.3.4 Vendor: SUSE LINUX AG, Nuernberg, Germany
    Release : 3 Build Date: Tue Oct 5 02:28:25 2004
    Install date: Fri Jan 28 13:53:49 2005 Build Host: gambey.suse.de
    Group : Development/Languages/Python Source RPM: python-2.3.4-3.src.rpm
    Size : 15108594 License: Artistic License, Other License(s), see package
    Signature : DSA/SHA1, Tue Oct 5 02:42:38 2004, Key ID a84edae89c800aca
    Packager : http://www.suse.de/feedback
    URL : http://www.python.org/
    Summary : Python Interpreter
    <snip>

    BTW, where have they found something with Artistic License in Python?

    --
    Denis S. Otkidach
    http://www.python.ru/ [ru]
    Denis S. Otkidach, Feb 10, 2005
    #8
  9. Denis S. Otkidach

    Serge Orlov Guest

    Denis S. Otkidach wrote:
    > On all platfroms \w matches all unicode letters when used with flag
    > re.UNICODE, but this doesn't work on SuSE 9.2:
    >
    > Python 2.3.4 (#1, Dec 17 2004, 19:56:48)
    > [GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
    > Type "help", "copyright", "credits" or "license" for more

    information.
    > >>> import re
    > >>> re.compile(ur'\w+', re.U).match(u'\xe4')
    > >>>

    >
    > BTW, is correctly recognize this character as lowercase letter:
    > >>> import unicodedata
    > >>> unicodedata.category(u'\xe4')

    > 'Ll'
    >
    > I've looked through all SuSE patches applied, but found nothing
    > related. What is the reason for broken behavior? Incorrect
    > configure options?


    To summarize the discussion: either it's a bug in glibc or there is an
    option to specify modern POSIX locale. POSIX locale consist of
    characters from the portable character set, unicode is certainly
    portable.

    Serge.
    Serge Orlov, Feb 10, 2005
    #9
  10. Peter Maas wrote:

    >> To summarize the discussion: either it's a bug in glibc or there is an
    >> option to specify modern POSIX locale. POSIX locale consist of
    >> characters from the portable character set, unicode is certainly
    >> portable.

    >
    > What about the environment variable LANG? I have SuSE 9.1 and
    > LANG = de_DE.UTF-8. Your example is running well on my computer.


    Python's Unicode subsystem shouldn't depend on the system's LANG
    setting.

    </F>
    Fredrik Lundh, Feb 10, 2005
    #10
  11. Denis S. Otkidach

    Serge Orlov Guest

    Peter Maas wrote:
    > Serge Orlov schrieb:
    > > Denis S. Otkidach wrote:
    > > To summarize the discussion: either it's a bug in glibc or there is

    an
    > > option to specify modern POSIX locale. POSIX locale consist of
    > > characters from the portable character set, unicode is certainly
    > > portable.

    >
    > What about the environment variable LANG? I have SuSE 9.1 and
    > LANG = de_DE.UTF-8. Your example is running well on my computer.


    This thread is about problems only with LANG=C or LANG=POSIX, it's not
    about other locales. Other locales are working as expected.

    Serge.
    Serge Orlov, Feb 10, 2005
    #11
  12. Denis S. Otkidach

    Peter Maas Guest

    Serge Orlov schrieb:
    > Denis S. Otkidach wrote:
    > To summarize the discussion: either it's a bug in glibc or there is an
    > option to specify modern POSIX locale. POSIX locale consist of
    > characters from the portable character set, unicode is certainly
    > portable.


    What about the environment variable LANG? I have SuSE 9.1 and
    LANG = de_DE.UTF-8. Your example is running well on my computer.

    --
    -------------------------------------------------------------------
    Peter Maas, M+R Infosysteme, D-52070 Aachen, Tel +49-241-93878-0
    E-mail 'cGV0ZXIubWFhc0BtcGx1c3IuZGU=\n'.decode('base64')
    -------------------------------------------------------------------
    Peter Maas, Feb 10, 2005
    #12
  13. On 10 Feb 2005 11:49:33 -0800
    "Serge Orlov" <> wrote:

    > This thread is about problems only with LANG=C or LANG=POSIX, it's not
    > about other locales. Other locales are working as expected.


    You are not right. I have LANG=de_DE.UTF-8, and the Python test_re.py
    doesn't pass. $LANG doesn't matter if I don't call setlocale.
    Fortunately setting any non-C locale solves the problem for all (I
    believe) unicode character:

    >>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')

    [u'\xb5\xba\xe4\u0430']

    --
    Denis S. Otkidach
    http://www.python.ru/ [ru]
    Denis S. Otkidach, Feb 11, 2005
    #13
  14. Denis S. Otkidach

    Serge Orlov Guest

    Denis S. Otkidach wrote:
    > On 10 Feb 2005 11:49:33 -0800
    > "Serge Orlov" <> wrote:
    >
    > > This thread is about problems only with LANG=C or LANG=POSIX, it's

    not
    > > about other locales. Other locales are working as expected.

    >
    > You are not right. I have LANG=de_DE.UTF-8, and the Python

    test_re.py
    > doesn't pass.


    I meant "only with C or POSIX locales" when I wrote "only with LANG=C
    or LANG=POSIX". My bad.

    > $LANG doesn't matter if I don't call setlocale.


    Sure.

    > Fortunately setting any non-C locale solves the problem for all (I
    > believe) unicode character:
    >
    > >>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')

    > [u'\xb5\xba\xe4\u0430']


    I can't find the strict definition of isalpha, but I believe average
    C program shouldn't care about the current locale alphabet, so isalpha
    is a union of all supported characters in all alphabets

    Serge.
    Serge Orlov, Feb 11, 2005
    #14
  15. Serge Orlov wrote:

    >> >>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')

    >> [u'\xb5\xba\xe4\u0430']

    >
    > I can't find the strict definition of isalpha, but I believe average
    > C program shouldn't care about the current locale alphabet, so isalpha
    > is a union of all supported characters in all alphabets


    nope. isalpha() depends on the locale, as does all other ctype functions
    (this also applies to wctype, on some platforms).

    </F>
    Fredrik Lundh, Feb 11, 2005
    #15
  16. Serge Orlov wrote:

    >> >>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')

    >> [u'\xb5\xba\xe4\u0430']

    >
    > I can't find the strict definition of isalpha, but I believe average
    > C program shouldn't care about the current locale alphabet, so isalpha
    > is a union of all supported characters in all alphabets


    btw, what does isalpha have to do with this example?

    </F>
    Fredrik Lundh, Feb 11, 2005
    #16
  17. Serge Orlov wrote:
    > To summarize the discussion: either it's a bug in glibc or there is an
    > option to specify modern POSIX locale. POSIX locale consist of
    > characters from the portable character set, unicode is certainly
    > portable.


    Yes, but U+00E4 is not in the portable character set. The portable
    character set is defined here:

    http://www.opengroup.org/onlinepubs/007908799/xbd/charset.html

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 12, 2005
    #17
  18. Denis S. Otkidach

    Serge Orlov Guest

    "Martin v. Löwis" wrote:
    > Serge Orlov wrote:
    > > To summarize the discussion: either it's a bug in glibc or there

    > is an
    >> option to specify modern POSIX locale. POSIX locale consist of
    >> characters from the portable character set, unicode is certainly
    >> portable.

    >
    > Yes, but U+00E4 is not in the portable character set. The portable
    > character set is defined here:
    >
    > http://www.opengroup.org/onlinepubs/007908799/xbd/charset.html


    Thanks for the link. They write (in 1997 or earlier ?):

    The wide-character value for each member of the Portable
    Character Set will equal its value when used as the lone character
    in an integer character constant. Wide-character codes for other
    characters are locale- and *implementation-dependent*

    Emphasis is mine. So how many libc implementations with
    non-unicode wide-character codes do we have in 2005?
    I'm really interested to know.

    Serge.
    Serge Orlov, Feb 12, 2005
    #18
  19. Denis S. Otkidach

    Serge Orlov Guest

    Fredrik Lundh wrote:
    > Serge Orlov wrote:
    >
    >>>>>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
    >>>>>> [u'\xb5\xba\xe4\u0430']

    >>
    >> I can't find the strict definition of isalpha, but I believe average
    >> C program shouldn't care about the current locale alphabet, so
    >> isalpha is a union of all supported characters in all alphabets

    >
    > nope. isalpha() depends on the locale, as does all other ctype
    > functions (this also applies to wctype, on some platforms).


    I mean "all supported characters in all alphabets [in the current
    locale]". For example in ru_RU.koi8-r isalpha should return
    true for characters in English and Russian alphabets. In
    ru_RU.koi8-u -- for characters in English, Russia and Ukrain
    alphabets, in ru_RU.utf-8 -- for all supported by the implementation
    alphabetic characters in unicode. IMHO iswalpha in POSIX
    locale can return true for all alphabetic characters in unicode
    instead of being limited by English alphabet.

    Serge.

    true in
    Serge Orlov, Feb 12, 2005
    #19
  20. Denis S. Otkidach

    Serge Orlov Guest

    Fredrik Lundh wrote:
    > Serge Orlov wrote:
    >
    >>>>>> re.compile(ur'\w+', re.U).findall(u'\xb5\xba\xe4\u0430')
    >>>>>> [u'\xb5\xba\xe4\u0430']

    >>
    >> I can't find the strict definition of isalpha, but I believe average
    >> C program shouldn't care about the current locale alphabet, so
    >> isalpha is a union of all supported characters in all alphabets

    >
    > btw, what does isalpha have to do with this example?


    It has to do with this thread. u'\xe4'.isalpha() returns false in
    Suse. It's in the same boat as \w

    Serge.
    Serge Orlov, Feb 12, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roman Suzi

    Is it bug or feature in sre?

    Roman Suzi, Jan 5, 2004, in forum: Python
    Replies:
    0
    Views:
    257
    Roman Suzi
    Jan 5, 2004
  2. Glenn R Williams
    Replies:
    0
    Views:
    427
    Glenn R Williams
    Aug 14, 2004
  3. Erik Johnson
    Replies:
    8
    Views:
    698
    Peter Otten
    Feb 3, 2005
  4. Yoav

    RE vs. SRE

    Yoav, Aug 21, 2005, in forum: Python
    Replies:
    1
    Views:
    375
    Michael Hoffman
    Aug 21, 2005
  5. Lawrence D'Oliveiro

    "re" vs "sre"?

    Lawrence D'Oliveiro, Sep 23, 2006, in forum: Python
    Replies:
    2
    Views:
    508
    Lawrence D'Oliveiro
    Sep 23, 2006
Loading...

Share This Page