Correct handling of case in unicode and regexps

Discussion in 'Python' started by Devin Jeanpierre, Feb 23, 2013.

  1. Hi folks,

    I'm pretty unsure of myself when it comes to unicode. As I understand
    it, you're generally supposed to compare things in a case insensitive
    manner by case folding, right? So instead of a.lower() == b.lower()
    (the ASCII way), you do a.casefold() == b.casefold()

    However, I'm struggling to figure out how regular expressions should
    treat case. Python's re module doesn't "work properly" to my
    understanding, because:

    >>> a = 'ss'
    >>> b = 'ß'
    >>> a.casefold() == b.casefold()

    True
    >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
    >>> # oh dear!


    In addition, it seems improbable that this ever _could_ work. Because
    if it did work like that, then what would the value be of
    re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?

    I'd really like to hear the thoughts of people more experienced with
    unicode. What is the ideal correct behavior here? Or do I
    misunderstand things?

    -- Devin
    Devin Jeanpierre, Feb 23, 2013
    #1
    1. Advertising

  2. Devin Jeanpierre

    jmfauth Guest

    On 23 fév, 15:26, Devin Jeanpierre <> wrote:
    > Hi folks,
    >
    > I'm pretty unsure of myself when it comes to unicode. As I understand
    > it, you're generally supposed to compare things in a case insensitive
    > manner by case folding, right? So instead of a.lower() == b.lower()
    > (the ASCII way), you do a.casefold() == b.casefold()
    >
    > However, I'm struggling to figure out how regular expressions should
    > treat case. Python's re module doesn't "work properly" to my
    > understanding, because:
    >
    >     >>> a = 'ss'
    >     >>> b = 'ß'
    >     >>> a.casefold() == b.casefold()
    >     True
    >     >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
    >     >>> # oh dear!
    >
    > In addition, it seems improbable that this ever _could_ work. Because
    > if it did work like that, then what would the value be of
    > re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?
    >
    > I'd really like to hear the thoughts of people more experienced with
    > unicode. What is the ideal correct behavior here? Or do I
    > misunderstand things?


    -----

    I'm just wondering if there is a real issue here. After all,
    this is only a question of conventions. Unicode has some
    conventions, re modules may (has to) use some conventions too.

    It seems to me, the safest way is to preprocess the text,
    which has to be examinated.

    Proposed case study:
    How should be ss/ß/SS/ẞ interpreted?

    'Richard-Strauss-Straße'
    'Richard-Strauss-Strasse'
    'RICHARD-STRAUSS-STRASSE'
    'RICHARD-STRAUSS-STRAẞE'


    There is more or less the same situation with sorting.
    Unicode can not do all and it may be mandatory to
    preprocess the "input".

    Eg. This fct I wrote once for the fun. It sorts French
    words (without unicodedata and locale).

    >>> import libfrancais
    >>> z = ['oeuf', 'Å“uf', 'od', 'of']
    >>> zo = libfrancais.sortedfr(z)
    >>> zo

    ['od', 'oeuf', 'Å“uf', 'of']

    jmf
    jmfauth, Feb 24, 2013
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vlastimil Brom
    Replies:
    0
    Views:
    63
    Vlastimil Brom
    Feb 23, 2013
  2. Devin Jeanpierre
    Replies:
    0
    Views:
    73
    Devin Jeanpierre
    Feb 23, 2013
  3. Devin Jeanpierre
    Replies:
    0
    Views:
    59
    Devin Jeanpierre
    Feb 23, 2013
  4. MRAB
    Replies:
    0
    Views:
    55
  5. Devin Jeanpierre
    Replies:
    0
    Views:
    68
    Devin Jeanpierre
    Feb 23, 2013
Loading...

Share This Page