Re: Correct handling of case in unicode and regexps

Discussion in 'Python' started by MRAB, Feb 23, 2013.

  1. MRAB

    MRAB Guest

    On 2013-02-23 18:57, Devin Jeanpierre wrote:
    > On Sat, Feb 23, 2013 at 1:12 PM, MRAB <>
    > wrote:
    >> The basic rule is that a series of characters in the regex must
    >> match a series of characters in the text, with no partial matches
    >> in either.
    >>
    >> For example, 'ss' can match 'ß', but 's' can't match 'ß' because
    >> that would be matching part of 'ß'.
    >>
    >> In a regex like 's+', you're asking it to match one or more
    >> repetitions of 's', but that would mean that 's' would have to
    >> match part of 'ß' in the first iteration and the remainder of 'ß'
    >> in the second iteration.

    >
    > That makes sense. I'll have to think about this and run some tests
    > through regex, as well.
    >
    >> Although it's theoretically possible to do that, the code is
    >> already difficult enough. The cost outweighs the potential
    >> benefit.
    >>
    >> If you'd like to have a go at implementing it, the code _is_ open
    >> source. :)

    >
    > Actually, the reason it's relevant to me is that I'm reimplementing
    > the re module using a more automata theoretic approach (it's my
    > second attack at the problem). Also, I've read the _sre source code
    > and it's unpleasant. Is regex much better?
    >

    I like to think so! ;-)

    Part of the problem may be that it also supports fuzzy (approximate)
    matching.

    > At least the way I'm planning on going about it, supporting this is
    > easier, as long as one can figure out what it means to match halfway
    > inside a ß. Since case folding is a homomorphism*, I can case fold
    > the regex** and case fold the input and then I'm done. Case folding
    > of the input can be done character by character, and to emulate the
    > regex module behavior I'd need to check at certain places whether or
    > not I'm in the middle of a casefolding expansion, and fail if so. On
    > the other hand, if I don't emulate the regex module's behavior in at
    > least some cases, I'd need to figure out what the value of a match
    > of 's' against 'ß' would be.
    >

    It seems like a reasonable restriction to me that start and end
    positions should be integers, i.e. forbid partial matches within a
    character.

    This means that matching '(ss)' against 'ß' would be OK, as would '(s+)'
    against 'ß', but '(s)' or '(s)(s)' against 'ß' would not, otherwise you
    could get:

    >>> match('(s)', 'ß').span(0)

    (0, 0.5)
    >>> match('(s)(s)', 'ß').span(0, 1, 2)

    ((0, 1), (0, 0.5), (0.5, 1)).

    > [*] i.e. it can be done character by character (see Unicode 3.13
    > Default Case Algorithms) [**] Not as trivial as it sounds, but still
    > easy. [ßa-z] goes to e.g. [a-z]|ss (not [ssa-z]).
    >
    MRAB, Feb 23, 2013
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Devin Jeanpierre
    Replies:
    1
    Views:
    91
    jmfauth
    Feb 24, 2013
  2. Vlastimil Brom
    Replies:
    0
    Views:
    65
    Vlastimil Brom
    Feb 23, 2013
  3. Devin Jeanpierre
    Replies:
    0
    Views:
    75
    Devin Jeanpierre
    Feb 23, 2013
  4. Devin Jeanpierre
    Replies:
    0
    Views:
    61
    Devin Jeanpierre
    Feb 23, 2013
  5. MRAB
    Replies:
    0
    Views:
    57
Loading...

Share This Page