regex help: splitting string gets weird groups

Discussion in 'Python' started by gry, Apr 8, 2010.

  1. gry

    gry Guest

    [ python3.1.1, re.__version__='2.2.1' ]
    I'm trying to use re to split a string into (any number of) pieces of
    these kinds:
    1) contiguous runs of letters
    2) contiguous runs of digits
    3) single other characters

    e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
    '.', 'in', '#', '=', 1234]
    I tried:
    >>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()

    ('1234', 'in', '1234', '=')

    Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
    group? Is my regexp illegal somehow and confusing the engine?

    I *would* like to understand what's wrong with this regex, though if
    someone has a neat other way to do the above task, I'm also interested
    in suggestions.
     
    gry, Apr 8, 2010
    #1
    1. Advertising

  2. gry

    MRAB Guest

    gry wrote:
    > [ python3.1.1, re.__version__='2.2.1' ]
    > I'm trying to use re to split a string into (any number of) pieces of
    > these kinds:
    > 1) contiguous runs of letters
    > 2) contiguous runs of digits
    > 3) single other characters
    >
    > e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
    > '.', 'in', '#', '=', 1234]
    > I tried:
    >>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()

    > ('1234', 'in', '1234', '=')
    >
    > Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
    > group? Is my regexp illegal somehow and confusing the engine?
    >
    > I *would* like to understand what's wrong with this regex, though if
    > someone has a neat other way to do the above task, I'm also interested
    > in suggestions.


    If the regex was illegal then it would raise an exception. It's doing
    exactly what you're asking it to do!

    First of all, there are 4 groups, with group 1 containing groups 2..4 as
    alternatives, so group 1 will match whatever groups 2..4 match:

    Group 1: (([A-Za-z]+)|([0-9]+)|([-.#=]))
    Group 2: ([A-Za-z]+)
    Group 3: ([0-9]+)
    Group 4: ([-.#=])

    It matches like this:

    Group 1 and group 3 match '555'.
    Group 1 and group 2 match 'tHe'.
    Group 1 and group 4 match '-'.
    Group 1 and group 2 match 'rain'.
    Group 1 and group 4 match '.'.
    Group 1 and group 2 match 'in'.
    Group 1 and group 4 match '#'.
    Group 1 and group 4 match '='.
    Group 1 and group 3 match '1234'.

    If a group matches then any earlier match of that group is discarded,
    so:

    Group 1 finishes with '1234'.
    Group 2 finishes with 'in'.
    Group 3 finishes with '1234'.
    Group 4 finishes with '='.

    A solution is:

    >>> re.findall('[A-Za-z]+|[0-9]+|[-.#=]', '555tHe-rain.in#=1234')

    ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

    Note: re.findall() returns a list of matches, so if the regex doesn't
    contain any groups then it returns the matched substrings. Compare:

    >>> re.findall("a(.)", "ax ay")

    ['x', 'y']
    >>> re.findall("a.", "ax ay")

    ['ax', 'ay']
     
    MRAB, Apr 8, 2010
    #2
    1. Advertising

  3. gry

    Jon Clements Guest

    On 8 Apr, 19:49, gry <> wrote:
    > [ python3.1.1, re.__version__='2.2.1' ]
    > I'm trying to use re to split a string into (any number of) pieces of
    > these kinds:
    > 1) contiguous runs of letters
    > 2) contiguous runs of digits
    > 3) single other characters
    >
    > e.g.   555tHe-rain.in#=1234   should give:   [555, 'tHe', '-', 'rain',
    > '.', 'in', '#', '=', 1234]
    > I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain..in#=1234').groups()
    >
    > ('1234', 'in', '1234', '=')
    >
    > Why is 1234 repeated in two groups?  and why doesn't "tHe" appear as a
    > group?  Is my regexp illegal somehow and confusing the engine?
    >
    > I *would* like to understand what's wrong with this regex, though if
    > someone has a neat other way to do the above task, I'm also interested
    > in suggestions.


    I would avoid .match and use .findall
    (if you walk through them both together, it'll make sense what's
    happening
    with your match string).

    >>> s = """555tHe-rain.in#=1234"""
    >>> re.findall('[A-Za-z]+|[0-9]+|[-.#=]', s)

    ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

    hth,

    Jon.
     
    Jon Clements, Apr 8, 2010
    #3
  4. On Apr 8, 1:49 pm, gry <> wrote:
    > [ python3.1.1, re.__version__='2.2.1' ]
    > I'm trying to use re to split a string into (any number of) pieces of
    > these kinds:
    > 1) contiguous runs of letters
    > 2) contiguous runs of digits
    > 3) single other characters
    >
    > e.g.   555tHe-rain.in#=1234   should give:   [555, 'tHe', '-', 'rain',
    > '.', 'in', '#', '=', 1234]
    > I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain..in#=1234').groups()
    >
    > ('1234', 'in', '1234', '=')
    >
    > Why is 1234 repeated in two groups?  and why doesn't "tHe" appear as a
    > group?  Is my regexp illegal somehow and confusing the engine?
    >
    > I *would* like to understand what's wrong with this regex, though if
    > someone has a neat other way to do the above task, I'm also interested
    > in suggestions.


    IMO, for most purposes, for people who don't want to become re
    experts, the easiest, fastest, best, most predictable way to use re is
    re.split. You can either call re.split directly, or, if you are going
    to be splitting on the same pattern over and over, compile the pattern
    and grab its split method. Use a *single* capture group in the
    pattern, that covers the *whole* pattern. In the case of your example
    data:

    >>> import re
    >>> splitter=re.compile('([A-Za-z]+|[0-9]+|[-.#=])').split
    >>> s='555tHe-rain.in#=1234'
    >>> [x for x in splitter(s) if x]

    ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

    The reason for the list comprehension is that re.split will always
    return a non-matching string between matches. Sometimes this is
    useful even when it is a null string (see recent discussion in the
    group about splitting digits out of a string), but if you don't care
    to see null (empty) strings, this comprehension will remove them.

    The reason for a single capture group that covers the whole pattern is
    that it is much easier to reason about the output. The split will
    give you all your data, in order, e.g.

    >>> ''.join(splitter(s)) == s

    True

    HTH,
    Pat
     
    Patrick Maupin, Apr 8, 2010
    #4
  5. gry

    Tim Chase Guest

    gry wrote:
    > [ python3.1.1, re.__version__='2.2.1' ]
    > I'm trying to use re to split a string into (any number of) pieces of
    > these kinds:
    > 1) contiguous runs of letters
    > 2) contiguous runs of digits
    > 3) single other characters
    >
    > e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
    > '.', 'in', '#', '=', 1234]
    > I tried:
    >>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()

    > ('1234', 'in', '1234', '=')
    >
    > Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
    > group? Is my regexp illegal somehow and confusing the engine?


    well, I'm not sure what it thinks its finding but nested capture-groups
    always produce somewhat weird results for me (I suspect that's what's
    triggering the duplication). Additionally, you're only searching for
    one match (.match() returns a single match-object or None; not all
    possible matches within the repeated super-group).

    > I *would* like to understand what's wrong with this regex, though if
    > someone has a neat other way to do the above task, I'm also interested
    > in suggestions.


    Tweaking your original, I used

    >>> s='555tHe-rain.in#=1234'
    >>> import re
    >>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
    >>> r.findall(s)

    ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

    The only difference between my results and your results is that the 555
    and 1234 come back as strings, not ints.

    -tkc
     
    Tim Chase, Apr 8, 2010
    #5
  6. gry

    gry Guest

    On Apr 8, 3:40 pm, MRAB <> wrote:

    ....
    > Group 1 and group 4 match '='.
    > Group 1 and group 3 match '1234'.
    >
    > If a group matches then any earlier match of that group is discarded,

    Wow, that makes this much clearer! I wonder if this behaviour
    shouldn't be mentioned in some form in the python docs?
    Thanks much!

    > so:
    >
    > Group 1 finishes with '1234'.
    > Group 2 finishes with 'in'.
    > Group 3 finishes with '1234'.
     
    gry, Apr 8, 2010
    #6
  7. gry

    Jon Clements Guest

    On 8 Apr, 19:49, gry <> wrote:
    > [ python3.1.1, re.__version__='2.2.1' ]
    > I'm trying to use re to split a string into (any number of) pieces of
    > these kinds:
    > 1) contiguous runs of letters
    > 2) contiguous runs of digits
    > 3) single other characters
    >
    > e.g.   555tHe-rain.in#=1234   should give:   [555, 'tHe', '-', 'rain',
    > '.', 'in', '#', '=', 1234]
    > I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain..in#=1234').groups()
    >
    > ('1234', 'in', '1234', '=')
    >
    > Why is 1234 repeated in two groups?  and why doesn't "tHe" appear as a
    > group?  Is my regexp illegal somehow and confusing the engine?
    >
    > I *would* like to understand what's wrong with this regex, though if
    > someone has a neat other way to do the above task, I'm also interested
    > in suggestions.


    Avoiding re's (for a bit of fun):
    (no good for unicode obviously)

    import string
    from itertools import groupby, chain, repeat, count, izip

    s = """555tHe-rain.in#=1234"""

    unique_group = count()
    lookup = dict(
    chain(
    izip(string.ascii_letters, repeat('L')),
    izip(string.digits, repeat('D')),
    izip(string.punctuation, unique_group)
    )
    )
    parse = dict(D=int, L=str.capitalize)


    print [ parse.get(key, lambda L: L)(''.join(items)) for key, items in
    groupby(s, lambda L: lookup[L]) ]
    [555, 'The', '-', 'Rain', '.', 'In', '#', '=', 1234]

    Jon.
     
    Jon Clements, Apr 8, 2010
    #7
  8. gry

    gry Guest

    >    >>> s='555tHe-rain.in#=1234'
    >    >>> import re
    >    >>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
    >    >>> r.findall(s)
    >    ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

    This is nice and simple and has the invertible property that Patrick
    mentioned above. Thanks much!
     
    gry, Apr 8, 2010
    #8
  9. On Apr 8, 3:40 pm, gry <> wrote:
    > >    >>> s='555tHe-rain.in#=1234'
    > >    >>> import re
    > >    >>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
    > >    >>> r.findall(s)
    > >    ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

    >
    > This is nice and simple and has the invertible property that Patrick
    > mentioned above.  Thanks much!


    Yes, like using split(), this is invertible. But you will see a
    difference (and for a given task, you might prefer one way or the
    other) if, for example, you put a few consecutive spaces in the middle
    of your string, where this pattern and findall() will return each
    space individually, and split() will return them all together.

    You *can* fix up the pattern for findall() where it will have the same
    properties as the split(), but it will almost always be a more
    complicated pattern than for the equivalent split().

    Another thing you can do with split(): if you *think* you have a
    pattern that fully covers every string you expect to throw at it, but
    would like to verify this, you can make use of the fact that split()
    returns a string between each match (and before the first match and
    after the last match). So if you expect that every character in your
    entire string should be a part of a match, you can do something like:

    strings = splitter(s)
    tokens = strings[1::2]
    assert not ''.join(strings[::2])

    Regards,
    Pat
     
    Patrick Maupin, Apr 8, 2010
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page