re.split() not keeping matched text

Discussion in 'Python' started by Robert Oschler, Jul 25, 2004.

  1. Hello,

    Given the following program:

    --------------

    import re

    x = "The dog ran. The cat eats! The bird flies? Done."
    l = re.split("[.?!]", x)

    for s in l:
    print s.strip()
    # for
    ---------------

    I am getting the following output:

    The dog ran
    The cat eats
    The bird flies
    Done

    As you can see the end of sentence punctuation marks are being removed. Yet
    the the docs for re.split() say that the matched text is supposed to be
    returned. I want to keep the punctuation marks.

    Where am I going wrong here?

    Thanks,
    --
    Robert
    Robert Oschler, Jul 25, 2004
    #1
    1. Advertising

  2. Robert Oschler

    Test Guest

    Hi Robert,

    Robert Oschler wrote:

    > l = re.split("[.?!]", x)


    > I want to keep the punctuation marks.


    The docs say: If _capturing parentheses_ are used in pattern, then the text
    of all groups in the pattern are also returned as part of the resulting
    list.

    So:

    l = re.split("([.?!])", x)

    will work as wanted.

    Bye,
    Kai
    Test, Jul 25, 2004
    #2
    1. Advertising

  3. "Test" <> wrote in message
    news:ce12hh$vc9$06$-online.com...
    > Hi Robert,
    >
    > The docs say: If _capturing parentheses_ are used in pattern, then the

    text
    > of all groups in the pattern are also returned as part of the resulting
    > list.
    >
    > So:
    >
    > l = re.split("([.?!])", x)
    >
    > will work as wanted.
    >
    > Bye,
    > Kai


    Kai,

    That works. Unfortunately the punctuation marks (matched text) are returned
    as separate list entries. Is there any way to avoid having to walk the list
    by steps of 2, and rejoin the "n" and "n+1" elements, to get back the
    original sentence(s)? I'm trying to save some processing time if possible.

    Thanks,
    --
    Robert
    Robert Oschler, Jul 25, 2004
    #3
  4. On Sun, 25 Jul 2004, Robert Oschler wrote:

    > Given the following program:
    >
    > --------------
    >
    > import re
    >
    > x = "The dog ran. The cat eats! The bird flies? Done."
    > l = re.split("[.?!]", x)
    >
    > for s in l:
    > print s.strip()
    > # for
    > ---------------


    > I want to keep the punctuation marks.
    >
    > Where am I going wrong here?


    What you need is some magic with the (?<=...), or 'look-behind assertion'
    operator:

    re.split(r'(?<=[.?!])\s*')

    What this regex is saying is "match a string of spaces that follows one of
    [.?!]". This way, it will not consume the punctuation, but will consume
    the spaces (thus killing two birds with one stone by obviating the need
    for the subsequent s.strip()).

    Unfortunately, there is a slight bug, where if the punctuation is not
    followed by whitespace, re.split won't split, because the regex returns a
    zero-length string. There is a patch to fix this (SF #988761, see the end
    of the message for a link), but until then, you can prevent the error by
    using:

    re.split(r'(?<=[.?!])\s+')

    This won't match end-of-character marks not followed by whitespace, but
    that may be preferable behaviour anyways (e.g. if you're parsing Python
    documentation).

    Hope this helps.

    Patch #988761:
    http://sourceforge.net/tracker/index.php?func=detail&aid=988761&group_id=5470&atid=305470
    Christopher T King, Jul 25, 2004
    #4
  5. Robert Oschler

    Guest

    I don't know if this will save you any processing time, but you can just
    replace the split with a findall like this:
    l = re.findall("[^.?!]+[?!.]+", x)

    This should handle your example, plus it handles multiple occurances of
    the punctuation at the end of the sentence.

    Robert Oschler <no_replies@fake_email_address.invalid> wrote:
    > Hello,
    >
    > Given the following program:
    >
    > --------------
    >
    > import re
    >
    > x = "The dog ran. The cat eats! The bird flies? Done."
    > l = re.split("[.?!]", x)
    >
    > for s in l:
    > print s.strip()
    > # for
    > ---------------
    >
    > I am getting the following output:
    >
    > The dog ran
    > The cat eats
    > The bird flies
    > Done
    >
    > As you can see the end of sentence punctuation marks are being removed. Yet
    > the the docs for re.split() say that the matched text is supposed to be
    > returned. I want to keep the punctuation marks.
    >
    > Where am I going wrong here?
    >
    > Thanks,
    , Jul 26, 2004
    #5
  6. Robert Oschler

    Peter Otten Guest

    wrote:

    > I don't know if this will save you any processing time, but you can just
    > replace the split with a findall like this:
    > l = re.findall("[^.?!]+[?!.]+", x)
    >
    > This should handle your example, plus it handles multiple occurances of
    > the punctuation at the end of the sentence.


    One caveat: the invariant

    "".join(re.findall("[^?!.]+[?!.]+", s)) == s

    will no longer hold as you will lose leading punctuation and trailing
    non-punctuation:

    >>> re.findall("[^?!.]+[?!.]+", "!so what! you're done? yes done")

    ['so what!', " you're done?"]
    >>>


    Peter
    Peter Otten, Jul 26, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Micah N

    Keeping Grid Checkbox Control matched with correct row

    Micah N, May 17, 2004, in forum: ASP .Net Datagrid Control
    Replies:
    0
    Views:
    110
    Micah N
    May 17, 2004
  2. John Butler
    Replies:
    2
    Views:
    160
    Colin Bartlett
    Jun 3, 2010
  3. Brian
    Replies:
    2
    Views:
    104
    Tad McClellan
    Apr 17, 2004
  4. Replies:
    2
    Views:
    119
    David Squire
    May 25, 2006
  5. laredotornado
    Replies:
    51
    Views:
    1,585
    Daniel Pitts
    Mar 28, 2012
Loading...

Share This Page