re.split() not keeping matched text

R

Robert Oschler

Hello,

Given the following program:

--------------

import re

x = "The dog ran. The cat eats! The bird flies? Done."
l = re.split("[.?!]", x)

for s in l:
print s.strip()
# for
---------------

I am getting the following output:

The dog ran
The cat eats
The bird flies
Done

As you can see the end of sentence punctuation marks are being removed. Yet
the the docs for re.split() say that the matched text is supposed to be
returned. I want to keep the punctuation marks.

Where am I going wrong here?

Thanks,
 
T

Test

Hi Robert,

Robert said:
l = re.split("[.?!]", x)
I want to keep the punctuation marks.

The docs say: If _capturing parentheses_ are used in pattern, then the text
of all groups in the pattern are also returned as part of the resulting
list.

So:

l = re.split("([.?!])", x)

will work as wanted.

Bye,
Kai
 
R

Robert Oschler

Test said:
Hi Robert,

The docs say: If _capturing parentheses_ are used in pattern, then the text
of all groups in the pattern are also returned as part of the resulting
list.

So:

l = re.split("([.?!])", x)

will work as wanted.

Bye,
Kai

Kai,

That works. Unfortunately the punctuation marks (matched text) are returned
as separate list entries. Is there any way to avoid having to walk the list
by steps of 2, and rejoin the "n" and "n+1" elements, to get back the
original sentence(s)? I'm trying to save some processing time if possible.

Thanks,
 
C

Christopher T King

Given the following program:

--------------

import re

x = "The dog ran. The cat eats! The bird flies? Done."
l = re.split("[.?!]", x)

for s in l:
print s.strip()
# for
---------------
I want to keep the punctuation marks.

Where am I going wrong here?

What you need is some magic with the (?<=...), or 'look-behind assertion'
operator:

re.split(r'(?<=[.?!])\s*')

What this regex is saying is "match a string of spaces that follows one of
[.?!]". This way, it will not consume the punctuation, but will consume
the spaces (thus killing two birds with one stone by obviating the need
for the subsequent s.strip()).

Unfortunately, there is a slight bug, where if the punctuation is not
followed by whitespace, re.split won't split, because the regex returns a
zero-length string. There is a patch to fix this (SF #988761, see the end
of the message for a link), but until then, you can prevent the error by
using:

re.split(r'(?<=[.?!])\s+')

This won't match end-of-character marks not followed by whitespace, but
that may be preferable behaviour anyways (e.g. if you're parsing Python
documentation).

Hope this helps.

Patch #988761:
http://sourceforge.net/tracker/index.php?func=detail&aid=988761&group_id=5470&atid=305470
 
M

mark

I don't know if this will save you any processing time, but you can just
replace the split with a findall like this:
l = re.findall("[^.?!]+[?!.]+", x)

This should handle your example, plus it handles multiple occurances of
the punctuation at the end of the sentence.
 
P

Peter Otten

I don't know if this will save you any processing time, but you can just
replace the split with a findall like this:
l = re.findall("[^.?!]+[?!.]+", x)

This should handle your example, plus it handles multiple occurances of
the punctuation at the end of the sentence.

One caveat: the invariant

"".join(re.findall("[^?!.]+[?!.]+", s)) == s

will no longer hold as you will lose leading punctuation and trailing
non-punctuation:
re.findall("[^?!.]+[?!.]+", "!so what! you're done? yes done") ['so what!', " you're done?"]

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,905
Latest member
Kristy_Poole

Latest Threads

Top