Splitting a sequence into pieces with identical elements

C

candide

Suppose you have a sequence s , a string for say, for instance this one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

What is the pythonic way to answer this question?

A naive solution would be the following :


# -------------------------------
z='spppammmmegggssss'

zz=[]
while z:
k=1
while z[:k]==k*z[0]:
k+=1
zz+=[z[:k-1]]
z=z[k-1:]

print zz
# -------------------------------


but I guess this code is not very idiomatic :(
 
C

Chris Rebert

Suppose you have a sequence s , a string  for say, for instance this one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

What is the pythonic way to answer this question?

If you're doing an operation on an iterable, always leaf thru itertools first:
http://docs.python.org/library/itertools.html

from itertools import groupby
def split_into_runs(seq):
return ["".join(run) for letter, run in groupby(seq)]


If itertools didn't exist:

def split_into_runs(seq):
if not seq: return []

iterator = iter(seq)
letter = next(iterator)
count = 1
words = []
for c in iterator:
if c == letter:
count += 1
else:
word = letter * count
words.append(word)
letter = c
count = 1
words.append(letter*count)
return words

Cheers,
Chris
 
T

Tim Chase

Suppose you have a sequence s , a string for say, for instance this one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

While I'm not sure it's idiomatic, the overabuse of regexps in
Python certainly seems prevalent enough to be idiomatic ;-)

As such, you can use:

import re
r = re.compile(r'((.)\1*)')
#r = re.compile(r'((\w)\1*)')
s = 'spppammmmegggssss'
results = [m.group(0) for m in r.finditer(s)]

Additionally, you have all the properties of the match-object
(which includes the start/end) available too if you need).

You don't specify what you want to have happen with non-letters
(whitespace, punctuation, etc). The above just treats them like
any other character, finding repeats. If you just want "word"
characters, you can use the 2nd ("\w") version, or adjust
accordingly.

-tkc
 
M

MRAB

Tim said:
Suppose you have a sequence s , a string for say, for instance this
one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

While I'm not sure it's idiomatic, the overabuse of regexps in Python
certainly seems prevalent enough to be idiomatic ;-)

As such, you can use:

import re
r = re.compile(r'((.)\1*)')
#r = re.compile(r'((\w)\1*)')

That should be \2, not \1.

Alternatively:

r = re.compile(r'(.)\1*')
#r = re.compile(r'(\w)\1*')
s = 'spppammmmegggssss'
results = [m.group(0) for m in r.finditer(s)]

Additionally, you have all the properties of the match-object (which
includes the start/end) available too if you need).

You don't specify what you want to have happen with non-letters
(whitespace, punctuation, etc). The above just treats them like any
other character, finding repeats. If you just want "word" characters,
you can use the 2nd ("\w") version, or adjust accordingly.
 
T

Tim Chase

That should be \2, not \1.

Alternatively:

r = re.compile(r'(.)\1*')

Doh, I had played with both and mis-transcribed the combination
of them into one malfunctioning regexp. My original trouble with
the 2nd one was that r.findall() (not .finditer) was only
returning the first letter of each because that's what was
matched. Wrapping it in the extra set of parens and using "\2"
returned the actual data in sub-tuples:
>>> s = 'spppammmmegggssss'
>>> import re
>>> r = re.compile(r'(.)\1*')
>>> r.findall(s) # no repeated text, just the initial letter ['s', 'p', 'a', 'm', 'e', 'g', 's']
>>> [m.group(0) for m in r.finditer(s)] ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>>> r = re.compile(r'((.)\2*)')
>>> r.findall(s)
[('s', 's'), ('ppp', 'p'), ('a', 'a'), ('mmmm', 'm'), ('e', 'e'),
('ggg', 'g'), ('ssss', 's')]
>>> [m.group(0) for m in r.finditer(s)]
['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

By then changing to .finditer() it made them both work the way I
wanted.

Thanks for catching my mistranscription.

-tkc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top