Splitting a sequence into pieces with identical elements

candide · Aug 10, 2010

Suppose you have a sequence s , a string for say, for instance this one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

What is the pythonic way to answer this question?

A naive solution would be the following :

# -------------------------------
z='spppammmmegggssss'

zz=[]
while z:
k=1
while z[:k]==k*z[0]:
k+=1
zz+=[z[:k-1]]
z=z[k-1:]

print zz
# -------------------------------

but I guess this code is not very idiomatic

Chris Rebert · Aug 10, 2010

Suppose you have a sequence s , a string Â for say, for instance this one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

What is the pythonic way to answer this question?

If you're doing an operation on an iterable, always leaf thru itertools first:
http://docs.python.org/library/itertools.html

from itertools import groupby
def split_into_runs(seq):
return ["".join(run) for letter, run in groupby(seq)]

If itertools didn't exist:

def split_into_runs(seq):
if not seq: return []

iterator = iter(seq)
letter = next(iterator)
count = 1
words = []
for c in iterator:
if c == letter:
count += 1
else:
word = letter * count
words.append(word)
letter = c
count = 1
words.append(letter*count)
return words

Cheers,
Chris

Tim Chase · Aug 10, 2010

Suppose you have a sequence s , a string for say, for instance this one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

While I'm not sure it's idiomatic, the overabuse of regexps in
Python certainly seems prevalent enough to be idiomatic ;-)

As such, you can use:

import re
r = re.compile(r'((.)\1*)')
#r = re.compile(r'((\w)\1*)')
s = 'spppammmmegggssss'
results = [m.group(0) for m in r.finditer(s)]

Additionally, you have all the properties of the match-object
(which includes the start/end) available too if you need).

You don't specify what you want to have happen with non-letters
(whitespace, punctuation, etc). The above just treats them like
any other character, finding repeats. If you just want "word"
characters, you can use the 2nd ("\w") version, or adjust
accordingly.

-tkc

MRAB · Aug 10, 2010

Tim said:
Suppose you have a sequence s , a string for say, for instance this
one :

spppammmmegggssss

We want to split s into the following parts :

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

ie each part is a single repeated character word.

Click to expand...

While I'm not sure it's idiomatic, the overabuse of regexps in Python
certainly seems prevalent enough to be idiomatic ;-)

As such, you can use:

import re
r = re.compile(r'((.)\1*)')
#r = re.compile(r'((\w)\1*)')

That should be \2, not \1.

Alternatively:

r = re.compile(r'(.)\1*')
#r = re.compile(r'(\w)\1*')

s = 'spppammmmegggssss'
results = [m.group(0) for m in r.finditer(s)]

Additionally, you have all the properties of the match-object (which
includes the start/end) available too if you need).

You don't specify what you want to have happen with non-letters
(whitespace, punctuation, etc). The above just treats them like any
other character, finding repeats. If you just want "word" characters,
you can use the 2nd ("\w") version, or adjust accordingly.

Tim Chase · Aug 10, 2010

That should be \2, not \1.

Alternatively:

r = re.compile(r'(.)\1*')

Doh, I had played with both and mis-transcribed the combination
of them into one malfunctioning regexp. My original trouble with
the 2nd one was that r.findall() (not .finditer) was only
returning the first letter of each because that's what was
matched. Wrapping it in the extra set of parens and using "\2"
returned the actual data in sub-tuples:

>>> s = 'spppammmmegggssss'
>>> import re
>>> r = re.compile(r'(.)\1*')
>>> r.findall(s) # no repeated text, just the initial letter ['s', 'p', 'a', 'm', 'e', 'g', 's']
>>> [m.group(0) for m in r.finditer(s)] ['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']
>>> r = re.compile(r'((.)\2*)')
>>> r.findall(s)

Click to expand...

Click to expand...

[('s', 's'), ('ppp', 'p'), ('a', 'a'), ('mmmm', 'm'), ('e', 'e'),
('ggg', 'g'), ('ssss', 's')]

>>> [m.group(0) for m in r.finditer(s)]

Click to expand...

Click to expand...

['s', 'ppp', 'a', 'mmmm', 'e', 'ggg', 'ssss']

By then changing to .finditer() it made them both work the way I
wanted.

Thanks for catching my mistranscription.

-tkc

Splitting a string into substrings of equal size	18	Aug 14, 2009
Implementing a Q-Learning Algorithm with Logistic Regression Normalization in C++	0	Jun 4, 2025
Can't solve problems! please Help	0	Sep 26, 2022
Mini Web Server in C++ (Part One)	4	Oct 2, 2025
efficiently splitting up strings based on substrings	7	Sep 5, 2009
use python to split a video file into a set of parts	2	May 7, 2013
Crawling	1	Mar 10, 2021
[SUMMARY] Splitting the Loot (#65)	8	Feb 9, 2006

Splitting a sequence into pieces with identical elements

candide

Chris Rebert

Tim Chase

MRAB

Tim Chase

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads