# Extracting subsequences composed of the same character

Discussion in 'Python' started by candide, Apr 1, 2011.

1. ### candideGuest

Suppose you have a string, for instance

"pyyythhooonnn ---> ++++"

and you search for the subquences composed of the same character, here
you get :

'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

It's not difficult to write a Python code that solves the problem, for
instance :

def f(text):
ch=text
r=[]
if not text:
return r
else:
x=ch[0]
i=0
for c in ch:
if c!=x:
if i>1:
r+=[x*i]
x=c
i=1
else:
i+=1
return r+(i>1)*[i*x]

print f("pyyythhooonnn ---> ++++")

I should confess that this code is rather cumbersome so I was looking
for an alternative. I imagine that a regular expressions approach could
provide a better method. Does a such code exist ? Note that the string
is not restricted to the ascii charset.

candide, Apr 1, 2011

2. ### MRABGuest

On 01/04/2011 01:43, candide wrote:
> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'
>
> It's not difficult to write a Python code that solves the problem, for
> instance :
>

[snip]
>
> I should confess that this code is rather cumbersome so I was looking
> for an alternative. I imagine that a regular expressions approach could
> provide a better method. Does a such code exist ? Note that the string
> is not restricted to the ascii charset.

>>> import re
>>> re.findall(r"((.)\2+)", s)

[('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---', '-'),
('++++', '+')]
>>> [m[0] for m in re.findall(r"((.)\2+)", s)]

['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

MRAB, Apr 1, 2011

3. ### Roy SmithGuest

In article <4d952008\$0\$3943\$>,
candide <> wrote:

> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

I got the following. It's O(n) (with the minor exception that the string
addition isn't, but that's trivial to fix, and in practice, the bunches
are short enough it hardly matters).

#!/usr/bin/env python

s = "pyyythhooonnn ---> ++++"
answer = ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

last = None
bunches = []
bunch = ''
for c in s:
if c == last:
bunch += c
else:
if bunch:
bunches.append(bunch)
bunch = c
last = c
bunches.append(bunch)

multiples = [bunch for bunch in bunches if len(bunch) > 1]
print multiples

[eagerly awaiting a PEP for collections.bunch and
collections.frozenbunch]

Roy Smith, Apr 1, 2011
4. ### Tim ChaseGuest

On 03/31/2011 07:43 PM, candide wrote:
> Suppose you have a string, for instance
>
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

>>> import re
>>> s = "pyyythhooonnn ---> ++++"
>>> [m.group(0) for m in re.finditer(r"(.)\1+", s)]

['yyy', 'hh', 'ooo', 'nnn', '---', '++++']
>>> [(m.group(0),m.group(1)) for m in re.finditer(r"(.)\1+", s)]

[('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---',
'-'), ('++++', '+')]

-tkc

Tim Chase, Apr 1, 2011
5. ### Tim ChaseGuest

On 03/31/2011 07:43 PM, candide wrote:
> "pyyythhooonnn ---> ++++"
>
> and you search for the subquences composed of the same character, here
> you get :
>
> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

Or, if you want to do it with itertools instead of the "re" module:

>>> s = "pyyythhooonnn ---> ++++"
>>> from itertools import groupby
>>> [c*length for c, length in ((k, len(list(g))) for k, g in

groupby(s)) if length > 1]
['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

-tkc

Tim Chase, Apr 1, 2011
6. ### Terry ReedyGuest

On 3/31/2011 10:20 PM, Tim Chase wrote:
> On 03/31/2011 07:43 PM, candide wrote:
>> "pyyythhooonnn ---> ++++"
>>
>> and you search for the subquences composed of the same character, here
>> you get :
>>
>> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

>
> Or, if you want to do it with itertools instead of the "re" module:
>
> >>> s = "pyyythhooonnn ---> ++++"
> >>> from itertools import groupby
> >>> [c*length for c, length in ((k, len(list(g))) for k, g in

> groupby(s)) if length > 1]
> ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

Slightly shorter:
[r for r in (''.join(g) for k, g in groupby(s)) if len(r) > 1]

--
Terry Jan Reedy

Terry Reedy, Apr 1, 2011
7. ### candideGuest

Thanks, yours responses gave me the opportunity to understand the
"backreference" feature, it was not clear in spite of my intensive study
of the well known RE howto manual.

candide, Apr 1, 2011