Extracting subsequences composed of the same character

Discussion in 'Python' started by candide, Apr 1, 2011.

  1. candide

    candide Guest

    Suppose you have a string, for instance

    "pyyythhooonnn ---> ++++"

    and you search for the subquences composed of the same character, here
    you get :

    'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

    It's not difficult to write a Python code that solves the problem, for
    instance :

    def f(text):
    ch=text
    r=[]
    if not text:
    return r
    else:
    x=ch[0]
    i=0
    for c in ch:
    if c!=x:
    if i>1:
    r+=[x*i]
    x=c
    i=1
    else:
    i+=1
    return r+(i>1)*[i*x]

    print f("pyyythhooonnn ---> ++++")


    I should confess that this code is rather cumbersome so I was looking
    for an alternative. I imagine that a regular expressions approach could
    provide a better method. Does a such code exist ? Note that the string
    is not restricted to the ascii charset.
     
    candide, Apr 1, 2011
    #1
    1. Advertising

  2. candide

    MRAB Guest

    On 01/04/2011 01:43, candide wrote:
    > Suppose you have a string, for instance
    >
    > "pyyythhooonnn ---> ++++"
    >
    > and you search for the subquences composed of the same character, here
    > you get :
    >
    > 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'
    >
    > It's not difficult to write a Python code that solves the problem, for
    > instance :
    >

    [snip]
    >
    > I should confess that this code is rather cumbersome so I was looking
    > for an alternative. I imagine that a regular expressions approach could
    > provide a better method. Does a such code exist ? Note that the string
    > is not restricted to the ascii charset.


    >>> import re
    >>> re.findall(r"((.)\2+)", s)

    [('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---', '-'),
    ('++++', '+')]
    >>> [m[0] for m in re.findall(r"((.)\2+)", s)]

    ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']
     
    MRAB, Apr 1, 2011
    #2
    1. Advertising

  3. candide

    Roy Smith Guest

    In article <4d952008$0$3943$>,
    candide <> wrote:

    > Suppose you have a string, for instance
    >
    > "pyyythhooonnn ---> ++++"
    >
    > and you search for the subquences composed of the same character, here
    > you get :
    >
    > 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'


    I got the following. It's O(n) (with the minor exception that the string
    addition isn't, but that's trivial to fix, and in practice, the bunches
    are short enough it hardly matters).

    #!/usr/bin/env python

    s = "pyyythhooonnn ---> ++++"
    answer = ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']

    last = None
    bunches = []
    bunch = ''
    for c in s:
    if c == last:
    bunch += c
    else:
    if bunch:
    bunches.append(bunch)
    bunch = c
    last = c
    bunches.append(bunch)

    multiples = [bunch for bunch in bunches if len(bunch) > 1]
    print multiples
    assert(multiples == answer)


    [eagerly awaiting a PEP for collections.bunch and
    collections.frozenbunch]
     
    Roy Smith, Apr 1, 2011
    #3
  4. candide

    Tim Chase Guest

    On 03/31/2011 07:43 PM, candide wrote:
    > Suppose you have a string, for instance
    >
    > "pyyythhooonnn ---> ++++"
    >
    > and you search for the subquences composed of the same character, here
    > you get :
    >
    > 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'


    >>> import re
    >>> s = "pyyythhooonnn ---> ++++"
    >>> [m.group(0) for m in re.finditer(r"(.)\1+", s)]

    ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']
    >>> [(m.group(0),m.group(1)) for m in re.finditer(r"(.)\1+", s)]

    [('yyy', 'y'), ('hh', 'h'), ('ooo', 'o'), ('nnn', 'n'), ('---',
    '-'), ('++++', '+')]

    -tkc
     
    Tim Chase, Apr 1, 2011
    #4
  5. candide

    Tim Chase Guest

    On 03/31/2011 07:43 PM, candide wrote:
    > "pyyythhooonnn ---> ++++"
    >
    > and you search for the subquences composed of the same character, here
    > you get :
    >
    > 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'


    Or, if you want to do it with itertools instead of the "re" module:

    >>> s = "pyyythhooonnn ---> ++++"
    >>> from itertools import groupby
    >>> [c*length for c, length in ((k, len(list(g))) for k, g in

    groupby(s)) if length > 1]
    ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']


    -tkc
     
    Tim Chase, Apr 1, 2011
    #5
  6. candide

    Terry Reedy Guest

    On 3/31/2011 10:20 PM, Tim Chase wrote:
    > On 03/31/2011 07:43 PM, candide wrote:
    >> "pyyythhooonnn ---> ++++"
    >>
    >> and you search for the subquences composed of the same character, here
    >> you get :
    >>
    >> 'yyy', 'hh', 'ooo', 'nnn', '---', '++++'

    >
    > Or, if you want to do it with itertools instead of the "re" module:
    >
    > >>> s = "pyyythhooonnn ---> ++++"
    > >>> from itertools import groupby
    > >>> [c*length for c, length in ((k, len(list(g))) for k, g in

    > groupby(s)) if length > 1]
    > ['yyy', 'hh', 'ooo', 'nnn', '---', '++++']


    Slightly shorter:
    [r for r in (''.join(g) for k, g in groupby(s)) if len(r) > 1]

    --
    Terry Jan Reedy
     
    Terry Reedy, Apr 1, 2011
    #6
  7. candide

    candide Guest

    Thanks, yours responses gave me the opportunity to understand the
    "backreference" feature, it was not clear in spite of my intensive study
    of the well known RE howto manual.
     
    candide, Apr 1, 2011
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steven Bethard

    grouping subsequences with BIO tags

    Steven Bethard, Apr 21, 2005, in forum: Python
    Replies:
    6
    Views:
    327
    Bengt Richter
    Apr 23, 2005
  2. Matthew Wilson
    Replies:
    4
    Views:
    395
    Tim Chase
    Sep 29, 2006
  3. Sathyaish
    Replies:
    11
    Views:
    1,682
    Daniel Pitts
    Apr 4, 2007
  4. Replies:
    5
    Views:
    383
  5. Eli Bendersky

    subsequences

    Eli Bendersky, Apr 17, 2006, in forum: Ruby
    Replies:
    14
    Views:
    240
    Daniel Schierbeck
    Apr 19, 2006
Loading...

Share This Page