String Splitter Brain Teaser

James Stroud · Mar 27, 2005

Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

I have written a very ugly function to do this (listed below for the curious),
but intuitively I think this should only take a couple of lines for one
skilled in regex and/or listcomp. Any takers?

James

p.s. Here is the ugly function I wrote:

def build_consensus(astr):

consensus = [] # the lol that will be returned
possibilities = [] # one element of consensus
consecutives = 0 # keeps track of how many in a row

for achar in astr:
if (achar == "/"):
consecutives = 0
continue
else:
consecutives += 1
if (consecutives > 1):
consensus.append(possibilities)
possibilities = [achar]
else:
possibilities.append(achar)
if possibilities:
consensus.append(possibilities)
return consensus

--
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Paul McGuire · Mar 28, 2005

Using a parser may sound like overkill, but why not when it's this
easy? Get the latest pyparsing at http://pyparsing.sourceforge.net.

-- Paul

from pyparsing import oneOf, Group, OneOrMore, Literal

testdata = "ATT/GATA/G"

marker = oneOf( "A T G C")
SLASH = Literal("/").suppress()
genDegenList = OneOrMore( Group( marker + SLASH + marker ) | Group(
marker ) )

print genDegenList.parseString( testdata )

(prints:
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Doug Schwarz · Mar 28, 2005

James Stroud said:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]

Ron_Adam · Mar 28, 2005

Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

Here's two ways without using regular expression. Both about the
same.

s = list("ATT/GATA/G")
result = []
while len(s)>0:
a = [s.pop(0)]
if s[0] == '/':
b = s.pop(0)
a.append(s.pop(0))
result.append(a)
print result

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

s = "ATT/GATA/G"
result = []
while len(s)>0:
if s[1:2] == '/':
result.append([s[0],s[2]])
s = s[3:]
else:
result.append([s[0]])
s = s[1:]
print result

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

runsun pan · Mar 28, 2005

For the fans of funtional programming:

s='ATT/GATA/G'
[y.split('/') for y in (' '.join([x for x in s]).replace(' / ',

Click to expand...

Click to expand...

'/')).split()]
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

bearophileHUGS · Mar 28, 2005

This is shorter:
map(list,' '.join(s).replace(' / ','').split())

but for very long genomes Michael Spencer's nice version can be faster.

Hugs,
Bearophile

Bill Mill · Mar 28, 2005

This is shorter:
map(list,' '.join(s).replace(' / ','').split())

but for very long genomes Michael Spencer's nice version can be faster.

for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]
....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com

Michael Spencer · Mar 28, 2005

Bill said:
for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]

...
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com

works according to the original spec, but there are a couple of issues:

1. the output is specified to be a list, so delaying the creation of the list
isn't a win

2. this version fails down in the presence of "double degeneracies" (if that's
what they should be called) - which were not in the OP spec, but which cropped
up in a later post : [['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]

Michael

Bill Mill · Mar 28, 2005

Bill said:
Bill said:

for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]

for g in xgen('ATT/GATA/G'): print g

Click to expand...

...
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com

Click to expand...

works according to the original spec, but there are a couple of issues:

1. the output is specified to be a list, so delaying the creation of the list
isn't a win

True. However, if it is a really long genome, he's not going to want
to have both a string of the genome and a list of the genome in
memory. Instead, I thought it might be useful to iterate through the
genome so that it doesn't have to be stored in memory. Since he didn't
specify what he wants the list for, it's possible that he just needs
to iterate through the genome, grouping degeneracies as he goes.

2. this version fails down in the presence of "double degeneracies" (if that's
what they should be called) - which were not in the OP spec, but which cropped
up in a later post :[['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]

This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:

def xgen(s):
e = enumerate(s)
stack = [e.next()[1]] #push the first char into the stack
for i,c in e:
if c != '/':
yield stack
stack = [c]
else:
stack.append(e.next()[1])
yield stack

....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G', 'A']
['T']

Peace
Bill Mill
bill.mill at gmail.com

Michael Spencer · Mar 28, 2005

Bill said:
> [long genomes might justify a generator approach]

That's a good point. I should have said: *If* you are going to put the items
into a list anyway, then there is no point generating the list items individually.

Michael said:
>>[Bill's solution didn't work for multiple-degeneracies]

Click to expand...

This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:

def xgen(s):
e = enumerate(s)
stack = [e.next()[1]] #push the first char into the stack
for i,c in e:
if c != '/':
yield stack
stack = [c]
else:
stack.append(e.next()[1])
yield stack

That is clearer. At this point, though, you don't need the enumerator any more
(so you can avoid indexing each item):

def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item =
yield item

Cheers
Michael

Steven Bethard · Mar 28, 2005

Michael said:
def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item =
yield item

Note that the generator-based solution doesn't generate an error on some
invalid data (e.g. where there is a final '/'), where the previous
list-based solution did:

py> group("AGC/C/TGA/T")
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> group("AGC/C/TGA/T/")
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "<interactive input>", line 6, in group
StopIteration
py> list(xgen("AGC/C/TGA/T"))
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> list(xgen("AGC/C/TGA/T/"))
[['A'], ['G'], ['C', 'C', 'T'], ['G']]

Not sure which is the desired behavior, but I figured the OP should be
aware of this in case it's possible to have strings in an invalid
format. If this needs to be fixed, you can just wrap the srciter.next()
call in an appropriate try/except.

STeVe

Bill Mill · Mar 28, 2005

That is clearer. At this point, though, you don't need the enumerator any more
(so you can avoid indexing each item):

Good point.

def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item =
yield item

For some reason, keeping the != first feels a lot more logical to me,
but I think that's just a reflection of my particular mental model of
the problem. Also, item is a much clearer name than stack; I chose
stack just to point out how similar the solution to objection 2 was to
yours.

Peace
Bill Mill
bill.mill at gmail.com

Doug Schwarz · Mar 28, 2005

James Stroud said:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]

Doug Schwarz · Mar 28, 2005

James Stroud said:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]

Array of Chars to String	15	Apr 19, 2005
How to pass a reference to the current module	22	Aug 4, 2007
Why no list heritable type?	3	Dec 16, 2004
Big time WTF with generators - bug?	16	Feb 13, 2008
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
[SUMMARY] 1-800-THE-QUIZ (#20)	7	Feb 24, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

String Splitter Brain Teaser

James Stroud

Paul McGuire

Doug Schwarz

Ron_Adam

runsun pan

bearophileHUGS

Bill Mill

Michael Spencer

Bill Mill

Michael Spencer

Steven Bethard

Bill Mill

Doug Schwarz

Doug Schwarz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads