String Splitter Brain Teaser

J

James Stroud

Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]

I have written a very ugly function to do this (listed below for the curious),
but intuitively I think this should only take a couple of lines for one
skilled in regex and/or listcomp. Any takers?

James

p.s. Here is the ugly function I wrote:

def build_consensus(astr):

consensus = [] # the lol that will be returned
possibilities = [] # one element of consensus
consecutives = 0 # keeps track of how many in a row

for achar in astr:
if (achar == "/"):
consecutives = 0
continue
else:
consecutives += 1
if (consecutives > 1):
consensus.append(possibilities)
possibilities = [achar]
else:
possibilities.append(achar)
if possibilities:
consensus.append(possibilities)
return consensus

--
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
P

Paul McGuire

Using a parser may sound like overkill, but why not when it's this
easy? Get the latest pyparsing at http://pyparsing.sourceforge.net.

-- Paul

from pyparsing import oneOf, Group, OneOrMore, Literal

testdata = "ATT/GATA/G"

marker = oneOf( "A T G C")
SLASH = Literal("/").suppress()
genDegenList = OneOrMore( Group( marker + SLASH + marker ) | Group(
marker ) )

print genDegenList.parseString( testdata )

(prints:
[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
 
D

Doug Schwarz

James Stroud said:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]


How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]
 
R

Ron_Adam

Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]


Here's two ways without using regular expression. Both about the
same.

s = list("ATT/GATA/G")
result = []
while len(s)>0:
a = [s.pop(0)]
if s[0] == '/':
b = s.pop(0)
a.append(s.pop(0))
result.append(a)
print result

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]


s = "ATT/GATA/G"
result = []
while len(s)>0:
if s[1:2] == '/':
result.append([s[0],s[2]])
s = s[3:]
else:
result.append([s[0]])
s = s[1:]
print result

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]
 
B

bearophileHUGS

This is shorter:
map(list,' '.join(s).replace(' / ','').split())

but for very long genomes Michael Spencer's nice version can be faster.

Hugs,
Bearophile
 
B

Bill Mill

This is shorter:
map(list,' '.join(s).replace(' / ','').split())

but for very long genomes Michael Spencer's nice version can be faster.

for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]
....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com
 
M

Michael Spencer

Bill said:
for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]


...
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com

works according to the original spec, but there are a couple of issues:

1. the output is specified to be a list, so delaying the creation of the list
isn't a win

2. this version fails down in the presence of "double degeneracies" (if that's
what they should be called) - which were not in the OP spec, but which cropped
up in a later post : [['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]

Michael
 
B

Bill Mill

Bill said:
for very long genomes he might want a generator:

def xgen(s):
l = len(s) - 1
e = enumerate(s)
for i,c in e:
if i < l and s[i+1] == '/':
e.next()
i2, c2 = e.next()
yield [c, c2]
else:
yield [c]

for g in xgen('ATT/GATA/G'): print g

...
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G']

Peace
Bill Mill
bill.mill at gmail.com

works according to the original spec, but there are a couple of issues:

1. the output is specified to be a list, so delaying the creation of the list
isn't a win

True. However, if it is a really long genome, he's not going to want
to have both a string of the genome and a list of the genome in
memory. Instead, I thought it might be useful to iterate through the
genome so that it doesn't have to be stored in memory. Since he didn't
specify what he wants the list for, it's possible that he just needs
to iterate through the genome, grouping degeneracies as he goes.
2. this version fails down in the presence of "double degeneracies" (if that's
what they should be called) - which were not in the OP spec, but which cropped
up in a later post :[['A'], ['G'], ['C', 'C'], ['/'], ['T'], ['G'], ['A', 'T']]

This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:

def xgen(s):
e = enumerate(s)
stack = [e.next()[1]] #push the first char into the stack
for i,c in e:
if c != '/':
yield stack
stack = [c]
else:
stack.append(e.next()[1])
yield stack

....
['A']
['T']
['T', 'G']
['A']
['T']
['A', 'G', 'A']
['T']

Peace
Bill Mill
bill.mill at gmail.com
 
M

Michael Spencer

Bill said:
> [long genomes might justify a generator approach]
That's a good point. I should have said: *If* you are going to put the items
into a list anyway, then there is no point generating the list items individually.

Michael said:
>>[Bill's solution didn't work for multiple-degeneracies]
This is simple enough to fix, in basically the same way your function
works. I think it actually makes the function simpler:

def xgen(s):
e = enumerate(s)
stack = [e.next()[1]] #push the first char into the stack
for i,c in e:
if c != '/':
yield stack
stack = [c]
else:
stack.append(e.next()[1])
yield stack
That is clearer. At this point, though, you don't need the enumerator any more
(so you can avoid indexing each item):

def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item =
yield item

Cheers
Michael
 
S

Steven Bethard

Michael said:
def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item =
yield item


Note that the generator-based solution doesn't generate an error on some
invalid data (e.g. where there is a final '/'), where the previous
list-based solution did:

py> group("AGC/C/TGA/T")
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> group("AGC/C/TGA/T/")
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "<interactive input>", line 6, in group
StopIteration
py> list(xgen("AGC/C/TGA/T"))
[['A'], ['G'], ['C', 'C', 'T'], ['G'], ['A', 'T']]
py> list(xgen("AGC/C/TGA/T/"))
[['A'], ['G'], ['C', 'C', 'T'], ['G']]

Not sure which is the desired behavior, but I figured the OP should be
aware of this in case it's possible to have strings in an invalid
format. If this needs to be fixed, you can just wrap the srciter.next()
call in an appropriate try/except.

STeVe
 
B

Bill Mill

That is clearer. At this point, though, you don't need the enumerator any more
(so you can avoid indexing each item):

Good point.
def xgen(s):
srciter = iter(s)
item = [srciter.next()]
for i in srciter:
if i == '/':
item.append(srciter.next())
else:
yield item
item =
yield item


For some reason, keeping the != first feels a lot more logical to me,
but I think that's just a reflection of my particular mental model of
the problem. Also, item is a much clearer name than stack; I chose
stack just to point out how similar the solution to objection 2 was to
yours.

Peace
Bill Mill
bill.mill at gmail.com
 
D

Doug Schwarz

James Stroud said:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]


How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]
 
D

Doug Schwarz

James Stroud said:
Hello,

I have strings represented as a combination of an alphabet (AGCT) and a an
operator "/", that signifies degeneracy. I want to split these strings into
lists of lists, where the degeneracies are members of the same list and
non-degenerates are members of single item lists. An example will clarify
this:

"ATT/GATA/G"

gets split to

[['A'], ['T'], ['T', 'G'], ['A'], ['T'], ['A', 'G']]


How about this?

import re
s = "ATT/GATA/G"
result1 = re.findall(r"./.|.", s)
consensus = [c.split("/") for c in result1]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top