grouping subsequences with BIO tags

S

Steven Bethard

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.

This is the best I've come up with so far:

py> class K(object):
.... def __init__(self):
.... self.last_result = False
.... self.last_label = 'O'
.... def __call__(self, label):
.... if label[:2] in ('O', 'B_'):
.... self.last_result = not self.last_result
.... elif self.last_label[2:] != label[2:]:
.... raise ValueError('%s followed by %s' %
.... (self.last_label, label))
.... self.last_label = label
.... return self.last_result
....
py> def get_runs(lst):
.... for _, item in itertools.groupby(lst, K()):
.... result = list(item)
.... if result != ['O']:
.... yield result
....
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?

STeVe
 
M

Michael Spencer

Steven said:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']


I'd have done it the same way as you, but here's 'another' way:
... stack = []
... for label in lst:
... prefix = label[0]
... if prefix == 'B':
... group = [label]
... stack.append(group)
... elif prefix == 'I':
... if group[0][2:] != label[2:]:
... raise ValueError('%s followed by %s' %
... (group[0], label))
... group.append(label)
... elif prefix == 'O':
... group = [label]
... return stack
...
>>>
>>> grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
>>>
>>> grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'O', 'I_X', 'B_X'])
Traceback (most recent call last):
File "<input>", line 1, in ?
File "\\CC1040907-A\MichaelDocuments\PyDev\Junk\BIO.py", line 32, in grp
raise ValueError('%s followed by %s' %
ValueError: O followed by I_X

Michael
 
B

Bengt Richter

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):
>>> L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>>> def get_runs(seq):
... subseq =[]
... curr = '<NO PRIOR ELEMENTS>'
... for latest in seq:
... curr, last = latest, curr
... if curr.startswith('B_'):
... if subseq: yield subseq
... subseq = [curr]
... elif curr.startswith('I_'):
... if (last[:2] not in ('B_', 'I_') or
... last[2:] != curr[2:]
... ): raise ValueError, '%r followed by %r'%(last, curr)
... subseq.append(curr)
... elif curr!='O':
... raise ValueError, 'Unrecognized element: %r' % curr
... if subseq: yield subseq
... [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

But note that I allowed multiple I_X, did you want to do that? [['B_X', 'I_X', 'I_X']]

Did you want all these "errors" caught? Traceback (most recent call last):
File "<stdin>", line 1, in ?
Traceback (most recent call last):
File "<stdin>", line 1, in ?
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: 'B_X' followed by 'I_Y'

Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)
This is the best I've come up with so far:

py> class K(object):
... def __init__(self):
... self.last_result = False
... self.last_label = 'O'
... def __call__(self, label):
... if label[:2] in ('O', 'B_'):
... self.last_result = not self.last_result
Thanks for nice reminder that groupby doesn't need distinct keys
for all groups, just different from last ;-)
... elif self.last_label[2:] != label[2:]:
... raise ValueError('%s followed by %s' %
... (self.last_label, label))
... self.last_label = label
... return self.last_result
...
py> def get_runs(lst):
... for _, item in itertools.groupby(lst, K()):
... result = list(item)
... if result != ['O']:
... yield result
...
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?
Well, there's no end to "another way", but above
is my .02USD ;-)

Regards,
Bengt Richter
 
S

Steven Bethard

Bengt said:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):
L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
def get_runs(seq):
... subseq =[]
... curr = '<NO PRIOR ELEMENTS>'
... for latest in seq:
... curr, last = latest, curr
... if curr.startswith('B_'):
... if subseq: yield subseq
... subseq = [curr]
... elif curr.startswith('I_'):
... if (last[:2] not in ('B_', 'I_') or
... last[2:] != curr[2:]
... ): raise ValueError, '%r followed by %r'%(last, curr)
... subseq.append(curr)
... elif curr!='O':
... raise ValueError, 'Unrecognized element: %r' % curr
... if subseq: yield subseq
...[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

Yeah, I started this route, and got confused by it. Of course it makes
perfect sense when someone writes you a working version. ;) Thanks!
But note that I allowed multiple I_X, did you want to do that?[['B_X', 'I_X', 'I_X']]

Yeah, that's right. Multiple 'I_...'s should be grouped together.
Did you want all these "errors" caught?Traceback (most recent call last):
File "<stdin>", line 1, in ?
Traceback (most recent call last):
File "<stdin>", line 1, in ?
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: 'B_X' followed by 'I_Y'

Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)

Yeah, those are the right errors. I'll have to think about whether I
should be trying to catch the [^BI]_ error. It doesn't appear in my
data now, but that doesn't mean it might not in the future. Thanks!

STeVe
 
M

Michael Spencer

Steven said:
Bengt said:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
[snip]
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,

I'm curious why you (Bengt or Steve) think the generator is an advantage here.
As Steve stated, the data already exists in lists of strings.

The direct list-building solution I posted is simpler, and quite a bit faster.

L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

def timethem(lst, funcs = (get_runsSB, get_runsMS, get_runsBR)):
for func in funcs:
print shell.timefunc(func, lst)
get_runsSB(...) 7877 iterations, 63.48usec per call
get_runsMS(...) 31081 iterations, 16.09usec per call
get_runsBR(...) 16114 iterations, 31.03usec per call


Michael
 
S

Steven Bethard

Michael said:
Steven said:
Bengt said:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
[snip]
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,

I'm curious why you (Bengt or Steve) think the generator is an advantage
here. As Steve stated, the data already exists in lists of strings.

The direct list-building solution I posted is simpler, and quite a bit
faster.

Aren't they basically just the same solution, with your stack.append
replaced by a yield (and with a little additional error checking)? As
far as I'm concerned, either solution is great and writes the code that
I couldn't. ;)

If you're still interested, in the real problem, the data doesn't exist
as a list of strings; it exists as a list of objects for which there is
a Python wrapper to a C API that retrieves the string. I don't know
exactly what happens in the wrapping, but it's possible that I can
conserve some memory by using the generator function. But I'd have to
profile it to know for sure.

STeVe
 
B

Bengt Richter

Steven said:
Bengt said:
On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
[snip]
With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,

I'm curious why you (Bengt or Steve) think the generator is an advantage here.
As Steve stated, the data already exists in lists of strings.

I hadn't seen your post[1], which I think is a nice crisp and clever solution ;-)

I just wrote what I thought was a straightforward solution, anticipating that
the imput list might be some huge bioinfo thing, and you might want to iterate
through the sublists one at a time and not want to build the whole list of
lists as represented by your stack.

[1] I don't know why stuff arrives almost instantly sometimes, and sometimes quite
delayed and out of order, but it is a bit embarrassing to post a me-too without
relevant comment, or being able to decide whether to play open source leapfrog.
In this case, I don't see a lily pad on the other side of your code, other than
the memory aspect ;-)
The direct list-building solution I posted is simpler, and quite a bit faster.

L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

def timethem(lst, funcs = (get_runsSB, get_runsMS, get_runsBR)):
for func in funcs:
print shell.timefunc(func, lst)
get_runsSB(...) 7877 iterations, 63.48usec per call
get_runsMS(...) 31081 iterations, 16.09usec per call
get_runsBR(...) 16114 iterations, 31.03usec per call


Michael

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top