grouping subsequences with BIO tags

Steven Bethard · Apr 21, 2005

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.

This is the best I've come up with so far:

py> class K(object):
.... def __init__(self):
.... self.last_result = False
.... self.last_label = 'O'
.... def __call__(self, label):
.... if label[:2] in ('O', 'B_'):
.... self.last_result = not self.last_result
.... elif self.last_label[2:] != label[2:]:
.... raise ValueError('%s followed by %s' %
.... (self.last_label, label))
.... self.last_label = label
.... return self.last_result
....
py> def get_runs(lst):
.... for _, item in itertools.groupby(lst, K()):
.... result = list(item)
.... if result != ['O']:
.... yield result
....
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?

STeVe

Michael Spencer · Apr 21, 2005

Steven said:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

I'd have done it the same way as you, but here's 'another' way:
... stack = []
... for label in lst:
... prefix = label[0]
... if prefix == 'B':
... group = [label]
... stack.append(group)
... elif prefix == 'I':
... if group[0][2:] != label[2:]:
... raise ValueError('%s followed by %s' %
... (group[0], label))
... group.append(label)
... elif prefix == 'O':
... group = [label]
... return stack
...

>>>
>>> grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']) [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
>>>
>>> grp(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'O', 'I_X', 'B_X'])

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<input>", line 1, in ?
File "\\CC1040907-A\MichaelDocuments\PyDev\Junk\BIO.py", line 32, in grp
raise ValueError('%s followed by %s' %
ValueError: O followed by I_X

Michael

Bengt Richter · Apr 22, 2005

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):

>>> L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
>>> def get_runs(seq):

Click to expand...

Click to expand...

... subseq =[]
... curr = '<NO PRIOR ELEMENTS>'
... for latest in seq:
... curr, last = latest, curr
... if curr.startswith('B_'):
... if subseq: yield subseq
... subseq = [curr]
... elif curr.startswith('I_'):
... if (last[:2] not in ('B_', 'I_') or
... last[2:] != curr[2:]
... ): raise ValueError, '%r followed by %r'%(last, curr)
... subseq.append(curr)
... elif curr!='O':
... raise ValueError, 'Unrecognized element: %r' % curr
... if subseq: yield subseq
... [['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

But note that I allowed multiple I_X, did you want to do that? [['B_X', 'I_X', 'I_X']]

Did you want all these "errors" caught? Traceback (most recent call last):
File "<stdin>", line 1, in ?
Traceback (most recent call last):
File "<stdin>", line 1, in ?
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: 'B_X' followed by 'I_Y'

Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)

This is the best I've come up with so far:

py> class K(object):
... def __init__(self):
... self.last_result = False
... self.last_label = 'O'
... def __call__(self, label):
... if label[:2] in ('O', 'B_'):
... self.last_result = not self.last_result

Thanks for nice reminder that groupby doesn't need distinct keys
for all groups, just different from last ;-)

... elif self.last_label[2:] != label[2:]:
... raise ValueError('%s followed by %s' %
... (self.last_label, label))
... self.last_label = label
... return self.last_result
...
py> def get_runs(lst):
... for _, item in itertools.groupby(lst, K()):
... result = list(item)
... if result != ['O']:
... yield result
...
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y

Can anyone see another way to do this?

Well, there's no end to "another way", but above
is my .02USD ;-)

Regards,
Bengt Richter

Steven Bethard · Apr 22, 2005

Bengt said:
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.

Click to expand...

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,
doing my own groupby as I went.
E.g., see if this does what you want
(slightly different error checking):

L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
def get_runs(seq):

Click to expand...

Click to expand...

... subseq =[]
... curr = '<NO PRIOR ELEMENTS>'
... for latest in seq:
... curr, last = latest, curr
... if curr.startswith('B_'):
... if subseq: yield subseq
... subseq = [curr]
... elif curr.startswith('I_'):
... if (last[:2] not in ('B_', 'I_') or
... last[2:] != curr[2:]
... ): raise ValueError, '%r followed by %r'%(last, curr)
... subseq.append(curr)
... elif curr!='O':
... raise ValueError, 'Unrecognized element: %r' % curr
... if subseq: yield subseq
...[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]

Yeah, I started this route, and got confused by it. Of course it makes
perfect sense when someone writes you a working version.

Thanks!

But note that I allowed multiple I_X, did you want to do that?[['B_X', 'I_X', 'I_X']]

Yeah, that's right. Multiple 'I_...'s should be grouped together.

Did you want all these "errors" caught?Traceback (most recent call last):
File "<stdin>", line 1, in ?
Traceback (most recent call last):
File "<stdin>", line 1, in ?
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 12, in get_runs
ValueError: 'B_X' followed by 'I_Y'

Does that do what you want? (BTW, I added an error check against ['B_X', '*_X'] and such)

Yeah, those are the right errors. I'll have to think about whether I
should be trying to catch the [^BI]_ error. It doesn't appear in my
data now, but that doesn't mean it might not in the future. Thanks!

STeVe

Michael Spencer · Apr 23, 2005

Steven said:
Bengt said:

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

Click to expand...

Click to expand...

[snip]

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,

Click to expand...

I'm curious why you (Bengt or Steve) think the generator is an advantage here.
As Steve stated, the data already exists in lists of strings.

The direct list-building solution I posted is simpler, and quite a bit faster.

L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

def timethem(lst, funcs = (get_runsSB, get_runsMS, get_runsBR)):
for func in funcs:
print shell.timefunc(func, lst)
get_runsSB(...) 7877 iterations, 63.48usec per call
get_runsMS(...) 31081 iterations, 16.09usec per call
get_runsBR(...) 16114 iterations, 31.03usec per call

Michael

Steven Bethard · Apr 23, 2005

Michael said:
Steven said:

Bengt said:

On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

Click to expand...

[snip]

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,

Click to expand...

Click to expand...

I'm curious why you (Bengt or Steve) think the generator is an advantage
here. As Steve stated, the data already exists in lists of strings.

The direct list-building solution I posted is simpler, and quite a bit
faster.

Aren't they basically just the same solution, with your stack.append
replaced by a yield (and with a little additional error checking)? As
far as I'm concerned, either solution is great and writes the code that
I couldn't.

If you're still interested, in the real problem, the data doesn't exist
as a list of strings; it exists as a list of objects for which there is
a Python wrapper to a C API that retrieves the string. I don't know
exactly what happens in the wrapping, but it's possible that I can
conserve some memory by using the generator function. But I'd have to
profile it to know for sure.

STeVe

Bengt Richter · Apr 23, 2005

Steven said:
Steven said:

Bengt said:

On Thu, 21 Apr 2005 15:37:03 -0600, Steven Bethard

I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

Click to expand...

[snip]

With error checks on predecessor relationship,
I think I'd do the whole thing in a generator,

Click to expand...

Click to expand...

I'm curious why you (Bengt or Steve) think the generator is an advantage here.
As Steve stated, the data already exists in lists of strings.

I hadn't seen your post[1], which I think is a nice crisp and clever solution ;-)

I just wrote what I thought was a straightforward solution, anticipating that
the imput list might be some huge bioinfo thing, and you might want to iterate
through the sublists one at a time and not want to build the whole list of
lists as represented by your stack.

[1] I don't know why stuff arrives almost instantly sometimes, and sometimes quite
delayed and out of order, but it is a bit embarrassing to post a me-too without
relevant comment, or being able to decide whether to play open source leapfrog.
In this case, I don't see a lily pad on the other side of your code, other than
the memory aspect ;-)

The direct list-building solution I posted is simpler, and quite a bit faster.

L = ['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']

def timethem(lst, funcs = (get_runsSB, get_runsMS, get_runsBR)):
for func in funcs:
print shell.timefunc(func, lst)
get_runsSB(...) 7877 iterations, 63.48usec per call
get_runsMS(...) 31081 iterations, 16.09usec per call
get_runsBR(...) 16114 iterations, 31.03usec per call

Michael

Regards,
Bengt Richter

Grouping items by a key?	1	Mar 22, 2013
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Need help with this Python code.	2	Jun 13, 2023
Trouble with code	2	Mar 5, 2023
Python hangs: Problem with wxPython, threading, pySerial, or events?	0	Oct 15, 2011
Exception running GNU module "op25_grc.py" : AttributeError:'gr_hier_block2_sptr' object has no	3	Feb 15, 2013
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
AttributeError: 'gr_hier_block2_sptr' object has no attribute'set_callback'	0	Feb 14, 2013

grouping subsequences with BIO tags

Steven Bethard

Michael Spencer

Bengt Richter

Steven Bethard

Michael Spencer

Steven Bethard

Bengt Richter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads