S
Steven Bethard
I have a list of strings that looks something like:
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.
This is the best I've come up with so far:
py> class K(object):
.... def __init__(self):
.... self.last_result = False
.... self.last_label = 'O'
.... def __call__(self, label):
.... if label[:2] in ('O', 'B_'):
.... self.last_result = not self.last_result
.... elif self.last_label[2:] != label[2:]:
.... raise ValueError('%s followed by %s' %
.... (self.last_label, label))
.... self.last_label = label
.... return self.last_result
....
py> def get_runs(lst):
.... for _, item in itertools.groupby(lst, K()):
.... result = list(item)
.... if result != ['O']:
.... yield result
....
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y
Can anyone see another way to do this?
STeVe
['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']
I need to group the strings into runs (lists) using the following rules
based on the string prefix:
'O' is discarded
'B_...' starts a new run
'I_...' continues a run started by a 'B_...'
So, the example above should look like:
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
At the same time that I'm extracting the runs, it's important that I
check for errors as well. 'I_...' must always follow 'B_...', so errors
look like:
['O', 'I_...']
['B_xxx', 'I_yyy']
where 'I_...' either follows an 'O' or a 'B_...' where the suffix of the
'B_...' is different from the suffix of the 'I_...'.
This is the best I've come up with so far:
py> class K(object):
.... def __init__(self):
.... self.last_result = False
.... self.last_label = 'O'
.... def __call__(self, label):
.... if label[:2] in ('O', 'B_'):
.... self.last_result = not self.last_result
.... elif self.last_label[2:] != label[2:]:
.... raise ValueError('%s followed by %s' %
.... (self.last_label, label))
.... self.last_label = label
.... return self.last_result
....
py> def get_runs(lst):
.... for _, item in itertools.groupby(lst, K()):
.... result = list(item)
.... if result != ['O']:
.... yield result
....
py> list(get_runs(['O', 'B_X', 'B_Y', 'I_Y', 'O', 'B_X', 'I_X', 'B_X']))
[['B_X'], ['B_Y', 'I_Y'], ['B_X', 'I_X'], ['B_X']]
py> list(get_runs(['O', 'I_Y']))
Traceback (most recent call last):
...
ValueError: O followed by I_Y
py> list(get_runs(['B_X', 'I_Y']))
Traceback (most recent call last):
...
ValueError: B_X followed by I_Y
Can anyone see another way to do this?
STeVe