S
Steven Bethard
So, I have a list of lists, where the items in each sublist are of
basically the same form. It looks something like:
py> data = [[('a', 0),
.... ('b', 1),
.... ('c', 2)],
....
.... [('d', 2),
.... ('e', 0)],
....
.... [('f', 0),
.... ('g', 2),
.... ('h', 1),
.... ('i', 0),
.... ('j', 0)]]
Now, I'd like to sample down the number of items in each sublist in the
following manner. I need to count the occurrences of each 'label' (the
second item in each tuple) in all the items of all the sublists, and
randomly remove some items until the number of occurrences of each
'label' is equal. So, given the data above, one possible resampling
would be:
[[('b', 1),
('c', 2)],
[('e', 0)],
[('g', 2),
('h', 1),
('i', 0)]]
Note that there are now only 2 examples of each label. I have code that
does this, but it's a little complicated:
py> import random
py> def resample(data):
.... # determine which indices are associated with each label
.... label_indices = {}
.... for i, group in enumerate(data):
.... for j, (item, label) in enumerate(group):
.... label_indices.setdefault(label, []).append((i, j))
.... # sample each set of indices down
.... min_count = min(len(indices)
.... for indices in label_indices.itervalues())
.... for label, indices in label_indices.iteritems():
.... label_indices[label] = random.sample(indices, min_count)
.... # return the resampled data
.... return [[(item, label)
.... for j, (item, label) in enumerate(group)
.... if (i, j) in label_indices[label]]
.... for i, group in enumerate(data)]
....
py>
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2), ('e', 0)], [('h', 1), ('i', 0)]]
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2)], [('f', 0), ('h', 1), ('j', 0)]]
Can anyone see a simpler way of doing this?
Steve
basically the same form. It looks something like:
py> data = [[('a', 0),
.... ('b', 1),
.... ('c', 2)],
....
.... [('d', 2),
.... ('e', 0)],
....
.... [('f', 0),
.... ('g', 2),
.... ('h', 1),
.... ('i', 0),
.... ('j', 0)]]
Now, I'd like to sample down the number of items in each sublist in the
following manner. I need to count the occurrences of each 'label' (the
second item in each tuple) in all the items of all the sublists, and
randomly remove some items until the number of occurrences of each
'label' is equal. So, given the data above, one possible resampling
would be:
[[('b', 1),
('c', 2)],
[('e', 0)],
[('g', 2),
('h', 1),
('i', 0)]]
Note that there are now only 2 examples of each label. I have code that
does this, but it's a little complicated:
py> import random
py> def resample(data):
.... # determine which indices are associated with each label
.... label_indices = {}
.... for i, group in enumerate(data):
.... for j, (item, label) in enumerate(group):
.... label_indices.setdefault(label, []).append((i, j))
.... # sample each set of indices down
.... min_count = min(len(indices)
.... for indices in label_indices.itervalues())
.... for label, indices in label_indices.iteritems():
.... label_indices[label] = random.sample(indices, min_count)
.... # return the resampled data
.... return [[(item, label)
.... for j, (item, label) in enumerate(group)
.... if (i, j) in label_indices[label]]
.... for i, group in enumerate(data)]
....
py>
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2), ('e', 0)], [('h', 1), ('i', 0)]]
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2)], [('f', 0), ('h', 1), ('j', 0)]]
Can anyone see a simpler way of doing this?
Steve