sampling items from a nested list

Steven Bethard · Feb 16, 2005

So, I have a list of lists, where the items in each sublist are of
basically the same form. It looks something like:

py> data = [[('a', 0),
.... ('b', 1),
.... ('c', 2)],
....
.... [('d', 2),
.... ('e', 0)],
....
.... [('f', 0),
.... ('g', 2),
.... ('h', 1),
.... ('i', 0),
.... ('j', 0)]]

Now, I'd like to sample down the number of items in each sublist in the
following manner. I need to count the occurrences of each 'label' (the
second item in each tuple) in all the items of all the sublists, and
randomly remove some items until the number of occurrences of each
'label' is equal. So, given the data above, one possible resampling
would be:

[[('b', 1),
('c', 2)],

[('e', 0)],

[('g', 2),
('h', 1),
('i', 0)]]

Note that there are now only 2 examples of each label. I have code that
does this, but it's a little complicated:

py> import random
py> def resample(data):
.... # determine which indices are associated with each label
.... label_indices = {}
.... for i, group in enumerate(data):
.... for j, (item, label) in enumerate(group):
.... label_indices.setdefault(label, []).append((i, j))
.... # sample each set of indices down
.... min_count = min(len(indices)
.... for indices in label_indices.itervalues())
.... for label, indices in label_indices.iteritems():
.... label_indices[label] = random.sample(indices, min_count)
.... # return the resampled data
.... return [[(item, label)
.... for j, (item, label) in enumerate(group)
.... if (i, j) in label_indices[label]]
.... for i, group in enumerate(data)]
....
py>
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2), ('e', 0)], [('h', 1), ('i', 0)]]
py> resample(data)
[[('b', 1), ('c', 2)], [('d', 2)], [('f', 0), ('h', 1), ('j', 0)]]

Can anyone see a simpler way of doing this?

Steve

Michael Spencer · Feb 17, 2005

Steven said:
So, I have a list of lists, where the items in each sublist are of
basically the same form. It looks something like:
....

Can anyone see a simpler way of doing this?

Steve

You just make these up to keep us amused, don't you? ;-)

If you don't need to preserve the ordering, would the following work?:

>>> data = [[('a', 0),

Click to expand...

Click to expand...

... ('b', 1),
... ('c', 2)],
...
... [('d', 2),
... ('e', 0)],
...
... [('f', 0),
... ('g', 2),
... ('h', 1),
... ('i', 0),
... ('j', 0)]]
... ... bag = {}
... random.shuffle(data)
... return [[(item, label)
... for item, label in group
... if bag.setdefault(label,[]).append(item)
... or len(bag[label]) < 3]
... for group in data if not random.shuffle(group)]
...

>>> resample2(data) [[('a', 0), ('c', 2), ('b', 1)], [('h', 1), ('g', 2), ('i', 0)], []]
>>> resample2(data) [[('h', 1), ('f', 0), ('j', 0), ('g', 2)], [('b', 1), ('c', 2)], []]
>>> resample2(data) [[('e', 0), ('d', 2)], [('i', 0), ('h', 1), ('g', 2)], [('b', 1)]]
>>>

Click to expand...

Click to expand...

Michael

Steven Bethard · Feb 17, 2005

Michael said:
You just make these up to keep us amused, don't you? ;-)

Heh heh. I wish. It's actually about resampling data read in the
Yamcha data format:

http://chasen.org/~taku/software/yamcha/

So each sublist is a "sentence" and each tuple is the feature vector for
a "word". The point is to even out the number of positive and negative
examples because support vector machines typically work better with
balanced data sets.

If you don't need to preserve the ordering, would the following work?:
[snip]... bag = {}
... random.shuffle(data)
... return [[(item, label)
... for item, label in group
... if bag.setdefault(label,[]).append(item)
... or len(bag[label]) < 3]
... for group in data if not
random.shuffle(group)]

It would be preferable to preserve ordering, but it's not absolutely
crucial. Thanks for the suggestion!

STeVe

Michael Spencer · Feb 17, 2005

Michael said:
... bag = {}
... random.shuffle(data)
... return [[(item, label)
... for item, label in group
... if bag.setdefault(label,[]).append(item)
... or len(bag[label]) < 3]
... for group in data if not

....which failed to calculate the minimum count of labels, try this instead
(while I was at it, I removed the insance LC)
... bag = {}
... sample = []
... labels = [label for group in data for item, label in group]
... min_count = min(labels.count(label) for label in set(labels))
... random.shuffle(data)
... for subgroup in data:
... random.shuffle(subgroup)
... subgroupsample = []
... for item, label in subgroup:
... bag.setdefault(label,[]).append(item)
... if len(bag[label]) <= min_count:
... subgroupsample.append((item,label))
... sample.append(subgroupsample)
... return sample
...
Cheers

Michael

Michael Spencer · Feb 17, 2005

Steven said:
Michael said:

You just make these up to keep us amused, don't you? ;-)

Click to expand...

Heh heh. I wish. It's actually about resampling data read in the
Yamcha data format:

http://chasen.org/~taku/software/yamcha/

So each sublist is a "sentence" and each tuple is the feature vector for
a "word". The point is to even out the number of positive and negative
examples because support vector machines typically work better with
balanced data sets.

If you don't need to preserve the ordering, would the following work?:
[snip]

def resample2(data):

Click to expand...

... bag = {}
... random.shuffle(data)
... return [[(item, label)
... for item, label in group
... if bag.setdefault(label,[]).append(item)
... or len(bag[label]) < 3]
... for group in data if not
random.shuffle(group)]

Click to expand...

It would be preferable to preserve ordering, but it's not absolutely
crucial. Thanks for the suggestion!

STeVe

Maybe combine this with a DSU pattern? Not sure whether the result would be
better than what you started with

Michael

Felix Wiemann · Feb 17, 2005

Steven said:
py> data = [[('a', 0),
... ('b', 1),
... ('c', 2)],
...
... [('d', 2),
... ('e', 0)],
...
... [('f', 0),
... ('g', 2),
... ('h', 1),
... ('i', 0),
... ('j', 0)]]

I need to count the occurrences of each 'label' (the second item in
each tuple) in all the items of all the sublists, and randomly remove
some items until the number of occurrences of each 'label' is equal.

If the tuples are "heavier" than this, you can avoid comparing them
using the following algorithm (which probably still leaves some room for
optimization, e.g. simpler return_list building [or returning a
generator instead of a list], or directly building the sample set
instead of converting a random.sample to a set):

def resample(data):
counts = {}
for i in data:
for j in i:
counts[j[1]] = counts.setdefault(j[1], 0) + 1

min_count = min(counts.itervalues())

# Same keys, so we can reuse the counts dictionary.
indices = counts
for label, count in counts.iteritems():
indices[label] = set(random.sample(xrange(count), min_count))

# Same thing with a generator expression, building a new dict (dunno
# what's faster).
#indices = dict(((label, set(random.sample(xrange(count), min_count)))
# for label, count in counts.iteritems()))

# "done" maps labels to the number of tuples (with that label) which
# have been added to return_list.
done = {}
return_list = []
for i in data:
return_list.append([])
for j in i:
if done.setdefault(j[1], 0) in indices[j[1]]:
return_list[-1].append(j)
done[j[1]] += 1
return return_list

I dont get this. Please help me!!	2	Jan 24, 2023
Removing items from a list	5	Feb 10, 2012
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
A Friday Python Programming Pearl: random sampling	2	May 28, 2010
Range / empty list issues??	1	Dec 11, 2023
Translater + module + tkinter	1	Feb 16, 2023
Nested Loop Insert Page Break	1	Nov 5, 2021
All CRUD operations work except POST. Why?	2	May 28, 2023

sampling items from a nested list

Steven Bethard

Michael Spencer

Steven Bethard

Michael Spencer

Michael Spencer

Felix Wiemann

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads