Trying to use sets for random selection, but the pop() method returnsitems in order

M

Mario Garcia

Im trying to use sets for doing statistics from a data set.
I want to select, 70% random records from a List. I thougth set where
a good idea so I
tested this way:

c = set(range(1000))
for d in range(1000):
print c.pop()

I was hoping to see a print out of random selected numbers from 1 to
1000
but I got an ordered count from 1 to 1000.
I also tried using a dictionary, with keys from 1 to 10, and also got
the keys in order.

Im using:
Python 2.5.2 |EPD 2.5.2001| (r252:60911, Aug 4 2008, 13:45:20)
[GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin

Examples in the documentation seem to work. But I cant make it.
Can some one, give me a hint on whats going on?
 
C

Carl Banks

Im trying to use sets for doing statistics from a data set.
I want to select, 70% random records from a List. I thougth set where
a good idea so I
tested this way:

c = set(range(1000))
for d in range(1000):
     print c.pop()

I was hoping to see a print out of random selected numbers from 1 to
1000
but I got an ordered count from 1 to 1000.
I also tried using a dictionary, with keys from 1 to 10, and also got
the keys in order.

Im using:
 Python 2.5.2 |EPD 2.5.2001| (r252:60911, Aug  4 2008, 13:45:20)
 [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin

Examples in the documentation seem to work. But I cant make it.
Can some one, give me a hint on whats going on?

The keys in a dict or set are not in random order, but (more or less)
they are in hash key order modulo the size of the hash. This neglects
the effect of hash collisions. The hash code of an integer happens to
the integer itself, so oftentimes a dict or set storing a sequence of
integers will end up with keys in order, although it's not guaranteed
to be so.

Point it, it's unsafe to rely on *any* ordering behavior in a dict or
set, even relatively random order.

Instead, call random.shuffle() on the list, and iterate through that
to get the elements in random order.


Carl Banks
 
M

Mensanator

Im trying to use sets for doing statistics from a data set.
I want to select, 70% random records from a List. I thougth set where
a good idea so I
tested this way:

c = set(range(1000))
for d in range(1000):
     print c.pop()

I was hoping to see a print out of random selected numbers from 1 to
1000
but I got an ordered count from 1 to 1000.
I also tried using a dictionary, with keys from 1 to 10, and also got
the keys in order.

Im using:
 Python 2.5.2 |EPD 2.5.2001| (r252:60911, Aug  4 2008, 13:45:20)
 [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin

Examples in the documentation seem to work. But I cant make it.
Can some one, give me a hint on whats going on?

Sets don't help. Try this:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
[78, 62, 38, 54, 12, 48, 24, 20, 8, 14, 1, 69, 49, 92, 41, 64, 17, 35,
88, 40, 73, 45, 5, 84, 96, 90, 98, 57, 51, 75, 99, 13, 29, 4, 97, 77,
74, 56, 91, 95, 59, 79, 89, 19, 9, 42, 31, 85, 86, 23, 27, 50, 6, 21,
15, 80, 3, 30, 87, 82, 16, 63, 2, 55, 37, 33, 10, 61, 93, 72, 60, 67,
44, 65, 11, 70, 52, 58, 47, 18, 36, 66, 94, 28, 22, 68, 32, 76, 53,
25, 83, 34, 26, 71, 39, 43, 7, 46, 0, 81]
print c.pop(),

81 0 46 7 43 39 71 26 34 83 25 53 76 32 68 22 28 94 66 36 18 47 58 52
70 11 65 44 67 60 72 93 61 10 33 37 55 2 63 16 82 87 30 3 80 15 21 6
50 27

Notice that the numbers are coming out in the exact reverse order
of the list, because you used .pop() without an index number.
But it's ok in _THIS_ example, because the list was randomly
shuffled before popping.
 
M

Mensanator

Im trying to use sets for doing statistics from a data set.
I want to select, 70% random records from a List. I thougth set where
a good idea so I
tested this way:
c = set(range(1000))
for d in range(1000):
     print c.pop()
I was hoping to see a print out of random selected numbers from 1 to
1000
but I got an ordered count from 1 to 1000.
I also tried using a dictionary, with keys from 1 to 10, and also got
the keys in order.
Im using:
 Python 2.5.2 |EPD 2.5.2001| (r252:60911, Aug  4 2008, 13:45:20)
 [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin
Examples in the documentation seem to work. But I cant make it.
Can some one, give me a hint on whats going on?

Sets don't help. Try this:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

[78, 62, 38, 54, 12, 48, 24, 20, 8, 14, 1, 69, 49, 92, 41, 64, 17, 35,
88, 40, 73, 45, 5, 84, 96, 90, 98, 57, 51, 75, 99, 13, 29, 4, 97, 77,
74, 56, 91, 95, 59, 79, 89, 19, 9, 42, 31, 85, 86, 23, 27, 50, 6, 21,
15, 80, 3, 30, 87, 82, 16, 63, 2, 55, 37, 33, 10, 61, 93, 72, 60, 67,
44, 65, 11, 70, 52, 58, 47, 18, 36, 66, 94, 28, 22, 68, 32, 76, 53,
25, 83, 34, 26, 71, 39, 43, 7, 46, 0, 81]

Oops! You don't want to iterate through c while popping it.
Your original for statement is what to use.
 
P

Paul Rubin

Mario Garcia said:
Im trying to use sets for doing statistics from a data set.
I want to select, 70% random records from a List. I thougth set where
a good idea so I

No that's not a good idea. When the set/dict documentation says you
get the keys in an undetermined order, it doesn't mean the ordering is
random. It means there is a fixed ordering controlled by the
implementation and not by you.

If you want a random sample, use random.sample(). See the docs for
the random module.
 
P

Paul Rubin

Carl Banks said:
Instead, call random.shuffle() on the list, and iterate through that
to get the elements in random order.

It's better to use random.sample() than random.shuffle().
 
C

Carl Banks

It's better to use random.sample() than random.shuffle().

If you're iterating through the whole list and don't need to preserve
the original order (as was the case here) random.shuffle() is better.


Aren't-absolutist-opinions-cool?-ly yr's,

Carl Banks
 
P

Paul Rubin

Carl Banks said:
If you're iterating through the whole list and don't need to preserve
the original order (as was the case here) random.shuffle() is better.

1. Random.sample avoids iterating through the whole list when it can.

2. Say you want to choose 10 random numbers between 1 and 1000000.

random.sample(xrange(1000000), 10)

works nicely. Doing the same with shuffle is painful.
 
C

Carl Banks

1. Random.sample avoids iterating through the whole list when it can.

2. Say you want to choose 10 random numbers between 1 and 1000000.

   random.sample(xrange(1000000), 10)

works nicely.  Doing the same with shuffle is painful.

random.shuffle() is still better when you're iterating through the
whole list as the OP was doing.

Carl Banks
 
P

Paul Rubin

Carl Banks said:
random.shuffle() is still better when you're iterating through the
whole list as the OP was doing.

The OP wrote:

I want to select, 70% random records from a List. I thougth set
where a good idea so I tested this way: ...

That sounds like 70% of the list, not the whole list. Note that
in addition to using time proportional to the size of the entire
lsit rather than the size of the sample, shuffle() also messes up
the order of the list.

Why would shuffle ever be better? It is designed for a different
purpose. Using a function called "sample" when you want to sample
should be a no-brainer, and if using shuffle instead is ever
preferable, that should be treated as a misfeature in the "sample"
implementation, and fixed.
 
C

Carl Banks

The OP wrote:

    I want to select, 70% random records from a List. I thougth set
    where a good idea so I tested this way: ...

I was going by his example which went through all the items in the
list.

That sounds like 70% of the list, not the whole list.  Note that
in addition to using time proportional to the size of the entire
lsit rather than the size of the sample, shuffle() also messes up
the order of the list.  

Why would shuffle ever be better?

Because we were talking about different things.


Carl Banks
 
M

Mario Garcia

Thank you all for your input, Yes random is what I need!!
I checked the docs, and follow your comments, I will use both
random.sample and random.shuffle,and random.choice etc. !!

For this particular problem I think ramdom.shuffle() is what I need.

I was not very clear or complete in my explanation, and I didn't
expect such a
discussion on my needs, now I'm embarrassed :) .

The list is a data set for training a machine learning algorithm. I
want to
use 70% of the records (random) for training, but then the remaining
30% are used for
validation. This is repeated a few times choose again 70% at random
for training and
the rest for validation.

With random.shuffle() I just iterate for the first 70% of the records
for training and
I just continue with the remaining 30%. This last 30% we dont care if
its random anymore,
so shuffle is doing some extra work as it was pointed out.
Conceptually something like this:

population = range(100)
training = random.sample(population,70)
validation = set(population).difference(set(training))

But I think this is more costly

Thanks Again
Mario
 
M

Mario Garcia

This could be better:
[0, 1, 2, 3, 4, 5, 6, 8, 9]

That was my idea with the previous pop(), remove from the population a
certain number of elements at random.
In the docs pop is defined as:
Remove and return an arbitrary element from the set.
My mistake: arbitrary is not the same as random :(

Mario
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,610
Members
45,254
Latest member
Top Crypto TwitterChannel

Latest Threads

Top