Extract sample (w/o replacement)

gamo · Dec 6, 2013

It's a better way to do this?

my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

TIA

Peter Makholm · Dec 6, 2013

gamo said:
my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

I would use splice instead of reordering the array

sub extract {
my $index = int rand @matrix;

return splice @matrix, $index, 1;
}

but I don't know if it faster. Benchmarking is left to the reader if it
is important.

//Makholm

gamo · Dec 6, 2013

El 06/12/13 13:48, Peter Makholm escribió:

gamo said:
gamo said:

my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

Click to expand...

I would use splice instead of reordering the array

sub extract {
my $index = int rand @matrix;

return splice @matrix, $index, 1;
}

but I don't know if it faster. Benchmarking is left to the reader if it
is important.

//Makholm

Something is wrong, because splice it's a lot slower.

Thanks

Peter J. Holzer · Dec 6, 2013

gamo said:
El 06/12/13 13:48, Peter Makholm escribió:

gamo said:

my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

Click to expand...

I would use splice instead of reordering the array

sub extract {
my $index = int rand @matrix;

return splice @matrix, $index, 1;
}

but I don't know if it faster. Benchmarking is left to the reader if it
is important.

//Makholm

Click to expand...

Something is wrong, because splice it's a lot slower.

That was to be expected. Think for a moment what splice does.

hp

gamo · Dec 6, 2013

El 06/12/13 18:53, Peter J. Holzer escribió:

That was to be expected. Think for a moment what splice does.

hp

When it srinks the size of the list, re-indexing, or
something worst. Anyway, I thought that there could be
a more efficient method than mine.

Thanks

$Bill · Dec 6, 2013

It's a better way to do this?

my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

What's the swapping for ?
Wouldn't this work?:

sub extract {
return $matrix[int rand ($#matrix+1)];
}

gamo · Dec 7, 2013

El 07/12/13 00:46, $Bill escribió:

The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

Click to expand...

What's the swapping for ?
Wouldn't this work?:

sub extract {
return $matrix[int rand ($#matrix+1)];
}

No, this doesn't work because you could extract the same list element
many times, and this is not permitted.

Thanks

$Bill · Dec 7, 2013

El 07/12/13 00:46, $Bill escribió:

The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

Click to expand...

What's the swapping for ?
Wouldn't this work?:

sub extract {
return $matrix[int rand ($#matrix+1)];
}

Click to expand...

No, this doesn't work because you could extract the same list element
many times, and this is not permitted.

A little unmentioned caveat.

gamo · Dec 7, 2013

El 07/12/13 04:10, Ben Morrow escribiÃ³:

Quoth gamo said:
Quoth gamo said:

It's a better way to do this?

my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);

Click to expand...

Peter has already shown you can better write this

int rand @matrix

($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;

Click to expand...

Perhaps

my $rv = $matrix[$ind];
$matrix[$ind] = pop @matrix;
return $rv;

would be clearer?

Ben

It's clearer and a 14% faster.

Thanks a lot

gamo · Dec 7, 2013

El 07/12/13 09:00, gamo escribiÃ³:

El 07/12/13 04:10, Ben Morrow escribiÃ³:

Perhaps

my $rv = $matrix[$ind];
$matrix[$ind] = pop @matrix;
return $rv;

would be clearer?

Ben

Click to expand...

It's clearer and a 14% faster.

But, what happens if $ind = $#matrix ?

The line $matrix[$ind] = pop @matrix could cause that element be repeated.

I'm afraid that's not a valid altenative.

Best regards

Rainer Weikusat · Dec 7, 2013

gamo said:
El 07/12/13 09:00, gamo escribió:

El 07/12/13 04:10, Ben Morrow escribió:

Perhaps

my $rv = $matrix[$ind];
$matrix[$ind] = pop @matrix;
return $rv;

would be clearer?

Ben

Click to expand...

It's clearer and a 14% faster.

Click to expand...

But, what happens if $ind = $#matrix ?

The line $matrix[$ind] = pop @matrix could cause that element be
repeated.

That's not a problem for $ind == $#matrix but for @matrix == 1: In this
case, the $matrix[$ind] = pop(@matrix) will reinsert the single remaning
element into the array whenever it is called, IOW, the size will stay 1.

It is possible to express this with slice notation as

@matrix[$ndx, -1] = @matrix[-1, $ndx]

At least one the computer where I tested this, this also seems to be
slightly faster with the difference being more marked as the input array
gets longer.

Rainer Weikusat · Dec 7, 2013

[...]

Perhaps

my $rv = $matrix[$ind];
$matrix[$ind] = pop @matrix;
return $rv;

would be clearer?

Considering that this is buggy and at least three people (me included, I
found the issue by testing) didn't realize this, the answer would be:
Decidedly no.

gamo · Dec 7, 2013

El 07/12/13 18:20, Rainer Weikusat escribió:

It is possible to express this with slice notation as

@matrix[$ndx, -1] = @matrix[-1, $ndx]

At least one the computer where I tested this, this also seems to be
slightly faster with the difference being more marked as the input array
gets longer.

I get the inverse result: a swap of values is sightly faster

Thanks

gamo · Dec 11, 2013

El 11/12/13 12:07, bugbear escribió:

a data structure more sophisticated than a plain (huge)
array might be indicated.

BugBear

Interesting. Can you be more specific?

Thanks

Rainer Weikusat · Dec 11, 2013

bugbear said:
gamo said:

It's a better way to do this?

my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

Click to expand...

Depending on how important the performance of this is,
and what other operations are performed on the array,
a data structure more sophisticated than a plain (huge)
array might be indicated.

Certainly not for this particular task. There's also the problem that
algorithms implemented in Perl tend to be a lot slower ('have bigger
constants', [can I credit this to Robert Pike in English in some sane
way?]) than algorithms implemented in C which implies that Perl arrays
are more often a good choice for Perl code than C arrays would be for
C[*].

[*] No, that's not because memory management in C is soo complicated ...

Rainer Weikusat · Dec 16, 2013

bugbear said:
Rainer said:

Depending on how important the performance of this is,
and what other operations are performed on the array,
a data structure more sophisticated than a plain (huge)
array might be indicated.

Click to expand...

Certainly not for this particular task. There's also the problem that
algorithms implemented in Perl tend to be a lot slower ('have bigger
constants', [can I credit this to Robert Pike in English in some sane
way?]) than algorithms implemented in C which implies that Perl arrays
are more often a good choice for Perl code than C arrays would be for
C[*].

[*] No, that's not because memory management in C is soo complicated ...

Click to expand...

I was assuming that a plain array in perl really IS a plain array,
and that (therefore) inserting/removing a single element
would have a cost linearly proportional to size.

The algorithm the OP posted runs in constant time (swap element to be
removed and last element, pop).

Peter J. Holzer · Dec 16, 2013

Rainer said:
Rainer said:

bugbear said:

Depending on how important the performance of this is,
and what other operations are performed on the array,
a data structure more sophisticated than a plain (huge)
array might be indicated.

Click to expand...

Certainly not for this particular task. There's also the problem that
algorithms implemented in Perl tend to be a lot slower ('have bigger
constants', [can I credit this to Robert Pike in English in some sane
way?]) than algorithms implemented in C which implies that Perl arrays
are more often a good choice for Perl code than C arrays would be for
C[*].

[*] No, that's not because memory management in C is soo complicated ...

Click to expand...

I was assuming that a plain array in perl really IS a plain array,
and that (therefore) inserting/removing a single element
would have a cost linearly proportional to size.

Not quite. The cost is proportional to the size of the part of the array
which has to be moved. Removing a single element at the start or end of
the array is very cheap, removing one from the middle of a large array
rather costly.

Also, the cost is not symmetrical: Removing an element from the second
half is cheaper than removing one from the first half: On my desktop,
using perl 5.14, removing a single element from just before the middle
of a 100000 element array takes a bit over 50 µs, removing one just
after the middle a bit under 20 µs. Both halves are linear, so an
obvious (but probably very minor) optimization would be to change the
method after the first third instead of after half.

The data structure I had in mind was (actually) rather simple.
A linked-list of arrays, (say 10000 elements each).

This is sometimes worthwhile but, as Rainer already pointed out, not in
this case.

hp

Old template class works in VC++ Not in C++ Builder 5	3	Nov 30, 2006
[SUMMARY] String Equations (#112)	0	Feb 8, 2007
sampling without replacement	3	Apr 17, 2008
C++ Word automation Extract text	1	Oct 3, 2005
help with xlst .. only extract parts	1	Jun 21, 2007
Could someone help me with this source code?	5	Jan 20, 2007
Request for help	22	Sep 20, 2007
Sample Function / Random Permutations (newbie)	1	Aug 22, 2004

Extract sample (w/o replacement)

gamo

Peter Makholm

gamo

Peter J. Holzer

gamo

$Bill

gamo

$Bill

gamo

gamo

Rainer Weikusat

Rainer Weikusat

gamo

gamo

Rainer Weikusat

Rainer Weikusat

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads