Extract sample (w/o replacement)

G

gamo

It's a better way to do this?

my @matrix = 1..10_000_000;


sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

TIA
 
P

Peter Makholm

gamo said:
my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

I would use splice instead of reordering the array

sub extract {
my $index = int rand @matrix;

return splice @matrix, $index, 1;
}

but I don't know if it faster. Benchmarking is left to the reader if it
is important.

//Makholm
 
G

gamo

El 06/12/13 13:48, Peter Makholm escribió:
gamo said:
my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

I would use splice instead of reordering the array

sub extract {
my $index = int rand @matrix;

return splice @matrix, $index, 1;
}

but I don't know if it faster. Benchmarking is left to the reader if it
is important.

//Makholm

Something is wrong, because splice it's a lot slower.

Thanks
 
P

Peter J. Holzer

El 06/12/13 13:48, Peter Makholm escribió:
gamo said:
my @matrix = 1..10_000_000;

sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

I would use splice instead of reordering the array

sub extract {
my $index = int rand @matrix;

return splice @matrix, $index, 1;
}

but I don't know if it faster. Benchmarking is left to the reader if it
is important.

//Makholm

Something is wrong, because splice it's a lot slower.

That was to be expected. Think for a moment what splice does.

hp
 
G

gamo

El 06/12/13 18:53, Peter J. Holzer escribió:
That was to be expected. Think for a moment what splice does.

hp

When it srinks the size of the list, re-indexing, or
something worst. Anyway, I thought that there could be
a more efficient method than mine.

Thanks
 
$

$Bill

It's a better way to do this?

my @matrix = 1..10_000_000;


sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

What's the swapping for ?
Wouldn't this work?:

sub extract {
return $matrix[int rand ($#matrix+1)];
}
 
G

gamo

El 07/12/13 00:46, $Bill escribió:
The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

What's the swapping for ?
Wouldn't this work?:

sub extract {
return $matrix[int rand ($#matrix+1)];
}

No, this doesn't work because you could extract the same list element
many times, and this is not permitted.

Thanks
 
$

$Bill

El 07/12/13 00:46, $Bill escribió:
The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

What's the swapping for ?
Wouldn't this work?:

sub extract {
return $matrix[int rand ($#matrix+1)];
}

No, this doesn't work because you could extract the same list element
many times, and this is not permitted.

A little unmentioned caveat. :)
 
G

gamo

El 07/12/13 04:10, Ben Morrow escribió:
Quoth gamo said:
It's a better way to do this?

my @matrix = 1..10_000_000;


sub extract{
my $ind = int rand ($#matrix+1);

Peter has already shown you can better write this

int rand @matrix
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;

Perhaps

my $rv = $matrix[$ind];
$matrix[$ind] = pop @matrix;
return $rv;

would be clearer?

Ben

It's clearer and a 14% faster.

Thanks a lot
 
G

gamo

El 07/12/13 09:00, gamo escribió:
El 07/12/13 04:10, Ben Morrow escribió:
Perhaps

my $rv = $matrix[$ind];
$matrix[$ind] = pop @matrix;
return $rv;

would be clearer?

Ben

It's clearer and a 14% faster.

But, what happens if $ind = $#matrix ?

The line $matrix[$ind] = pop @matrix could cause that element be repeated.

I'm afraid that's not a valid altenative.

Best regards
 
R

Rainer Weikusat

gamo said:
El 07/12/13 09:00, gamo escribió:
El 07/12/13 04:10, Ben Morrow escribió:
Perhaps

my $rv = $matrix[$ind];
$matrix[$ind] = pop @matrix;
return $rv;

would be clearer?

Ben

It's clearer and a 14% faster.

But, what happens if $ind = $#matrix ?

The line $matrix[$ind] = pop @matrix could cause that element be
repeated.

That's not a problem for $ind == $#matrix but for @matrix == 1: In this
case, the $matrix[$ind] = pop(@matrix) will reinsert the single remaning
element into the array whenever it is called, IOW, the size will stay 1.

It is possible to express this with slice notation as

@matrix[$ndx, -1] = @matrix[-1, $ndx]

At least one the computer where I tested this, this also seems to be
slightly faster with the difference being more marked as the input array
gets longer.
 
R

Rainer Weikusat

[...]
Perhaps

my $rv = $matrix[$ind];
$matrix[$ind] = pop @matrix;
return $rv;

would be clearer?

Considering that this is buggy and at least three people (me included, I
found the issue by testing) didn't realize this, the answer would be:
Decidedly no.
 
G

gamo

El 07/12/13 18:20, Rainer Weikusat escribió:
It is possible to express this with slice notation as

@matrix[$ndx, -1] = @matrix[-1, $ndx]

At least one the computer where I tested this, this also seems to be
slightly faster with the difference being more marked as the input array
gets longer.

I get the inverse result: a swap of values is sightly faster

Thanks
 
G

gamo

El 11/12/13 12:07, bugbear escribió:
a data structure more sophisticated than a plain (huge)
array might be indicated.

BugBear

Interesting. Can you be more specific?

Thanks
 
R

Rainer Weikusat

bugbear said:
gamo said:
It's a better way to do this?

my @matrix = 1..10_000_000;


sub extract{
my $ind = int rand ($#matrix+1);
($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
return pop @matrix;
}

The goal is to extact random elements without using shuffle, and
ever removing them from the original list.

Depending on how important the performance of this is,
and what other operations are performed on the array,
a data structure more sophisticated than a plain (huge)
array might be indicated.

Certainly not for this particular task. There's also the problem that
algorithms implemented in Perl tend to be a lot slower ('have bigger
constants', [can I credit this to Robert Pike in English in some sane
way?]) than algorithms implemented in C which implies that Perl arrays
are more often a good choice for Perl code than C arrays would be for
C[*].

[*] No, that's not because memory management in C is soo complicated ...
 
R

Rainer Weikusat

bugbear said:
Rainer said:
Depending on how important the performance of this is,
and what other operations are performed on the array,
a data structure more sophisticated than a plain (huge)
array might be indicated.

Certainly not for this particular task. There's also the problem that
algorithms implemented in Perl tend to be a lot slower ('have bigger
constants', [can I credit this to Robert Pike in English in some sane
way?]) than algorithms implemented in C which implies that Perl arrays
are more often a good choice for Perl code than C arrays would be for
C[*].

[*] No, that's not because memory management in C is soo complicated ...

I was assuming that a plain array in perl really IS a plain array,
and that (therefore) inserting/removing a single element
would have a cost linearly proportional to size.

The algorithm the OP posted runs in constant time (swap element to be
removed and last element, pop).
 
P

Peter J. Holzer

Rainer said:
bugbear said:
Depending on how important the performance of this is,
and what other operations are performed on the array,
a data structure more sophisticated than a plain (huge)
array might be indicated.

Certainly not for this particular task. There's also the problem that
algorithms implemented in Perl tend to be a lot slower ('have bigger
constants', [can I credit this to Robert Pike in English in some sane
way?]) than algorithms implemented in C which implies that Perl arrays
are more often a good choice for Perl code than C arrays would be for
C[*].

[*] No, that's not because memory management in C is soo complicated ...

I was assuming that a plain array in perl really IS a plain array,
and that (therefore) inserting/removing a single element
would have a cost linearly proportional to size.

Not quite. The cost is proportional to the size of the part of the array
which has to be moved. Removing a single element at the start or end of
the array is very cheap, removing one from the middle of a large array
rather costly.

Also, the cost is not symmetrical: Removing an element from the second
half is cheaper than removing one from the first half: On my desktop,
using perl 5.14, removing a single element from just before the middle
of a 100000 element array takes a bit over 50 µs, removing one just
after the middle a bit under 20 µs. Both halves are linear, so an
obvious (but probably very minor) optimization would be to change the
method after the first third instead of after half.
The data structure I had in mind was (actually) rather simple.
A linked-list of arrays, (say 10000 elements each).

This is sometimes worthwhile but, as Rainer already pointed out, not in
this case.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top