Extract sample (w/o replacement)

Discussion in 'Perl Misc' started by gamo, Dec 6, 2013.

  1. gamo

    gamo Guest

    It's a better way to do this?

    my @matrix = 1..10_000_000;

    sub extract{
    my $ind = int rand ($#matrix+1);
    ($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
    return pop @matrix;

    The goal is to extact random elements without using shuffle, and
    ever removing them from the original list.

    gamo, Dec 6, 2013
    1. Advertisements

  2. I would use splice instead of reordering the array

    sub extract {
    my $index = int rand @matrix;

    return splice @matrix, $index, 1;

    but I don't know if it faster. Benchmarking is left to the reader if it
    is important.

    Peter Makholm, Dec 6, 2013
    1. Advertisements

  3. gamo

    gamo Guest

    El 06/12/13 13:48, Peter Makholm escribió:
    Something is wrong, because splice it's a lot slower.

    gamo, Dec 6, 2013
  4. That was to be expected. Think for a moment what splice does.

    Peter J. Holzer, Dec 6, 2013
  5. gamo

    gamo Guest

    El 06/12/13 18:53, Peter J. Holzer escribió:
    When it srinks the size of the list, re-indexing, or
    something worst. Anyway, I thought that there could be
    a more efficient method than mine.

    gamo, Dec 6, 2013
  6. gamo

    $Bill Guest

    What's the swapping for ?
    Wouldn't this work?:

    sub extract {
    return $matrix[int rand ($#matrix+1)];
    $Bill, Dec 6, 2013
  7. gamo

    gamo Guest

    El 07/12/13 00:46, $Bill escribió:
    No, this doesn't work because you could extract the same list element
    many times, and this is not permitted.

    gamo, Dec 7, 2013
  8. gamo

    $Bill Guest

    A little unmentioned caveat. :)
    $Bill, Dec 7, 2013
  9. gamo

    gamo Guest

    El 07/12/13 04:10, Ben Morrow escribió:
    It's clearer and a 14% faster.

    Thanks a lot
    gamo, Dec 7, 2013
  10. gamo

    gamo Guest

    El 07/12/13 09:00, gamo escribió:
    But, what happens if $ind = $#matrix ?

    The line $matrix[$ind] = pop @matrix could cause that element be repeated.

    I'm afraid that's not a valid altenative.

    Best regards
    gamo, Dec 7, 2013
  11. That's not a problem for $ind == $#matrix but for @matrix == 1: In this
    case, the $matrix[$ind] = pop(@matrix) will reinsert the single remaning
    element into the array whenever it is called, IOW, the size will stay 1.

    It is possible to express this with slice notation as

    @matrix[$ndx, -1] = @matrix[-1, $ndx]

    At least one the computer where I tested this, this also seems to be
    slightly faster with the difference being more marked as the input array
    gets longer.
    Rainer Weikusat, Dec 7, 2013
  12. [...]
    Considering that this is buggy and at least three people (me included, I
    found the issue by testing) didn't realize this, the answer would be:
    Decidedly no.
    Rainer Weikusat, Dec 7, 2013
  13. gamo

    gamo Guest

    El 07/12/13 18:20, Rainer Weikusat escribió:
    I get the inverse result: a swap of values is sightly faster

    gamo, Dec 7, 2013
  14. gamo

    gamo Guest

    El 11/12/13 12:07, bugbear escribió:
    Interesting. Can you be more specific?

    gamo, Dec 11, 2013
  15. Certainly not for this particular task. There's also the problem that
    algorithms implemented in Perl tend to be a lot slower ('have bigger
    constants', [can I credit this to Robert Pike in English in some sane
    way?]) than algorithms implemented in C which implies that Perl arrays
    are more often a good choice for Perl code than C arrays would be for

    [*] No, that's not because memory management in C is soo complicated ...
    Rainer Weikusat, Dec 11, 2013
  16. The algorithm the OP posted runs in constant time (swap element to be
    removed and last element, pop).
    Rainer Weikusat, Dec 16, 2013
  17. Not quite. The cost is proportional to the size of the part of the array
    which has to be moved. Removing a single element at the start or end of
    the array is very cheap, removing one from the middle of a large array
    rather costly.

    Also, the cost is not symmetrical: Removing an element from the second
    half is cheaper than removing one from the first half: On my desktop,
    using perl 5.14, removing a single element from just before the middle
    of a 100000 element array takes a bit over 50 µs, removing one just
    after the middle a bit under 20 µs. Both halves are linear, so an
    obvious (but probably very minor) optimization would be to change the
    method after the first third instead of after half.
    This is sometimes worthwhile but, as Rainer already pointed out, not in
    this case.

    Peter J. Holzer, Dec 16, 2013
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.