Extract sample (w/o replacement)

Discussion in 'Perl Misc' started by gamo, Dec 6, 2013.

  1. gamo

    gamo Guest

    It's a better way to do this?

    my @matrix = 1..10_000_000;


    sub extract{
    my $ind = int rand ($#matrix+1);
    ($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
    return pop @matrix;
    }

    The goal is to extact random elements without using shuffle, and
    ever removing them from the original list.

    TIA
    gamo, Dec 6, 2013
    #1
    1. Advertising

  2. gamo <> writes:

    > my @matrix = 1..10_000_000;
    >
    > sub extract{
    > my $ind = int rand ($#matrix+1);
    > ($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
    > return pop @matrix;
    > }


    I would use splice instead of reordering the array

    sub extract {
    my $index = int rand @matrix;

    return splice @matrix, $index, 1;
    }

    but I don't know if it faster. Benchmarking is left to the reader if it
    is important.

    //Makholm
    Peter Makholm, Dec 6, 2013
    #2
    1. Advertising

  3. gamo

    gamo Guest

    El 06/12/13 13:48, Peter Makholm escribió:
    > gamo <> writes:
    >
    >> my @matrix = 1..10_000_000;
    >>
    >> sub extract{
    >> my $ind = int rand ($#matrix+1);
    >> ($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
    >> return pop @matrix;
    >> }

    >
    > I would use splice instead of reordering the array
    >
    > sub extract {
    > my $index = int rand @matrix;
    >
    > return splice @matrix, $index, 1;
    > }
    >
    > but I don't know if it faster. Benchmarking is left to the reader if it
    > is important.
    >
    > //Makholm
    >


    Something is wrong, because splice it's a lot slower.

    Thanks
    gamo, Dec 6, 2013
    #3
  4. On 2013-12-06 13:10, gamo <> wrote:
    > El 06/12/13 13:48, Peter Makholm escribió:
    >> gamo <> writes:
    >>> my @matrix = 1..10_000_000;
    >>>
    >>> sub extract{
    >>> my $ind = int rand ($#matrix+1);
    >>> ($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
    >>> return pop @matrix;
    >>> }

    >>
    >> I would use splice instead of reordering the array
    >>
    >> sub extract {
    >> my $index = int rand @matrix;
    >>
    >> return splice @matrix, $index, 1;
    >> }
    >>
    >> but I don't know if it faster. Benchmarking is left to the reader if it
    >> is important.
    >>
    >> //Makholm
    >>

    >
    > Something is wrong, because splice it's a lot slower.


    That was to be expected. Think for a moment what splice does.

    hp

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Dec 6, 2013
    #4
  5. gamo

    gamo Guest

    El 06/12/13 18:53, Peter J. Holzer escribió:
    >> Something is wrong, because splice it's a lot slower.


    > That was to be expected. Think for a moment what splice does.
    >
    > hp


    When it srinks the size of the list, re-indexing, or
    something worst. Anyway, I thought that there could be
    a more efficient method than mine.

    Thanks
    gamo, Dec 6, 2013
    #5
  6. gamo

    $Bill Guest

    On 12/6/2013 03:34, gamo wrote:
    >
    > It's a better way to do this?
    >
    > my @matrix = 1..10_000_000;
    >
    >
    > sub extract{
    > my $ind = int rand ($#matrix+1);
    > ($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
    > return pop @matrix;
    > }
    >
    > The goal is to extact random elements without using shuffle, and
    > ever removing them from the original list.


    What's the swapping for ?
    Wouldn't this work?:

    sub extract {
    return $matrix[int rand ($#matrix+1)];
    }
    $Bill, Dec 6, 2013
    #6
  7. gamo

    gamo Guest

    El 07/12/13 00:46, $Bill escribió:
    >> The goal is to extact random elements without using shuffle, and
    >> ever removing them from the original list.

    >
    > What's the swapping for ?
    > Wouldn't this work?:
    >
    > sub extract {
    > return $matrix[int rand ($#matrix+1)];
    > }


    No, this doesn't work because you could extract the same list element
    many times, and this is not permitted.

    Thanks
    gamo, Dec 7, 2013
    #7
  8. gamo

    $Bill Guest

    On 12/6/2013 16:17, gamo wrote:
    > El 07/12/13 00:46, $Bill escribió:
    >>> The goal is to extact random elements without using shuffle, and
    >>> ever removing them from the original list.

    >>
    >> What's the swapping for ?
    >> Wouldn't this work?:
    >>
    >> sub extract {
    >> return $matrix[int rand ($#matrix+1)];
    >> }

    >
    > No, this doesn't work because you could extract the same list element
    > many times, and this is not permitted.


    A little unmentioned caveat. :)
    $Bill, Dec 7, 2013
    #8
  9. gamo

    gamo Guest

    El 07/12/13 04:10, Ben Morrow escribió:
    >
    > Quoth gamo <>:
    >>
    >> It's a better way to do this?
    >>
    >> my @matrix = 1..10_000_000;
    >>
    >>
    >> sub extract{
    >> my $ind = int rand ($#matrix+1);

    >
    > Peter has already shown you can better write this
    >
    > int rand @matrix
    >
    >> ($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
    >> return pop @matrix;

    >
    > Perhaps
    >
    > my $rv = $matrix[$ind];
    > $matrix[$ind] = pop @matrix;
    > return $rv;
    >
    > would be clearer?
    >
    > Ben
    >


    It's clearer and a 14% faster.

    Thanks a lot
    gamo, Dec 7, 2013
    #9
  10. gamo

    gamo Guest

    El 07/12/13 09:00, gamo escribió:
    > El 07/12/13 04:10, Ben Morrow escribió:
    >> Perhaps
    >>
    >> my $rv = $matrix[$ind];
    >> $matrix[$ind] = pop @matrix;
    >> return $rv;
    >>
    >> would be clearer?
    >>
    >> Ben
    >>

    >
    > It's clearer and a 14% faster.
    >


    But, what happens if $ind = $#matrix ?

    The line $matrix[$ind] = pop @matrix could cause that element be repeated.

    I'm afraid that's not a valid altenative.

    Best regards
    gamo, Dec 7, 2013
    #10
  11. gamo <> writes:
    > El 07/12/13 09:00, gamo escribió:
    >> El 07/12/13 04:10, Ben Morrow escribió:
    >>> Perhaps
    >>>
    >>> my $rv = $matrix[$ind];
    >>> $matrix[$ind] = pop @matrix;
    >>> return $rv;
    >>>
    >>> would be clearer?
    >>>
    >>> Ben
    >>>

    >>
    >> It's clearer and a 14% faster.
    >>

    >
    > But, what happens if $ind = $#matrix ?
    >
    > The line $matrix[$ind] = pop @matrix could cause that element be
    > repeated.


    That's not a problem for $ind == $#matrix but for @matrix == 1: In this
    case, the $matrix[$ind] = pop(@matrix) will reinsert the single remaning
    element into the array whenever it is called, IOW, the size will stay 1.

    It is possible to express this with slice notation as

    @matrix[$ndx, -1] = @matrix[-1, $ndx]

    At least one the computer where I tested this, this also seems to be
    slightly faster with the difference being more marked as the input array
    gets longer.
    Rainer Weikusat, Dec 7, 2013
    #11
  12. Ben Morrow <> writes:

    [...]

    > Perhaps
    >
    > my $rv = $matrix[$ind];
    > $matrix[$ind] = pop @matrix;
    > return $rv;
    >
    > would be clearer?


    Considering that this is buggy and at least three people (me included, I
    found the issue by testing) didn't realize this, the answer would be:
    Decidedly no.
    Rainer Weikusat, Dec 7, 2013
    #12
  13. gamo

    gamo Guest

    El 07/12/13 18:20, Rainer Weikusat escribió:
    > It is possible to express this with slice notation as
    >
    > @matrix[$ndx, -1] = @matrix[-1, $ndx]
    >
    > At least one the computer where I tested this, this also seems to be
    > slightly faster with the difference being more marked as the input array
    > gets longer.


    I get the inverse result: a swap of values is sightly faster

    Thanks
    gamo, Dec 7, 2013
    #13
  14. gamo

    gamo Guest

    El 11/12/13 12:07, bugbear escribió:
    > a data structure more sophisticated than a plain (huge)
    > array might be indicated.
    >
    > BugBear


    Interesting. Can you be more specific?

    Thanks
    gamo, Dec 11, 2013
    #14
  15. bugbear <bugbear@trim_papermule.co.uk_trim> writes:
    > gamo wrote:
    >>
    >> It's a better way to do this?
    >>
    >> my @matrix = 1..10_000_000;
    >>
    >>
    >> sub extract{
    >> my $ind = int rand ($#matrix+1);
    >> ($matrix[$ind],$matrix[-1]) = ($matrix[-1],$matrix[$ind]);
    >> return pop @matrix;
    >> }
    >>
    >> The goal is to extact random elements without using shuffle, and
    >> ever removing them from the original list.

    >
    > Depending on how important the performance of this is,
    > and what other operations are performed on the array,
    > a data structure more sophisticated than a plain (huge)
    > array might be indicated.


    Certainly not for this particular task. There's also the problem that
    algorithms implemented in Perl tend to be a lot slower ('have bigger
    constants', [can I credit this to Robert Pike in English in some sane
    way?]) than algorithms implemented in C which implies that Perl arrays
    are more often a good choice for Perl code than C arrays would be for
    C[*].

    [*] No, that's not because memory management in C is soo complicated ...
    Rainer Weikusat, Dec 11, 2013
    #15
  16. bugbear <bugbear@trim_papermule.co.uk_trim> writes:
    > Rainer Weikusat wrote:
    >> bugbear <bugbear@trim_papermule.co.uk_trim> writes:

    >
    >>>
    >>> Depending on how important the performance of this is,
    >>> and what other operations are performed on the array,
    >>> a data structure more sophisticated than a plain (huge)
    >>> array might be indicated.

    >>
    >> Certainly not for this particular task. There's also the problem that
    >> algorithms implemented in Perl tend to be a lot slower ('have bigger
    >> constants', [can I credit this to Robert Pike in English in some sane
    >> way?]) than algorithms implemented in C which implies that Perl arrays
    >> are more often a good choice for Perl code than C arrays would be for
    >> C[*].
    >>
    >> [*] No, that's not because memory management in C is soo complicated ...

    >
    > I was assuming that a plain array in perl really IS a plain array,
    > and that (therefore) inserting/removing a single element
    > would have a cost linearly proportional to size.


    The algorithm the OP posted runs in constant time (swap element to be
    removed and last element, pop).
    Rainer Weikusat, Dec 16, 2013
    #16
  17. On 2013-12-16 12:52, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > Rainer Weikusat wrote:
    >> bugbear <bugbear@trim_papermule.co.uk_trim> writes:
    >>> Depending on how important the performance of this is,
    >>> and what other operations are performed on the array,
    >>> a data structure more sophisticated than a plain (huge)
    >>> array might be indicated.

    >>
    >> Certainly not for this particular task. There's also the problem that
    >> algorithms implemented in Perl tend to be a lot slower ('have bigger
    >> constants', [can I credit this to Robert Pike in English in some sane
    >> way?]) than algorithms implemented in C which implies that Perl arrays
    >> are more often a good choice for Perl code than C arrays would be for
    >> C[*].
    >>
    >> [*] No, that's not because memory management in C is soo complicated ...

    >
    > I was assuming that a plain array in perl really IS a plain array,
    > and that (therefore) inserting/removing a single element
    > would have a cost linearly proportional to size.


    Not quite. The cost is proportional to the size of the part of the array
    which has to be moved. Removing a single element at the start or end of
    the array is very cheap, removing one from the middle of a large array
    rather costly.

    Also, the cost is not symmetrical: Removing an element from the second
    half is cheaper than removing one from the first half: On my desktop,
    using perl 5.14, removing a single element from just before the middle
    of a 100000 element array takes a bit over 50 µs, removing one just
    after the middle a bit under 20 µs. Both halves are linear, so an
    obvious (but probably very minor) optimization would be to change the
    method after the first third instead of after half.

    > The data structure I had in mind was (actually) rather simple.
    > A linked-list of arrays, (say 10000 elements each).


    This is sometimes worthwhile but, as Rainer already pointed out, not in
    this case.

    hp

    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Dec 16, 2013
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jake
    Replies:
    0
    Views:
    547
  2. Francogrex
    Replies:
    3
    Views:
    437
    Lionel B
    May 7, 2008
  3. Tim Chase
    Replies:
    0
    Views:
    70
    Tim Chase
    Feb 16, 2014
  4. Terry Reedy
    Replies:
    0
    Views:
    78
    Terry Reedy
    Feb 16, 2014
  5. Tim Chase
    Replies:
    0
    Views:
    83
    Tim Chase
    Feb 16, 2014
Loading...

Share This Page