sorting just the largest values

Discussion in 'Perl Misc' started by Alex Hart, Jan 27, 2005.

  1. Alex Hart

    Alex Hart Guest

    I have a very large list of data (up to 20,000 lines), and I just want
    to print out the largest 50 values. What is the quickest way to sort
    this using perl?

    Thanks in advance.

    - Alex Hart
     
    Alex Hart, Jan 27, 2005
    #1
    1. Advertising

  2. "Alex Hart" <> writes:
    > I have a very large list of data (up to 20,000 lines), and I just want
    > to print out the largest 50 values. What is the quickest way to sort
    > this using perl?


    For very large lists of data: while reading in the lines, keep a
    sorted array with the up to 50 largest seen values.

    But 20000 is not very large, or even large. Read the whole file and
    sort it.

    If you say "sort" to search.cpan.org, it will show you some modules
    that are likely to be useful.
     
    Arndt Jonasson, Jan 27, 2005
    #2
    1. Advertising

  3. Alex Hart wrote:

    > I have a very large list of data (up to 20,000 lines), and I just want
    > to print out the largest 50 values. What is the quickest way to sort
    > this using perl?
    >
    > Thanks in advance.
    >
    > - Alex Hart


    Selection sort - just do the outer loop 50 times.
    http://en.wikipedia.org/wiki/Selection_sort

    The time complexity is O(n) (sorting top k items ;k<<n)
    compared to O(n log n) for sorting the whole list.

    gtoomey
     
    Gregory Toomey, Jan 27, 2005
    #3
  4. Alex Hart wrote:
    > I have a very large list of data (up to 20,000 lines), and I just want
    > to print out the largest 50 values. What is the quickest way to sort
    > this using perl?


    my @largest = `sort -n yourfile | tail -50`;



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Jan 27, 2005
    #4
  5. Alex Hart

    Eric Bohlman Guest

    "Alex Hart" <> wrote in
    news::

    > I have a very large list of data (up to 20,000 lines), and I just want
    > to print out the largest 50 values. What is the quickest way to sort
    > this using perl?


    Most people wouldn't consider 20,000 items to be a "very large list."
    I'd suggest first just trying the brute-force approach of reading them
    into an array, sorting them, and taking the last 50 results. Only if
    this turns out to be unacceptably slow or memory-consuming for your
    application should you consider something different. Remember that
    perl's sort routines are implemented in optimized C code, so even if you
    implement a lower-time-complexity algorithm yourself, the value of N for
    which it becomes faster may be quite large.

    If the brute-force approach turns out to be unacceptable (again, based on
    actual measurement, not guesswork), here's a linear-time (O(N)) selection
    algorithm:

    1) Read the first 50 values into the top 50 list.

    2) For each subsequent value, find the smallest value in the top 50 list
    that's less than it. If there isn't one, do nothing. If there is, kick
    it out of the list and add the new value to the list.

    Step 2 will be simpler to program if you sort the list once after step 1.
    You'll find shift() and splice() particularly helpful.

    [UNTESTED]

    my @top;
    while (<>) {
    push @top,$_ if $.<=50;
    @top=sort @top if $.=50;
    if ($.>50) {
    for (my $i=49;$i>=0;--$i) {
    if ($top[$i] lt $_) {
    # this is where the new value belongs
    splice @top,$i,0,$_; # insert the new value
    shift @top; # remove the smallest value
    last;
    }
    }
    }
    }
     
    Eric Bohlman, Jan 27, 2005
    #5
  6. Alex Hart

    Alex Hart Guest

    > Most people wouldn't consider 20,000 items to be a "very large list."
    > I'd suggest first just trying the brute-force approach of reading

    them
    > into an array, sorting them, and taking the last 50 results. Only if


    I'm developing a real-time application, and a particular sort was
    taking several minutes when the list got up to 20,000. It might be a
    problem with the data being too ordered to start with (or does per
    shuffle before it sorts?). I already extract the sort key and just
    sort on that. I can try the packed string sort-key, but I think even a
    slow method of finding the top 50 must be faster than sorting the whole
    list.

    I'll try some of the suggestions here, and I'll be sure to benchmark
    along the way.
     
    Alex Hart, Jan 27, 2005
    #6
  7. Alex Hart

    Guest

    "Alex Hart" <> wrote:
    > > Most people wouldn't consider 20,000 items to be a "very large list."
    > > I'd suggest first just trying the brute-force approach of reading

    > them
    > > into an array, sorting them, and taking the last 50 results. Only if

    >
    > I'm developing a real-time application, and a particular sort was
    > taking several minutes when the list got up to 20,000.


    That is definitely pathological.

    > It might be a
    > problem with the data being too ordered to start with (or does per
    > shuffle before it sorts?).


    Depends on the version. Which version are you using?

    > I'll try some of the suggestions here, and I'll be sure to benchmark
    > along the way.


    Unfortunately, benchmarking is extremely difficult in cases where there are
    unknown and unpredictable pathological patterns in the data. Be careful.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jan 27, 2005
    #7
  8. Alex Hart

    Anno Siegel Guest

    <> wrote in comp.lang.perl.misc:
    > "Alex Hart" <> wrote:
    > > > Most people wouldn't consider 20,000 items to be a "very large list."
    > > > I'd suggest first just trying the brute-force approach of reading

    > > them
    > > > into an array, sorting them, and taking the last 50 results. Only if

    > >
    > > I'm developing a real-time application, and a particular sort was
    > > taking several minutes when the list got up to 20,000.

    >
    > That is definitely pathological.


    I agree. There's some gross inefficiency going on. We should see the
    sort, it's certainly speed-up-able.

    In general, the solution to the "bottom-n" problem is a heap. In pseudo-
    code (using a mythical module Heap, but there are real ones on CPAN):

    use Heap;
    my $h = Heap->new();
    my $n = 50;
    for ( @data ) {
    $h->insert( $_);
    $h->extract_maximum if $h->size > $n;
    }
    my @top_n = map $h->extract_maximum, 1 .. $h->size;

    If the heap implementation doesn't have a ->size method, it is easy to
    use an external counter instead.

    If $n is larger than the data size, the algorithm does a heap sort on
    the data. If $n is smaller, it is faster than the equivalent sort.

    Anno
     
    Anno Siegel, Jan 27, 2005
    #8
  9. Alex Hart

    David Combs Guest

    In article <ctb88g$ng0$-Berlin.DE>,
    Anno Siegel <-berlin.de> wrote:
    >...
    >In general, the solution to the "bottom-n" problem is a heap. In pseudo-
    >code (using a mythical module Heap, but there are real ones on CPAN):
    >
    > use Heap;
    > my $h = Heap->new();
    > my $n = 50;
    > for ( @data ) {
    > $h->insert( $_);
    > $h->extract_maximum if $h->size > $n;
    > }
    > my @top_n = map $h->extract_maximum, 1 .. $h->size;
    >
    >If the heap implementation doesn't have a ->size method, it is easy to
    >use an external counter instead.
    >
    >If $n is larger than the data size, the algorithm does a heap sort on
    >the data. If $n is smaller, it is faster than the equivalent sort.
    >
    >Anno


    Perhaps (not sure about this) it would be faster (fewer steps?)
    to use one of the *newer* (than heapsort) priority-queue methods,
    eg splay trees and the like?

    (Or maybe no real difference?)

    David
     
    David Combs, Feb 20, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sanitarium
    Replies:
    2
    Views:
    565
    Whitecrest
    Dec 4, 2003
  2. Keke922
    Replies:
    2
    Views:
    3,119
    Shane Beasley
    Nov 2, 2003
  3. Replies:
    9
    Views:
    461
  4. Davy
    Replies:
    7
    Views:
    234
    DouhetSukd
    Nov 13, 2007
  5. Roy Smith
    Replies:
    11
    Views:
    534
    Dan Stromberg
    Dec 31, 2010
Loading...

Share This Page