Hpw make lists that are easy to sort.

Discussion in 'Python' started by Anton Vredegoor, Mar 28, 2007.

  1. Python's sorting algorithm takes advantage of preexisting order in a
    sequence:

    #sort_test.py
    import random
    import time

    def test():
    n = 1000
    k = 2**28

    L = random.sample(xrange(-k,k),n)
    R = random.sample(xrange(-k,k),n)

    t = time.time()
    LR = [(i+j) for i in L for j in R]
    print time.time()-t
    LR.sort()
    print time.time()-t

    print

    t = time.time()
    #L.sort()
    R.sort()
    presorted_LR = [(i+j) for i in L for j in R]
    print time.time()-t
    presorted_LR.sort()
    print time.time()-t

    if __name__=='__main__':
    test()

    On this -very slow- computer this prints:

    >d:\python25\pythonw -u "sort_test.py"

    1.10000014305
    8.96000003815

    1.10000014305
    5.49000000954
    >Exit code: 0


    Presorting the second sequence gains us more than three seconds. I
    wonder if there is a way to generate the combined items in such a way
    that sorting them is even faster? Is there some other sorting algorithm
    that can specifically take advantage of this way -or another way- of
    generating this list?

    The final sequence is len(L)*len(R) long but it is produced from only
    len(L)+len(R) different items, is it possible to exploit this fact? I'd
    also be interested in a more general solution that would work for
    summing the items of more than two lists in this way.

    A.
    Anton Vredegoor, Mar 28, 2007
    #1
    1. Advertising

  2. Anton Vredegoor

    Guest

    On Mar 28, 12:12 pm, Anton Vredegoor <>
    wrote:
    > Python's sorting algorithm takes advantage of preexisting order in a
    > sequence:
    >
    > #sort_test.py
    > import random
    > import time
    >
    > def test():
    > n = 1000
    > k = 2**28
    >
    > L = random.sample(xrange(-k,k),n)
    > R = random.sample(xrange(-k,k),n)
    >
    > t = time.time()
    > LR = [(i+j) for i in L for j in R]
    > print time.time()-t
    > LR.sort()
    > print time.time()-t
    >
    > print
    >
    > t = time.time()
    > #L.sort()
    > R.sort()
    > presorted_LR = [(i+j) for i in L for j in R]
    > print time.time()-t
    > presorted_LR.sort()
    > print time.time()-t
    >
    > if __name__=='__main__':
    > test()
    >
    > On this -very slow- computer this prints:
    >
    > >d:\python25\pythonw -u "sort_test.py"

    > 1.10000014305
    > 8.96000003815
    >
    > 1.10000014305
    > 5.49000000954
    > >Exit code: 0

    >
    > Presorting the second sequence gains us more than three seconds. I
    > wonder if there is a way to generate the combined items in such a way
    > that sorting them is even faster? Is there some other sorting algorithm
    > that can specifically take advantage of this way -or another way- of
    > generating this list?
    >
    > The final sequence is len(L)*len(R) long but it is produced from only
    > len(L)+len(R) different items, is it possible to exploit this fact? I'd
    > also be interested in a more general solution that would work for
    > summing the items of more than two lists in this way.
    >
    > A.


    I found a website that hopefully will point you in the right
    direction:

    http://wiki.python.org/moin/HowTo/Sorting

    And this one has an interesting profile of various sort methods with
    Python:

    http://www.biais.org/blog/index.php/2007/01/28/23-python-sorting-efficiency

    Enjoy,

    Mike
    , Mar 28, 2007
    #2
    1. Advertising

  3. Anton Vredegoor

    Paul Rubin Guest

    Anton Vredegoor <> writes:
    > Presorting the second sequence gains us more than three seconds. I
    > wonder if there is a way to generate the combined items in such a way
    > that sorting them is even faster? Is there some other sorting
    > algorithm that can specifically take advantage of this way -or another
    > way- of generating this list?


    Well there are various hacks one can think of, but is there an actual
    application you have in mind?

    > The final sequence is len(L)*len(R) long but it is produced from only
    > len(L)+len(R) different items, is it possible to exploit this fact?
    > I'd also be interested in a more general solution that would work for
    > summing the items of more than two lists in this way.


    If you really want the sum of several probability distriutions (in
    this case it's the sum of several copies of the uniform distribution),
    it's the convolution of the distributions being summed. You can do
    that with the fast fourier transform much more efficiently than
    grinding out that cartesian product. But I don't know if that's
    anything like what you're trying to do.
    Paul Rubin, Mar 28, 2007
    #3
  4. Paul Rubin wrote:

    > Well there are various hacks one can think of, but is there an actual
    > application you have in mind?


    Suppose both input lists are sorted. Then the product list is still not
    sorted but it's also not completely unsorted. How can I sort the
    product? I want to know if it is necessary to compute the complete
    product list first in order to sort it. Is it possible to generate the
    items in sorted order using only a small stack?

    Also, I have a sumfour script that is slow because of sorting. It would
    become competitive to the hashing solution if the sorting would be about
    ten times faster. If the items could be generated directly in order the
    script would also have only a very small memory footprint.

    > If you really want the sum of several probability distriutions (in
    > this case it's the sum of several copies of the uniform distribution),
    > it's the convolution of the distributions being summed. You can do
    > that with the fast fourier transform much more efficiently than
    > grinding out that cartesian product. But I don't know if that's
    > anything like what you're trying to do.


    I want the product, but sorted in less time. If Fourier transforms can
    help, I want them :)

    A.
    Anton Vredegoor, Mar 28, 2007
    #4
  5. Anton Vredegoor

    Terry Reedy Guest

    "Anton Vredegoor" <> wrote in message
    news:euenlj$oll$1.ov.home.nl...
    | Paul Rubin wrote:
    |
    | > Well there are various hacks one can think of, but is there an actual
    | > application you have in mind?
    |
    | Suppose both input lists are sorted. Then the product list is still not
    | sorted but it's also not completely unsorted. How can I sort the
    | product? I want to know if it is necessary to compute the complete
    | product list first in order to sort it. Is it possible to generate the
    | items in sorted order using only a small stack?

    If you have lists A and B of lengths m and n, m < n, and catenate the m
    product lists A[0]*B, A[1]*B, ..., A[m-1]*B, then list.sort will definitely
    take advantage of the initial order in each of the m sublists and will be
    faster than sorting m*n scrambled items (which latter is O(m*n*log(m*n))).

    One could generate the items in order in less space by doing, for instance,
    an m-way merge, in which only the lowest member of each of the m sublists
    is present at any one time. But I don't know if this (which is
    O(m*n*log(m))) would be any faster (in some Python implementation) for any
    particular values of m and m.

    Terry Jan Reedy
    Terry Reedy, Mar 28, 2007
    #5
  6. Terry Reedy wrote:

    > One could generate the items in order in less space by doing, for instance,
    > an m-way merge, in which only the lowest member of each of the m sublists
    > is present at any one time. But I don't know if this (which is
    > O(m*n*log(m))) would be any faster (in some Python implementation) for any
    > particular values of m and m.


    If hashing is O(n+m), it would mean that it would be faster.

    I'm not sure if I can agree with your analysis. All information to
    generate the product is already inside the two lists we begin with.
    Doesn't that make the product less complex than a random n*m matrix? Or
    is that what you are saying with O(m*n*log(m)) ?

    A.
    Anton Vredegoor, Mar 29, 2007
    #6
  7. Anton Vredegoor

    Terry Reedy Guest

    "Anton Vredegoor" <> wrote in message
    news:eueu79$mkd$1.ov.home.nl...
    | Terry Reedy wrote:
    |
    | > One could generate the items in order in less space by doing, for
    instance,
    | > an m-way merge, in which only the lowest member of each of the m
    sublists
    | > is present at any one time. But I don't know if this (which is
    | > O(m*n*log(m))) would be any faster (in some Python implementation) for
    any
    | > particular values of m and m.
    |
    | If hashing is O(n+m), it would mean that it would be faster.
    |
    | I'm not sure if I can agree with your analysis. All information to
    | generate the product is already inside the two lists we begin with.
    | Doesn't that make the product less complex than a random n*m matrix? Or
    | is that what you are saying with O(m*n*log(m)) ?

    If I understand correctly, you want to multiiply each of m numbers by each
    of n numbers, giving m*n products. That is O(m*n) work. Inserting (and
    extracting) each of these is a constant size m priority cue takes, I
    believe, O(log(m)) work, for a total of m*n*log(m). That is faster than
    O(m*n*log(m*n)) for sorting m*n random numbers.

    I don't know how you would sort by hashing.

    Terry Jan Reedy
    Terry Reedy, Mar 29, 2007
    #7
  8. Terry Reedy wrote:

    > If I understand correctly, you want to multiiply each of m numbers by each
    > of n numbers, giving m*n products. That is O(m*n) work. Inserting (and
    > extracting) each of these is a constant size m priority cue takes, I
    > believe, O(log(m)) work, for a total of m*n*log(m). That is faster than
    > O(m*n*log(m*n)) for sorting m*n random numbers.


    Probably, I'm not very good with big O computations. Please forget my
    earlier post and please also ignore the unfortunate subject line. I want
    the cartesian product of the lists but I want the sums of the items.
    Suppose the function to combine the items is

    def f(i,j):
    return i+j

    And the list we want to sort would be made like this:

    LR = [f(i,j) for i in L for j in R]

    That would be like in my original post. However if the function would
    have been:

    def g(i,j):
    return n*i+j

    The resulting cartesian product of the list would be sorted a lot
    quicker, especially if the two lists -L and R- we start with are sorted.
    (n is the length of both lists here)

    So if changing the way the items are combined influences sorting time,
    is there also a way to optimize the order of generating the items for
    later sorting.

    I mean optimized for a specific combining function, in this case
    function f.

    > I don't know how you would sort by hashing.


    Me too. I also learned that hashing is O(1) for non-mathematicians.

    Probably I'm suffering from a mild outbreak of spring. I'm currently
    trying to stop myself from starting to smoke again or writing critical
    posts about PyPy, if it explains anything.

    A.
    Anton Vredegoor, Mar 29, 2007
    #8
  9. Terry Reedy wrote:

    > If I understand correctly, you want to multiiply each of m numbers by each
    > of n numbers, giving m*n products. That is O(m*n) work. Inserting (and
    > extracting) each of these is a constant size m priority cue takes, I
    > believe, O(log(m)) work, for a total of m*n*log(m). That is faster than
    > O(m*n*log(m*n)) for sorting m*n random numbers.


    According to this page:

    http://maven.smith.edu/~orourke/TOPP/P41.html

    You are very close. The only thing is whether the logarithmic factor can
    be removed.

    But there's more:

    </quote>
    If the input consists of n integers between - M and M, an algorithm of
    Seidel based on fast Fourier transforms runs in O(n + M log M) time
    [Eri99a]. The $ \Omega$(n2) lower bounds require exponentially large
    integers.
    <quote>

    So maybe there is something at least for this specific case. I hope I'm
    not irritating someone by posting my thought processes here, since
    posting things sometimes seems to be the only way to find the links. I
    wonder if it's the selective attention that makes them turn up after
    posting or whether your talk about big O's has put me on the right track.

    Thanks anyway. The problem is still open in general, but some hacks are
    possible, as Paul Rubin said.

    A.
    Anton Vredegoor, Mar 30, 2007
    #9
  10. Anton Vredegoor

    Paul Rubin Guest

    Anton Vredegoor <> writes:
    > I want the product, but sorted in less time. If Fourier transforms can
    > help, I want them :)


    Oh, I see what you mean. I don't see an obvious faster way to do it
    and I don't have the feeling that one necessarily exists. As someone
    mentioned, you could do an n-way merge, which at least avoids using
    quadratic memory. Here's a version using Frederik Lundh's trick of
    representing a lazy list as its head plus the generator for the tail:

    from heapq import heapify, heappop, heappush
    import random

    def sortedsums(L,R):
    # yield elements of [(i+j) for i in L for j in R] in sorted order
    # assumes L and R are themselves sorted
    def lundh(x):
    g = ((a+x) for a in L)
    return (g.next(), g)
    heap = [lundh(x) for x in R]

    heapify (heap) # not sure this is needed

    while heap:
    z,zn = heappop(heap)
    try: heappush(heap, (zn.next(), zn))
    except StopIteration: pass
    yield z

    def test():
    L = sorted(random.sample(xrange(2000),150))
    R = sorted(random.sample(xrange(2000),150))

    t = sortedsums(L,R)
    t2 = sorted([(i+j) for i in L for j in R])
    assert list(t) == t2

    test()
    Paul Rubin, Mar 31, 2007
    #10
  11. Paul Rubin wrote:

    > Oh, I see what you mean. I don't see an obvious faster way to do it
    > and I don't have the feeling that one necessarily exists. As someone
    > mentioned, you could do an n-way merge, which at least avoids using
    > quadratic memory. Here's a version using Frederik Lundh's trick of
    > representing a lazy list as its head plus the generator for the tail:


    That's a beautiful trick! I was just exploring some idea about
    traversing the matrix starting from the upper left and ending at the
    lower right by forming some kind of wave like front line. It's currently
    very slow but I hope it can be polished a bit.

    Also I was trying to figure out if it could have any advantage over the
    straight row by row merge, but now that these lazy rows have appeared
    the field has changed a lot :)

    def typewriter(L,R):
    Z = [0] * len(R)
    M = [(L[0]+R[0],0)]
    while M:
    val,k = min(M)
    yield val
    Z[k] += 1
    M = []
    for k,x in enumerate(Z):
    if x < len(R):
    M.append((L[k]+R[x],k))
    if not x:
    break

    A.
    Anton Vredegoor, Mar 31, 2007
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hazzard
    Replies:
    2
    Views:
    635
    Hazzard
    Apr 6, 2004
  2. =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==

    List of lists of lists of lists...

    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==, May 8, 2006, in forum: Python
    Replies:
    5
    Views:
    400
    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==
    May 15, 2006
  3. Bruno Desthuilliers
    Replies:
    5
    Views:
    381
    Bruno Desthuilliers
    Aug 29, 2007
  4. Stef Mientki

    hpw to convert a linux python script ?

    Stef Mientki, Feb 11, 2009, in forum: Python
    Replies:
    7
    Views:
    749
    Steve Holden
    Feb 15, 2009
  5. Navin
    Replies:
    1
    Views:
    685
    Ken Schaefer
    Sep 9, 2003
Loading...

Share This Page