finding most common elements between thousands of multiple arrays.

Discussion in 'Python' started by mclovin, Jul 4, 2009.

  1. mclovin

    mclovin Guest

    Currently I need to find the most common elements in thousands of
    arrays within one large array (arround 2 million instances with ~70k
    unique elements)

    so I set up a dictionary to handle the counting so when I am
    iterating I up the count on the corrosponding dictionary element. I
    then iterate through the dictionary and find the 25 most common
    elements.

    the elements are initially held in a array within an array. so I am am
    just trying to find the most common elements between all the arrays
    contained in one large array.
    my current code looks something like this:
    d = {}
    for arr in my_array:
    -----for i in arr:
    #elements are numpy integers and thus are not accepted as dictionary
    keys
    -----------d[int(i)]=d.get(int(i),0)+1

    then I filter things down. but with my algorithm that only takes about
    1 sec so I dont need to show it here since that isnt the problem.


    But there has to be something better. I have to do this many many
    times and it seems silly to iterate through 2 million things just to
    get 25. The element IDs are integers and are currently being held in
    numpy arrays in a larger array. this ID is what makes up the key to
    the dictionary.

    It currently takes about 5 seconds to accomplish this with my current
    algorithm.

    So does anyone know the best solution or algorithm? I think the trick
    lies in matrix intersections but I do not know.
    mclovin, Jul 4, 2009
    #1
    1. Advertising

  2. mclovin

    Chris Rebert Guest

    Re: finding most common elements between thousands of multiplearrays.

    On Sat, Jul 4, 2009 at 12:33 AM, mclovin<> wrote:
    > Currently I need to find the most common elements in thousands of
    > arrays within one large array (arround 2 million instances with ~70k
    > unique elements)
    >
    > so I set up a dictionary to handle the counting so when I am
    > iterating  I up the count on the corrosponding dictionary element. I
    > then iterate through the dictionary and find the 25 most common
    > elements.
    >
    > the elements are initially held in a array within an array. so I am am
    > just trying to find the most common elements between all the arrays
    > contained in one large array.
    > my current code looks something like this:
    > d = {}
    > for arr in my_array:
    > -----for i in arr:
    > #elements are numpy integers and thus are not accepted as dictionary
    > keys
    > -----------d[int(i)]=d.get(int(i),0)+1
    >
    > then I filter things down. but with my algorithm that only takes about
    > 1 sec so I dont need to show it here since that isnt the problem.
    >
    >
    > But there has to be something better. I have to do this many many
    > times and it seems silly to iterate through 2 million things just to
    > get 25. The element IDs are integers and are currently being held in
    > numpy arrays in a larger array. this ID is what makes up the key to
    > the dictionary.
    >
    >  It currently takes about 5 seconds to accomplish this with my current
    > algorithm.
    >
    > So does anyone know the best solution or algorithm? I think the trick
    > lies in matrix intersections but I do not know.


    Using a defaultdict
    (http://docs.python.org/library/collections.html#collections.defaultdict)
    would speed it up (but only slightly) and make it clearer to read.

    Cheers,
    Chris
    --
    http://blog.rebertia.com
    Chris Rebert, Jul 4, 2009
    #2
    1. Advertising

  3. mclovin

    Andre Engels Guest

    Re: finding most common elements between thousands of multiplearrays.

    On Sat, Jul 4, 2009 at 9:33 AM, mclovin<> wrote:
    > Currently I need to find the most common elements in thousands of
    > arrays within one large array (arround 2 million instances with ~70k
    > unique elements)
    >
    > so I set up a dictionary to handle the counting so when I am
    > iterating  I up the count on the corrosponding dictionary element. I
    > then iterate through the dictionary and find the 25 most common
    > elements.
    >
    > the elements are initially held in a array within an array. so I am am
    > just trying to find the most common elements between all the arrays
    > contained in one large array.
    > my current code looks something like this:
    > d = {}
    > for arr in my_array:
    > -----for i in arr:
    > #elements are numpy integers and thus are not accepted as dictionary
    > keys
    > -----------d[int(i)]=d.get(int(i),0)+1
    >
    > then I filter things down. but with my algorithm that only takes about
    > 1 sec so I dont need to show it here since that isnt the problem.
    >
    >
    > But there has to be something better. I have to do this many many
    > times and it seems silly to iterate through 2 million things just to
    > get 25. The element IDs are integers and are currently being held in
    > numpy arrays in a larger array. this ID is what makes up the key to
    > the dictionary.
    >
    >  It currently takes about 5 seconds to accomplish this with my current
    > algorithm.
    >
    > So does anyone know the best solution or algorithm? I think the trick
    > lies in matrix intersections but I do not know.


    There's no better algorithm for the general case. No method of
    checking the matrices using less than 2000000-x look-ups will ensure
    you that there's not a new value with x occurences lurking somewhere.

    However, if you need this information more often, and if the number of
    values that changes in between is small compared to the total size of
    the matrices, you could make a gain in subsequent calculations by
    remembering part results. Depending on the situation you could for
    example keep the dictionary with the counts and update it each time,
    or keep such a dictionary for each "big" matrix, and set a flag when
    that dictionary is not up-to-date any more

    ..

    --
    André Engels,
    Andre Engels, Jul 4, 2009
    #3
  4. mclovin

    Vilya Harvey Guest

    Re: finding most common elements between thousands of multiplearrays.

    2009/7/4 Andre Engels <>:
    > On Sat, Jul 4, 2009 at 9:33 AM, mclovin<> wrote:
    >> Currently I need to find the most common elements in thousands of
    >> arrays within one large array (arround 2 million instances with ~70k
    >> unique elements)
    >>
    >> so I set up a dictionary to handle the counting so when I am
    >> iterating  I up the count on the corrosponding dictionary element. I
    >> then iterate through the dictionary and find the 25 most common
    >> elements.
    >>
    >> the elements are initially held in a array within an array. so I am am
    >> just trying to find the most common elements between all the arrays
    >> contained in one large array.
    >> my current code looks something like this:
    >> d = {}
    >> for arr in my_array:
    >> -----for i in arr:
    >> #elements are numpy integers and thus are not accepted as dictionary
    >> keys
    >> -----------d[int(i)]=d.get(int(i),0)+1
    >>
    >> then I filter things down. but with my algorithm that only takes about
    >> 1 sec so I dont need to show it here since that isnt the problem.
    >>
    >>
    >> But there has to be something better. I have to do this many many
    >> times and it seems silly to iterate through 2 million things just to
    >> get 25. The element IDs are integers and are currently being held in
    >> numpy arrays in a larger array. this ID is what makes up the key to
    >> the dictionary.
    >>
    >>  It currently takes about 5 seconds to accomplish this with my current
    >> algorithm.
    >>
    >> So does anyone know the best solution or algorithm? I think the trick
    >> lies in matrix intersections but I do not know.

    >
    > There's no better algorithm for the general case. No method of
    > checking the matrices using less than 2000000-x look-ups will ensure
    > you that there's not a new value with x occurences lurking somewhere.


    Try flattening the arrays into a single large array & sorting it. Then
    you can just iterate over the large array counting as you go; you only
    ever have to insert into the dict once for each value and there's no
    lookups in the dict. I don't know numpy, so there's probably a more
    efficient way to write this, but this should show what I'm talking
    about:

    big_arr = sorted(reduce(list.__add__, my_array, []))
    counts = {}
    last_val = big_arr[0]
    count = 0
    for val in big_arr:
    if val == last_val:
    count += 1
    else:
    counts[last_val] = count
    count = 0
    last_val = val
    counts[last_val] = count # to get the count for the last value.

    If flattening the arrays isn't practical, you may still get some
    improvements by sorting them individually and applying the same
    principle to each of them:

    counts = {}
    for arr in my_array:
    sorted_arr = sorted(arr)
    last_val = sorted_arr[0]
    count = 0
    for val in sorted_arr:
    if val == last_val:
    count += 1
    else:
    counts[last_val] = counts.get(last_val, 0) + count
    count = 0
    last_val = val
    counts[last_val] = counts.get(last_val, 0) + count

    Hope that helps...

    Vil.
    Vilya Harvey, Jul 4, 2009
    #4
  5. You can join all your arrays into a single big array with concatenate.

    >>> import numpy as np
    >>> a = np.concatenate(array_of_arrays)


    Then count the number of occurrances of each unique element using this trick
    with searchsorted. This should be pretty fast.

    >>> a.sort()
    >>> unique_a = np.unique(a)
    >>> count = []
    >>> for val in unique_a:

    .... count.append(a.searchsorted(val,side='right') - a.searchsorted(val,
    side='left'))
    >>> mostcommonvals = unique_a[np.argsort(count)[-25:]]



    Neil
    Neil Crighton, Jul 4, 2009
    #5
  6. Re: finding most common elements between thousands of multiplearrays.

    On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:

    > 2009/7/4 Andre Engels <>:
    >> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<> wrote:
    >>> Currently I need to find the most common elements in thousands of
    >>> arrays within one large array (arround 2 million instances with ~70k
    >>> unique elements)

    ....
    >> There's no better algorithm for the general case. No method of checking
    >> the matrices using less than 2000000-x look-ups will ensure you that
    >> there's not a new value with x occurences lurking somewhere.

    >
    > Try flattening the arrays into a single large array & sorting it. Then
    > you can just iterate over the large array counting as you go; you only
    > ever have to insert into the dict once for each value and there's no
    > lookups in the dict.


    You're suggesting to do a whole bunch of work copying 2,000,000 pointers
    into a single array, then a whole bunch of more work sorting that second
    array (which is O(N*log N) on average), and then finally iterate over the
    second array. Sure, that last step will on average involve fewer than
    O(N) steps, but to get to that point you've already done more work than
    just iterating over the array-of-arrays in the first place.

    Now, if you're really lucky, your strategy can be done in fast C code
    instead of slow Python code, and you might see a speed-up for values of N
    which aren't too big. But I wouldn't put money on it.

    Another strategy might be to pre-count elements in each array, as you
    build or modify them. This will marginally slow down each modification
    you make to the array, but the payback will be that finding the frequency
    of any element will be almost instantaneous.


    --
    Steven
    Steven D'Aprano, Jul 4, 2009
    #6
  7. Re: finding most common elements between thousands of multiplearrays.

    On Sat, 04 Jul 2009 13:42:06 +0000, Steven D'Aprano wrote:

    > On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:
    >
    >> 2009/7/4 Andre Engels <>:
    >>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<> wrote:
    >>>> Currently I need to find the most common elements in thousands of
    >>>> arrays within one large array (arround 2 million instances with ~70k
    >>>> unique elements)

    > ...
    >>> There's no better algorithm for the general case. No method of
    >>> checking the matrices using less than 2000000-x look-ups will ensure
    >>> you that there's not a new value with x occurences lurking somewhere.

    >>
    >> Try flattening the arrays into a single large array & sorting it. Then
    >> you can just iterate over the large array counting as you go; you only
    >> ever have to insert into the dict once for each value and there's no
    >> lookups in the dict.

    >
    > You're suggesting to do a whole bunch of work copying 2,000,000 pointers
    > into a single array, then a whole bunch of more work sorting that second
    > array (which is O(N*log N) on average), and then finally iterate over
    > the second array. Sure, that last step will on average involve fewer
    > than O(N) steps,


    Er what?

    Ignore that last comment -- I don't know what I was thinking. You still
    have to iterate over all N elements, sorted or not.

    > but to get to that point you've already done more work
    > than just iterating over the array-of-arrays in the first place.


    What it does buy you though, as you pointed out, is reducing the number
    of explicit dict lookups and writes. However, dict lookups and writes are
    very fast, fast enough that they're used throughout Python. A line like:

    count += 1

    actually is a dict lookup and write.



    --
    Steven
    Steven D'Aprano, Jul 4, 2009
    #7
  8. mclovin

    mclovin Guest

    Re: finding most common elements between thousands of multiplearrays.

    OK then. I will try some of the strategies here but I guess things
    arent looking too good. I need to run this over a dataset that someone
    pickled. I need to run this 480,000 times so you can see my
    frustration. So it doesn't need to be "real time" but it would be nice
    it was done sorting this month.

    Is there a "bet guess" strategy where it is not 100% accurate but much
    faster?

    On Jul 4, 7:38 am, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Sat, 04 Jul 2009 13:42:06 +0000, Steven D'Aprano wrote:
    > > On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:

    >
    > >> 2009/7/4 Andre Engels <>:
    > >>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<> wrote:
    > >>>> Currently I need to find the most common elements in thousands of
    > >>>> arrays within one large array (arround 2 million instances with ~70k
    > >>>> unique elements)

    > > ...
    > >>> There's no better algorithm for the general case. No method of
    > >>> checking the matrices using less than 2000000-x look-ups will ensure
    > >>> you that there's not a new value with x occurences lurking somewhere.

    >
    > >> Try flattening the arrays into a single large array & sorting it. Then
    > >> you can just iterate over the large array counting as you go; you only
    > >> ever have to insert into the dict once for each value and there's no
    > >> lookups in the dict.

    >
    > > You're suggesting to do a whole bunch of work copying 2,000,000 pointers
    > > into a single array, then a whole bunch of more work sorting that second
    > > array (which is O(N*log N) on average), and then finally iterate over
    > > the second array. Sure, that last step will on average involve fewer
    > > than O(N) steps,

    >
    > Er what?
    >
    > Ignore that last comment -- I don't know what I was thinking. You still
    > have to iterate over all N elements, sorted or not.
    >
    > > but to get to that point you've already done more work
    > > than just iterating over the array-of-arrays in the first place.

    >
    > What it does buy you though, as you pointed out, is reducing the number
    > of explicit dict lookups and writes. However, dict lookups and writes are
    > very fast, fast enough that they're used throughout Python. A line like:
    >
    > count += 1
    >
    > actually is a dict lookup and write.
    >
    > --
    > Steven
    mclovin, Jul 4, 2009
    #8
  9. mclovin

    Vilya Harvey Guest

    Re: finding most common elements between thousands of multiplearrays.

    2009/7/4 Steven D'Aprano <>:
    > On Sat, 04 Jul 2009 13:42:06 +0000, Steven D'Aprano wrote:
    >
    >> On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:
    >>
    >>> 2009/7/4 Andre Engels <>:
    >>>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin<> wrote:
    >>>>> Currently I need to find the most common elements in thousands of
    >>>>> arrays within one large array (arround 2 million instances with ~70k
    >>>>> unique elements)

    >> ...
    >>>> There's no better algorithm for the general case. No method of
    >>>> checking the matrices using less than 2000000-x look-ups will ensure
    >>>> you that there's not a new value with x occurences lurking somewhere.
    >>>
    >>> Try flattening the arrays into a single large array & sorting it. Then
    >>> you can just iterate over the large array counting as you go; you only
    >>> ever have to insert into the dict once for each value and there's no
    >>> lookups in the dict.

    >>
    >> You're suggesting to do a whole bunch of work copying 2,000,000 pointers
    >> into a single array, then a whole bunch of more work sorting that second
    >> array (which is O(N*log N) on average), and then finally iterate over
    >> the second array. Sure, that last step will on average involve fewer
    >> than O(N) steps,

    >
    > Er what?
    >
    > Ignore that last comment -- I don't know what I was thinking. You still
    > have to iterate over all N elements, sorted or not.
    >
    >> but to get to that point you've already done more work
    >> than just iterating over the array-of-arrays in the first place.

    >
    > What it does buy you though, as you pointed out, is reducing the number
    > of explicit dict lookups and writes. However, dict lookups and writes are
    > very fast, fast enough that they're used throughout Python. A line like:
    >
    > count += 1
    >
    > actually is a dict lookup and write.


    I did some tests, just to be sure, and you're absolutely right: just
    creating the flattened list took several hundred (!) times as long as
    iterating through all the lists in place. Live and learn...

    Vil.
    Vilya Harvey, Jul 4, 2009
    #9
  10. mclovin

    Lie Ryan Guest

    Re: finding most common elements between thousands of multiple arrays.

    mclovin wrote:
    > OK then. I will try some of the strategies here but I guess things
    > arent looking too good. I need to run this over a dataset that someone
    > pickled. I need to run this 480,000 times so you can see my
    > frustration. So it doesn't need to be "real time" but it would be nice
    > it was done sorting this month.
    >
    > Is there a "bet guess" strategy where it is not 100% accurate but much
    > faster?


    Heuristics?

    If you don't need 100% accuraccy, you can simply sample 10000 or so
    element and find the most common element in this small sample space. It
    should be much faster, though you'll probably need to determine the best
    cutoff number (too small and you're risking biases, too large and it
    would be slower). random.sample() might be useful here.
    Lie Ryan, Jul 4, 2009
    #10
  11. mclovin

    mclovin Guest

    Re: finding most common elements between thousands of multiplearrays.

    On Jul 4, 12:51 pm, Scott David Daniels <> wrote:
    > mclovin wrote:
    > > OK then. I will try some of the strategies here but I guess things
    > > arent looking too good. I need to run this over a dataset that someone
    > > pickled. I need to run this 480,000 times so you can see my
    > > frustration. So it doesn't need to be "real time" but it would be nice
    > > it was done sorting this month.

    >
    > > Is there a "bet guess" strategy where it is not 100% accurate but much
    > > faster?

    >
    > Well, I timed a run of a version of mine, and the scan is approx 5X
    > longer than the copy-and-sort.  Time arr_of_arr.flatten().sort() to
    > see how quickly the copy and sort happens.So you could try a variant
    > exploiting the following property:
    >      If you know the minimum length of a run that will be in the top 25,
    > then the value for each of the most-frequent run entries must show up at
    > positions n * stride and (n + 1) * stride (for some n).  That should
    > drastically reduce the scan cost, as long as stride is reasonably large.
    >
    > For my uniformly distributed 0..1024 values in 5M x 5M array,
    > About 2.5 sec to flatten and sort.
    > About 15 sec to run one of my heapish thingies.
    > the least frequency encountered: 24716
    > so, with stride at
    >
    > sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
    > So there are only 1000 points to investigate.
    > With any distribution other than uniform, that should go _way_ down.
    > So just pull out those points, use bisect to get their frequencies, and
    > feed those results into the heap accumulation.
    >
    > --Scott David Daniels


    I dont quite understand what you are saying but I know this: the times
    the most common element appears varies greatly. Sometimes it appears
    over 1000 times, and some times it appears less than 50. It all
    depends on the density of the arrays I am analyzing.

    like I said I need to do this 480,000 times so to get this done
    realistically I need to analyse about 5 a second. It appears that the
    average matrix size contains about 15 million elements.

    I threaded my program using your code and I did about 1,000 in an hour
    so it is still much too slow.

    When I selected 1 million random elements to count, 8 out of the top
    10 of those were in the top 25 of the precise way and 18 of the 25
    were in the top 25 of the precise way. so I suppose that could be an
    option.
    mclovin, Jul 4, 2009
    #11
  12. mclovin

    MRAB Guest

    mclovin wrote:
    [snip]
    > like I said I need to do this 480,000 times so to get this done
    > realistically I need to analyse about 5 a second. It appears that the
    > average matrix size contains about 15 million elements.
    >
    > I threaded my program using your code and I did about 1,000 in an hour
    > so it is still much too slow.
    >
    > When I selected 1 million random elements to count, 8 out of the top
    > 10 of those were in the top 25 of the precise way and 18 of the 25
    > were in the top 25 of the precise way. so I suppose that could be an
    > option.


    The values are integers, aren't they? What is the range of values?
    MRAB, Jul 4, 2009
    #12
  13. mclovin

    mclovin Guest

    Re: finding most common elements between thousands of multiplearrays.

    On Jul 4, 3:29 pm, MRAB <> wrote:
    > mclovin wrote:
    >
    > [snip]
    >
    > > like I said I need to do this 480,000 times so to get this done
    > > realistically I need to analyse about 5 a second. It appears that the
    > > average matrix size contains about 15 million elements.

    >
    > > I threaded my program using your code and I did about 1,000 in an hour
    > > so it is still much too slow.

    >
    > > When I selected 1 million random elements to count, 8 out of the top
    > > 10 of those were in the top 25 of the precise way and 18 of the 25
    > > were in the top 25 of the precise way. so I suppose that could be an
    > > option.

    >
    > The values are integers, aren't they? What is the range of values?


    There are appox 550k unique values with a range of 0-2million with
    gaps.
    mclovin, Jul 4, 2009
    #13
  14. mclovin

    MRAB Guest

    mclovin wrote:
    > On Jul 4, 3:29 pm, MRAB <> wrote:
    >> mclovin wrote:
    >>
    >> [snip]
    >>
    >>> like I said I need to do this 480,000 times so to get this done
    >>> realistically I need to analyse about 5 a second. It appears that the
    >>> average matrix size contains about 15 million elements.
    >>> I threaded my program using your code and I did about 1,000 in an hour
    >>> so it is still much too slow.
    >>> When I selected 1 million random elements to count, 8 out of the top
    >>> 10 of those were in the top 25 of the precise way and 18 of the 25
    >>> were in the top 25 of the precise way. so I suppose that could be an
    >>> option.

    >> The values are integers, aren't they? What is the range of values?

    >
    > There are appox 550k unique values with a range of 0-2million with
    > gaps.


    I've done a little experimentation with lists (no numpy involved) and
    found that I got a x2 speed increase if I did the counting using a list,
    something like this:

    counts = [0] * 2000000
    for x in values:
    counts[x] += 1
    counts = dict(e for e in enumerate(values) if e[1] != 0)
    MRAB, Jul 5, 2009
    #14
  15. On 7/4/2009 12:33 AM mclovin said...
    > Currently I need to find the most common elements in thousands of
    > arrays within one large array (arround 2 million instances with ~70k
    > unique elements)
    >
    > so I set up a dictionary to handle the counting so when I am
    > iterating I


    ** up the count on the corrosponding dictionary element **

    Right at this point, instead of or in addition to counting, why not save
    the large array index in a list? Then when you've identified the 25
    most common elements you'll already have a list of pointer to the
    instances to work from.

    Emile
    Emile van Sebille, Jul 5, 2009
    #15
  16. Re: finding most common elements between thousands of multiplearrays.

    On Sat, 04 Jul 2009 15:06:29 -0700, mclovin wrote:

    > like I said I need to do this 480,000 times so to get this done
    > realistically I need to analyse about 5 a second. It appears that the
    > average matrix size contains about 15 million elements.


    Have you considered recording the element counts as you construct the
    arrays? This will marginally increase the time it takes to build the
    array, but turn finding the most frequent elements into a very quick
    operation.



    --
    Steven
    Steven D'Aprano, Jul 5, 2009
    #16
  17. Re: finding most common elements between thousands of multiplearrays.

    On Sat, 04 Jul 2009 07:19:48 -0700, Scott David Daniels wrote:

    > Actually the next step is to maintain a min-heap as you run down the
    > sorted array. Something like:


    Not bad.

    I did some tests on it, using the following sample data:

    arr = np.array([xrange(i, i+7000) for i in xrange(143)] +
    [[750]*7000] + [xrange(3*i, 3*i+7000) for i in xrange(142)])


    and compared your code against the following simple function:


    def count(arr, N):
    D = {}
    for v in arr:
    for x in v:
    D[x] = D.get(x, 0) + 1
    freq = []
    for el, f in D.iteritems():
    freq.append((f, el))
    return sorted(freq, reverse=True)[:N]


    As a rough figure, your min-heap code is approximately twice as fast as
    mine.

    To the OP: I think you should start profiling your code and find out
    exactly *where* it is slow and concentrate on that. I think that trying a
    heuristic to estimate the most frequent elements by taking a random
    sample is likely to be a mistake -- whatever you're trying to accomplish
    with the frequency counts, the use of such a heuristic will mean that
    you're only approximately accomplishing it.



    --
    Steven
    Steven D'Aprano, Jul 5, 2009
    #17
  18. Re: finding most common elements between thousands of multiplearrays.

    On Sun, 05 Jul 2009 17:30:58 -0700, Scott David Daniels wrote:

    > Summary: when dealing with numpy, (or any bulk <-> individual values
    > transitions), try several ways that you think are equivalent and
    > _measure_.


    This advice is *much* more general than numpy -- it applies to any
    optimization exercise. People's intuitions about what's fast and what's
    slow are often very wrong.


    --
    Steven
    Steven D'Aprano, Jul 6, 2009
    #18
  19. mclovin

    Peter Otten Guest

    Re: finding most common elements between thousands of multiple arrays.

    Scott David Daniels wrote:

    > Scott David Daniels wrote:


    > t = timeit.Timer('sum(part[:-1]==part[1:])',
    > 'from __main__ import part')


    What happens if you calculate the sum in numpy? Try

    t = timeit.Timer('(part[:-1]==part[1:]).sum()',
    'from __main__ import part')


    Peter
    Peter Otten, Jul 6, 2009
    #19
  20. "mclovin" <> wrote in message
    news:...
    > Currently I need to find the most common elements in thousands of
    > arrays within one large array (arround 2 million instances with ~70k
    > unique elements)
    >
    > so I set up a dictionary to handle the counting so when I am
    > iterating I up the count on the corrosponding dictionary element. I
    > then iterate through the dictionary and find the 25 most common
    > elements.
    >
    > the elements are initially held in a array within an array. so I am am
    > just trying to find the most common elements between all the arrays
    > contained in one large array.
    > my current code looks something like this:
    > d = {}
    > for arr in my_array:
    > -----for i in arr:
    > #elements are numpy integers and thus are not accepted as dictionary
    > keys
    > -----------d[int(i)]=d.get(int(i),0)+1
    >
    > then I filter things down. but with my algorithm that only takes about
    > 1 sec so I dont need to show it here since that isnt the problem.
    >
    >
    > But there has to be something better. I have to do this many many
    > times and it seems silly to iterate through 2 million things just to
    > get 25. The element IDs are integers and are currently being held in
    > numpy arrays in a larger array. this ID is what makes up the key to
    > the dictionary.
    >
    > It currently takes about 5 seconds to accomplish this with my current
    > algorithm.
    >
    > So does anyone know the best solution or algorithm? I think the trick
    > lies in matrix intersections but I do not know.


    Would the following work for you, or am I missing something? For a 5Kx5K
    array, this takes about a tenth of a second on my machine. This code
    doesn't deal with the sub-array issue.

    #####################
    import numpy
    import time

    LOWER = 0
    UPPER = 1024
    SIZE = 5000
    NUM_BEST = 4

    # sample data
    data = numpy.random.randint(LOWER, UPPER, (SIZE,SIZE)).astype(int)

    time.clock()
    count = numpy.bincount(data.flat)
    best = sorted(zip(count, range(len(count))))[-NUM_BEST:]
    print 'time=', time.clock()
    print best
    Andrew Henshaw, Jul 7, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Denny
    Replies:
    1
    Views:
    774
  2. antar2
    Replies:
    2
    Views:
    389
    Bighead
    Jul 17, 2008
  3. AbidDF
    Replies:
    0
    Views:
    962
    AbidDF
    Feb 19, 2010
  4. Sebastian probst Eide

    Finding shared elements between to arrays.

    Sebastian probst Eide, Oct 9, 2007, in forum: Ruby
    Replies:
    15
    Views:
    183
    Robert Klemme
    Oct 13, 2007
  5. Neil
    Replies:
    4
    Views:
    166
Loading...

Share This Page