efficient intersection of lists with rounding

Gordon Williams · Dec 2, 2004

Hi,

I have to lists that I need to find the common numbers (2nd rounded to
nearest integral) and I am wondering if there is a more efficient way of
doing it.

a= [(123,1.3),(123,2.4),(123,7.8),(123,10.2)]
b= [(123, 0.9), (123, 1.9), (123, 8.0)]
[ (i,round(j)) for i,j in a for l,m in b if (i,round(j)) ==

Click to expand...

Click to expand...

(l,round(m))]
[(123, 1.0), (123, 2.0), (123, 8.0)]This works but a and b can be in the order of 30K long.

A couple of other bits of info.
- a and b are ordered smallest to largest (could bisect module be used?)
- in the future I will want to round the second number of closest 0.25
rather than whole number.

Would the sets module be more efficient?

I'm using python 2.3.

Thanks for any ideas.

Regards,

Gordon Williams

Diez B. Roggisch · Dec 2, 2004

A couple of other bits of info.

- a and b are ordered smallest to largest (could bisect module be used?)
- in the future I will want to round the second number of closest 0.25
rather than whole number.

Would the sets module be more efficient?

I'm using python 2.3.

I'd go for something that uses the rounded versions of the lists and then
iterates the first list and lets the second "cach up". Sorry, I'm to lazy
to desribe it better, so here is the code:

a= [(123,1.3),(123,2.4),(123,7.8),(123,10.2)]
b= [(123, 0.9), (123, 1.9), (123, 8.0)]
a = [ (i,round(j)) for i,j in a]
b = [ (i,round(j)) for i,j in b]

res = []
pos_b = 0

try:
for i, pivot in a:
while b[pos_b][1] < pivot:
pos_b += 1
while b[pos_b][1] == pivot:
res.append(b[pos_b])
pos_b += 1
except IndexError:
# If b gets exhausted somewhere
pass
print res

While it looks more complicated, it certainly is faster, as its complexity
is in O(max(len(a), len(b))) where your code was O(len(a) * len(b)) - so
usually more or less quadratic.

The speed gain comes of course from the order of the elements. And you could
factor the rounding _into_ the loops, but thats more ugly.

Steven Bethard · Dec 3, 2004

Gordon said:
I have to lists that I need to find the common numbers (2nd rounded to
nearest integral) and I am wondering if there is a more efficient way of
doing it.

a= [(123,1.3),(123,2.4),(123,7.8),(123,10.2)]
b= [(123, 0.9), (123, 1.9), (123, 8.0)]
[ (i,round(j)) for i,j in a for l,m in b if (i,round(j)) ==

Click to expand...

Click to expand...

(l,round(m))]
[(123, 1.0), (123, 2.0), (123, 8.0)]
[snip]
Would the sets module be more efficient?

Well, in Python 2.3, I believe sets are implemented in Python while
they're implemented in C in Python 2.4. So probably not, unless you
upgrade. A 2.4 solution with sets:

>>> a = [(123,1.3),(123,2.4),(123,7.8),(123,10.2)]
>>> b = [(123, 0.9), (123, 1.9), (123, 8.0)]
>>> def roundedj(pairs_iterable):

Click to expand...

Click to expand...

.... return ((i, round(j)) for i, j in pairs_iterable)
....set([(123, 8.0), (123, 2.0), (123, 1.0)])

Steve

Michael Hoffman · Dec 3, 2004

Steven said:
Well, in Python 2.3, I believe sets are implemented in Python while
they're implemented in C in Python 2.4.

I think the Python 2.3 Sets implementation is likely to be quicker than
whatever list-manipulation answer you come up with instead. But there's
only one way to find out

Greg Ewing · Dec 3, 2004

Gordon said:
a= [(123,1.3),(123,2.4),(123,7.8),(123,10.2)]
b= [(123, 0.9), (123, 1.9), (123, 8.0)]
[ (i,round(j)) for i,j in a for l,m in b if (i,round(j)) ==

Click to expand...

Click to expand...

(l,round(m))]

d = {}
for (l, m) in b:
d[l, round(m)] = 1

result = []
for (i, j) in a:
t = (i, round(j))
if t in d:
result.append(t)

- in the future I will want to round the second number of closest 0.25
rather than whole number.

I would do that by multiplying by 4 and rounding to
an integer to derive the dictionary key. That will
avoid any float-representation problems you might have
by trying to round to a fraction.

Would the sets module be more efficient?

As another poster said, sets are implemented as dicts
in 2.3, so it comes down to much the same thing. Using
sets might be a bit faster than the above code in 2.4,
but probably not greatly so. By far the biggest
improvement will come from using an O(n) algorithm
instead of an O(n**2) one.

Adam DePrince · Dec 3, 2004

Gordon said:
Gordon said:

a= [(123,1.3),(123,2.4),(123,7.8),(123,10.2)]
b= [(123, 0.9), (123, 1.9), (123, 8.0)]
[ (i,round(j)) for i,j in a for l,m in b if (i,round(j)) ==

Click to expand...

(l,round(m))]

Click to expand...

d = {}
for (l, m) in b:
d[l, round(m)] = 1

result = []
for (i, j) in a:
t = (i, round(j))
if t in d:
result.append(t)

- in the future I will want to round the second number of closest 0.25
rather than whole number.

Click to expand...

I would do that by multiplying by 4 and rounding to
an integer to derive the dictionary key. That will
avoid any float-representation problems you might have
by trying to round to a fraction.

Would the sets module be more efficient?

Click to expand...

As another poster said, sets are implemented as dicts
in 2.3, so it comes down to much the same thing. Using
sets might be a bit faster than the above code in 2.4,
but probably not greatly so. By far the biggest
improvement will come from using an O(n) algorithm
instead of an O(n**2) one.

Of course a low O-factor is important; you should avoid however
confusing the statement of what you want to do with the statement of how
you want to do it. One of the benefits of a HLL like Python is you can
merely state *what* you want without worrying about how to compute it.

In the original example above you are computing a set intersection -
python's set object has an intersection method. Use it. Not only is it
faster than your O**2 solution, but it is a good deal clearer.

from sets import Set
set_a = Set( [(i,round(j)) for i,j in a] )
set_b = Set( [(i,round(j)) for i,j in b] )
set_a.intersection( set_b )

Click to expand...

Click to expand...

Set([(123, 2.0), (123, 1.0), (123, 8.0)])

Or you could say ...

set_a, set_b = [[Set((i,round(j))) for i,j in s] for s in (a,b )]

Click to expand...

Click to expand...

Adam DePrince

Steven Bethard · Dec 3, 2004

Michael said:
I think the Python 2.3 Sets implementation is likely to be quicker than
whatever list-manipulation answer you come up with instead. But there's
only one way to find out

Yeah, almost certainly since he's looking at lists 3K long. If they
were small, you never know since the list comprehension gets the C-code
speedup, while sets.Set is Python code:

> python -m timeit -s "a = [(123,1.3),(123,2.4),(123,7.8),(123,10.2)];

b = [(123, 0.9), (123, 1.9), (123, 8.0)]" "[ (i,round(j)) for i,j in a
for l,m in b if (i,round(j)) == (l,round(m))]"
10000 loops, best of 3: 27.5 usec per loop

> python -m timeit -s "import sets; a =

[(123,1.3),(123,2.4),(123,7.8),(123,10.2)]; b = [(123, 0.9), (123, 1.9
), (123, 8.0)]" "sets.Set([(i,round(j)) for i,j in
a]).intersection(sets.Set([(i, round(j)) for i, j in b]))"
10000 loops, best of 3: 47.7 usec per loop

In the case given, the O(n**2) list comprehension is faster than the
O(n) set intersection. Of course, this is not likely to be true with
any reasonable sized data. But it's something worth keeping in mind.

Steve

Raymond Hettinger · Dec 3, 2004

Gordon Williams said:
Hi,

I have to lists that I need to find the common numbers (2nd rounded to
nearest integral) and I am wondering if there is a more efficient way of
doing it.

a= [(123,1.3),(123,2.4),(123,7.8),(123,10.2)]
b= [(123, 0.9), (123, 1.9), (123, 8.0)]
[ (i,round(j)) for i,j in a for l,m in b if (i,round(j)) ==

Click to expand...

Click to expand...

(l,round(m))]
[(123, 1.0), (123, 2.0), (123, 8.0)]This works but a and b can be in the order of 30K long.

A couple of other bits of info.
- a and b are ordered smallest to largest (could bisect module be used?)
- in the future I will want to round the second number of closest 0.25
rather than whole number.

Would the sets module be more efficient?
Yes:

set([(123, 8.0), (123, 2.0), (123, 1.0)])

I'm using python 2.3.

from sets import Set as set
set([(x,round(y)) for x,y in a]) & set([(x,round(y)) for x,y in b])

Click to expand...

Click to expand...

set([(123, 8.0), (123, 2.0), (123, 1.0)])

Raymond Hettinger

Michael Hoffman · Dec 3, 2004

Steven said:
Yeah, almost certainly since he's looking at lists 3K long. If they
were small, you never know since the list comprehension gets the C-code
speedup, while sets.Set is Python code:

> [list comprehension]
10000 loops, best of 3: 27.5 usec per loop

> [Python 2.3 Set]
10000 loops, best of 3: 47.7 usec per loop

In the case given, the O(n**2) list comprehension is faster than the
O(n) set intersection. Of course, this is not likely to be true with
any reasonable sized data. But it's something worth keeping in mind.

Of course if you're working with a dataset that small, it probably
doesn't really matter which of these implementations you use.

The exception would be if this were in an inner loop in the actual
program and *were* being run 10000 times or more.

Gordon Williams · Dec 6, 2004

Hi,

I have to lists that I need to find the common numbers (2nd rounded to
nearest integral) and I am wondering if there is a more efficient way of
doing it.

a= [(123,1.3),(123,2.4),(123,7.8),(123,10.2)]
b= [(123, 0.9), (123, 1.9), (123, 8.0)]
[ (i,round(j)) for i,j in a for l,m in b if (i,round(j)) ==

Click to expand...

Click to expand...

(l,round(m))]
[(123, 1.0), (123, 2.0), (123, 8.0)]

Thanks for all your suggestions. I've tried each one with lists of 1K, 10K
and 30K long and tabulated the results below. Run with profile on W2K,
python 2.3.2, 1GHz Athlon.

1K, 10K and 30K long (seconds per call)
t1= 0.009, 0.148, 0.563
t2= 0.015, 0.217, 0.777
t3= 0.008, 0.108, 0.487
t4= 0.016, 0.190, 0.749
t5= 0.015, 0.224, 0.773

The non-set algorithims (t1,t3) came out the winners (maybe due to the
conversion of the set to a sorted list. I didn't look into it any farther.)

Regards,

Gordon Williams

--------

from sets import Set
import random

size = 1000

a = [(123,i+random.choice([-.2,-.1,.1,.2])) for i in range(size)]
b = [(123, 1+i+random.choice([-.2,-.1,.1,.2])) for i in range(size)]

def t1():
#Diez B. Roggisch <[email protected]>
ra = [ (i,round(j)) for i,j in a]
rb = [ (i,round(j)) for i,j in b]

res = []
pos_b = 0

try:
for i, pivot in ra:
while rb[pos_b][1] < pivot:
pos_b += 1
while rb[pos_b][1] == pivot:
res.append(rb[pos_b])
pos_b += 1
except IndexError:
# If b gets exhausted somewhere
pass
return res

def t2():
#Steven Bethard <[email protected]>
def roundedj(pairs_iterable):
return [(i, round(j)) for i, j in pairs_iterable]

l=list(Set(roundedj(a)).intersection(Set(roundedj(b))))
l.sort()
return l

def t3():
#Greg Ewing <[email protected]>
d = {}
for (l, m) in b:
d[l, round(m)] = 1

result = []
for (i, j) in a:
t = (i, round(j))
if t in d:
result.append(t)
return result

def t4():
#Adam DePrince <[email protected]>
set_a = Set( [(i,round(j)) for i,j in a] )
set_b = Set( [(i,round(j)) for i,j in b] )
l= list(set_a.intersection( set_b ))
l.sort()
return l

def t5():
#Raymond Hettinger <[email protected]>
l= list(Set([(x,round(y)) for x,y in a]) & Set([(x,round(y)) for x,y in
b]))
l.sort()
return l

def test():
r1= t1()
r2= t2()
r3= t3()
r4= t4()
r5= t5()

lists as an efficient implementation of large two-dimensionalarrays(!)	0	Feb 2, 2010
Combinations of lists	4	Oct 3, 2012
Perl +lists/unions/intersection:	5	Jul 12, 2006
Only one table shows up with the information	2	Mar 29, 2023
Rounding error when converting from double to int	41	Aug 4, 2009
best way to compare contents of 2 lists?	0	Apr 24, 2009
perl - array functions (union, intersection, difference, aonly,bonly) input problems.	3	Feb 3, 2008
Rounding error when converting from double to int	15	May 27, 2008

efficient intersection of lists with rounding

Gordon Williams

Diez B. Roggisch

Steven Bethard

Michael Hoffman

Greg Ewing

Adam DePrince

Steven Bethard

Raymond Hettinger

Michael Hoffman

Gordon Williams

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads