Match items in large list

Fisherking · Feb 12, 2009

Hi,

I hope you can help me with this optimizing problem!
I have a large list of dictionaries (about 250 000 of them). Two or
more of these dictionaries contains the same data (not more than five
or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d':
4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...].
(the dictionaries are really containg phone numbers, duration time and
call time.)

Which are the best way of searching through the list and extract the
items that are the same. In the first run I want to find every item
that are the same as {'a':1,'b':'2','c':3}, the second {'d':
4,'e':'5','f':6} etc. The list are not read-only and I can pop them
out of the list when I have found them!

(How can I found items that are almost the same? e.g. {'a':
1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} )

In the second part of this program I need to take each of these items
(similar items are treated as one) and find these in a database that
contains 2.5M posts!

The python version I am bound to is Python 2.3

thanks for your help!

Paul Rubin · Feb 12, 2009

Fisherking said:
Which are the best way of searching through the list and extract the
items that are the same.

Sort the list, then scan through looking for duplicates, which will be
adjacent to each other.

Martin · Feb 12, 2009

Hi,

hmmm how about using sqlites in memory database and let SQL take care
of finding that for you?

hth
Martin

--
http://soup.alt.delete.co.at
http://www.xing.com/profile/Martin_Marcher
http://www.linkedin.com/in/martinmarcher

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

Peter Otten · Feb 12, 2009

Fisherking said:
I hope you can help me with this optimizing problem!
I have a large list of dictionaries (about 250 000 of them). Two or
more of these dictionaries contains the same data (not more than five
or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d':
4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...].
(the dictionaries are really containg phone numbers, duration time and
call time.)

I'd assume the dictionaries to look like

{"phone": "0123...", "duration": 5.67, "calltime": "10:42"}

What do the different keys ("a", "b", "c") versus ("d", "e", "f") mean?

Which are the best way of searching through the list and extract the
items that are the same. In the first run I want to find every item
that are the same as {'a':1,'b':'2','c':3}, the second {'d':
4,'e':'5','f':6} etc. The list are not read-only and I can pop them
out of the list when I have found them!

items = [dict(a=1, b=2), dict(a=1, b=2), dict(a=3, c=2), dict(a=3, c=2.2),
dict(a=3, c=2.6)]

# 2.5
seen = set()
for row in items:
rowkey = frozenset(row.iteritems())
if rowkey in seen:
print "dupe", row
else:
seen.add(rowkey)

# 2.3 (and older); not tested
seen = {}
for row in items:
rowkey = row.items()
rowkey.sort()
rowkey = tuple(rowkey)
if rowkey in seen:
print "dupe", row
else:
seen[rowkey] = 1

(How can I found items that are almost the same? e.g. {'a':
1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} )

If you are content with detecting similarities on a single "axis":

def window(items):
items = iter(items)
prev = items.next()
for cur in items:
yield cur, prev
prev = cur

def todict(k, v, rest):
d = dict(rest)
d[k] = v
return d

axis = "c"
d = {}
eps = 0.5
for row in items:
row = dict(row)
try:
value = row.pop(axis)
except KeyError:
pass
else:
key = frozenset(row.iteritems())
d.setdefault(key, []).append(value)

for key, values in d.iteritems():
if len(values) > 1:
for cur, prev in window(sorted(values)):
if abs(cur-prev) < eps:
print "similar rows", todict(axis, cur, key), todict(axis,
prev, key)

Otherwise you have to define a distance function and use a clustering
algorithm. Segaran's "Programming Collective Intelligence" has some
examples in Python, though the code can hardly be called idiomatic.

Peter

MRAB · Feb 12, 2009

Fisherking said:
Hi,

I hope you can help me with this optimizing problem!
I have a large list of dictionaries (about 250 000 of them). Two or
more of these dictionaries contains the same data (not more than five
or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d':
4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...].
(the dictionaries are really containg phone numbers, duration time and
call time.)

Which are the best way of searching through the list and extract the
items that are the same. In the first run I want to find every item
that are the same as {'a':1,'b':'2','c':3}, the second {'d':
4,'e':'5','f':6} etc. The list are not read-only and I can pop them
out of the list when I have found them!

(How can I found items that are almost the same? e.g. {'a':
1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} )

In the second part of this program I need to take each of these items
(similar items are treated as one) and find these in a database that
contains 2.5M posts!

The python version I am bound to is Python 2.3

The best way of looking for duplicates is to make a dict or set of the
items. Unfortunately a dict isn't hashable, so it can't be a key.
Therefore I suggest you convert the dicts into tuples, which are
hashable:

counts = {}
for d in my_list:
d_as_list = d.items()
# Standardise the order so that identical dicts become identical
tuples.
d_as_list.sort()
key = tuple(d_as_list)
if key in counts:
counts[key] += 1
else:
counts[key] = 1

Terry Reedy · Feb 12, 2009

Fisherking said:
Hi,

I hope you can help me with this optimizing problem!
I have a large list of dictionaries (about 250 000 of them). Two or
more of these dictionaries contains the same data (not more than five
or six of them per item) e.g. [{'a':1,'b':'2','c':3} , {'d':
4,'e':'5','f':6},{'a':1,'b':'2','c':3} , {'d':4,'e':'5','f':6},...].
(the dictionaries are really containg phone numbers, duration time and
call time.)

Which are the best way of searching through the list and extract the
items that are the same. In the first run I want to find every item
that are the same as {'a':1,'b':'2','c':3}, the second {'d':
4,'e':'5','f':6} etc. The list are not read-only and I can pop them
out of the list when I have found them!

Popping items out of the middle of a list is a slow O(n) operation.
MRAB's set of tuples is what I would have suggested.

(How can I found items that are almost the same? e.g. {'a':
1,'b':'2','c':3.1} and {'a':1,'b':'2','c':3.2} )

In the second part of this program I need to take each of these items
(similar items are treated as one) and find these in a database that
contains 2.5M posts!

The python version I am bound to is Python 2.3

For that, dict of tuples (with fake value == None) might be faster than
the add-on set module.

Fisherking · Feb 13, 2009

Thank you all for your answers! I must say I'm really impressed by the
willing to help and the quick responses in this group (since it is my
first post)!

I solved the problem using SQL-queries. I added a id, the same for
each item in a chain (two or more similar posts) and just updated the
database for each unique post. I had to do this anyway and I think it
is the simplest and quickest way!

thanks again!

Range / empty list issues??	1	Dec 11, 2023
Grouping items by a key?	1	Mar 22, 2013
PHP cURL for large content and single HTTP request	1	Feb 23, 2023
How to output accept items for repair	1	Nov 18, 2021
Average of MultiMode of a list of a list	1	Oct 28, 2022
skipping blank array items	14	Sep 6, 2013
Removing items from a list	5	Feb 10, 2012
Dictionary and List	1	Apr 26, 2021

Match items in large list

Fisherking

Paul Rubin

Martin

Peter Otten

MRAB

Terry Reedy

Fisherking

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads