Best way to handle large lists?

Chaz Ginger · Oct 3, 2006

I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz

Duncan Booth · Oct 3, 2006

Chaz Ginger said:
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

How about:

smallSet = set(smallList)
something = [ i for i in bigList if i not in smallSet ]

Use timeit.py on some representative data to see what difference that
makes.

Paul Rubin · Oct 3, 2006

Chaz Ginger said:
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

diff = list(set(bigList) - set(smallList))

Note that doesn't get you the elements in any particular order.

durumdara · Oct 3, 2006

Chaz Ginger írta:

I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz

Hi !

If you have big list, you can use dbm like databases.
They are very quick. BSDDB, flashdb, etc. See SleepyCat, or see python help.

In is very slow in large datasets, but bsddb is use hash values, so it
is very quick.
The SleepyCat database have many extras, you can set the cache size and
many other parameters.

Or if you don't like dbm style databases, you can use SQLite. Also
quick, you can use SQL commands.
A little slower than bsddb, but it is like SQL server. You can improve
the speed with special parameters.

dd

Bill Williams · Oct 3, 2006

I don't know enough about Python internals, but the suggested solutions
all seem to involve scanning bigList. Can this presumably linear
operation be avoided by using dict or similar to find all occurrences of
smallist items in biglist and then deleting those occurrences?

Bill Williams

Hari Sekhon · Oct 3, 2006

I don't know much about the python internals either, so this may be the
blind leading the blind, but aren't dicts much slower to work with than
lists and therefore wouldn't your suggestion to use dicts be much
slower? I think it's something to do with the comparative overhead of
using keys in dicts rather than using positional indexes in lists/arrays...

At least that is what I thought.

Can anyone confirm this?

-h

Hari Sekhon

Bill said:
I don't know enough about Python internals, but the suggested solutions
all seem to involve scanning bigList. Can this presumably linear
operation be avoided by using dict or similar to find all occurrences of
smallist items in biglist and then deleting those occurrences?

Bill Williams

I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz

Click to expand...

Sybren Stuvel · Oct 3, 2006

Bill Williams enlightened us with:

I don't know enough about Python internals, but the suggested
solutions all seem to involve scanning bigList. Can this presumably
linear operation be avoided by using dict or similar to find all
occurrences of smallist items in biglist and then deleting those
occurrences?

And how would that beat O(n)? Every element of bigList has to be
scanned at one point, either to compare it to every earlier element in
bigList and eliminate it, or to compare it to every element in
smallList.

Run benchmarks on the suggestions, and see which is fastest for
yourself.

Sybren

Chaz Ginger · Oct 3, 2006

I've done that and decided that Python's 'list comprehension' isn't a
way to go. I was hoping that perhaps someone had some experience with
some C or C++ library that has a Python interface that would make a
difference.

Chaz

Paul Rubin · Oct 3, 2006

Sybren Stuvel said:
And how would that beat O(n)? Every element of bigList has to be
scanned at one point, either to compare it to every earlier element in
bigList and eliminate it, or to compare it to every element in
smallList.

Maybe the application should use sets instead of lists for these
collections.

Chaz Ginger · Oct 3, 2006

Paul said:
Maybe the application should use sets instead of lists for these
collections.

What would sets do for me over lists?

Chaz

Larry Bates · Oct 3, 2006

Chaz said:
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz

IMHO the only way to speed things up is to know more about the
actual data in the lists (e.g are the elements unique, can they
be sorted, etc) and take advantage of all that information to
come up with a "faster" algorithm. If they are unique, sets
might be a good choice. If they are sorted, bisect module
might help. The specifics about the list(s) may yield a faster
method.

-Larry

Jeremy Sanders · Oct 3, 2006

Chaz said:
What would sets do for me over lists?

It's faster to tell whether something is in a set or dict than in a list
(for some minimum size).

Jeremy

Chaz Ginger · Oct 3, 2006

Larry said:
Chaz said:

I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz

Click to expand...

IMHO the only way to speed things up is to know more about the
actual data in the lists (e.g are the elements unique, can they
be sorted, etc) and take advantage of all that information to
come up with a "faster" algorithm. If they are unique, sets
might be a good choice. If they are sorted, bisect module
might help. The specifics about the list(s) may yield a faster
method.

-Larry

Each item in the list is a fully qualified domain name, e.g.
foo.bar.com. The order in the list has no importance. That is about all
there is to the list other than to say the number of items in a list can
top out about 10,000.

Chaz

Richard Brodie · Oct 3, 2006

Each item in the list is a fully qualified domain name, e.g.
foo.bar.com. The order in the list has no importance.

So you don't actually need to use lists at all, then.
You can just use sets and write:

newSet = bigSet - littleSet

Jeremy Sanders · Oct 3, 2006

Jeremy said:
It's faster to tell whether something is in a set or dict than in a list
(for some minimum size).

As a footnote, this program

import random
num = 100000

a = set( range(num) )
for i in range(100000):
x = random.randint(0, num-1) in a

completes in less than a second, whereas

import random
num = 100000

a = range(num)
for i in range(100000):
x = random.randint(0, num-1) in a

takes a long time on my computer.

Jeremy

Chaz Ginger · Oct 3, 2006

Jeremy said:
As a footnote, this program

import random
num = 100000

a = set( range(num) )
for i in range(100000):
x = random.randint(0, num-1) in a

completes in less than a second, whereas

import random
num = 100000

a = range(num)
for i in range(100000):
x = random.randint(0, num-1) in a

takes a long time on my computer.

Jeremy

Thanks Jeremy. I am in the process of converting my stuff to use sets! I
wouldn't have thought it would have made that big a deal! I guess it is
live and learn.

Peace,
Chaz

Fredrik Lundh · Oct 3, 2006

Hari said:
That is surprising since I read on this list recently that lists were
faster than dicts

depends on what you're doing with them, of course.

It was one reason that was cited as to why local vars are better than
> global vars.

L[int] is indeed a bit faster than D[string] (but not much), but that
doesn't mean that you can loop over *all* items in a list faster than
you can look up a single key in a dictionary.

</F>

durumdara · Oct 4, 2006

Hi !

Thanks Jeremy. I am in the process of converting my stuff to use sets! I
wouldn't have thought it would have made that big a deal! I guess it is
live and learn.

If you have simplified records with big amount of data, you can trying
dbhash. With this you don't get out from memory...

dd

import dbhash
import time
import random
import gc
import sys

itemcount = 250000

db = dbhash.open('test.dbh','w')
for i in range(itemcount):
db[str(i)] = str(i)

littlelist = []
littleset = set()

while len(littlelist) < 1000:
x = str(random.randint(0, itemcount-1))
if not (x in littlelist):
littlelist.append(x)
littleset.add(x)

def DBHash():
gc.collect()
hk = db.has_key
st = time.time()
newlist = []
for val in littlelist:
if hk(val):
newlist.append(val)
et = time.time()
print "Size", len(newlist)
newlist.sort()
print "Hash", hash(str(newlist))
print "Time", "%04f"%(et-st)
print

def Set():
gc.collect()
largeset = set()
for i in range(itemcount):
largeset.add(str(i))

st = time.time()
newset = largeset.intersection(littleset)
newsetlist = []
while newset:
newsetlist.append(newset.pop())
et = time.time()
print "Size", len(newsetlist)
newsetlist.sort()
print "Hash", hash(str(newsetlist))
print "Time", "%04f"%(et-st)

DBHash()
Set()

GHUM · Oct 4, 2006

Maybe the application should use sets instead of lists for these

What would sets do for me over lists?

searching for an element in a list is O(n)
searching for an element in a set is O(1) (for reasonable distributed
elements)

Harald

Fredrik Lundh · Oct 4, 2006

Hari said:
So are you saying that using a dict means a faster search since you only
need to look up one value?

I would think that you would have to look through the keys and stop at
the first key that matches since each key has to be uniq, so perhaps if
it is nearer the front of the set of keys then perhaps it would be a
quicker lookup?

http://en.wikipedia.org/wiki/Hash_table

</F>

Optimal way to make a table for large lists	2	Jul 7, 2022
What is the best way of going about recreating the setTimeout() function?	0	Sep 2, 2022
Best way for permutating using multiple lists?	3	Nov 6, 2009
fetchall to python lists	2	Sep 30, 2013
Best Way to Handle All Exceptions	19	Jul 13, 2009
The best way to check if two lists have identical values	4	Feb 25, 2010
Is there a more elegant way to handle determing fail status?	3	Jan 15, 2013
Best way to handle changing list of class imports	5	Oct 10, 2009

Best way to handle large lists?

Chaz Ginger

Duncan Booth

Paul Rubin

durumdara

Bill Williams

Hari Sekhon

Sybren Stuvel

Chaz Ginger

Paul Rubin

Chaz Ginger

Larry Bates

Jeremy Sanders

Chaz Ginger

Richard Brodie

Jeremy Sanders

Chaz Ginger

Fredrik Lundh

durumdara

GHUM

Fredrik Lundh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads