sparse sets of integers?

D

Dan Stromberg

Does anyone have a python implementation (or python C/C+ extension) that
would allow me to perform set operations (union, intersection, difference,
number of elements, &c) over very large collections of integers?

Some of the sets may have over 10**11 (probably less than 10**13
though) integers in them, but there will tend to be runs of integers being
included or not included, so there might be 10**5 consecutive integers
included, then 10**4 that are not included, and then another 10**6 that
are.

Is there such code already available?

I wrote something like this in C ages ago, but:

1) I no longer have the code
2) Who wants to work in C if you can do it fast enough in python anyway? :)

Thanks!
 
R

runes

You have sets in Python 2.3x and 2.4x. I don't know if they can handle
your amounts of data, but i guess so.
 
?

=?iso-8859-1?Q?Fran=E7ois?= Pinard

[runes]
You have sets in Python 2.3x and 2.4x. I don't know if they can handle
your amounts of data, but i guess so.

The original poster (OP) spoke about 10**13 numbers, if I remember
correctly (I did not keep the original message). It all depends on the
real sparsity of the spare sets! :) If not sparse enough, this might be
a bit inordinate for standard Python sets on most machines, also given
that each Python integer uses a lot more than 4 bytes...

On the other hand, the OP also wrote that the numbers he handles are
grouped in big and compact intervals, separated by big unpopulated
holes. So maybe his data could be represented in Python as a
user-written type hiding a sorted list of intervals, for which it should
be relatively easy to write set functions. That might make the problem
tractable enough, with reasonably speedy processing.

As last resort, the wanted sets could be represented as sorted files of
numbers. Slower, but set operations would also be easy to write.
 
T

Terry Reedy

Dan Stromberg said:
Does anyone have a python implementation (or python C/C+ extension) that
would allow me to perform set operations (union, intersection,
difference,
number of elements, &c) over very large collections of integers?

Some of the sets may have over 10**11 (probably less than 10**13
though) integers in them, but there will tend to be runs of integers
being
included or not included, so there might be 10**5 consecutive integers
included, then 10**4 that are not included, and then another 10**6 that
are.

I would call this chunky rather than sparse. Or a set (ordered list) of
integer ranges, which I am sure is how you implemented it before. I did
not find anything Googling 'Python integer range set' but would look harder
before coding.

Terry J. Reedy
 
T

Terry Reedy

Dan Stromberg said:
Does anyone have a python implementation (or python C/C+ extension) that
would allow me to perform set operations (union, intersection,
difference,
number of elements, &c) over very large collections of integers?

Some of the sets may have over 10**11 (probably less than 10**13
though) integers in them, but there will tend to be runs of integers
being
included or not included, so there might be 10**5 consecutive integers
included, then 10**4 that are not included, and then another 10**6 that
are.

I would call this chunky rather than sparse. Or a set (ordered list) of
integer ranges, which I am sure is how you implemented it before. I did
not find anything Googling 'Python integer range set' but would look harder
before coding.

Terry J. Reedy
 
T

Terry Reedy

Dan Stromberg said:
Does anyone have a python implementation (or python C/C+ extension) that
would allow me to perform set operations (union, intersection,
difference,
number of elements, &c) over very large collections of integers?

Some of the sets may have over 10**11 (probably less than 10**13
though) integers in them, but there will tend to be runs of integers
being
included or not included, so there might be 10**5 consecutive integers
included, then 10**4 that are not included, and then another 10**6 that
are.

I would call this chunky rather than sparse. Or a set (ordered list) of
integer ranges, which I am sure is how you implemented it before. I did
not find anything Googling 'Python integer range set' but would look harder
before coding.

Terry J. Reedy
 
R

Raymond Hettinger

[Dan Stromberg]
Does anyone have a python implementation (or python C/C+ extension) that
would allow me to perform set operations (union, intersection, difference,
number of elements, &c) over very large collections of integers?

Some of the sets may have over 10**11 (probably less than 10**13
though) integers in them, but there will tend to be runs of integers being
included or not included, so there might be 10**5 consecutive integers
included, then 10**4 that are not included, and then another 10**6 that
are.

This may help:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/230113


Raymond Hettinger
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top