searching a value of a dict (each value is a list)

B

bearophileHUGS

Seongsu Lee:
I have a dictionary with million keys. Each value in the dictionary has a list with up to thousand integers.<

Let's say each integer can be represented with 32 bits (if there are
less numbers then a 3-byte representation may suffice, but this makes
things more complex), that is 2^2 bytes. Let's say there are 2^20 keys
each one associated to 2^10 values. So to represent the values you
need 2^32 bytes. It means 4 GB, so I don't think Python suffices to
store them in RAM, because a Python int object requires quite more
than 4 bytes (only represented inside an array.array it may need just
4 bytes).

So if you can use 128 MB RAM to store such data structure you need to
store data on HD too. You probably can use a lower-level language. On
disk you can keep the reverse index, represented as an array of
records/structs, each of such structures keep two 32-bit numbers (so
such array is 8 GB). Such index is sorted according to the first
element of the struct. The first number is the value of the original
dictionary and the second nuber is its key. Inside the RAM you can
keep another sorted array that "summarizes" your whole data. When you
need a number you can do a binary search on the array in RAM, such
array gives you the position where you can read (with a seek) a little
part of the file (512 bytes may suffice), to perform a little binary
search (if the block is very little a linear scan suffices) on it too
to find the number you need. Note that the summarizing data structure
in RAM may be represented with just a Python dict too, so in the end
you can use Python to solve this problem. You may need a lower-level
language to create the 8 GB file on disk (or create it with Python,
but it may take lot of time. You may sort it with the sort unix
command).

This isn't a complete solution, but I think it may work.

Bye,
bearophile
 
M

MonkeeSage

Seongsu Lee:


Let's say each integer can be represented with 32 bits (if there are
less numbers then a 3-byte representation may suffice, but this makes
things more complex), that is 2^2 bytes. Let's say there are 2^20 keys
each one associated to 2^10 values. So to represent the values you
need 2^32 bytes. It means 4 GB, so I don't think Python suffices to
store them in RAM, because a Python int object requires quite more
than 4 bytes (only represented inside an array.array it may need just
4 bytes).

So if you can use 128 MB RAM to store such data structure you need to
store data on HD too. You probably can use a lower-level language. On
disk you can keep the reverse index, represented as an array of
records/structs, each of such structures keep two 32-bit numbers (so
such array is 8 GB). Such index is sorted according to the first
element of the struct. The first number is the value of the original
dictionary and the second nuber is its key. Inside the RAM you can
keep another sorted array that "summarizes" your whole data. When you
need a number you can do a binary search on the array in RAM, such
array gives you the position where you can read (with a seek) a little
part of the file (512 bytes may suffice), to perform a little binary
search (if the block is very little a linear scan suffices) on it too
to find the number you need. Note that the summarizing data structure
in RAM may be represented with just a Python dict too, so in the end
you can use Python to solve this problem. You may need a lower-level
language to create the 8 GB file on disk (or create it with Python,
but it may take lot of time. You may sort it with the sort unix
command).

This isn't a complete solution, but I think it may work.

Bye,
bearophile

Nice. :)

Regards,
Jordan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,133
Latest member
MDACVReview
Top