Scalable python dict {'key_is_a_string': [count, some_val]}


K

krishna

I have to manage a couple of dicts with huge dataset (larger than
feasible with the memory on my system), it basically has a key which
is a string (actually a tuple converted to a string) and a two item
list as value, with one element in the list being a count related to
the key. I have to at the end sort this dictionary by the count.

The platform is linux. I am planning to implement it by setting a
threshold beyond which I write the data into files (3 columns: 'key
count some_val' ) and later merge those files (I plan to sort the
individual files by the key column and walk through the files with one
pointer per file and merge them; I would add up the counts when
entries from two files match by key) and sorting using the 'sort'
command. Thus the bottleneck is the 'sort' command.

Any suggestions, comments?

By the way, is there a linux command that does the merging part?

Thanks,
Krishna
 
Ad

Advertisements

P

Paul Rubin

krishna said:
entries from two files match by key) and sorting using the 'sort'
command. Thus the bottleneck is the 'sort' command.

That is a good approach. The sort command is highly optimized and will
beat any Python program that does something comparable. Set LC_ALL=C if
the file is all ascii, since that will bypass a lot of slow Unicode
conversion and make sorting go even faster.
By the way, is there a linux command that does the merging part?

sort -m

Note that the sort command already does external sorting, so if you
can just write out one large file and sort it, instead of sorting and
then merging a bunch of smaller files, that may simplify your task.
 
J

Jonathan Gardner

I have to manage a couple of dicts with huge dataset (larger than
feasible with the memory on my system), it basically has a key which
is a string (actually a tuple converted to a string) and a two item
list as value, with one element in the list being a count related to
the key. I have to at the end sort this dictionary by the count.

The platform is linux. I am planning to implement it by setting a
threshold beyond which I write the data into files (3 columns: 'key
count some_val' ) and later merge those files (I plan to sort the
individual files by the key column and walk through the files with one
pointer per file and merge them; I would add up the counts when
entries from two files match by key) and sorting using the 'sort'
command. Thus the bottleneck is the 'sort' command.

Any suggestions, comments?

You should be using BDBs or even something like PostgreSQL. The
indexes there will give you the scalability you need. I doubt you will
be able to write anything that will select, update, insert or delete
data better than what BDBs and PostgreSQL can give you.
 
A

Arnaud Delobelle

I have to manage a couple of dicts with huge dataset (larger than
feasible with the memory on my system), it basically has a key which
is a string (actually a tuple converted to a string) and a two item
list as value, with one element in the list being a count related to
the key. I have to at the end sort this dictionary by the count.

The platform is linux. I am planning to implement it by setting a
threshold beyond which I write the data into files (3 columns: 'key
count some_val' ) and later merge those files (I plan to sort the
individual files by the key column and walk through the files with one
pointer per file and merge them; I would add up the counts when
entries from two files match by key) and sorting using the 'sort'
command. Thus the bottleneck is the 'sort' command.

Any suggestions, comments?

By the way, is there a linux command that does the merging part?

Thanks,
Krishna

Have you looked here? http://docs.python.org/library/persistence.html
 
Ad

Advertisements

G

geremy condra

Thank you. I tried BDB, it seems to get very very slow as you scale.

Thank you,
Krishna

Have you tried any of the big key-value store systems, like couchdb etc?

Geremy Condra
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top