Cache a large list to disk

C

Chris

I have a set of routines, the first of which reads lots and lots of
data from disparate regions of disk. This read routine takes 40
minutes on a P3-866 (with IDE drives). This routine populates an
array with a number of dictionaries, e.g.,

[{'el2': 0, 'el3': 0, 'el1': 0, 'el4': 0, 'el5': 0},
{'el2': 15, 'el3': 21, 'el1': 9, 'el4': 33, 'el5': 51},
{'el2': 35, 'el3': 49, 'el1': 21, 'el4': 77, 'el5': 119},
{'el2': 45, 'el3': 63, 'el1': 27, 'el4': 99, 'el5': 153}]
(not actually the data i'm reading)

This information is acted upon by subsequent routines. These routines
change very often, but the data changes very infrequently (the
opposite pattern of what I'm used to). This data changes once per
week, so I can safely cache this data to a big file on disk, and read
out of this big file -- rather than having to read about 10,000 files
-- when the program is loaded.

Now, if this were C I'd know how to do this in a pretty
straightforward manner. But being new to Python, I don't know how I
can (hopefully easily) write this data to a file, and then read it out
into memory on subsequent launches.

If anyone can provide some pointers, or even some sample code on how
to accomplish this, it would be greatly appreciated.

Thanks in advance for any help.
-cjl
 
P

Peter Otten

Chris said:
week, so I can safely cache this data to a big file on disk, and read
out of this big file -- rather than having to read about 10,000 files
-- when the program is loaded.

Now, if this were C I'd know how to do this in a pretty
straightforward manner. But being new to Python, I don't know how I
can (hopefully easily) write this data to a file, and then read it out
into memory on subsequent launches.

Have a look at pickle:
data = [{'el2': 0, 'el3': 0, 'el1': 0, 'el4': 0, 'el5': 0},
.... {'el2': 15, 'el3': 21, 'el1': 9, 'el4': 33, 'el5': 51},
.... {'el2': 35, 'el3': 49, 'el1': 21, 'el4': 77, 'el5': 119},
.... {'el2': 45, 'el3': 63, 'el1': 27, 'el4': 99, 'el5': 153}][{'el2': 0, 'el3': 0, 'el1': 0, 'el4': 0, 'el5': 0}, {'el2': 15, 'el
3': 21, 'el1': 9, 'el4': 33, 'el5': 51}, {'el2': 35, 'el3': 49, 'el1
': 21, 'el4': 77, 'el5': 119}, {'el2': 45, 'el3': 63, 'el1': 27, 'el
4': 99, 'el5': 153}]

Peter
 
S

Svein Ove Aas

Peter said:
Chris said:
week, so I can safely cache this data to a big file on disk, and read
out of this big file -- rather than having to read about 10,000 files
-- when the program is loaded.

Now, if this were C I'd know how to do this in a pretty
straightforward manner. But being new to Python, I don't know how I
can (hopefully easily) write this data to a file, and then read it out
into memory on subsequent launches.

Have a look at pickle:
data = [{'el2': 0, 'el3': 0, 'el1': 0, 'el4': 0, 'el5': 0},
... {'el2': 15, 'el3': 21, 'el1': 9, 'el4': 33, 'el5': 51},
... {'el2': 35, 'el3': 49, 'el1': 21, 'el4': 77, 'el5': 119},
... {'el2': 45, 'el3': 63, 'el1': 27, 'el4': 99, 'el5': 153}]

And, yes, cPickle is faster. A lot faster.

There are switches you can throw to have it use binary instead of sticking
to readable characters for some savings, too.
 
P

Paul Rubin

I have a set of routines, the first of which reads lots and lots of
data from disparate regions of disk. This read routine takes 40
minutes on a P3-866 (with IDE drives). This routine populates an
array with a number of dictionaries, e.g.,

[{'el2': 0, 'el3': 0, 'el1': 0, 'el4': 0, 'el5': 0},
{'el2': 15, 'el3': 21, 'el1': 9, 'el4': 33, 'el5': 51},
{'el2': 35, 'el3': 49, 'el1': 21, 'el4': 77, 'el5': 119},
{'el2': 45, 'el3': 63, 'el1': 27, 'el4': 99, 'el5': 153}]
(not actually the data i'm reading)

The dict entries are the same for each list item?
Now, if this were C I'd know how to do this in a pretty
straightforward manner. But being new to Python, I don't know how I
can (hopefully easily) write this data to a file, and then read it out
into memory on subsequent launches.

If anyone can provide some pointers, or even some sample code on how
to accomplish this, it would be greatly appreciated.

I dunno what the question is. You can open files, seek on them, etc.
in Python just like in C. You can use the mmap module to map a file
into memory. If you want to lose some efficiency, you can write out
the Python objects (dicts, lists, etc) with the pickle or cpickle modules.
 
R

Radovan Garabik

Chris said:
I have a set of routines, the first of which reads lots and lots of
data from disparate regions of disk. This read routine takes 40
minutes on a P3-866 (with IDE drives). This routine populates an
array with a number of dictionaries, e.g.,

[{'el2': 0, 'el3': 0, 'el1': 0, 'el4': 0, 'el5': 0},
{'el2': 15, 'el3': 21, 'el1': 9, 'el4': 33, 'el5': 51},
{'el2': 35, 'el3': 49, 'el1': 21, 'el4': 77, 'el5': 119},
{'el2': 45, 'el3': 63, 'el1': 27, 'el4': 99, 'el5': 153}]
(not actually the data i'm reading)

This information is acted upon by subsequent routines. These routines
change very often, but the data changes very infrequently (the
opposite pattern of what I'm used to). This data changes once per
week, so I can safely cache this data to a big file on disk, and read
out of this big file -- rather than having to read about 10,000 files
-- when the program is loaded.

Now, if this were C I'd know how to do this in a pretty
straightforward manner. But being new to Python, I don't know how I
can (hopefully easily) write this data to a file, and then read it out
into memory on subsequent launches.

If anyone can provide some pointers, or even some sample code on how
to accomplish this, it would be greatly appreciated.

as already mentioned, use cPickle or shelve
However, depending how big and how many your dictionaries are,
you can use *dbm databases instead of dictionaries, with numbers
packed up using struct module (I found out it is sometimes much
efficient than using shelve).
Looking at your sample, you could even reorganize the data as:
{'el2': [0, 15, 35, 45],
'el3': [0, 21, 49, 63],
...
}
and use one big dbm database, with lists represented as array objects -
that is going to give you major memory efficiency boost.

If the arrays are going to be big (like really BIG, of some tens
of megabytes), you can store them one per file, and use mmap
to access them - I am doing now something similar


--
-----------------------------------------------------------
| Radovan Garabík http://melkor.dnp.fmph.uniba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top