How pickle helps in reading huge files?

H

Harsh Jha

I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it againand again? Please explain, why?

Thank you.
 
M

Mark Lawrence

I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?

Thank you.

What's your definition of huge? Maybe it would be effective to pickle
and unpickle but until you try it, perhaps with a relatively small data
sample, how can you know? Why can't you leave the file open and keep
iterating over the contents?

--
Roses are red,
Violets are blue,
Most poems rhyme,
But this one doesn't.

Mark Lawrence
 
R

rusi

Keep it in memory

Thats a strange answer given that the OP says his file is huge.
Of course 'huge' may not really be huge -- that really depends on the h/w he's using.
 
C

Chris Angelico

Thats a strange answer given that the OP says his file is huge.
Of course 'huge' may not really be huge -- that really depends on the h/whe's using.

Most people's idea of a big file is one that has a few thousand lines
in it. That may be pretty huge in terms of manual work, but it'd fit
inside memory easily enough. And even if it really is bigger than
memory, chances are you can use your page file and still keep it in
"memory" - and that's generally the easiest, if perhaps not the most
efficient, solution.

ChrisA
 
R

Roy Smith

Harsh Jha said:
I've a huge csv file and I want to read stuff from it again and again. Is it
useful to pickle it and keep and then unpickle it whenever I need to use that
data? Is it faster that accessing that file simply by opening it again and
again? Please explain, why?

Thank you.

It can be. I did a project a bunch of years ago which involved reading
(and parsing) SNMP MIBs before you could do any work. Startup took
something like 10-20 seconds. If I pre-parsed the MIBs and wrote out
the data structures as pickles, I could cut startup time to a couple of
seconds.

But, that's because the parsing I was doing was pretty complicated.
Parsing a CSV file is much easier, so I wouldn't expect you to have much
improvement reading a pickle file vs. reading the original CSV.

The bottom line is, you should try it. Pickling a data structure is
about one line of code (not counting the 'import cPickle'). Try it and
see what happens. Time how long it takes to read the original file, and
how long it takes to read the pickle. Let us know your results.

Also, let us know what "huge" means. 1000 rows? A million? 100
million?
 
D

Dennis Lee Bieber

I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?
As others mention, what is "huge"?

Does it get updated often? How extensive are updates?

I suspect I'd use the CSV module to parse it into an SQLite3 database,
then use the database for the repetitive access. NOTE: I've never used
pickle -- but for stuff that is coming in as simple CSV I'd suspect the
parsing (even including the various int()/float() wrapping of numeric
fields) can't be much slower than the object creation/unwrapping used by
pickle; SQLite3 should let you leave the data in numeric formats without
the translation penalty on each use.
 
P

Peter Cacioppi

I've a huge csv file and I want to read stuff from it again and again. Isit useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?



Thank you.

Surprising no-one else mentioned a fairly typical pattern for this sort of situation - the compromise between "read from disk" and "read from memory" is "implement a cache".

I've had lots of good experiences hand rolling simple caches, especially ifthere is an application specific access pattern.

Python has nice implementations of things like tuple and dictionary which make caching fairly easy compared to other languages.
 
I

Irmen de Jong

Surprising no-one else mentioned a fairly typical pattern for this sort of situation
- the compromise between "read from disk" and "read from memory" is "implement a
cache".

....or: use memory mapped I/O. Just let the OS deal with the 'caching' of memory pages.

Irmen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top