How pickle helps in reading huge files?

Harsh Jha · Oct 16, 2013

I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it againand again? Please explain, why?

Thank you.

Stephane Wirtel · Oct 16, 2013

Keep it in memory

Mark Lawrence · Oct 16, 2013

I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?

Thank you.

What's your definition of huge? Maybe it would be effective to pickle
and unpickle but until you try it, perhaps with a relatively small data
sample, how can you know? Why can't you leave the file open and keep
iterating over the contents?

--
Roses are red,
Violets are blue,
Most poems rhyme,
But this one doesn't.

Mark Lawrence

rusi · Oct 16, 2013

Keep it in memory

Thats a strange answer given that the OP says his file is huge.
Of course 'huge' may not really be huge -- that really depends on the h/w he's using.

Chris Angelico · Oct 16, 2013

Thats a strange answer given that the OP says his file is huge.
Of course 'huge' may not really be huge -- that really depends on the h/whe's using.

Most people's idea of a big file is one that has a few thousand lines
in it. That may be pretty huge in terms of manual work, but it'd fit
inside memory easily enough. And even if it really is bigger than
memory, chances are you can use your page file and still keep it in
"memory" - and that's generally the easiest, if perhaps not the most
efficient, solution.

ChrisA

Roy Smith · Oct 16, 2013

Harsh Jha said:
I've a huge csv file and I want to read stuff from it again and again. Is it
useful to pickle it and keep and then unpickle it whenever I need to use that
data? Is it faster that accessing that file simply by opening it again and
again? Please explain, why?

Thank you.

It can be. I did a project a bunch of years ago which involved reading
(and parsing) SNMP MIBs before you could do any work. Startup took
something like 10-20 seconds. If I pre-parsed the MIBs and wrote out
the data structures as pickles, I could cut startup time to a couple of
seconds.

But, that's because the parsing I was doing was pretty complicated.
Parsing a CSV file is much easier, so I wouldn't expect you to have much
improvement reading a pickle file vs. reading the original CSV.

The bottom line is, you should try it. Pickling a data structure is
about one line of code (not counting the 'import cPickle'). Try it and
see what happens. Time how long it takes to read the original file, and
how long it takes to read the pickle. Let us know your results.

Also, let us know what "huge" means. 1000 rows? A million? 100
million?

Dennis Lee Bieber · Oct 16, 2013

I've a huge csv file and I want to read stuff from it again and again. Is it useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?

As others mention, what is "huge"?

Does it get updated often? How extensive are updates?

I suspect I'd use the CSV module to parse it into an SQLite3 database,
then use the database for the repetitive access. NOTE: I've never used
pickle -- but for stuff that is coming in as simple CSV I'd suspect the
parsing (even including the various int()/float() wrapping of numeric
fields) can't be much slower than the object creation/unwrapping used by
pickle; SQLite3 should let you leave the data in numeric formats without
the translation penalty on each use.

Peter Cacioppi · Oct 16, 2013

I've a huge csv file and I want to read stuff from it again and again. Isit useful to pickle it and keep and then unpickle it whenever I need to use that data? Is it faster that accessing that file simply by opening it again and again? Please explain, why?

Thank you.

Surprising no-one else mentioned a fairly typical pattern for this sort of situation - the compromise between "read from disk" and "read from memory" is "implement a cache".

I've had lots of good experiences hand rolling simple caches, especially ifthere is an application specific access pattern.

Python has nice implementations of things like tuple and dictionary which make caching fairly easy compared to other languages.

Irmen de Jong · Oct 16, 2013

Surprising no-one else mentioned a fairly typical pattern for this sort of situation
- the compromise between "read from disk" and "read from memory" is "implement a
cache".

....or: use memory mapped I/O. Just let the OS deal with the 'caching' of memory pages.

Irmen

How can I train a neural network by reading different csv files	0	Nov 24, 2022
Pyro4 - reading files	1	Jan 28, 2014
huge dictionary -> bsddb/pickle question	2	Jun 15, 2007
Problem with pickle and restarting a program	3	Mar 19, 2014
writing pickle function	6	Jan 23, 2009
Pickle MemoryError - any ideas?	3	Jul 20, 2010
KeyError in pickle	2	May 23, 2008
ANN: Pyrolite 1.12 - native pickle and Pyro client library for javaand .net	0	Aug 14, 2013

How pickle helps in reading huge files?

Harsh Jha

Stephane Wirtel

Mark Lawrence

rusi

Chris Angelico

Roy Smith

Dennis Lee Bieber

Peter Cacioppi

Irmen de Jong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads