Fast forward-backward (write-read)

V

Virgil Stokes

I am working with some rather large data files (>100GB) that contain time series
data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
various types of processing on these data (e.g. moving median, moving average,
and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
number of these data need be stored in RAM when being processed. When performing
Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
Thus, I will need to input these variables saved to an external file from the
forward pass, in reverse order --- from last written to first written.

Finally, to my question --- What is a fast way to write these variables to an
external file and then read them in backwards?
 
P

Paul Rubin

Virgil Stokes said:
Finally, to my question --- What is a fast way to write these
variables to an external file and then read them in backwards?

Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach. If you're really trying to optimize the thing, you might
buffer up large chunks (like 1 MB) before writing. If you're writing
once and reading multiple times, you might reverse the order of records
within the chunks during the writing phase.

You're of course taking a performance bath from writing the program in
Python to begin with (unless using scipy/numpy or the like), enough that
it might dominate any effects of how the files are written.

Of course (it should go without saying) that you want to dump in a
binary format rather than converting to decimal.
 
P

Paul Rubin

Paul Rubin said:
Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach.

Oh yes, I should have mentioned, it may be simpler and perhaps a little
bit faster to use mmap rather than seeking.
 
T

Tim Chase

Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach. If you're really trying to optimize the thing, you might
buffer up large chunks (like 1 MB) before writing. If you're writing
once and reading multiple times, you might reverse the order of records
within the chunks during the writing phase.

I agree with Paul here, it's been a while since I did it, and my
dataset was small enough (and passed through once) so I just let it
run. Writing larger chunks is definitely a good way to go.
You're of course taking a performance bath from writing the program in
Python to begin with (unless using scipy/numpy or the like), enough that
it might dominate any effects of how the files are written.

I usually find that the I/O almost always overwhelms the actual
processing.
Of course (it should go without saying) that you want to dump in a
binary format rather than converting to decimal.

Again, the conversion to/from decimal hasn't been a great cost in my
experience, as it's overwhelmed by the I/O cost of shoveling the
data to/from disk.

-tkc
 
P

Paul Rubin

Tim Chase said:
Again, the conversion to/from decimal hasn't been a great cost in my
experience, as it's overwhelmed by the I/O cost of shoveling the
data to/from disk.

I've found that cpu costs both for processing and conversion are
significant. Also, using a binary format makes the file a lot smaller,
which decreases the i/o cost as well eliminating the conversion cost.
And, the conversion can introduce precision loss, another thing to be
avoided. The famous "butterfly effect" was serendipitously discovered
that way.
 
V

Virgil Stokes

Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach. If you're really trying to optimize the thing, you might
buffer up large chunks (like 1 MB) before writing. If you're writing
once and reading multiple times, you might reverse the order of records
within the chunks during the writing phase.
I am writing (forward) once and reading (backward) once.
You're of course taking a performance bath from writing the program in
Python to begin with (unless using scipy/numpy or the like), enough that
it might dominate any effects of how the files are written.
I am currently using SciPy/NumPy
Of course (it should go without saying) that you want to dump in a
binary format rather than converting to decimal.
Yes, I am doing this (but thanks for "underlining" it!)

Thanks Paul :)
 
R

rusi

I am working with some rather large data files (>100GB) that contain timeseries
data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
various types of processing on these data (e.g. moving median, moving average,
and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
number of these data need be stored in RAM when being processed. When performing
Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)).. These
are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
Thus, I will need to input these variables saved to an external file fromthe
forward pass, in reverse order --- from last written to first written.

Finally, to my question --- What is a fast way to write these variables to an
external file and then read them in backwards?

Have you tried gdbm/bsddbm? They are meant for such (I believe).
Probably needs to be installed for windows; works for linux.
If I were you I'd try out with the giant data on linux and see if the
problem is solved, then see how to install for windows
 
V

Virgil Stokes

Have you tried gdbm/bsddbm? They are meant for such (I believe).
Probably needs to be installed for windows; works for linux.
If I were you I'd try out with the giant data on linux and see if the
problem is solved, then see how to install for windows
Thanks Rusi :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,835
Latest member
KetoRushACVBuy

Latest Threads

Top