Fast forward-backward (write-read)

Virgil Stokes · Oct 23, 2012

I am working with some rather large data files (>100GB) that contain time series
data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
various types of processing on these data (e.g. moving median, moving average,
and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
number of these data need be stored in RAM when being processed. When performing
Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
Thus, I will need to input these variables saved to an external file from the
forward pass, in reverse order --- from last written to first written.

Finally, to my question --- What is a fast way to write these variables to an
external file and then read them in backwards?

Paul Rubin · Oct 23, 2012

Virgil Stokes said:
Finally, to my question --- What is a fast way to write these
variables to an external file and then read them in backwards?

Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach. If you're really trying to optimize the thing, you might
buffer up large chunks (like 1 MB) before writing. If you're writing
once and reading multiple times, you might reverse the order of records
within the chunks during the writing phase.

You're of course taking a performance bath from writing the program in
Python to begin with (unless using scipy/numpy or the like), enough that
it might dominate any effects of how the files are written.

Of course (it should go without saying) that you want to dump in a
binary format rather than converting to decimal.

Paul Rubin · Oct 23, 2012

Paul Rubin said:
Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach.

Oh yes, I should have mentioned, it may be simpler and perhaps a little
bit faster to use mmap rather than seeking.

Tim Chase · Oct 23, 2012

Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach. If you're really trying to optimize the thing, you might
buffer up large chunks (like 1 MB) before writing. If you're writing
once and reading multiple times, you might reverse the order of records
within the chunks during the writing phase.

I agree with Paul here, it's been a while since I did it, and my
dataset was small enough (and passed through once) so I just let it
run. Writing larger chunks is definitely a good way to go.

You're of course taking a performance bath from writing the program in
Python to begin with (unless using scipy/numpy or the like), enough that
it might dominate any effects of how the files are written.

I usually find that the I/O almost always overwhelms the actual
processing.

Of course (it should go without saying) that you want to dump in a
binary format rather than converting to decimal.

Again, the conversion to/from decimal hasn't been a great cost in my
experience, as it's overwhelmed by the I/O cost of shoveling the
data to/from disk.

-tkc

Paul Rubin · Oct 23, 2012

Tim Chase said:
Again, the conversion to/from decimal hasn't been a great cost in my
experience, as it's overwhelmed by the I/O cost of shoveling the
data to/from disk.

I've found that cpu costs both for processing and conversion are
significant. Also, using a binary format makes the file a lot smaller,
which decreases the i/o cost as well eliminating the conversion cost.
And, the conversion can introduce precision loss, another thing to be
avoided. The famous "butterfly effect" was serendipitously discovered
that way.

Virgil Stokes · Oct 23, 2012

Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach. If you're really trying to optimize the thing, you might
buffer up large chunks (like 1 MB) before writing. If you're writing
once and reading multiple times, you might reverse the order of records
within the chunks during the writing phase.

I am writing (forward) once and reading (backward) once.

You're of course taking a performance bath from writing the program in
Python to begin with (unless using scipy/numpy or the like), enough that
it might dominate any effects of how the files are written.

I am currently using SciPy/NumPy

Of course (it should go without saying) that you want to dump in a
binary format rather than converting to decimal.

Yes, I am doing this (but thanks for "underlining" it!)

Thanks Paul

rusi · Oct 24, 2012

I am working with some rather large data files (>100GB) that contain timeseries
data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
various types of processing on these data (e.g. moving median, moving average,
and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
number of these data need be stored in RAM when being processed. When performing
Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)).. These
are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
Thus, I will need to input these variables saved to an external file fromthe
forward pass, in reverse order --- from last written to first written.

Finally, to my question --- What is a fast way to write these variables to an
external file and then read them in backwards?

Have you tried gdbm/bsddbm? They are meant for such (I believe).
Probably needs to be installed for windows; works for linux.
If I were you I'd try out with the giant data on linux and see if the
problem is solved, then see how to install for windows

Virgil Stokes · Oct 25, 2012

Have you tried gdbm/bsddbm? They are meant for such (I believe).
Probably needs to be installed for windows; works for linux.
If I were you I'd try out with the giant data on linux and see if the
problem is solved, then see how to install for windows

Thanks Rusi

How to write fast into a file in python?	28	May 17, 2013
Finding the variables (read or write)	4	Jan 14, 2013
The cost of the cheapest routes between cities	3	Jan 7, 2023
Range / empty list issues??	1	Dec 11, 2023
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Fast list traversal	4	Nov 2, 2008
Need help with this script	4	Mar 12, 2023
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022

Fast forward-backward (write-read)

Virgil Stokes

Paul Rubin

Paul Rubin

Tim Chase

Paul Rubin

Virgil Stokes

rusi

Virgil Stokes

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads