Fast forward-backward (write-read)

Discussion in 'Python' started by Virgil Stokes, Oct 23, 2012.

  1. I am working with some rather large data files (>100GB) that contain time series
    data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
    various types of processing on these data (e.g. moving median, moving average,
    and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
    number of these data need be stored in RAM when being processed. When performing
    Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
    external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
    are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
    Thus, I will need to input these variables saved to an external file from the
    forward pass, in reverse order --- from last written to first written.

    Finally, to my question --- What is a fast way to write these variables to an
    external file and then read them in backwards?
     
    Virgil Stokes, Oct 23, 2012
    #1
    1. Advertising

  2. Virgil Stokes

    Paul Rubin Guest

    Virgil Stokes <> writes:
    > Finally, to my question --- What is a fast way to write these
    > variables to an external file and then read them in backwards?


    Seeking backwards in files works, but the performance hit is
    significant. There is also a performance hit to scanning pointers
    backwards in memory, due to cache misprediction. If it's something
    you're just running a few times, seeking backwards the simplest
    approach. If you're really trying to optimize the thing, you might
    buffer up large chunks (like 1 MB) before writing. If you're writing
    once and reading multiple times, you might reverse the order of records
    within the chunks during the writing phase.

    You're of course taking a performance bath from writing the program in
    Python to begin with (unless using scipy/numpy or the like), enough that
    it might dominate any effects of how the files are written.

    Of course (it should go without saying) that you want to dump in a
    binary format rather than converting to decimal.
     
    Paul Rubin, Oct 23, 2012
    #2
    1. Advertising

  3. Virgil Stokes

    Paul Rubin Guest

    Paul Rubin <> writes:
    > Seeking backwards in files works, but the performance hit is
    > significant. There is also a performance hit to scanning pointers
    > backwards in memory, due to cache misprediction. If it's something
    > you're just running a few times, seeking backwards the simplest
    > approach.


    Oh yes, I should have mentioned, it may be simpler and perhaps a little
    bit faster to use mmap rather than seeking.
     
    Paul Rubin, Oct 23, 2012
    #3
  4. Virgil Stokes

    Tim Chase Guest

    On 10/23/12 11:17, Paul Rubin wrote:
    > Virgil Stokes <> writes:
    >> Finally, to my question --- What is a fast way to write these
    >> variables to an external file and then read them in backwards?

    >
    > Seeking backwards in files works, but the performance hit is
    > significant. There is also a performance hit to scanning pointers
    > backwards in memory, due to cache misprediction. If it's something
    > you're just running a few times, seeking backwards the simplest
    > approach. If you're really trying to optimize the thing, you might
    > buffer up large chunks (like 1 MB) before writing. If you're writing
    > once and reading multiple times, you might reverse the order of records
    > within the chunks during the writing phase.


    I agree with Paul here, it's been a while since I did it, and my
    dataset was small enough (and passed through once) so I just let it
    run. Writing larger chunks is definitely a good way to go.

    > You're of course taking a performance bath from writing the program in
    > Python to begin with (unless using scipy/numpy or the like), enough that
    > it might dominate any effects of how the files are written.


    I usually find that the I/O almost always overwhelms the actual
    processing.

    > Of course (it should go without saying) that you want to dump in a
    > binary format rather than converting to decimal.


    Again, the conversion to/from decimal hasn't been a great cost in my
    experience, as it's overwhelmed by the I/O cost of shoveling the
    data to/from disk.

    -tkc
     
    Tim Chase, Oct 23, 2012
    #4
  5. Virgil Stokes

    Paul Rubin Guest

    Tim Chase <> writes:
    > Again, the conversion to/from decimal hasn't been a great cost in my
    > experience, as it's overwhelmed by the I/O cost of shoveling the
    > data to/from disk.


    I've found that cpu costs both for processing and conversion are
    significant. Also, using a binary format makes the file a lot smaller,
    which decreases the i/o cost as well eliminating the conversion cost.
    And, the conversion can introduce precision loss, another thing to be
    avoided. The famous "butterfly effect" was serendipitously discovered
    that way.
     
    Paul Rubin, Oct 23, 2012
    #5
  6. On 23-Oct-2012 18:17, Paul Rubin wrote:
    > Virgil Stokes <> writes:
    >> Finally, to my question --- What is a fast way to write these
    >> variables to an external file and then read them in backwards?

    > Seeking backwards in files works, but the performance hit is
    > significant. There is also a performance hit to scanning pointers
    > backwards in memory, due to cache misprediction. If it's something
    > you're just running a few times, seeking backwards the simplest
    > approach. If you're really trying to optimize the thing, you might
    > buffer up large chunks (like 1 MB) before writing. If you're writing
    > once and reading multiple times, you might reverse the order of records
    > within the chunks during the writing phase.

    I am writing (forward) once and reading (backward) once.
    >
    > You're of course taking a performance bath from writing the program in
    > Python to begin with (unless using scipy/numpy or the like), enough that
    > it might dominate any effects of how the files are written.

    I am currently using SciPy/NumPy
    >
    > Of course (it should go without saying) that you want to dump in a
    > binary format rather than converting to decimal.

    Yes, I am doing this (but thanks for "underlining" it!)

    Thanks Paul :)
     
    Virgil Stokes, Oct 23, 2012
    #6
  7. Virgil Stokes

    rusi Guest

    On Oct 23, 7:52 pm, Virgil Stokes <> wrote:
    > I am working with some rather large data files (>100GB) that contain timeseries
    > data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
    > various types of processing on these data (e.g. moving median, moving average,
    > and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
    > number of these data need be stored in RAM when being processed. When performing
    > Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
    > external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)).. These
    > are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
    > Thus, I will need to input these variables saved to an external file fromthe
    > forward pass, in reverse order --- from last written to first written.
    >
    > Finally, to my question --- What is a fast way to write these variables to an
    > external file and then read them in backwards?


    Have you tried gdbm/bsddbm? They are meant for such (I believe).
    Probably needs to be installed for windows; works for linux.
    If I were you I'd try out with the giant data on linux and see if the
    problem is solved, then see how to install for windows
     
    rusi, Oct 24, 2012
    #7
  8. On 24-Oct-2012 17:11, rusi wrote:
    > On Oct 23, 7:52 pm, Virgil Stokes <> wrote:
    >> I am working with some rather large data files (>100GB) that contain time series
    >> data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
    >> various types of processing on these data (e.g. moving median, moving average,
    >> and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
    >> number of these data need be stored in RAM when being processed. When performing
    >> Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
    >> external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
    >> are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
    >> Thus, I will need to input these variables saved to an external file from the
    >> forward pass, in reverse order --- from last written to first written.
    >>
    >> Finally, to my question --- What is a fast way to write these variables to an
    >> external file and then read them in backwards?

    > Have you tried gdbm/bsddbm? They are meant for such (I believe).
    > Probably needs to be installed for windows; works for linux.
    > If I were you I'd try out with the giant data on linux and see if the
    > problem is solved, then see how to install for windows

    Thanks Rusi :)
     
    Virgil Stokes, Oct 25, 2012
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tim Chase
    Replies:
    0
    Views:
    189
    Tim Chase
    Oct 23, 2012
  2. Dennis Lee Bieber

    Re: Fast forward-backward (write-read)

    Dennis Lee Bieber, Oct 23, 2012, in forum: Python
    Replies:
    0
    Views:
    145
    Dennis Lee Bieber
    Oct 23, 2012
  3. Virgil Stokes

    Re: Fast forward-backward (write-read)

    Virgil Stokes, Oct 23, 2012, in forum: Python
    Replies:
    4
    Views:
    201
    Tim Golden
    Oct 24, 2012
  4. Virgil Stokes

    Re: Fast forward-backward (write-read)

    Virgil Stokes, Oct 23, 2012, in forum: Python
    Replies:
    0
    Views:
    122
    Virgil Stokes
    Oct 23, 2012
  5. Tim Chase
    Replies:
    0
    Views:
    201
    Tim Chase
    Oct 23, 2012
Loading...

Share This Page