save tuple of simple data types to disk (low memory foot print)

Discussion in 'Python' started by Gelonida N, Oct 28, 2011.

  1. Gelonida N

    Gelonida N Guest

    Hi,

    I would like to save many dicts with a fixed amount of keys
    tuples to a file in a memory efficient manner (no random, but only
    sequential access is required)

    As the keys are the same for each entry I considered converting them to
    tuples.

    The tuples contain only strings, ints (long ints) and floats (double)
    and the data types for each position within the tuple are fixed.

    The fastest and simplest way is to pickle the data or to use json.
    Both formats however are not that optimal.


    I could store ints and floats with pack. As strings have variable length
    I'm not sure how to save them efficiently
    (except adding a length first and then the string.

    Is there already some 'standard' way or standard library to store
    such data efficiently?

    Thanks in advance for any suggestion.
     
    Gelonida N, Oct 28, 2011
    #1
    1. Advertising

  2. Gelonida N

    Roy Smith Guest

    In article <>,
    Gelonida N <> wrote:

    > I would like to save many dicts with a fixed amount of keys
    > tuples to a file in a memory efficient manner (no random, but only
    > sequential access is required)


    There's two possible scenarios here. One, which you seem to be
    exploring, is to carefully study your data and figure out the best way
    to externalize it which reduces volume.

    The other is to just write it out in whatever form is most convenient
    (JSON is a reasonable thing to try first), and compress the output. Let
    the compression algorithms worry about extracting the entropy. You may
    be surprised at how well it works. It's also an easy experiment to try,
    so if it doesn't work well, at least it didn't cost you much to find out.
     
    Roy Smith, Oct 29, 2011
    #2
    1. Advertising

  3. On Fri, 28 Oct 2011 22:47:42 +0200, Gelonida N wrote:

    > Hi,
    >
    > I would like to save many dicts with a fixed amount of keys tuples to a
    > file in a memory efficient manner (no random, but only sequential
    > access is required)


    What do you call "many"? Fifty? A thousand? A thousand million? How many
    items in each dict? Ten? A million?

    What do you mean "keys tuples"?


    > As the keys are the same for each entry I considered converting them to
    > tuples.


    I don't even understand what that means. You're going to convert the keys
    to tuples? What will that accomplish?


    > The tuples contain only strings, ints (long ints) and floats (double)
    > and the data types for each position within the tuple are fixed.
    >
    > The fastest and simplest way is to pickle the data or to use json. Both
    > formats however are not that optimal.


    How big are your JSON files? 10KB? 10MB? 10GB?

    Have you tried using pickle's space-efficient binary format instead of
    text format? Try using protocol=2 when you call pickle.Pickler.

    Or have you considered simply compressing the files?


    > I could store ints and floats with pack. As strings have variable length
    > I'm not sure how to save them efficiently (except adding a length first
    > and then the string.


    This isn't 1980 and you're very unlikely to be using 720KB floppies.
    Premature optimization is the root of all evil. Keep in mind that when
    you save a file to disk, even if it contains only a single bit of data,
    the actual space used will be an entire block, which on modern hard
    drives is very likely to be 4KB. Trying to compress files smaller than a
    single block doesn't actually save you any space.


    > Is there already some 'standard' way or standard library to store such
    > data efficiently?


    Yes. Pickle and JSON plus zip or gzip.


    --
    Steven
     
    Steven D'Aprano, Oct 29, 2011
    #3
  4. Gelonida N

    Gelonida N Guest

    On 10/29/2011 03:00 AM, Steven D'Aprano wrote:
    > On Fri, 28 Oct 2011 22:47:42 +0200, Gelonida N wrote:
    >
    >> Hi,
    >>
    >> I would like to save many dicts with a fixed amount of keys tuples to a
    >> file in a memory efficient manner (no random, but only sequential
    >> access is required)


    >
    > What do you mean "keys tuples"?

    Corrected phrase:
    I would like to save many dicts with a fixed (and known) amount of keys
    in a memory efficient manner (no random, but only sequential access is
    required) to a file (which can later be sent over a slow expensive
    network to other machines)

    Example:
    Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
    'message1', 'message2'
    'timestamp' is an integer
    'floatvalue' is a float
    'intvalue' an int
    'message1' is a string with a length of max 2000 characters, but can
    often be very short
    'message2' the same as message1

    so a typical dict will look like
    { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
    'message1' : '', 'message2' : '=' * 1999 }


    >
    > What do you call "many"? Fifty? A thousand? A thousand million? How many
    > items in each dict? Ten? A million?


    File size can be between 100kb and over 100Mb per file. Files will be
    accumulated over months.

    I just want to use the smallest possible space, as the data is collected
    over a certain time (days / months) and will be transferred via UMTS /
    EDGE / GSM network, where the transfer takes already for quite small
    data sets several minutes.

    I want to reduce the transfer time, when requesting files on demand (and
    the amount of data in order to not exceed the monthly quota)



    >> As the keys are the same for each entry I considered converting them to
    >> tuples.

    >
    > I don't even understand what that means. You're going to convert the keys
    > to tuples? What will that accomplish?


    >> As the keys are the same for each entry I considered converting them

    (the before mentioned dicts) to tuples.

    so the dict { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
    'message1' : '', 'message2' : '=' * 1999 }

    would become
    [ 12, 3.14159, 42, '', ''=' * 1999 ]
    >
    >
    >> The tuples contain only strings, ints (long ints) and floats (double)
    >> and the data types for each position within the tuple are fixed.
    >>
    >> The fastest and simplest way is to pickle the data or to use json. Both
    >> formats however are not that optimal.

    >
    > How big are your JSON files? 10KB? 10MB? 10GB?
    >
    > Have you tried using pickle's space-efficient binary format instead of
    > text format? Try using protocol=2 when you call pickle.Pickler.


    No. This is probably already a big step forward.

    As I know the data types if each element in the tuple I would however
    prefer a representation, which is not storing the data types for each
    typle over and over again (as they are the same for each dict / tuple)

    >
    > Or have you considered simply compressing the files?


    Compression makes sense but the inital file format should be already
    rather 'compact'

    >
    >> I could store ints and floats with pack. As strings have variable length
    >> I'm not sure how to save them efficiently (except adding a length first
    >> and then the string.

    >
    > This isn't 1980 and you're very unlikely to be using 720KB floppies.
    > Premature optimization is the root of all evil. Keep in mind that when
    > you save a file to disk, even if it contains only a single bit of data,
    > the actual space used will be an entire block, which on modern hard
    > drives is very likely to be 4KB. Trying to compress files smaller than a
    > single block doesn't actually save you any space.


    >
    >
    >> Is there already some 'standard' way or standard library to store such
    >> data efficiently?

    >
    > Yes. Pickle and JSON plus zip or gzip.
    >


    pickle protocol-2 + gzip of the tuple derived from the dict, might be
    good enough for the start.

    I have to create a little more typical data in order to see how many
    percent of my payload would consist of repeating the data types for each
    tuple.
     
    Gelonida N, Oct 29, 2011
    #4
  5. Gelonida N

    Gelonida N Guest

    On 10/29/2011 01:08 AM, Roy Smith wrote:
    > In article <>,
    > Gelonida N <> wrote:
    >
    >> I would like to save many dicts with a fixed amount of keys
    >> tuples to a file in a memory efficient manner (no random, but only
    >> sequential access is required)

    >
    > There's two possible scenarios here. One, which you seem to be
    > exploring, is to carefully study your data and figure out the best way
    > to externalize it which reduces volume.
    >
    > The other is to just write it out in whatever form is most convenient
    > (JSON is a reasonable thing to try first), and compress the output. Let
    > the compression algorithms worry about extracting the entropy. You may
    > be surprised at how well it works. It's also an easy experiment to try,
    > so if it doesn't work well, at least it didn't cost you much to find out.



    Yes I have to make some more tests to see the defference between
    just compressing aplain format (JSON / pickle) and compressing the
    'optimized' representation.
     
    Gelonida N, Oct 29, 2011
    #5
  6. Gelonida N

    Tim Chase Guest

    On 10/29/11 11:44, Gelonida N wrote:
    > I would like to save many dicts with a fixed (and known) amount of keys
    > in a memory efficient manner (no random, but only sequential access is
    > required) to a file (which can later be sent over a slow expensive
    > network to other machines)
    >
    > Example:
    > Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
    > 'message1', 'message2'
    > 'timestamp' is an integer
    > 'floatvalue' is a float
    > 'intvalue' an int
    > 'message1' is a string with a length of max 2000 characters, but can
    > often be very short
    > 'message2' the same as message1
    >
    > so a typical dict will look like
    > { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
    > 'message1' : '', 'message2' : '=' * 1999 }
    >
    >
    >>
    >> What do you call "many"? Fifty? A thousand? A thousand million? How many
    >> items in each dict? Ten? A million?

    >
    > File size can be between 100kb and over 100Mb per file. Files will be
    > accumulated over months.


    If Steven's pickle-protocol2 solution doesn't quite do what you
    need, you can do something like the code below. Gzip is pretty
    good at addressing...

    >> Or have you considered simply compressing the files?

    > Compression makes sense but the inital file format should be
    > already rather 'compact'


    ....by compressing out a lot of the duplicate aspects. Which also
    mitigates some of the verbosity of CSV.

    It serializes the data to a gzipped CSV file then unserializes
    it. Just point it at the appropriate data-source, adjust the
    column-names and data-types

    -tkc

    from gzip import GzipFile
    from csv import writer, reader

    data = [ # use your real data here
    {
    'timestamp': 12,
    'floatvalue': 3.14159,
    'intvalue': 42,
    'message1': 'hello world',
    'message2': '=' * 1999,
    },
    ] * 10000


    f = GzipFile('data.gz', 'wb')
    try:
    w = writer(f)
    for row in data:
    w.writerow([
    row[name] for name in (
    # use your real col-names here
    'timestamp',
    'floatvalue',
    'intvalue',
    'message1',
    'message2',
    )])
    finally:
    f.close()

    output = []
    for row in reader(GzipFile('data.gz')):
    d = dict((
    (name, f(row))
    for i, (f,name) in enumerate((
    # adjust for your column-names/data-types
    (int, 'timestamp'),
    (float, 'floatvalue'),
    (int, 'intvalue'),
    (str, 'message1'),
    (str, 'message2'),
    ))))
    output.append(d)

    # or

    output = [
    dict((
    (name, f(row))
    for i, (f,name) in enumerate((
    # adjust for your column-names/data-types
    (int, 'timestamp'),
    (float, 'floatvalue'),
    (int, 'intvalue'),
    (str, 'message1'),
    (str, 'message2'),
    ))))
    for row in reader(GzipFile('data.gz'))
    ]
     
    Tim Chase, Oct 29, 2011
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page