save tuple of simple data types to disk (low memory foot print)

G

Gelonida N

Hi,

I would like to save many dicts with a fixed amount of keys
tuples to a file in a memory efficient manner (no random, but only
sequential access is required)

As the keys are the same for each entry I considered converting them to
tuples.

The tuples contain only strings, ints (long ints) and floats (double)
and the data types for each position within the tuple are fixed.

The fastest and simplest way is to pickle the data or to use json.
Both formats however are not that optimal.


I could store ints and floats with pack. As strings have variable length
I'm not sure how to save them efficiently
(except adding a length first and then the string.

Is there already some 'standard' way or standard library to store
such data efficiently?

Thanks in advance for any suggestion.
 
R

Roy Smith

Gelonida N said:
I would like to save many dicts with a fixed amount of keys
tuples to a file in a memory efficient manner (no random, but only
sequential access is required)

There's two possible scenarios here. One, which you seem to be
exploring, is to carefully study your data and figure out the best way
to externalize it which reduces volume.

The other is to just write it out in whatever form is most convenient
(JSON is a reasonable thing to try first), and compress the output. Let
the compression algorithms worry about extracting the entropy. You may
be surprised at how well it works. It's also an easy experiment to try,
so if it doesn't work well, at least it didn't cost you much to find out.
 
S

Steven D'Aprano

Hi,

I would like to save many dicts with a fixed amount of keys tuples to a
file in a memory efficient manner (no random, but only sequential
access is required)

What do you call "many"? Fifty? A thousand? A thousand million? How many
items in each dict? Ten? A million?

What do you mean "keys tuples"?

As the keys are the same for each entry I considered converting them to
tuples.

I don't even understand what that means. You're going to convert the keys
to tuples? What will that accomplish?

The tuples contain only strings, ints (long ints) and floats (double)
and the data types for each position within the tuple are fixed.

The fastest and simplest way is to pickle the data or to use json. Both
formats however are not that optimal.

How big are your JSON files? 10KB? 10MB? 10GB?

Have you tried using pickle's space-efficient binary format instead of
text format? Try using protocol=2 when you call pickle.Pickler.

Or have you considered simply compressing the files?

I could store ints and floats with pack. As strings have variable length
I'm not sure how to save them efficiently (except adding a length first
and then the string.

This isn't 1980 and you're very unlikely to be using 720KB floppies.
Premature optimization is the root of all evil. Keep in mind that when
you save a file to disk, even if it contains only a single bit of data,
the actual space used will be an entire block, which on modern hard
drives is very likely to be 4KB. Trying to compress files smaller than a
single block doesn't actually save you any space.

Is there already some 'standard' way or standard library to store such
data efficiently?

Yes. Pickle and JSON plus zip or gzip.
 
G

Gelonida N

What do you mean "keys tuples"?
Corrected phrase:
I would like to save many dicts with a fixed (and known) amount of keys
in a memory efficient manner (no random, but only sequential access is
required) to a file (which can later be sent over a slow expensive
network to other machines)

Example:
Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
'message1', 'message2'
'timestamp' is an integer
'floatvalue' is a float
'intvalue' an int
'message1' is a string with a length of max 2000 characters, but can
often be very short
'message2' the same as message1

so a typical dict will look like
{ 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
'message1' : '', 'message2' : '=' * 1999 }

What do you call "many"? Fifty? A thousand? A thousand million? How many
items in each dict? Ten? A million?

File size can be between 100kb and over 100Mb per file. Files will be
accumulated over months.

I just want to use the smallest possible space, as the data is collected
over a certain time (days / months) and will be transferred via UMTS /
EDGE / GSM network, where the transfer takes already for quite small
data sets several minutes.

I want to reduce the transfer time, when requesting files on demand (and
the amount of data in order to not exceed the monthly quota)


I don't even understand what that means. You're going to convert the keys
to tuples? What will that accomplish?
(the before mentioned dicts) to tuples.

so the dict { 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
'message1' : '', 'message2' : '=' * 1999 }

would become
[ 12, 3.14159, 42, '', ''=' * 1999 ]
How big are your JSON files? 10KB? 10MB? 10GB?

Have you tried using pickle's space-efficient binary format instead of
text format? Try using protocol=2 when you call pickle.Pickler.

No. This is probably already a big step forward.

As I know the data types if each element in the tuple I would however
prefer a representation, which is not storing the data types for each
typle over and over again (as they are the same for each dict / tuple)
Or have you considered simply compressing the files?

Compression makes sense but the inital file format should be already
rather 'compact'
This isn't 1980 and you're very unlikely to be using 720KB floppies.
Premature optimization is the root of all evil. Keep in mind that when
you save a file to disk, even if it contains only a single bit of data,
the actual space used will be an entire block, which on modern hard
drives is very likely to be 4KB. Trying to compress files smaller than a
single block doesn't actually save you any space.

Yes. Pickle and JSON plus zip or gzip.

pickle protocol-2 + gzip of the tuple derived from the dict, might be
good enough for the start.

I have to create a little more typical data in order to see how many
percent of my payload would consist of repeating the data types for each
tuple.
 
G

Gelonida N

There's two possible scenarios here. One, which you seem to be
exploring, is to carefully study your data and figure out the best way
to externalize it which reduces volume.

The other is to just write it out in whatever form is most convenient
(JSON is a reasonable thing to try first), and compress the output. Let
the compression algorithms worry about extracting the entropy. You may
be surprised at how well it works. It's also an easy experiment to try,
so if it doesn't work well, at least it didn't cost you much to find out.


Yes I have to make some more tests to see the defference between
just compressing aplain format (JSON / pickle) and compressing the
'optimized' representation.
 
T

Tim Chase

I would like to save many dicts with a fixed (and known) amount of keys
in a memory efficient manner (no random, but only sequential access is
required) to a file (which can later be sent over a slow expensive
network to other machines)

Example:
Every dict will have the keys 'timestamp', 'floatvalue', 'intvalue',
'message1', 'message2'
'timestamp' is an integer
'floatvalue' is a float
'intvalue' an int
'message1' is a string with a length of max 2000 characters, but can
often be very short
'message2' the same as message1

so a typical dict will look like
{ 'timetamp' : 12, 'floatvalue': 3.14159, 'intvalue': 42,
'message1' : '', 'message2' : '=' * 1999 }



File size can be between 100kb and over 100Mb per file. Files will be
accumulated over months.

If Steven's pickle-protocol2 solution doesn't quite do what you
need, you can do something like the code below. Gzip is pretty
good at addressing...
Compression makes sense but the inital file format should be
already rather 'compact'

....by compressing out a lot of the duplicate aspects. Which also
mitigates some of the verbosity of CSV.

It serializes the data to a gzipped CSV file then unserializes
it. Just point it at the appropriate data-source, adjust the
column-names and data-types

-tkc

from gzip import GzipFile
from csv import writer, reader

data = [ # use your real data here
{
'timestamp': 12,
'floatvalue': 3.14159,
'intvalue': 42,
'message1': 'hello world',
'message2': '=' * 1999,
},
] * 10000


f = GzipFile('data.gz', 'wb')
try:
w = writer(f)
for row in data:
w.writerow([
row[name] for name in (
# use your real col-names here
'timestamp',
'floatvalue',
'intvalue',
'message1',
'message2',
)])
finally:
f.close()

output = []
for row in reader(GzipFile('data.gz')):
d = dict((
(name, f(row))
for i, (f,name) in enumerate((
# adjust for your column-names/data-types
(int, 'timestamp'),
(float, 'floatvalue'),
(int, 'intvalue'),
(str, 'message1'),
(str, 'message2'),
))))
output.append(d)

# or

output = [
dict((
(name, f(row))
for i, (f,name) in enumerate((
# adjust for your column-names/data-types
(int, 'timestamp'),
(float, 'floatvalue'),
(int, 'intvalue'),
(str, 'message1'),
(str, 'message2'),
))))
for row in reader(GzipFile('data.gz'))
]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top