Memory efficient tuple storage

P

psaffrey

I'm reading in some rather large files (28 files each of 130MB). Each
file is a genome coordinate (chromosome (string) and position (int))
and a data point (float). I want to read these into a list of
coordinates (each a tuple of (chromosome, position)) and a list of
data points.

This has taught me that Python lists are not memory efficient, because
if I use lists it gets through 100MB a second until it hits the swap
space and I have 8GB physical memory in this machine. I can use Python
or numpy arrays for the data points, which is much more manageable.
However, I still need the coordinates. If I don't keep them in a list,
where can I keep them?

Peter
 
K

Kurt Smith

I'm reading in some rather large files (28 files each of 130MB). Each
file is a genome coordinate (chromosome (string) and position (int))
and a data point (float). I want to read these into a list of
coordinates (each a tuple of (chromosome, position)) and a list of
data points.

This has taught me that Python lists are not memory efficient, because
if I use lists it gets through 100MB a second until it hits the swap
space and I have 8GB physical memory in this machine. I can use Python
or numpy arrays for the data points, which is much more manageable.
However, I still need the coordinates. If I don't keep them in a list,
where can I keep them?

Assuming your data is in a plaintext file something like
'genomedata.txt' below, the following will load it into a numpy array
with a customized dtype. You can access the different fields by name
('chromo', 'position', and 'dpoint' -- change to your liking). Don't
know if this works or not; might give it a try.

===============================================

[186]$ cat genomedata.txt
gene1 120189 5.34849
gene2 84040 903873.1
gene3 300822 -21002.2020

[187]$ cat g2arr.py
import numpy as np

def g2arr(fname):
# the 'S100' should be modified to be large enough for your string field.
dt = np.dtype({'names': ['chromo', 'position', 'dpoint'],
'formats': ['S100', np.int, np.float]})
return np.loadtxt(fname, delimiter=' ', dtype=dt)

if __name__ == '__main__':
arr = g2arr('genomedata.txt')
print arr
print arr['chromo']
print arr['position']
print arr['dpoint']

=================================================

Take a look at the np.loadtxt and np.dtype documentation.

Kurt
 
T

Tim Wintle

I'm reading in some rather large files (28 files each of 130MB). Each
file is a genome coordinate (chromosome (string) and position (int))
and a data point (float). I want to read these into a list of
coordinates (each a tuple of (chromosome, position)) and a list of
data points.

This has taught me that Python lists are not memory efficient, because
if I use lists it gets through 100MB a second until it hits the swap
space and I have 8GB physical memory in this machine. I can use Python
or numpy arrays for the data points, which is much more manageable.
However, I still need the coordinates. If I don't keep them in a list,
where can I keep them?

If you just have one list, of objects then it's actually relatively
efficient, it's if you have lots of lists that it's inefficient.

I'm not certain without seeing your code (and my biology isn't good
enough to know the answer to my question below)

How many unique chromosome strings do you have (by equivalence)?

If the same chromosome string is being used multiple times then you may
find it more efficient to reference the same string, so you don't need
to have multiple copies of the same string in memory. That may be what
is taking up the space.


i.e. something like (written verbosely)

reference_dict = {}
list_of_coordinates = []
for (chromosome,posn) in my_file:
chromosome = reference_dict.setdefault(chromosome,chromosome)
list_of_coordinates.append((chromosome,posn))

(or something like that)


Tim Wintle
 
K

Kurt Smith

Assuming your data is in a plaintext file something like
'genomedata.txt' below, the following will load it into a numpy array
with a customized dtype.  You can access the different fields by name
('chromo', 'position', and 'dpoint' -- change to your liking).  Don't
know if this works or not; might give it a try.

To clarify -- I don't know if this will work for your particular
problem, but I do know that it will read in the array correctly and
cut down on memory usage in the final array size.

Specifically, if you use a dtype with 'S50', 'i4' and 'f8' (see the
numpy dtype docs) -- that's 50 bytes for your chromosome string, 4
bytes for the position and 8 bytes for the data point -- each entry
will use just 50 + 4 + 8 bytes, and the numpy array will have just
enough memory allocated for all of these records. The datatypes
stored in the array will be a char array for the string, a C int and a
C double; it won't use the corresponding python datatypes which have a
bunch of other memory usage associated with them.

Hope this helps,

Kurt
 
T

Tim Chase

While Kurt gave some excellent ideas for using numpy, there were
some missing details in your original post that might help folks
come up with a "work smarter, not harder" solution.

Clearly, you're not loading it into memory just for giggles --
surely you're *doing* something with it once it's in memory.
With details of what you're trying to do with that data, some of
the smart-minds on the list may be able to provide a
solution/algorithm that doesn't require having everything in
memory concurrently.

Or you may try streaming through your data-sources, pumping it
into a sqlite/mysql/postgres database, allowing for more
efficient querying of the data. Both mysql & postgres offer the
ability to import data directly into the server without the need
for (or overhead of) a bajillion INSERT statements which may also
speed up the slurping-in process.

-tkc
 
P

psaffrey

Thanks for all the replies.

First of all, can anybody recommend a good way to show memory usage? I
tried heapy, but couldn't make much sense of the output and it didn't
seem to change too much for different usages. Maybe I was just making
the h.heap() call in the wrong place. I also tried getrusage() in the
resource module. That seemed to give 0 for the shared and unshared
memory size no matter what I did. I was calling it after the function
call the filled up the lists. The memory figures I give in this
message come from top.

The numpy solution does work, but it uses more than 1GB of memory for
one of my 130MB files. I'm using

np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6',
'i4', 'f8']})

so shouldn't it use 18 bytes per line? The file has 5832443 lines,
which by my arithmetic is around 100MB...?

My previous solution - using a python array for the numbers and a list
of tuples for the coordinates uses about 900MB. The dictionary
solution suggested by Tim got this down to 650MB. If I just ignore the
coordinates, this comes down to less than 100MB. I feel sure the list
mechanics for storing the coordinates is what is killing me here.

As to "work smarter", you could be right, but it's tricky. The 28
files are in 4 groups of 7, so given that each file is about 6 million
lines, each group of data points contains about 42 million points.
First, I need to divide every point by the median of its group. Then I
need to z-score the whole group of points.

After this preparation, I need to file each point, based on its
coordinates, into other data structures - the genome itself is divided
up into bins that cover a range of coordinates, and we file each point
into the appropriate bin for the coordinate region it overlaps. Then
there operations that combine the values from various bins. The
relevant coordinates for these combinations come from more enormous
csv files. I've already done all this analysis on smaller datasets, so
I'm hoping I won't have to make huge changes just to fit the data into
memory. Yes, I'm also finding out how much it will cost to upgrade to
32GB of memory :)

Sorry for the long message...

Peter
 
G

Gabriel Genellina

If the same chromosome string is being used multiple times then you may
find it more efficient to reference the same string, so you don't need
to have multiple copies of the same string in memory. That may be what
is taking up the space.


i.e. something like (written verbosely)

reference_dict = {}
for (chromosome,posn) in my_file:
chromosome = reference_dict.setdefault(chromosome,chromosome)

Note that the intern() builtin does exactly that: chromosome =
intern(chromosome)
 
K

Kurt Smith

Thanks for all the replies.
[snip]

The numpy solution does work, but it uses more than 1GB of memory for
one of my 130MB files. I'm using

np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6',
'i4', 'f8']})

so shouldn't it use 18 bytes per line? The file has 5832443 lines,
which by my arithmetic is around 100MB...?

I made a mock up file with 5832443 lines, each line consisting of

abcdef 100 100.0

and ran the g2arr() function with 'S6' for the string. While running
(which took really long), the memory usage spiked on my computer to
around 800MB, but once g2arr() returned, the memory usage went to
around 200MB. The number of bytes consumed by the array is 105MB
(using arr.nbytes). From looking at the loadtxt routine in numpy, it
looks like there are a zillion objects created (string objects for
splitting each line, temporary ints floats and strings for type
conversions, etc) while in the routine which are garbage collected
upon return. I'm not well versed in Python's internal memory
managment system, but from what I understand, practically all that
memory is either returned to the OS or held onto by Python for future
use by other objects after the routine returns. But the only memory
in use by the array is the ~100MB for the raw data.

Making 5 copies of the array (using numpy.copy(arr)) bumps total
memory usage (from top) up to 700MB, which is 117MB per array or so.
The total memory reported by summing the arr.nbytes is 630MB (105MB /
array), so there isn't that much memory wasted. Basically, the numpy
solution will pack the data into an array of C structs with the fields
as indicated by the dtype parameter.

Perhaps a database solution as mentioned in other posts would suit you
better; if the temporary spike in memory usage is unacceptable you
could try to roll your own loadtxt function that would be leaner and
meaner. I suggest the numpy solution for its ease and efficient use
of memory.

Kurt
 
A

Aaron Brady

Thanks for all the replies.

First of all, can anybody recommend a good way to show memory usage? I
tried heapy, but couldn't make much sense of the output and it didn't
seem to change too much for different usages. Maybe I was just making
the h.heap() call in the wrong place. I also tried getrusage() in the
resource module. That seemed to give 0 for the shared and unshared
memory size no matter what I did. I was calling it after the function
call the filled up the lists. The memory figures I give in this
message come from top.

The numpy solution does work, but it uses more than 1GB of memory for
one of my 130MB files. I'm using

np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6',
'i4', 'f8']})

so shouldn't it use 18 bytes per line? The file has 5832443 lines,
which by my arithmetic is around 100MB...?
snip

Sorry, did not study your post. But can you use a ctypes.Structure?
Or, can you use a database or mmap to keep the data out of memory?
Or, how would you feel about a mini extension in C?
 
P

psaffrey

In the end, I used a cStringIO object to store the chromosomes -
because there are only 23, I can use one character for each chromosome
and represent the whole lot with a giant string and a dictionary to
say what each character means. Then I used numpy arrays for the data
and coordinates. This squeezed each file into under 100MB.

Thanks again for the help!

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top