numpy.memmap advice?

Lionel · Feb 17, 2009

Hello all,

On a previous thread (http://groups.google.com/group/comp.lang.python/
browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
binary data. Briefly, my data consists of complex numbers, 32-bit
floats for real and imaginary parts. The data is stored as 4 bytes
Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
row-major format. I needed to read the data in as two separate numpy
arrays, one for real values and one for imaginary values.

There were several very helpful performance tips offered, and one in
particular I've started looking into. The author suggested a
"numpy.memmap" object may be beneficial. It was suggested I use it as
follows:

descriptor = dtype([("r", "<f4"), ("i", "<f4")])
data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
print "First 100 real values:", data.r[:100]

I have two questions:
1) What is "recarray"?
2) The documentation for numpy.memmap claims that it is meant to be
used in situations where it is beneficial to load only segments of a
file into memory, not the whole thing. This is definately something
I'd like to be able to do as my files are frequently >1Gb. I don't
really see in the diocumentation how portions are loaded, however.
They seem to create small arrays and then assign the entire array
(i.e. file) to the memmap object. Let's assume I have a binary data
file of complex numbers in the format described above, and let's
assume that the size of the complex data array (that is, the entire
file) is 100x100 (rows x columns). Could someone please post a few
lines showing how to load the top-left 50 x 50 quadrant, and the lower-
right 50 x 50 quadrant into memmap objects? Thank you very much in
advance!

-L

Robert Kern · Feb 17, 2009

Hello all,

On a previous thread (http://groups.google.com/group/comp.lang.python/
browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
binary data. Briefly, my data consists of complex numbers, 32-bit
floats for real and imaginary parts. The data is stored as 4 bytes
Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
row-major format. I needed to read the data in as two separate numpy
arrays, one for real values and one for imaginary values.

I don't have time to answer your questions now, so you should ask on the numpy
mailing list where others can jump in.

http://www.scipy.org/Mailing_Lists

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Carl Banks · Feb 18, 2009

Hello all,

On a previous thread (http://groups.google.com/group/comp.lang.python/
browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
binary data. Briefly, my data consists of complex numbers, 32-bit
floats for real and imaginary parts. The data is stored as 4 bytes
Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
row-major format. I needed to read the data in as two separate numpy
arrays, one for real values and one for imaginary values.

There were several very helpful performance tips offered, and one in
particular I've started looking into. The author suggested a
"numpy.memmap" object may be beneficial. It was suggested I use it as
follows:

descriptor = dtype([("r", "<f4"), ("i", "<f4")])
data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
print "First 100 real values:", data.r[:100]

I have two questions:
1) What is "recarray"?

Let's look:

[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Help on class recarray in module numpy.core.records:

class recarray(numpy.ndarray)
| recarray(shape, dtype=None, buf=None, **kwds)
|
| Subclass of ndarray that allows field access using attribute
lookup.
|
| Parameters
| ----------
| shape : tuple
| shape of record array
| dtype : data-type or None
| The desired data-type. If this is None, then the data-type is
determine
| by the *formats*, *names*, *titles*, *aligned*, and
*byteorder* keywords
| buf : [buffer] or None
| If this is None, then a new array is created of the given
shape and data
| If this is an object exposing the buffer interface, then the
array will
| use the memory from an existing buffer. In this case, the
*offset* and
| *strides* keywords can also be used.
....

So there you have it. It's a subclass of ndarray that allows field
access using attribute lookup. (IOW, you're creating a view of the
memmap'ed data of type recarray, which is the type numpy uses to
access structures by name. You need to create the view because
regular numpy arrays, which numpy.memmap creates, can't access fields
by attribute.)

help() is a nice thing to use, and numpy is one of the better
libraries when it comes to docstrings, so learn to use it.

2) The documentation for numpy.memmap claims that it is meant to be
used in situations where it is beneficial to load only segments of a
file into memory, not the whole thing. This is definately something
I'd like to be able to do as my files are frequently >1Gb. I don't
really see in the diocumentation how portions are loaded, however.
They seem to create small arrays and then assign the entire array
(i.e. file) to the memmap object. Let's assume I have a binary data
file of complex numbers in the format described above, and let's
assume that the size of the complex data array (that is, the entire
file) is 100x100 (rows x columns). Could someone please post a few
lines showing how to load the top-left 50 x 50 quadrant, and the lower-
right 50 x 50 quadrant into memmap objects? Thank you very much in
advance!

You would memmap the whole region in question (in this case the whole
file), then take a slice. Actually you could get away with memmapping
just the last 50 rows (bottom half). The offset into the file would
be 50*100*8, so:

data = memmap(filename, dtype=descriptor, mode='r',offset=
(50*100*8)).view(recarray)
reshaped_data = reshape(data,(50,100))
intersting_data = reshaped_data[:,50:100]

A word of caution: Every instance of numpy.memmap creates its own mmap
of the whole file (even if it only creates an array from part of the
file). The implications of this are A) you can't use numpy.memmap's
offset parameter to get around file size limitations, and B) you
shouldn't create many numpy.memmaps of the same file. To work around
B, you should create a single memmap, and dole out views and slices.

Carl Banks

Lionel · Feb 18, 2009

Hello all,

Click to expand...

On a previous thread (http://groups.google.com/group/comp.lang.python/
browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12?
hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in
binary data. Briefly, my data consists of complex numbers, 32-bit
floats for real and imaginary parts. The data is stored as 4 bytes
Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in
row-major format. I needed to read the data in as two separate numpy
arrays, one for real values and one for imaginary values.

Click to expand...

There were several very helpful performance tips offered, and one in
particular I've started looking into. The author suggested a
"numpy.memmap" object may be beneficial. It was suggested I use it as
follows:

Click to expand...

descriptor = dtype([("r", "<f4"), ("i", "<f4")])
data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
print "First 100 real values:", data.r[:100]

Click to expand...

I have two questions:
1) What is "recarray"?

Click to expand...

Let's look:

[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> import numpy
Help on class recarray in module numpy.core.records:

class recarray(numpy.ndarray)
| recarray(shape, dtype=None, buf=None, **kwds)
|
| Subclass of ndarray that allows field access using attribute
lookup.
|
| Parameters
| ----------
| shape : tuple
| shape of record array
| dtype : data-type or None
| The desired data-type. If this is None, then the data-type is
determine
| by the *formats*, *names*, *titles*, *aligned*, and
*byteorder* keywords
| buf : [buffer] or None
| If this is None, then a new array is created of the given
shape and data
| If this is an object exposing the buffer interface, then the
array will
| use the memory from an existing buffer. In this case, the
*offset* and
| *strides* keywords can also be used.
...

So there you have it. It's a subclass of ndarray that allows field
access using attribute lookup. (IOW, you're creating a view of the
memmap'ed data of type recarray, which is the type numpy uses to
access structures by name. You need to create the view because
regular numpy arrays, which numpy.memmap creates, can't access fields
by attribute.)

help() is a nice thing to use, and numpy is one of the better
libraries when it comes to docstrings, so learn to use it.

2) The documentation for numpy.memmap claims that it is meant to be
used in situations where it is beneficial to load only segments of a
file into memory, not the whole thing. This is definately something
I'd like to be able to do as my files are frequently >1Gb. I don't
really see in the diocumentation how portions are loaded, however.
They seem to create small arrays and then assign the entire array
(i.e. file) to the memmap object. Let's assume I have a binary data
file of complex numbers in the format described above, and let's
assume that the size of the complex data array (that is, the entire
file) is 100x100 (rows x columns). Could someone please post a few
lines showing how to load the top-left 50 x 50 quadrant, and the lower-
right 50 x 50 quadrant into memmap objects? Thank you very much in
advance!

Click to expand...

You would memmap the whole region in question (in this case the whole
file), then take a slice. Actually you could get away with memmapping
just the last 50 rows (bottom half). The offset into the file would
be 50*100*8, so:

data = memmap(filename, dtype=descriptor, mode='r',offset=
(50*100*8)).view(recarray)
reshaped_data = reshape(data,(50,100))
intersting_data = reshaped_data[:,50:100]

A word of caution: Every instance of numpy.memmap creates its own mmap
of the whole file (even if it only creates an array from part of the
file). The implications of this are A) you can't use numpy.memmap's
offset parameter to get around file size limitations, and B) you
shouldn't create many numpy.memmaps of the same file. To work around
B, you should create a single memmap, and dole out views and slices.

Carl Banks- Hide quoted text -

- Show quoted text -

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?

Carl Banks · Feb 18, 2009

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?

No, what accounts for the memory efficienty is there is no bulk
allocation at all. The ndarray you have points to the memory that's
in the mmap. There is no copying data or separate array allocation.

Also, it's not any more memory efficient to use the offset parameter
with numpy.memmap than it is to memmap the whole file and take a
slice.

Carl Banks

sturlamolden · Feb 18, 2009

1) What is "recarray"?

An ndarray of what C programmers know as a "struct", in which each
field is accessible by its name.

That is,

struct rgba{
unsigned char r;
unsigned char g;
unsigned char b;
unsigned char a;
};

struct rgba arr[480][640];

is similar to:

import numpy as np
rbga = np.dtype({'names':list('rgba'), 'formats':[np.uint8]*4})
arr = np.array((480,640), dtype=rgba)

Now you can access the r, g, b and a fields directly using arr['r'],
arr['g'], arr['b'], and arr['a'].
Internally the data will be represented compactly as with the C code
above. If you want to view the data as an 480 x 640 array of 32 bit
integers instead, it is as simple as arr.view(dtype=np.uint32).
Formatted binary data can of course be read from files using
np.fromfile with the specified dtype, and written to files by passing
a recarray as buffer to file.write. You can thus see NumPy's
recarray's as a more powerful alternative to Python's struct module.

I don't really see in the diocumentation how portions are loaded, however.

Prior to Python 2.6, the mmap object (which numpy.memmap uses
internally) does not take an offset parameter. But when NumPy are
ported to newer version of Python this will be fixed. You should then
be able to memory map with an ndarray from a certain offset. To make
this work now, you must e.g. backport mmap from Python 2.6 and use
that with NumPy. Not difficult, but nobody has bothered to do it (as
far as I know).

Sturla Molden

Carl Banks · Feb 18, 2009

1) What is "recarray"?

Click to expand...

An ndarray of what C programmers know as a "struct", in which each
field is accessible by its name.

That is,

struct rgba{
unsigned char r;
unsigned char g;
unsigned char b;
unsigned char a;

};

struct rgba arr[480][640];

is similar to:

import numpy as np
rbga = np.dtype({'names':list('rgba'), 'formats':[np.uint8]*4})
arr = np.array((480,640), dtype=rgba)

Now you can access the r, g, b and a fields directly using arr['r'],
arr['g'], arr['b'], and arr['a'].
Internally the data will be represented compactly as with the C code
above. If you want to view the data as an 480 x 640 array of 32 bit
integers instead, it is as simple as arr.view(dtype=np.uint32).
Formatted binary data can of course be read from files using
np.fromfile with the specified dtype, and written to files by passing
a recarray as buffer to file.write. You can thus see NumPy's
recarray's as a more powerful alternative to Python's struct module.

I don't really see in the diocumentation how portions are loaded, however.

Click to expand...

Prior to Python 2.6, the mmap object (which numpy.memmap uses
internally) does not take an offset parameter. But when NumPy are
ported to newer version of Python this will be fixed. You should then
be able to memory map with an ndarray from a certain offset. To make
this work now, you must e.g. backport mmap from Python 2.6 and use
that with NumPy. Not difficult, but nobody has bothered to do it (as
far as I know).

You can use an offset with numpy.memmap today; it'll mmap the whole
file, but start the array data at the given offset.

The offset parameter of mmap itself would be useful to map small
portions of gigabyte-sized files, and maybe numpy.memmap can take
advantage of that if the user passes an offset parameter. One thing
you can't do with mmap's offset, but you can do with numpy.memmap, is
to set it to an arbitary value, since it has to be a multiple of some
large number (something like 1 MB, depending on the OS).

Carl Banks

Lionel · Feb 19, 2009

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?

Click to expand...

No, what accounts for the memory efficienty is there is no bulk
allocation at all. The ndarray you have points to the memory that's
in the mmap. There is no copying data or separate array allocation.

Also, it's not any more memory efficient to use the offset parameter
with numpy.memmap than it is to memmap the whole file and take a
slice.

Carl Banks

Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?

Carl Banks · Feb 19, 2009

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?

Click to expand...

Click to expand...

No, what accounts for the memory efficienty is there is no bulk
allocation at all. The ndarray you have points to the memory that's
in the mmap. There is no copying data or separate array allocation.

Click to expand...

Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?

Ok, sorry for the confusion. What I should have said is that there is
no bulk allocation *by numpy* at all. The call to mmap does allocate
a chunk of RAM to reflect file contents, but the numpy arrays don't
allocate any memory of their own: they use the same memory as was
allocated by the mmap call.

Carl Banks

sturlamolden · Feb 19, 2009

The offset parameter of mmap itself would be useful to map small
portions of gigabyte-sized files, and maybe numpy.memmap can take
advantage of that if the user passes an offset parameter.

NumPy's memmap is just a wrapper for Python 2.5's mmap. The offset
parameter does not affect the amount that is actually memory mapped.

S.M.

Carl Banks · Feb 19, 2009

NumPy's memmap is just a wrapper for Python 2.5's mmap. The offset
parameter does not affect the amount that is actually memory mapped.

Yes, that's what I said, but in future numpy.mmap could be updated to
take advantage of mmap's new offset parameter.

Carl Banks

Lionel · Feb 19, 2009

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?
No, what accounts for the memory efficienty is there is no bulk
allocation at all. The ndarray you have points to the memory that's
in the mmap. There is no copying data or separate array allocation..

Click to expand...

Click to expand...

Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?

Click to expand...

Ok, sorry for the confusion. What I should have said is that there is
no bulk allocation *by numpy* at all. The call to mmap does allocate
a chunk of RAM to reflect file contents, but the numpy arrays don't
allocate any memory of their own: they use the same memory as was
allocated by the mmap call.

Carl Banks- Hide quoted text -

- Show quoted text -

Thanks for the explanations Carl. I'm sorry, but it's me who's the
confused one here, not anyone else

I hate to waste everyone's time again, but something is just not
"clicking" in that black-hole I call a brain. So..."numpy.memmap"
allocates a chunk off the heap to coincide with the file contents. If
I memmap the entire 1 Gb file, a corresponding amount (approx. 1 Gb)
is allocated? That seems to contradict what is stated in the numpy
documentation:

"class numpy.memmap
Create a memory-map to an array stored in a file on disk.

Memory-mapped files are used for accessing small segments of large
files on disk, without reading the entire file into memory."

In my previous example that we were working with (100x100 data file),
you used an offset to memmap the "lower-half" of the array. Does this
mean that in the process of memmapping that lower half, RAM was set
aside for 50x100 32-bit complex numbers? If so, and I decide to memmap
an entire file, there is no memory benefit in doing so.

At this point do you (or anyone else) recommend I just write a little
function for my class that takes the coords I intend to load and "roll
my own" function? Seems like the best way to keep memory to a minimum,
I'm just worried about performance. On the other hand, the most I'd be
loading would be around 1k x 1k worth of data.

Carl Banks · Feb 19, 2009

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?
No, what accounts for the memory efficienty is there is no bulk
allocation at all. The ndarray you have points to the memory that's
in the mmap. There is no copying data or separate array allocation.
Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?

Click to expand...

Click to expand...

Ok, sorry for the confusion. What I should have said is that there is
no bulk allocation *by numpy* at all. The call to mmap does allocate
a chunk of RAM to reflect file contents, but the numpy arrays don't
allocate any memory of their own: they use the same memory as was
allocated by the mmap call.

Click to expand...

Carl Banks- Hide quoted text -

Click to expand...

- Show quoted text -

Click to expand...

Thanks for the explanations Carl. I'm sorry, but it's me who's the
confused one here, not anyone else

I hate to waste everyone's time again, but something is just not
"clicking" in that black-hole I call a brain. So..."numpy.memmap"
allocates a chunk off the heap to coincide with the file contents. If
I memmap the entire 1 Gb file, a corresponding amount (approx. 1 Gb)
is allocated? That seems to contradict what is stated in the numpy
documentation:

"class numpy.memmap
Create a memory-map to an array stored in a file on disk.

Memory-mapped files are used for accessing small segments of large
files on disk, without reading the entire file into memory."

Yes, it allocates room for the whole file in your process's LOGICAL
address space. However, it doesn't actually reserve any PHYSICAL
memory, or read in any data from the disk, until you've actually
access the data. And then it only reads small chunks in, not the
whole file.

So when you mmap your 1GB file, the OS sets aside a 1 GB chunk of
address to use for your memory map. That's all it does: it doesn't
read anything from disk, it doesn't reserve any physical RAM. Later,
when you access a byte in the mmap via a pointer, the OS notes that it
hasn't yet loaded the data at that address, so it grabs a small chunk
of physical ram and reads in the a small amount of data from the disk
containing the byte you are accessing. This all happens automatically
and transparently to you.

In my previous example that we were working with (100x100 data file),
you used an offset to memmap the "lower-half" of the array. Does this
mean that in the process of memmapping that lower half, RAM was set
aside for 50x100 32-bit complex numbers? If so, and I decide to memmap
an entire file, there is no memory benefit in doing so.

The mmap call sets aside room for all 100x100 32-bit complex numbers
in logical address space, regardless of whether you use the offset
parameter or not. However, it might only read in part of the file in
from disk, and will only reserve physical RAM for the parts it reads
in.

At this point do you (or anyone else) recommend I just write a little
function for my class that takes the coords I intend to load and "roll
my own" function? Seems like the best way to keep memory to a minimum,
I'm just worried about performance. On the other hand, the most I'd be
loading would be around 1k x 1k worth of data.-

No, if your file is not too large to mmap, just do it the way you've
been doing it. The documentation you've been reading is pretty much
correct, even if you approach it naively. It is both memory and I/O
efficient. You're overthinking things here; don't try to outsmart the
operating system. It'll take care of the performance issues
satisfactorily.

The only thing you have to worry about is if the file is too large to
fit into your process's logical address space, which on a typical 32-
bit system is 2-3 GB (depending on configuration) minus the space
occupied by Python and other heap objects, which is probably only a
few MB.

Carl Banks

Lionel · Feb 19, 2009

Thanks Carl, I like your solution. Am I correct in my understanding
that memory is allocated at the slicing step in your example i.e. when
"reshaped_data" is sliced using "interesting_data = reshaped_data[:,
50:100]"? In other words, given a huge (say 1Gb) file, a memmap object
is constructed that memmaps the entire file. Some relatively small
amount of memory is allocated for the memmap operation, but the bulk
memory allocation occurs when I generate my final numpy sub-array by
slicing, and this accounts for the memory efficiency of using memmap?
No, what accounts for the memory efficienty is there is no bulk
allocation at all. The ndarray you have points to the memory that's
in the mmap. There is no copying data or separate array allocation.
Does this mean that everytime I iterate through an ndarray that is
sourced from a memmap, the data is read from the disc? The sliced
array is at no time wholly resident in memory? What are the
performance implications of this?
Ok, sorry for the confusion. What I should have said is that there is
no bulk allocation *by numpy* at all. The call to mmap does allocate
a chunk of RAM to reflect file contents, but the numpy arrays don't
allocate any memory of their own: they use the same memory as was
allocated by the mmap call.
Carl Banks- Hide quoted text -
- Show quoted text -

Click to expand...

Click to expand...

Thanks for the explanations Carl. I'm sorry, but it's me who's the
confused one here, not anyone else

Click to expand...

I hate to waste everyone's time again, but something is just not
"clicking" in that black-hole I call a brain. So..."numpy.memmap"
allocates a chunk off the heap to coincide with the file contents. If
I memmap the entire 1 Gb file, a corresponding amount (approx. 1 Gb)
is allocated? That seems to contradict what is stated in the numpy
documentation:

Click to expand...

"class numpy.memmap
Create a memory-map to an array stored in a file on disk.

Click to expand...

Memory-mapped files are used for accessing small segments of large
files on disk, without reading the entire file into memory."

Click to expand...

Yes, it allocates room for the whole file in your process's LOGICAL
address space. However, it doesn't actually reserve any PHYSICAL
memory, or read in any data from the disk, until you've actually
access the data. And then it only reads small chunks in, not the
whole file.

So when you mmap your 1GB file, the OS sets aside a 1 GB chunk of
address to use for your memory map. That's all it does: it doesn't
read anything from disk, it doesn't reserve any physical RAM. Later,
when you access a byte in the mmap via a pointer, the OS notes that it
hasn't yet loaded the data at that address, so it grabs a small chunk
of physical ram and reads in the a small amount of data from the disk
containing the byte you are accessing. This all happens automatically
and transparently to you.

In my previous example that we were working with (100x100 data file),
you used an offset to memmap the "lower-half" of the array. Does this
mean that in the process of memmapping that lower half, RAM was set
aside for 50x100 32-bit complex numbers? If so, and I decide to memmap
an entire file, there is no memory benefit in doing so.

Click to expand...

The mmap call sets aside room for all 100x100 32-bit complex numbers
in logical address space, regardless of whether you use the offset
parameter or not. However, it might only read in part of the file in
from disk, and will only reserve physical RAM for the parts it reads
in.

At this point do you (or anyone else) recommend I just write a little
function for my class that takes the coords I intend to load and "roll
my own" function? Seems like the best way to keep memory to a minimum,
I'm just worried about performance. On the other hand, the most I'd be
loading would be around 1k x 1k worth of data.-

Click to expand...

No, if your file is not too large to mmap, just do it the way you've
been doing it. The documentation you've been reading is pretty much
correct, even if you approach it naively. It is both memory and I/O
efficient. You're overthinking things here; don't try to outsmart the
operating system. It'll take care of the performance issues
satisfactorily.

The only thing you have to worry about is if the file is too large to
fit into your process's logical address space, which on a typical 32-
bit system is 2-3 GB (depending on configuration) minus the space
occupied by Python and other heap objects, which is probably only a
few MB.

Carl Banks- Hide quoted text -

- Show quoted text -

I see. That was very well explained Carl, thank you.

Large data arrays?	9	Apr 23, 2009
AES-128 Clipboard Protector: Auto-Encrypt Ctrl+C, Smart-Decrypt Ctrl+V (C++ Windows Hook)	7	Mar 24, 2026
Universal BMP Steganography Tool (AES-128-CTR + SP800-90A CSPRNG) Full Encoder/Decoder with 3LSB Payload, PasswordDerived Key & External Key File	4	Mar 26, 2026
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
RSA implementation issues in public key pem loader function	0	May 21, 2025
Python point location of intersect between two lines	0	Feb 28, 2018
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Survey details won't go through using php, ajax, Mysql	3	Oct 25, 2023

numpy.memmap advice?

Lionel

Robert Kern

Carl Banks

Lionel

Carl Banks

sturlamolden

Carl Banks

Lionel

Carl Banks

sturlamolden

Carl Banks

Lionel

Carl Banks

Lionel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads