Large data arrays?

O

Ole Streicher

Hi,

for my application, I need to use quite large data arrays
(100.000 x 4000 values) with floating point numbers where I need a fast
row-wise and column-wise access (main case: return a column with the sum
over a number of selected rows, and vice versa).

I would use the numpy array for that, but they seem to be
memory-resistent. So, one of these arrays would use about 1.6 GB
memory which far too much. So I was thinking about a memory mapped
file for that. As far as I understand, there is one in numpy.

For this, I have two questions:

1. Are the "numpy.memmap" array unlimited in size (resp. only limited
by the maximal file size)? And are they part of the system's memory
limit (~3GB for 32bit systems)?

2. Since I need row-wise as well as column-wise access, a simple usage
of a big array as memory mapped file will probably lead to a very poor
performance, since one of them would need to read values splattered
around the whole file. Are there any "plug and play" solutions for
that? If not: what would be the best way to solve this problem?
Probably, one needs to use someting like the "Morton layout" for the
data. Would one then build a subclass of memmap (or ndarray?) that
implements this specific layout? How would one do that? (Sorry, I am
still a beginner with respect to python).

Best regards

Ole
 
O

Ole Streicher

Hi Nick,

Nick Craig-Wood said:
mmaps come out of your applications memory space, so out of that 3 GB
limit. You don't need that much RAM of course but it does use up
address space.

Hmm. So I have no chance to use >= 2 of these arrays simultaniously?
Sorry don't know very much about numpy, but it occurs to me that you
could have two copies of your mmapped array, one the transpose of the
other which would then speed up the two access patterns enormously.

That would be a solution, but it takes twice the amount of address
space (which seems already to be the limiting factor). In my case (1.6
GB per array), I could even not use one array.

Also, I would need to fill two large files at program start: one for
each orientation (row-wise or column-wise). Depending on the input
data (which are also either row-wise or column-wise), the filling of
the array with opposite direction would take a lot of time because of
the inefficiencies.

For that, using both directions probably would be not a good
solution. What I found is the "Morton layout" which uses a kind of
fractal interleaving and sound not that complicated. But I have no
idea on how to turn it into a "numpy" style: can I just extend from
numpy.ndarray (or numpy.memmap), and which functions/methods then need
to be overwritten? The best would be ofcourse that someone already did
this before that I could use without trapping in all these pitfalls
which occur when one implements a very generic algorithm.

Best regards

Ole
 
A

Aahz

for my application, I need to use quite large data arrays
(100.000 x 4000 values) with floating point numbers where I need a fast
row-wise and column-wise access (main case: return a column with the sum
over a number of selected rows, and vice versa).

I would use the numpy array for that, but they seem to be
memory-resistent. So, one of these arrays would use about 1.6 GB
memory which far too much. So I was thinking about a memory mapped
file for that. As far as I understand, there is one in numpy.

You probably want to ask on a NumPy or SciPy list:
http://scipy.org/Mailing_Lists
 
J

John Machin

Hi,

for my application, I need to use quite large data arrays
(100.000 x 4000 values) with floating point numbers where I need a fast
row-wise and column-wise access (main case: return a column with the sum
over a number of selected rows, and vice versa).

I would use the numpy array for that, but they seem to be
memory-resistent. So, one of these arrays would use about 1.6 GB
memory which far too much. So I was thinking about a memory mapped
file for that. As far as I understand, there is one in numpy.

For this, I have two questions:

1. Are the "numpy.memmap" array unlimited in size (resp. only limited
by the maximal file size)? And are they part of the system's memory
limit (~3GB for 32bit systems)?

2. Since I need row-wise as well as column-wise access, a simple usage
of a big array as memory mapped file will probably lead to a very poor
performance, since one of them would need to read values splattered
around the whole file. Are there any "plug and play" solutions for
that? If not: what would be the best way to solve this problem?
Probably, one needs to use someting like the "Morton layout" for the
data. Would one then build a subclass of memmap (or ndarray?) that
implements this specific layout? How would one do that? (Sorry, I am
still a beginner with respect to python).

The Morton layout wastes space if the matrix is not square. Your 100K
x 4K is very non-square. Looks like you might want to use e.g. 25
Morton arrays, each 4K x 4K.

Cheers,
John
 
O

Ole Streicher

Hi John,

John Machin said:
The Morton layout wastes space if the matrix is not square. Your 100K
x 4K is very non-square. Looks like you might want to use e.g. 25
Morton arrays, each 4K x 4K.

What I found was that Morton layout shall be usable, if the shape is
rectangular and both dimensions are powers of two. But, all examples
were done with equal dimensions, so I am a bit confused here.

From my access pattern, it would be probably better to combine 25 rows
into one slice and have one matrix where every cell contains 25 rows.

Are there any objections about that?

Best regards

Ole
 
O

Ole Streicher

Hi Nick,

Nick Craig-Wood said:
I'd start by writing a function which took (x, y) in array
co-ordinates and transformed that into (z) remapped in the Morton
layout.

This removes the possibility to use the sum() and similar methods of
numpy. Implementing them myself is probably much worse than using
Numpys own.
Alternatively you could install a 64bit OS on your machine and use
my scheme!

Well: I am just the developer. Ofcourse I could just raise the
requirements to use my software, but I think it is good style to keep
them as low as possible.

Best regards

Ole
 
J

John Machin

Hi John,



What I found was that Morton layout shall be usable, if the shape is
rectangular and both dimensions are powers of two. But, all examples
were done with equal dimensions, so I am a bit confused here.

Yes, you are right, it can be done in one hit with a rectangular
array. How it is done: in your case you have a 2**17 x 2**12 array, so
the Morton index corresponding to (i, j) would have the top 5 bits of
i followed by the remaining 12 bits of i interleaved with the 12 bits
of j -- scarcely distinguishable from my original suggestion of 25
4Kx4K arrays, once you've ignored the trailing approx 2**17 - 1000000
elements that you don't need to allocate space for ;-)
From my access pattern, it would be probably better to combine 25 rows
into one slice and have one matrix where every cell contains 25 rows.

Are there any objections about that?

Can't object, because I'm not sure what you mean ... how many elements
in a "cell"?

Cheers,
John
 
O

Ole Streicher

Hi John,

John Machin said:
Can't object, because I'm not sure what you mean ... how many elements
in a "cell"?

Well, a matrix consists of "cells"? A 10x10 matrix has 100 "cells".

Regards

Ole
 
J

John Machin

Hi John,



Well, a matrix consists of "cells"? A 10x10 matrix has 100 "cells".

Yes yes but you said "every cell contains 25 rows" ... what's in a
cell? 25 rows, with each row containing what?
 
O

Ole Streicher

Hi John

John Machin said:
Yes yes but you said "every cell contains 25 rows" ... what's in a
cell? 25 rows, with each row containing what?

I mean: original cells.
I have 100.000x4096 entries:

(0,0) (0,1) ... (0,4095)
(1,0) (1,1) ... (1,4095)
....
(100.000,0) (100.000,1) ... (100.000,4095)

This will be re-organized in a new matrix, containing 4096 columns (as
before) and 4000 rows. The leftmost cell (first row, first col) in the
new matrix then contains the array

(0,0)
(1,0)
....
(24,0)

The second column of the first row contains the array

(0,1)
(1,1)
....
(24,1)

and so on. The first column of the second row contains

(25,0)
....
(49,0)

That way, I get a new matrix where every cell contains an array of 24
"original" cells. Disadvantage (what I see now when I write it down)
is that this is bad for numpy since it deals with arrays instead of
numbers in matrix positions.

Best regards

Ole
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top