Large data arrays?

Discussion in 'Python' started by Ole Streicher, Apr 23, 2009.

  1. Hi,

    for my application, I need to use quite large data arrays
    (100.000 x 4000 values) with floating point numbers where I need a fast
    row-wise and column-wise access (main case: return a column with the sum
    over a number of selected rows, and vice versa).

    I would use the numpy array for that, but they seem to be
    memory-resistent. So, one of these arrays would use about 1.6 GB
    memory which far too much. So I was thinking about a memory mapped
    file for that. As far as I understand, there is one in numpy.

    For this, I have two questions:

    1. Are the "numpy.memmap" array unlimited in size (resp. only limited
    by the maximal file size)? And are they part of the system's memory
    limit (~3GB for 32bit systems)?

    2. Since I need row-wise as well as column-wise access, a simple usage
    of a big array as memory mapped file will probably lead to a very poor
    performance, since one of them would need to read values splattered
    around the whole file. Are there any "plug and play" solutions for
    that? If not: what would be the best way to solve this problem?
    Probably, one needs to use someting like the "Morton layout" for the
    data. Would one then build a subclass of memmap (or ndarray?) that
    implements this specific layout? How would one do that? (Sorry, I am
    still a beginner with respect to python).

    Best regards

    Ole
     
    Ole Streicher, Apr 23, 2009
    #1
    1. Advertising

  2. Hi Nick,

    Nick Craig-Wood <> writes:
    > mmaps come out of your applications memory space, so out of that 3 GB
    > limit. You don't need that much RAM of course but it does use up
    > address space.


    Hmm. So I have no chance to use >= 2 of these arrays simultaniously?

    > Sorry don't know very much about numpy, but it occurs to me that you
    > could have two copies of your mmapped array, one the transpose of the
    > other which would then speed up the two access patterns enormously.


    That would be a solution, but it takes twice the amount of address
    space (which seems already to be the limiting factor). In my case (1.6
    GB per array), I could even not use one array.

    Also, I would need to fill two large files at program start: one for
    each orientation (row-wise or column-wise). Depending on the input
    data (which are also either row-wise or column-wise), the filling of
    the array with opposite direction would take a lot of time because of
    the inefficiencies.

    For that, using both directions probably would be not a good
    solution. What I found is the "Morton layout" which uses a kind of
    fractal interleaving and sound not that complicated. But I have no
    idea on how to turn it into a "numpy" style: can I just extend from
    numpy.ndarray (or numpy.memmap), and which functions/methods then need
    to be overwritten? The best would be ofcourse that someone already did
    this before that I could use without trapping in all these pitfalls
    which occur when one implements a very generic algorithm.

    Best regards

    Ole
     
    Ole Streicher, Apr 23, 2009
    #2
    1. Advertising

  3. Ole Streicher

    Aahz Guest

    In article <>,
    Ole Streicher <> wrote:
    >
    >for my application, I need to use quite large data arrays
    >(100.000 x 4000 values) with floating point numbers where I need a fast
    >row-wise and column-wise access (main case: return a column with the sum
    >over a number of selected rows, and vice versa).
    >
    >I would use the numpy array for that, but they seem to be
    >memory-resistent. So, one of these arrays would use about 1.6 GB
    >memory which far too much. So I was thinking about a memory mapped
    >file for that. As far as I understand, there is one in numpy.


    You probably want to ask on a NumPy or SciPy list:
    http://scipy.org/Mailing_Lists
    --
    Aahz () <*> http://www.pythoncraft.com/

    "If you think it's expensive to hire a professional to do the job, wait
    until you hire an amateur." --Red Adair
     
    Aahz, Apr 23, 2009
    #3
  4. Ole Streicher

    John Machin Guest

    On Apr 23, 8:22 pm, Ole Streicher <> wrote:
    > Hi,
    >
    > for my application, I need to use quite large data arrays
    > (100.000 x 4000 values) with floating point numbers where I need a fast
    > row-wise and column-wise access (main case: return a column with the sum
    > over a number of selected rows, and vice versa).
    >
    > I would use the numpy array for that, but they seem to be
    > memory-resistent. So, one of these arrays would use about 1.6 GB
    > memory which far too much. So I was thinking about a memory mapped
    > file for that. As far as I understand, there is one in numpy.
    >
    > For this, I have two questions:
    >
    > 1. Are the "numpy.memmap" array unlimited in size (resp. only limited
    > by the maximal file size)? And are they part of the system's memory
    > limit (~3GB for 32bit systems)?
    >
    > 2. Since I need row-wise as well as column-wise access, a simple usage
    > of a big array as memory mapped file will probably lead to a very poor
    > performance, since one of them would need to read values splattered
    > around the whole file. Are there any "plug and play" solutions for
    > that? If not: what would be the best way to solve this problem?
    > Probably, one needs to use someting like the "Morton layout" for the
    > data. Would one then build a subclass of memmap (or ndarray?) that
    > implements this specific layout? How would one do that? (Sorry, I am
    > still a beginner with respect to python).


    The Morton layout wastes space if the matrix is not square. Your 100K
    x 4K is very non-square. Looks like you might want to use e.g. 25
    Morton arrays, each 4K x 4K.

    Cheers,
    John
     
    John Machin, Apr 24, 2009
    #4
  5. Hi John,

    John Machin <> writes:
    > The Morton layout wastes space if the matrix is not square. Your 100K
    > x 4K is very non-square. Looks like you might want to use e.g. 25
    > Morton arrays, each 4K x 4K.


    What I found was that Morton layout shall be usable, if the shape is
    rectangular and both dimensions are powers of two. But, all examples
    were done with equal dimensions, so I am a bit confused here.

    From my access pattern, it would be probably better to combine 25 rows
    into one slice and have one matrix where every cell contains 25 rows.

    Are there any objections about that?

    Best regards

    Ole
     
    Ole Streicher, Apr 24, 2009
    #5
  6. Hi Nick,

    Nick Craig-Wood <> writes:
    > I'd start by writing a function which took (x, y) in array
    > co-ordinates and transformed that into (z) remapped in the Morton
    > layout.


    This removes the possibility to use the sum() and similar methods of
    numpy. Implementing them myself is probably much worse than using
    Numpys own.

    > Alternatively you could install a 64bit OS on your machine and use
    > my scheme!


    Well: I am just the developer. Ofcourse I could just raise the
    requirements to use my software, but I think it is good style to keep
    them as low as possible.

    Best regards

    Ole
     
    Ole Streicher, Apr 24, 2009
    #6
  7. Ole Streicher

    John Machin Guest

    On Apr 24, 5:17 pm, Ole Streicher <> wrote:
    > Hi John,
    >
    > John Machin <> writes:
    > > The Morton layout wastes space if the matrix is not square. Your 100K
    > > x 4K is very non-square. Looks like you might want to use e.g. 25
    > > Morton arrays, each 4K x 4K.

    >
    > What I found was that Morton layout shall be usable, if the shape is
    > rectangular and both dimensions are powers of two. But, all examples
    > were done with equal dimensions, so I am a bit confused here.


    Yes, you are right, it can be done in one hit with a rectangular
    array. How it is done: in your case you have a 2**17 x 2**12 array, so
    the Morton index corresponding to (i, j) would have the top 5 bits of
    i followed by the remaining 12 bits of i interleaved with the 12 bits
    of j -- scarcely distinguishable from my original suggestion of 25
    4Kx4K arrays, once you've ignored the trailing approx 2**17 - 1000000
    elements that you don't need to allocate space for ;-)

    >
    > From my access pattern, it would be probably better to combine 25 rows
    > into one slice and have one matrix where every cell contains 25 rows.
    >
    > Are there any objections about that?


    Can't object, because I'm not sure what you mean ... how many elements
    in a "cell"?

    Cheers,
    John
     
    John Machin, Apr 24, 2009
    #7
  8. Hi John,

    John Machin <> writes:
    >> From my access pattern, it would be probably better to combine 25 rows
    >> into one slice and have one matrix where every cell contains 25 rows.
    >> Are there any objections about that?

    > Can't object, because I'm not sure what you mean ... how many elements
    > in a "cell"?


    Well, a matrix consists of "cells"? A 10x10 matrix has 100 "cells".

    Regards

    Ole
     
    Ole Streicher, Apr 24, 2009
    #8
  9. Ole Streicher

    John Machin Guest

    On Apr 25, 1:14 am, Ole Streicher <> wrote:
    > Hi John,
    >
    > John Machin <> writes:
    > >> From my access pattern, it would be probably better to combine 25 rows
    > >> into one slice and have one matrix where every cell contains 25 rows.
    > >> Are there any objections about that?

    > > Can't object, because I'm not sure what you mean ... how many elements
    > > in a "cell"?

    >
    > Well, a matrix consists of "cells"? A 10x10 matrix has 100 "cells".


    Yes yes but you said "every cell contains 25 rows" ... what's in a
    cell? 25 rows, with each row containing what?
     
    John Machin, Apr 24, 2009
    #9
  10. Hi John

    John Machin <> writes:
    > On Apr 25, 1:14 am, Ole Streicher <> wrote:
    >> John Machin <> writes:
    >> >> From my access pattern, it would be probably better to combine 25 rows
    >> >> into one slice and have one matrix where every cell contains 25 rows.
    >> >> Are there any objections about that?
    >> > Can't object, because I'm not sure what you mean ... how many elements
    >> > in a "cell"?

    >>
    >> Well, a matrix consists of "cells"? A 10x10 matrix has 100 "cells".

    >
    > Yes yes but you said "every cell contains 25 rows" ... what's in a
    > cell? 25 rows, with each row containing what?


    I mean: original cells.
    I have 100.000x4096 entries:

    (0,0) (0,1) ... (0,4095)
    (1,0) (1,1) ... (1,4095)
    ....
    (100.000,0) (100.000,1) ... (100.000,4095)

    This will be re-organized in a new matrix, containing 4096 columns (as
    before) and 4000 rows. The leftmost cell (first row, first col) in the
    new matrix then contains the array

    (0,0)
    (1,0)
    ....
    (24,0)

    The second column of the first row contains the array

    (0,1)
    (1,1)
    ....
    (24,1)

    and so on. The first column of the second row contains

    (25,0)
    ....
    (49,0)

    That way, I get a new matrix where every cell contains an array of 24
    "original" cells. Disadvantage (what I see now when I write it down)
    is that this is bad for numpy since it deals with arrays instead of
    numbers in matrix positions.

    Best regards

    Ole
     
    Ole Streicher, Apr 24, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    1,974
  2. Maxim

    Large Arrays of Strings

    Maxim, Jun 29, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    507
    Natty Gur
    Jun 30, 2003
  3. Philipp
    Replies:
    21
    Views:
    1,156
    Philipp
    Jan 20, 2009
  4. Ketchup
    Replies:
    1
    Views:
    259
    Jan Tielens
    May 25, 2004
  5. Replies:
    5
    Views:
    917
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page