Which sparse matrix package?

Martin Manns · Dec 18, 2008

Hi:

I am writing a spreadsheet application in Python

http://pyspread.sf.net

which currently uses numpy.array for:

+ storing object references
(each array element corresponds to one grid cell)
+ slicing (read and write)
+ mapping from/to smaller numpy.array
+ searching and replacing
+ growing and shrinking
+ transpose operation.

While fast and stable, this is really memory inefficient for large grids
(i.e. larger than 1E7 rows and columns), so that I am looking into dicts
and sparse matrices.

The dict that I tried out is of the type:

{(1,2,3): "2323", (1,2,545): "2324234", ... }

It is too slow for my application when it grows. One slicing operation
with list comprehensions takes about 1/2 s on my computer for 1E6
elements.

Therefore, I looked into sparse matrices and found scipy.sparse and
pysparse. I tried out both lil_matrix objects. (I wrote a wrapper that
turns them into Python object arrays.)

scipy.sparse.lil_matrix allowed __getitem__ slicing only for one of the
dimensions and used much memory when increasing the number of
columns above 1E7.

pysparse.spmatrix.ll_mat was faster, uses less space and allows slicing
for both dimensions. However, its methods are not documented well and
I am not able to compile it in Debian testing due to some g77
dependencies. Even though the deb package works well, I am concerned
about having a dependency to a problematic package.

Now my questions:

Is there a better suited / maintained module for such sparse matrices
(or multi-dim arrays)?

Should I use another type of matrix in scipy.sparse? If yes which?

Does a different data-structure suit my above-stated needs better?

Best Regards

Martin

James Mills · Dec 18, 2008

Hi:
Hi,

I am writing a spreadsheet application in Python

What's wrong with pyspread ?

[ ... snip ... ]

The dict that I tried out is of the type:

{(1,2,3): "2323", (1,2,545): "2324234", ... }

It is too slow for my application when it grows. One slicing operation
with list comprehensions takes about 1/2 s on my computer for 1E6
elements.

Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)

Are you mad ? This is TEN Millions and you
required it faster than 0.5s ?

Am I missing something here ?

cheers
James

Robert Kern · Dec 18, 2008

James said:
What's wrong with pyspread ?

pyspread *is* the spreadsheet application he is writing.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

MRAB · Dec 18, 2008

James said:
Hi:
Hi,

I am writing a spreadsheet application in Python

Click to expand...

What's wrong with pyspread ?

[ ... snip ... ]

The dict that I tried out is of the type:

{(1,2,3): "2323", (1,2,545): "2324234", ... }

It is too slow for my application when it grows. One slicing operation
with list comprehensions takes about 1/2 s on my computer for 1E6
elements.

Click to expand...

Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)

Are you mad ? This is TEN Millions and you
required it faster than 0.5s ?

No, 1E6, or ONLY 1 million.

Am I missing something here ?

Would a matrix of matrices work better, ie a supermatrix where each
element is either a submatrix of cells or None if the submatrix is
empty? How much memory it would save would depend on how sparse the
matrix is.

eric · Dec 18, 2008

What's wrong with pyspread ?

Click to expand...

[ ... snip ... ]

The dict that I tried out is of the type:
{(1,2,3): "2323", (1,2,545): "2324234", ... }
It is too slow for my application when it grows. One slicing operation
with list comprehensions takes about 1/2 s on my computer for 1E6
elements.

Click to expand...

Click to expand...

Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)

Click to expand...

Are you mad ? This is TEN Millions and you
required it faster than 0.5s ?

Click to expand...

No, 1E6, or ONLY 1 million.

what does Internet explorer 6 & 7 have to do with that ? ;-)

Would a matrix of matrices work better, ie a supermatrix where each
element is either a submatrix of cells or None if the submatrix is
empty? How much memory it would save would depend on how sparse the
matrix is.

sparse matrix efficiency is all about guessing the probability for a
cell to be empty. My guess is that the probability for a cell (i,j) to
be empty grows with norm(i,j)

then, you should store your data in a list, and the indexes in another
"list", then use a BTree like algorithm, or keep you data ordered by
norm(i,j) growing.

something like:

data = [] #
kernel =[] # full of couples (i,j)

def add(i,j,value):
data.append(value) # or put in the "right place" to keep it sorted
kernel.append(i,j) # same comment

def get(i,j):
k = kernel_of(i,j)
return data[k]
def set(i,j,value):
if is_probably_empty():
k = kernel_of(i,j)
if k:
data[k]= value
else: add(i,j,value)

else:
add(i,j,value)

where the 'kernel_of method" can take advantage of the sorted order
(i*j or max(i,j) should be as good) to make a fast dichotomy.

The set method is about finding if the cell is 'empty or not' and then
either perform an indexof, or a add.
to efficiently know if a cell is empty (the is_probably_empty method)
or not, try to use the bloom filter using
i*j%m as hash function (for instance)

here arrays have the size of your actual data (maybe thousands of cell
for a huge spreadsheet), that's MUCH better than the size of the
maxN*maxM

if my memory is correct, there is room in numpy for your own sparse
matrix implementation. You should try it with your own sparse matrix
implementation.

James Mills · Dec 18, 2008

pyspread *is* the spreadsheet application he is writing.

Oh

My bad

--JamesMills

Robert Kern · Dec 18, 2008

Martin said:
Should I use another type of matrix in scipy.sparse? If yes which?

If you have a benchmark, you might just want to try all of them. Should be just
a matter of a small script. Block Sparse Row (bsr_matrix) might be the most
appropriate in terms of data structure, but it appears that a bunch of stuff
important for your use case is unimplemented.

Does a different data-structure suit my above-stated needs better?

Something similar to a quadtree might be more suitable. That should let you do
queries along both dimensions easily. You probably don't want exactly a
quadtree, though; it's good for scattered points, but probably not for largish
blocks of points on a precise grid, which you will probably see frequently in a
spreadsheet application.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Guest · Dec 18, 2008

Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)

My benchmark is as follows:

1) Each of the numbers in the 3-tuple is in the range [0, 1E7).
2) There are 1 000 000 elements in the dict (randomly distributed).
3) The content string is a random number in the range [0, 1E10) that is
casted into a string.
4) Measure the time that retrieving all elements in a 10000x100x10 cube
requires.

Are you mad ? This is TEN Millions and you
required it faster than 0.5s ?

Think about how often a spreadsheet re-calculates cells.
Furthermore, people tend to copy&paste cells.

I had some small example (Wagner-Whitin algorithm demo)
with hundreds of slicing operations on one screen. I believe that
no-one wants to wait for 1/2 s times 500 == 250 s each time the
spreadsheet is updated.

Martin

Guest · Dec 18, 2008

If you have a benchmark, you might just want to try all of them.
Should be just a matter of a small script. Block Sparse Row
(bsr_matrix) might be the most appropriate in terms of data
structure, but it appears that a bunch of stuff important for your
use case is unimplemented.

This was my first impulse until I saw that the syntax and the actually
implemented methods vary considerably inside scipy.sparse (different
authors?).

Before I code around all of that, I would like to to see, which
functionality is really provided by which matrix class.

Do you know where to find an overview (some docs)? Currently, I stumble
upon unimplemented things in my unit tests and figure out what is going
on after reading the sparse code :-(

Martin

James Mills · Dec 18, 2008

Let me get this straight.
It's taking 0.5s to slice your matrix
of 1E7 (10000000.0 rows/columns)

Click to expand...

My benchmark is as follows:

1) Each of the numbers in the 3-tuple is in the range [0, 1E7).
2) There are 1 000 000 elements in the dict (randomly distributed).
3) The content string is a random number in the range [0, 1E10) that is
casted into a string.
4) Measure the time that retrieving all elements in a 10000x100x10 cube
requires.

Does a spreadsheet really get this complex ?

Think about how often a spreadsheet re-calculates cells.
Furthermore, people tend to copy&paste cells.

I had some small example (Wagner-Whitin algorithm demo)
with hundreds of slicing operations on one screen. I believe that
no-one wants to wait for 1/2 s times 500 == 250 s each time the
spreadsheet is updated.

Hmmm

Java matrix problem	3	Sep 10, 2023
Sparse Matrix Multiplications	4	May 12, 2008
Operations on sparse matrices	3	Apr 19, 2010
Sparse Matrix implemented as a doubly-linked list	5	Jun 22, 2007
A math Library to handle sparse matrices...	0	Mar 19, 2009
Trouble creating multi dimensional array. 0 to 26 in 3 dimensions.	1	Oct 12, 2022
Matrix operations	6	Jul 23, 2010
locate items in matrix (index of lists of lists)	6	Mar 20, 2009

Which sparse matrix package?

Martin Manns

James Mills

Robert Kern

MRAB

eric

James Mills

Robert Kern

Guest

Guest

James Mills

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads