Numpy Performance

timlash · Apr 23, 2009

Still fairly new to Python. I wrote a program that used a class
called RectangularArray as described here:

class RectangularArray:
def __init__(self, rows, cols, value=0):
self.arr = [None]*rows
self.row = [value]*cols
def __getitem__(self, (i, j)):
return (self.arr or self.row)[j]
def __setitem__(self, (i, j), value):
if self.arr==None: self.arr = self.row[:]
self.arr[j] = value

This class was found in a 14 year old post:
http://www.python.org/search/hypermail/python-recent/0106.html

This worked great and let me process a few hundred thousand data
points with relative ease. However, I soon wanted to start sorting
arbitrary portions of my arrays and to transpose others. I turned to
Numpy rather than reinventing the wheel with custom methods within the
serviceable RectangularArray class. However, once I refactored with
Numpy I was surprised to find that the execution time for my program
doubled! I expected a purpose built array module to be more efficient
rather than less.

I'm not doing any linear algebra with my data. I'm working with
rectangular datasets, evaluating individual rows, grouping, sorting
and summarizing various subsets of rows.

Is a Numpy implementation overkill for my data handling uses? Should
I evaluate prior array modules such as Numeric or Numarray? Are there
any other modules suited to handling tabular data? Would I be best
off expanding the RectangularArray class for the few data
transformation methods I need?

Any guidance or suggestions would be greatly appreciated!

Cheers,

Tim

Peter Otten · Apr 23, 2009

timlash said:
Still fairly new to Python. I wrote a program that used a class
called RectangularArray as described here:

class RectangularArray:
def __init__(self, rows, cols, value=0):
self.arr = [None]*rows
self.row = [value]*cols
def __getitem__(self, (i, j)):
return (self.arr or self.row)[j]
def __setitem__(self, (i, j), value):
if self.arr==None: self.arr = self.row[:]
self.arr[j] = value

This class was found in a 14 year old post:
http://www.python.org/search/hypermail/python-recent/0106.html

This worked great and let me process a few hundred thousand data
points with relative ease. However, I soon wanted to start sorting
arbitrary portions of my arrays and to transpose others. I turned to
Numpy rather than reinventing the wheel with custom methods within the
serviceable RectangularArray class. However, once I refactored with
Numpy I was surprised to find that the execution time for my program
doubled! I expected a purpose built array module to be more efficient
rather than less.

I'm not doing any linear algebra with my data. I'm working with
rectangular datasets, evaluating individual rows, grouping, sorting
and summarizing various subsets of rows.

Is a Numpy implementation overkill for my data handling uses? Should
I evaluate prior array modules such as Numeric or Numarray? Are there
any other modules suited to handling tabular data? Would I be best
off expanding the RectangularArray class for the few data
transformation methods I need?

Any guidance or suggestions would be greatly appreciated!

Do you have many rows with zeros? That might be the reason why your
self-made approach shows better performance.

Googling for "numpy sparse" finds:

http://www.scipy.org/SciPy_Tutorial

Maybe one of the sparse matrix implementations in scipy works for you.

Peter

Robert Kern · Apr 24, 2009

Still fairly new to Python. I wrote a program that used a class
called RectangularArray as described here:

class RectangularArray:
def __init__(self, rows, cols, value=0):
self.arr = [None]*rows
self.row = [value]*cols
def __getitem__(self, (i, j)):
return (self.arr or self.row)[j]
def __setitem__(self, (i, j), value):
if self.arr==None: self.arr = self.row[:]
self.arr[j] = value

This class was found in a 14 year old post:
http://www.python.org/search/hypermail/python-recent/0106.html

This worked great and let me process a few hundred thousand data
points with relative ease. However, I soon wanted to start sorting
arbitrary portions of my arrays and to transpose others. I turned to
Numpy rather than reinventing the wheel with custom methods within the
serviceable RectangularArray class. However, once I refactored with
Numpy I was surprised to find that the execution time for my program
doubled! I expected a purpose built array module to be more efficient
rather than less.

It depends on how much you refactored you code. numpy tries to optimize bulk
operations. If you are doing a lot of __getitem__s and __setitem__s with
individual elements as you would with RectangularArray, numpy is going to do a
lot of extra work creating and deleting the scalar objects.

I'm not doing any linear algebra with my data. I'm working with
rectangular datasets, evaluating individual rows, grouping, sorting
and summarizing various subsets of rows.

Is a Numpy implementation overkill for my data handling uses? Should
I evaluate prior array modules such as Numeric or Numarray?

Click to expand...

No.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

timlash · Apr 24, 2009

Thanks for your replies.

@Peter - My arrays are not sparse at all, but I'll take a quick look
as scipy. I also should have mentioned that my numpy arrays are of
Object type as each data point (row) has one or more text labels for
categorization.

@Robert - Thanks for the comments about how numpy was optimized for
bulk transactions. Most of the processing I'm doing is with
individual elements.

Essentially, I'm testing tens of thousands of scenarios on a
relatively small number of test cases. Each scenario requires all
elements of each test case to be scored, then summarized, then sorted
and grouped with some top scores captured for reporting.

It seems like I can either work toward a procedure that features
indexed categorization so that my arrays are of integer type and a
design that will allow each scenario to be handled in bulk numpy
fashion, or expand RectangularArray with custom data handling methods.

Any other recommended approaches to working with tabular data in
Python?

Cheers,

Tim

Robert Kern · Apr 24, 2009

Essentially, I'm testing tens of thousands of scenarios on a
relatively small number of test cases. Each scenario requires all
elements of each test case to be scored, then summarized, then sorted
and grouped with some top scores captured for reporting.

It seems like I can either work toward a procedure that features
indexed categorization so that my arrays are of integer type and a
design that will allow each scenario to be handled in bulk numpy
fashion, or expand RectangularArray with custom data handling methods.

If you posted a small, self-contained example of what you are doing to
numpy-discussion, the denizens there will probably be able to help you formulate
the right way to do this in numpy, if such a way exists.

http://www.scipy.org/Mailing_Lists

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

python - jquery datatables with mongodb(mongoengine)	3	Dec 18, 2013
numpy help	2	Nov 3, 2006
Implementing append within a descriptor	0	Jan 21, 2014
Sort by number of characters	1	Nov 2, 2023
SystemError: error return without exception set	0	Feb 26, 2010
Update slider widget range	1	Nov 21, 2012
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Help and optimization hints, anyone?	4	Jan 23, 2004

Numpy Performance

timlash

Peter Otten

Robert Kern

timlash

Robert Kern

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads