Looking for suggestions on improving numpy code

D

David Lees

I am starting to use numpy and have written a hack for reading in a
large data set that has 8 columns and millions of rows. I want to read
and process a single column. I have written the very ugly hack below,
but am sure there is a more efficient and pythonic way to do this. The
file is too big to read by brute force and select a column, so it is
read in chunks and the column selected. Things I don't like in the code:
1. Performing a transpose on a large array
2. Uncertainty about numpy append efficiency

Is there a way to directly read every n'th element from the file into an
array?

david


from numpy import *
from scipy.io.numpyio import fread

fd = open('testcase.bin', 'rb')
datatype = 'h'
byteswap = 0
M = 1000000
N = 8
size = M*N
shape = (M,N)
colNum = 2
sf =1.645278e-04*10
z=array([])
for i in xrange(50):
data = fread(fd, size, datatype,datatype,byteswap)
data = data.reshape(shape)
data = data.transpose()
z = append(z,data[colNum]*sf)

print z.mean()

fd.close()
 
R

Robert Kern

David said:
I am starting to use numpy and have written a hack for reading in a
large data set that has 8 columns and millions of rows. I want to read
and process a single column. I have written the very ugly hack below,
but am sure there is a more efficient and pythonic way to do this. The
file is too big to read by brute force and select a column, so it is
read in chunks and the column selected. Things I don't like in the code:
1. Performing a transpose on a large array

Transposition is trivially fast in numpy. It does not copy any memory.
2. Uncertainty about numpy append efficiency

Rest assured that it's slow. Appending to lists is fast since lists preallocate
memory according to a scheme such that the amortized cost of appending elements
is O(1). We don't quite have that luxury in numpy.
Is there a way to directly read every n'th element from the file into an
array?

Since this is a regular binary file, you can memory map the file.


import numpy

M = 1000000
N = 8
column = 2
sf =1.645278e-04*10

m = numpy.memmap('testcase.bin', dtype=numpy.int16, shape=(M,N))
z = m[:,column] * sf


You may want to ask future numpy questions on the numpy mailing list.

http://www.scipy.org/Mailing_Lists

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top