Skipping bytes while reading a binary file?

L

Lionel

Hello,
I have data stored in binary files. Some of these files are
huge...upwards of 2 gigs or more. They consist of 32-bit float complex
numbers where the first 32 bits of the file is the real component, the
second 32bits is the imaginary, the 3rd 32-bits is the real component
of the second number, etc.

I'd like to be able to read in just the real components, load them
into a numpy.ndarray, then load the imaginary coponents and load them
into a numpy.ndarray. I need the real and imaginary components stored
in seperate arrays, they cannot be in a single array of complex
numbers except for temporarily. I'm trying to avoid temporary storage,
though, because of the size of the files.

I'm currently reading the file scanline-by-scanline to extract rows of
complex numbers which I then loop over and load into the real/
imaginary arrays as follows:


self._realData = numpy.empty((Rows, Columns), dtype =
numpy.float32)
self._imaginaryData = numpy.empty((Rows, Columns), dtype =
numpy.float32)

floatData = array.array('f')

for CurrentRow in range(Rows):

floatData.fromfile(DataFH, (Columns*2))

position = 0
for CurrentColumn in range(Columns):

self._realData[CurrentRow, CurrentColumn] =
floatData[position]
self._imaginaryData[CurrentRow, CurrentColumn] =
floatData[position+1]
position = position + 2


The above code works but is much too slow. If I comment out the body
of the "for CurrentColumn in range(Columns)" loop, the performance is
perfectly adequate i.e. function call overhead associated with the
"fromfile(...)" call is not very bad at all. What seems to be most
time-consuming are the simple assignment statements in the
"CurrentColumn" for-loop.

Does anyone see any ways of speeding this up at all? Reading
everything into a complex64 ndarray in one fell swoop would certainly
be easier and faster, but at some point I'll need to split this array
into two parts (real / imaginary). I'd like to have that done
initially to keep the memory usage down since the files are so
ginormous.

Psyco is out because I need 64-bits, and I didn't see anything on the
forums regarding a method that reads in every other 32-bit chunk form
a file into an array. I'm not sure what else to try.

Thanks in advance.
L
 
K

Krzysztof Retel

Lionel wrote:
 > Hello,
 > I have data stored in binary files. Some of these files are
 > huge...upwards of 2 gigs or more. They consist of 32-bit float complex
 > numbers where the first 32 bits of the file is the real component, the
 > second 32bits is the imaginary, the 3rd 32-bits is the real component
 > of the second number, etc.
 >
 > I'd like to be able to read in just the real components, load them
 > into a numpy.ndarray, then load the imaginary coponents and load them
 > into a numpy.ndarray.  I need the real and imaginary components stored
 > in seperate arrays, they cannot be in a single array of complex
 > numbers except for temporarily. I'm trying to avoid temporary storage,
 > though, because of the size of the files.
 >
 > I'm currently reading the file scanline-by-scanline to extract rows of
 > complex numbers which I then loop over and load into the real/
 > imaginary arrays as follows:
 >
 >
 >         self._realData         = numpy.empty((Rows, Columns), dtype =
 > numpy.float32)
 >         self._imaginaryData = numpy.empty((Rows, Columns), dtype =
 > numpy.float32)
 >
 >         floatData = array.array('f')
 >
 >         for CurrentRow in range(Rows):
 >
 >             floatData.fromfile(DataFH, (Columns*2))
 >
 >             position = 0
 >             for CurrentColumn in range(Columns):
 >
 >                  self._realData[CurrentRow, CurrentColumn]          =
 > floatData[position]
 >                 self._imaginaryData[CurrentRow, CurrentColumn]  =
 > floatData[position+1]
 >                 position = position + 2
 >
 >
 > The above code works but is much too slow. If I comment out the body
 > of the "for CurrentColumn in range(Columns)" loop, the performance is
 > perfectly adequate i.e. function call overhead associated with the
 > "fromfile(...)" call is not very bad at all. What seems to be most
 > time-consuming are the simple assignment statements in the
 > "CurrentColumn" for-loop.
 >
[snip]
Try array slicing. floatData[0::2] will return the real parts and
floatData[1::2] will return the imaginary parts. You'll have to read up
how to assign to a slice of the numpy array (it might be
"self._realData[CurrentRow] = real_parts" or "self._realData[CurrentRow,
:] = real_parts").
BTW, it's not the function call overhead of fromfile() which takes the
time, but actually reading data from the file.
Very nice! I like that! I'll post the improvement (if any).
L- Hide quoted text -
- Show quoted text -
Okay, the following:
            self._realData[CurrentRow]      = floatData[0::2]
            self._imaginaryData[CurrentRow] = floatData[1::2]
gives a 3.5x improvement in execution speed over the original that I
posted. That's much better. Thank you for the suggestion.
L- Hide quoted text -
- Show quoted text -

Correction: improvement is around 7-8x.

I had similar issues while Slicing Network packets (TCP/UDP) on a real
time basis.
I was using 're' and found it a lot more time and resource consuming,
than 'normal' string slicing as suggested by MRAB.

K
 
S

Slaunger

You might also want to have a look at a numpy memmap viewed as a
recarray.

from numpy import dtype, memmap, recarray
# Define your record in the file, 4bytes for the real value,
# and 4 bytes for the imaginary (assuming Little Endian repr)
descriptor = dtype([("r", "<f4"), ("i", "<f4")])
# Now typecast a memap (a memory efficient array) onto the file in
read only mode
# Viewing it as a recarray means the attributes data.r and data.i are
acessible as
# ordinary numpy array
data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
print "First 100 real values:", data.r[:100]

--Slaunger
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top