Skipping bytes while reading a binary file?

Lionel · Feb 5, 2009

Hello,
I have data stored in binary files. Some of these files are
huge...upwards of 2 gigs or more. They consist of 32-bit float complex
numbers where the first 32 bits of the file is the real component, the
second 32bits is the imaginary, the 3rd 32-bits is the real component
of the second number, etc.

I'd like to be able to read in just the real components, load them
into a numpy.ndarray, then load the imaginary coponents and load them
into a numpy.ndarray. I need the real and imaginary components stored
in seperate arrays, they cannot be in a single array of complex
numbers except for temporarily. I'm trying to avoid temporary storage,
though, because of the size of the files.

I'm currently reading the file scanline-by-scanline to extract rows of
complex numbers which I then loop over and load into the real/
imaginary arrays as follows:

self._realData = numpy.empty((Rows, Columns), dtype =
numpy.float32)
self._imaginaryData = numpy.empty((Rows, Columns), dtype =
numpy.float32)

floatData = array.array('f')

for CurrentRow in range(Rows):

floatData.fromfile(DataFH, (Columns*2))

position = 0
for CurrentColumn in range(Columns):

self._realData[CurrentRow, CurrentColumn] =
floatData[position]
self._imaginaryData[CurrentRow, CurrentColumn] =
floatData[position+1]
position = position + 2

The above code works but is much too slow. If I comment out the body
of the "for CurrentColumn in range(Columns)" loop, the performance is
perfectly adequate i.e. function call overhead associated with the
"fromfile(...)" call is not very bad at all. What seems to be most
time-consuming are the simple assignment statements in the
"CurrentColumn" for-loop.

Does anyone see any ways of speeding this up at all? Reading
everything into a complex64 ndarray in one fell swoop would certainly
be easier and faster, but at some point I'll need to split this array
into two parts (real / imaginary). I'd like to have that done
initially to keep the memory usage down since the files are so
ginormous.

Psyco is out because I need 64-bits, and I didn't see anything on the
forums regarding a method that reads in every other 32-bit chunk form
a file into an array. I'm not sure what else to try.

Thanks in advance.
L

Krzysztof Retel · Feb 6, 2009

Lionel wrote:
> Hello,
> I have data stored in binary files. Some of these files are
> huge...upwards of 2 gigs or more. They consist of 32-bit float complex
> numbers where the first 32 bits of the file is the real component, the
> second 32bits is the imaginary, the 3rd 32-bits is the real component
> of the second number, etc.
>
> I'd like to be able to read in just the real components, load them
> into a numpy.ndarray, then load the imaginary coponents and load them
> into a numpy.ndarray. I need the real and imaginary components stored
> in seperate arrays, they cannot be in a single array of complex
> numbers except for temporarily. I'm trying to avoid temporary storage,
> though, because of the size of the files.
>
> I'm currently reading the file scanline-by-scanline to extract rows of
> complex numbers which I then loop over and load into the real/
> imaginary arrays as follows:
>
>
> self._realData = numpy.empty((Rows, Columns), dtype =
> numpy.float32)
> self._imaginaryData = numpy.empty((Rows, Columns), dtype =
> numpy.float32)
>
> floatData = array.array('f')
>
> for CurrentRow in range(Rows):
>
> floatData.fromfile(DataFH, (Columns*2))
>
> position = 0
> for CurrentColumn in range(Columns):
>
> self._realData[CurrentRow, CurrentColumn] =
> floatData[position]
> self._imaginaryData[CurrentRow, CurrentColumn] =
> floatData[position+1]
> position = position + 2
>
>
> The above code works but is much too slow. If I comment out the body
> of the "for CurrentColumn in range(Columns)" loop, the performance is
> perfectly adequate i.e. function call overhead associated with the
> "fromfile(...)" call is not very bad at all. What seems to be most
> time-consuming are the simple assignment statements in the
> "CurrentColumn" for-loop.
>
[snip]
Try array slicing. floatData[0::2] will return the real parts and
floatData[1::2] will return the imaginary parts. You'll have to read up
how to assign to a slice of the numpy array (it might be
"self._realData[CurrentRow] = real_parts" or "self._realData[CurrentRow,
:] = real_parts").
BTW, it's not the function call overhead of fromfile() which takes the
time, but actually reading data from the file.
Very nice! I like that! I'll post the improvement (if any).
L- Hide quoted text -
- Show quoted text -

Click to expand...

Click to expand...

Okay, the following:

Click to expand...

self._realData[CurrentRow] = floatData[0::2]
self._imaginaryData[CurrentRow] = floatData[1::2]

Click to expand...

gives a 3.5x improvement in execution speed over the original that I
posted. That's much better. Thank you for the suggestion.

Click to expand...

L- Hide quoted text -

Click to expand...

- Show quoted text -

Click to expand...

Correction: improvement is around 7-8x.

I had similar issues while Slicing Network packets (TCP/UDP) on a real
time basis.
I was using 're' and found it a lot more time and resource consuming,
than 'normal' string slicing as suggested by MRAB.

K

Slaunger · Feb 6, 2009

You might also want to have a look at a numpy memmap viewed as a
recarray.

from numpy import dtype, memmap, recarray
# Define your record in the file, 4bytes for the real value,
# and 4 bytes for the imaginary (assuming Little Endian repr)
descriptor = dtype([("r", "<f4"), ("i", "<f4")])
# Now typecast a memap (a memory efficient array) onto the file in
read only mode
# Viewing it as a recarray means the attributes data.r and data.i are
acessible as
# ordinary numpy array
data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
print "First 100 real values:", data.r[:100]

--Slaunger

[Python3] Reading a binary file and wrtiting the bytes verbatim in an utf-8 file	6	Apr 23, 2010
Reading a binary file	9	Apr 28, 2011
Seeking help: reading text file with genfromtxt	0	Apr 4, 2012
numpy.memmap advice?	13	Feb 17, 2009
Problem reading binary chunks from file	0	Jul 12, 2010
Define a class containing methods for reading a file and then storelines in external variable	1	Aug 15, 2013
How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	1	Mar 22, 2024
Reading File Into 2D List	2	Jul 9, 2013

Skipping bytes while reading a binary file?

Lionel

Krzysztof Retel

Slaunger

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads