Skipping bytes while reading a binary file?

Discussion in 'Python' started by Lionel, Feb 5, 2009.

  1. Lionel

    Lionel Guest

    I have data stored in binary files. Some of these files are
    huge...upwards of 2 gigs or more. They consist of 32-bit float complex
    numbers where the first 32 bits of the file is the real component, the
    second 32bits is the imaginary, the 3rd 32-bits is the real component
    of the second number, etc.

    I'd like to be able to read in just the real components, load them
    into a numpy.ndarray, then load the imaginary coponents and load them
    into a numpy.ndarray. I need the real and imaginary components stored
    in seperate arrays, they cannot be in a single array of complex
    numbers except for temporarily. I'm trying to avoid temporary storage,
    though, because of the size of the files.

    I'm currently reading the file scanline-by-scanline to extract rows of
    complex numbers which I then loop over and load into the real/
    imaginary arrays as follows:

    self._realData = numpy.empty((Rows, Columns), dtype =
    self._imaginaryData = numpy.empty((Rows, Columns), dtype =

    floatData = array.array('f')

    for CurrentRow in range(Rows):

    floatData.fromfile(DataFH, (Columns*2))

    position = 0
    for CurrentColumn in range(Columns):

    self._realData[CurrentRow, CurrentColumn] =
    self._imaginaryData[CurrentRow, CurrentColumn] =
    position = position + 2

    The above code works but is much too slow. If I comment out the body
    of the "for CurrentColumn in range(Columns)" loop, the performance is
    perfectly adequate i.e. function call overhead associated with the
    "fromfile(...)" call is not very bad at all. What seems to be most
    time-consuming are the simple assignment statements in the
    "CurrentColumn" for-loop.

    Does anyone see any ways of speeding this up at all? Reading
    everything into a complex64 ndarray in one fell swoop would certainly
    be easier and faster, but at some point I'll need to split this array
    into two parts (real / imaginary). I'd like to have that done
    initially to keep the memory usage down since the files are so

    Psyco is out because I need 64-bits, and I didn't see anything on the
    forums regarding a method that reads in every other 32-bit chunk form
    a file into an array. I'm not sure what else to try.

    Thanks in advance.
    Lionel, Feb 5, 2009
    1. Advertisements

  2. I had similar issues while Slicing Network packets (TCP/UDP) on a real
    time basis.
    I was using 're' and found it a lot more time and resource consuming,
    than 'normal' string slicing as suggested by MRAB.

    Krzysztof Retel, Feb 6, 2009
    1. Advertisements

  3. Lionel

    Slaunger Guest

    You might also want to have a look at a numpy memmap viewed as a

    from numpy import dtype, memmap, recarray
    # Define your record in the file, 4bytes for the real value,
    # and 4 bytes for the imaginary (assuming Little Endian repr)
    descriptor = dtype([("r", "<f4"), ("i", "<f4")])
    # Now typecast a memap (a memory efficient array) onto the file in
    read only mode
    # Viewing it as a recarray means the attributes data.r and data.i are
    acessible as
    # ordinary numpy array
    data = memmap(filename, dtype=descriptor, mode='r').view(recarray)
    print "First 100 real values:", data.r[:100]

    Slaunger, Feb 6, 2009
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.