Qustion about struct.unpack

O

OhKyu Yoon

Hi!
I have a really long binary file that I want to read.
The way I am doing it now is:

for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', infile.read(8))
# do something
tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))
# do something

Each loop takes about 0.2 ms in my computer, which means the whole for loop
takes 2000 seconds.
I would like it to run faster.
Do you have any suggestions?
Thank you very much.

OhKyu
 
S

Steven D'Aprano

Hi!
I have a really long binary file that I want to read.
The way I am doing it now is:

for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', infile.read(8))
# do something
tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))

I assume that is supposed to be infile.read()

# do something

Each loop takes about 0.2 ms in my computer, which means the whole for loop
takes 2000 seconds.

You're reading 400 million bytes, or 400MB, in about half an hour. Whether
that's fast or slow depends on what the "do something" lines are doing.

I would like it to run faster.
Do you have any suggestions?

Disk I/O is slow, so don't read from files in tiny little chunks. Read a
bunch of records into memory, then process them.

# UNTESTED!
rsize = 8 + 32 # record size
for i in xrange(N//1000):
buffer = infile.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:eek:ffset+8])
# do something
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:eek:ffset+rsize])
# do something


(Now I'm just waiting for somebody to tell me that file.read() already
buffers reads...)
 
E

eC

Hi!
I have a really long binary file that I want to read.
The way I am doing it now is:
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', infile.read(8))
# do something
tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))

I assume that is supposed to be infile.read()
# do something
Each loop takes about 0.2 ms in my computer, which means the whole for loop
takes 2000 seconds.

You're reading 400 million bytes, or 400MB, in about half an hour. Whether
that's fast or slow depends on what the "do something" lines are doing.
I would like it to run faster.
Do you have any suggestions?

Disk I/O is slow, so don't read from files in tiny little chunks. Read a
bunch of records into memory, then process them.

# UNTESTED!
rsize = 8 + 32 # record size
for i in xrange(N//1000):
buffer = infile.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:eek:ffset+8])
# do something
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:eek:ffset+rsize])
# do something

(Now I'm just waiting for somebody to tell me that file.read() already
buffers reads...)

I think the file.read() already buffers reads... :)
 
G

Gabriel Genellina

I have a really long binary file that I want to read.
The way I am doing it now is:
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', infile.read(8))
# do something
tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))

Disk I/O is slow, so don't read from files in tiny little chunks. Read a
bunch of records into memory, then process them.

# UNTESTED!
rsize = 8 + 32 # record size
for i in xrange(N//1000):
buffer = infile.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:eek:ffset+8])
# do something
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:eek:ffset+rsize])
# do something

(Now I'm just waiting for somebody to tell me that file.read() already
buffers reads...)

I think the file.read() already buffers reads... :)

Now we need someone to actually measure it, to confirm the expected
behavior... Done.

--- begin code ---
import struct,timeit,os

fn = r"c:\temp\delete.me"
fsize = 1000000
if not os.path.isfile(fn):
f = open(fn, "wb")
f.write("\0" * fsize)
f.close()
os.system("sync")

def smallreads(fn):
rsize = 40
N = fsize // rsize
f = open(fn, "rb")
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', f.read(8))
tdc = struct.unpack('=LiLiLiLi', f.read(32))
f.close()


def bigreads(fn):
rsize = 40
N = fsize // rsize
f = open(fn, "rb")
for i in xrange(N//1000):
buffer = f.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:eek:ffset+8])
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:eek:ffset+rsize])
f.close()

print "smallreads", timeit.Timer("smallreads(fn)","from __main__ import
fn,smallreads,fsize").repeat(3,1)
print "bigreads", timeit.Timer("bigreads(fn)", "from __main__ import
fn,bigreads,fsize").repeat(3,1)
--- end code ---

Output:
smallreads [4.2534193777646663, 4.126013885559789, 4.2389176672125458]
bigreads [1.2897319939456011, 1.3076018578892405, 1.2703250635695138]

So in this sample case, reading in big chunks is about 3 times faster than
reading many tiny pieces.
 
O

OhKyu Yoon

Wow, thank you all!

Gabriel Genellina said:
On Mon, 30 Apr 2007 00:45:22 -0700, OhKyu Yoon wrote:
I have a really long binary file that I want to read.
The way I am doing it now is:

for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', infile.read(8))
# do something
tdc = struct.unpack('=LiLiLiLi',self.lmf.read(32))

Disk I/O is slow, so don't read from files in tiny little chunks. Read a
bunch of records into memory, then process them.

# UNTESTED!
rsize = 8 + 32 # record size
for i in xrange(N//1000):
buffer = infile.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:eek:ffset+8])
# do something
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:eek:ffset+rsize])
# do something

(Now I'm just waiting for somebody to tell me that file.read() already
buffers reads...)

I think the file.read() already buffers reads... :)

Now we need someone to actually measure it, to confirm the expected
behavior... Done.

--- begin code ---
import struct,timeit,os

fn = r"c:\temp\delete.me"
fsize = 1000000
if not os.path.isfile(fn):
f = open(fn, "wb")
f.write("\0" * fsize)
f.close()
os.system("sync")

def smallreads(fn):
rsize = 40
N = fsize // rsize
f = open(fn, "rb")
for i in xrange(N): # N is about 10,000,000
time = struct.unpack('=HHHH', f.read(8))
tdc = struct.unpack('=LiLiLiLi', f.read(32))
f.close()


def bigreads(fn):
rsize = 40
N = fsize // rsize
f = open(fn, "rb")
for i in xrange(N//1000):
buffer = f.read(rsize*1000) # read 1000 records at once
for j in xrange(1000): # process each record
offset = j*rsize
time = struct.unpack('=HHHH', buffer[offset:eek:ffset+8])
tdc = struct.unpack('=LiLiLiLi', buffer[offset+8:eek:ffset+rsize])
f.close()

print "smallreads", timeit.Timer("smallreads(fn)","from __main__ import
fn,smallreads,fsize").repeat(3,1)
print "bigreads", timeit.Timer("bigreads(fn)", "from __main__ import
fn,bigreads,fsize").repeat(3,1)
--- end code ---

Output:
smallreads [4.2534193777646663, 4.126013885559789, 4.2389176672125458]
bigreads [1.2897319939456011, 1.3076018578892405, 1.2703250635695138]

So in this sample case, reading in big chunks is about 3 times faster than
reading many tiny pieces.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top