write a 20GB file

Jackie Lee · May 14, 2010

Hello there,

I have a 22 GB binary file, a want to change values of specific
positions. Because of the volume of the file, I doubt my code a
efficient one:

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''
sys.exit(1)

#skip EBCDIC header
try:
f.seek(3200)
except Exception:
print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
print 'file read error'
sys.exit(1)

#read trace header
while True:
f.seek(28,1)
f.write(struct.pack('>h',1))
f.seek(212,1)
f.seek(ns*4,1)

f.close()

Dave Angel · May 14, 2010

Jackie said:
Hello there,

I have a 22 GB binary file, a want to change values of specific
positions. Because of the volume of the file, I doubt my code a
efficient one:

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''
sys.exit(1)

#skip EBCDIC header
try:
f.seek(3200)
except Exception:
print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
print 'file read error'
sys.exit(1)

#read trace header
while True:
f.seek(28,1)
f.write(struct.pack('>h',1))
f.seek(212,1)
f.seek(ns*4,1)

f.close()

I don't see a question anywhere. So perhaps you just want comments on
your code.

1) How do you plan to test this?
2) Consider doing a lot more checking to see that you have in fact a
file of the right type.
3) Fix indentation - perhaps you've accidentally used a tab in the source.
4) Provide a termination condition for the while True loop, which
currently will (I think) go forever, or perhaps until the disk fills up.
5) Depending on the purpose of this file, you should consider making the
changes on a copy, then deleting and renaming. As it stands, if the
program gets aborted part way through, there's no way to know how far it
got. Since it's just clobbering bytes, it would be safe to rerun the
same program again, but many times that's not the case. And this
program clearly isn't finished yet, so perhaps it's not true here either.
6) I don't see anything inefficient about it. The nature of the problem
is going to be very slow (for small values of ns), but I don't know what
your code could do to speed it up. Perhaps make sure the file is on a
fast drive, and not RAID 5.

DaveA

Jackie Lee · May 14, 2010

Thx, Dave,

The code works fine. I just don't know how f.write works. It says that
file.write won't write the file until file.close or file.flush. So I
don't know if the following one is more efficient (sorry I forget to
add condition to break the loop):

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''
sys.exit(1)

#skip EBCDIC header
try:
f.seek(3200)
except Exception:
print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
print 'file read error'
sys.exit(1)

#read trace header
while True:
f.seek(28,1)
if f.read(2) == '':
break
f.seek(-2,1)
f.write(struct.pack('>h',1))
f.seek(210,1)
f.seek(ns*4,1)

f.close()

Jackie said:
Jackie said:

Hello there,

I have a 22 GB binary file, a want to change values of specific
positions. Because of the volume of the file, I doubt my code a
efficient one:

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''
sys.exit(1)

#skip EBCDIC header
try:
f.seek(3200)
except Exception:
print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
print 'file read error'
sys.exit(1)

#read trace header
while True:
f.seek(28,1)
f.write(struct.pack('>h',1))
f.seek(212,1)
f.seek(ns*4,1)

f.close()

Click to expand...

I don't see a question anywhere. So perhaps you just want comments on your
code.

1) How do you plan to test this?
2) Consider doing a lot more checking to see that you have in fact a file of
the right type.
3) Fix indentation - perhaps you've accidentally used a tab in the source..
4) Provide a termination condition for the while True loop, which currently
will (I think) go forever, or perhaps until the disk fills up.
5) Depending on the purpose of this file, you should consider making the
changes on a copy, then deleting and renaming. As it stands, if the program
gets aborted part way through, there's no way to know how far it got. Since
it's just clobbering bytes, it would be safe to rerun the same program
again, but many times that's not the case. And this program clearly isn't
finished yet, so perhaps it's not true here either.
6) I don't see anything inefficient about it. The nature of the problem is
going to be very slow (for small values of ns), but I don't know what your
code could do to speed it up. Perhaps make sure the file is on a fast
drive, and not RAID 5.

DaveA

J · May 14, 2010

Thx, Dave,

The code works fine. I just don't know how f.write works. It says that
file.write won't write the file until file.close or file.flush. So I
don't know if the following one is more efficient (sorry I forget to
add condition to break the loop):

someone smarter than me can correct me, but file.write() will write
when it's buffer is filled, or close() or flush() are called.
I don't know what the default buffer size for file.write() is though.
close() flushes the buffer before closing the file, and flush()
flushes the buffer and leaves the file open for further writing.

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''

You can just add a f.flush() every time you write to the file, but, I
tend to open files with 0 buffer size like this:

f = open(filename,"rb+",0)

Then again, I don't deal with files of that size, so there could be a
problem with my way once you start scaling up to the 20GB or larger
that you're working with.

Again, I could be wrong about all of that, so if so, I hope someone
will correct me and fix my understanding...

Cheers,

Jeff

Martin v. Loewis · May 14, 2010

The code works fine. I just don't know how f.write works. It says that

file.write won't write the file until file.close or file.flush.

You are misinterpreting the documentation. It certainly won't keep the
entire file in memory. Instead, it has a fixed-size buffer (something
like 8kiB or 32kiB) in which it writes and which it flushes when that
buffer is full.

The comment about flush and close merely refers to the problem that some
data may still be in the buffer at any point in time, unless you just
called close or flush.

HTH,
Martin

Martin v. Loewis · May 14, 2010

The code works fine. I just don't know how f.write works. It says that

file.write won't write the file until file.close or file.flush.

You are misinterpreting the documentation. It certainly won't keep the
entire file in memory. Instead, it has a fixed-size buffer (something
like 8kiB or 32kiB) in which it writes and which it flushes when that
buffer is full.

The comment about flush and close merely refers to the problem that some
data may still be in the buffer at any point in time, unless you just
called close or flush.

HTH,
Martin

Nobody · May 14, 2010

someone smarter than me can correct me, but file.write() will write when
it's buffer is filled, or close() or flush() are called.

And, in all probability, seek() will either flush it immediately or cause
the next write() to flush it before writing anything.

J · May 14, 2010

And, in all probability, seek() will either flush it immediately or cause
the next write() to flush it before writing anything.

Ahhh... I didn't know that... I thought seek() just moved the pointer
through the file a little further....

Cool.

Jackie Lee · May 15, 2010

Thanks to y'all. I should have be more careful reading the documentation.

Cheers

Nobody · May 15, 2010

Ahhh... I didn't know that... I thought seek() just moved the pointer
through the file a little further....

Think about how this affects buffering. write() writes at the current file
position. If you write, then seek, then write, it can't just concatenate
the two sets of data, as that would "lose" the seek.

Either the buffer has to contain multiple, distinct sets of data, each
with an associated position, or (far more likely), the original data must
be written to the correct location before the second set of data can be
stored.

Dave Angel · May 16, 2010

Nathan said:
This is precisely the situation mmap was made for It has almost the same
methods as a file so it should be an easy replacement.

<snip>

Only on a 64bit system, and I'm not sure it's even possible there in
every case. On a 32bit system, it would be impossible to mmap a 20gb
file. You only have 4gb of address space to play with, total.

DaveA

Patrick Maupin · May 16, 2010

Only on a 64bit system, and I'm not sure it's even possible there in
every case. On a 32bit system, it would be impossible to mmap a 20gb
file. You only have 4gb of address space to play with, total.

DaveA

Patrick Maupin · May 16, 2010

Only on a 64bit system, and I'm not sure it's even possible there in
every case. On a 32bit system, it would be impossible to mmap a 20gb
file. You only have 4gb of address space to play with, total.

DaveA

Well, depending on the OS, I think you could have multiple mappings
per file. So you could maintain your own mapping cache. That could
get a bit ugly, but depending on what you are doing, it might not be
too bad.

Regards,
Pat

Search and write to .txt file	5	Aug 11, 2009
How do I write a script to generate 10 random EVEN numbers and writethem to a .txt file?	3	Jul 8, 2013
write to a file two dict()	2	Sep 23, 2012
Python client/server that reads HTML body from server	1	Apr 12, 2023
How write a IGMP V3 request	0	Oct 1, 2012
How to write fast into a file in python?	28	May 17, 2013
newbie: write new file (from a server)	0	Jul 29, 2012
newbie: write content in a file (server-side)	4	Jul 29, 2012

write a 20GB file

Jackie Lee

Dave Angel

Jackie Lee

J

Martin v. Loewis

Martin v. Loewis

Nobody

J

Jackie Lee

Nobody

Dave Angel

Patrick Maupin

Patrick Maupin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads