writing large files quickly

rbt · Jan 27, 2006

I've been doing some file system benchmarking. In the process, I need to
create a large file to copy around to various drives. I'm creating the
file like this:

fd = file('large_file.bin', 'wb')
for x in xrange(409600000):
fd.write('0')
fd.close()

This takes a few minutes to do. How can I speed up the process?

Thanks!

superfun · Jan 27, 2006

One way to speed this up is to write larger strings:

fd = file('large_file.bin', 'wb')
for x in xrange(51200000):
fd.write('00000000')
fd.close()

However, I bet within an hour or so you will have a much better answer
or 10. =)

casevh · Jan 27, 2006

rbt said:
I've been doing some file system benchmarking. In the process, I need to
create a large file to copy around to various drives. I'm creating the
file like this:

fd = file('large_file.bin', 'wb')
for x in xrange(409600000):
fd.write('0')
fd.close()

This takes a few minutes to do. How can I speed up the process?

Thanks!

Untested, but this should be faster.

block = '0' * 409600
fd = file('large_file.bin', 'wb')
for x in range(1000):
fd.write('0')
fd.close()

Tim Chase · Jan 27, 2006

Untested, but this should be faster.

block = '0' * 409600
fd = file('large_file.bin', 'wb')
for x in range(1000):
fd.write('0')
fd.close()

Just checking...you mean

fd.write(block)

right?

Otherwise, you end up with just 1000 "0" characters in
your file

Is there anything preventing one from just doing the following?

fd.write("0" * 409600000)

It's one huge string for a very short time. It skips all the
looping and allows Python to pump the file out to the disk as
fast as the OS can handle it. (and sorta as fast as Python can
generate this humongous string)

-tkc

Paul Rubin · Jan 27, 2006

Tim Chase said:
Is there anything preventing one from just doing the following?
fd.write("0" * 409600000)
It's one huge string for a very short time. It skips all the looping
and allows Python to pump the file out to the disk as fast as the OS
can handle it. (and sorta as fast as Python can generate this
humongous string)

That's large enough that it might exceed your PC's memory and cause
swapping. Try strings of about 64k (65536).

Grant Edwards · Jan 27, 2006

I've been doing some file system benchmarking. In the process, I need to
create a large file to copy around to various drives. I'm creating the
file like this:

fd = file('large_file.bin', 'wb')
for x in xrange(409600000):
fd.write('0')
fd.close()

This takes a few minutes to do. How can I speed up the process?

Don't write so much data.

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')
f.close()

That should be almost instantaneous in that the time required
for those 4 lines of code is neglgible compared to interpreter
startup and shutdown.

casevh · Jan 27, 2006

Oops. I did mean

fd.write(block)

The only limit is available memory. I've used 1MB block sizes when I
did read/write tests. I was comparing NFS vs. local disk performance. I
know Python can do at least 100MB/sec.

Tim Chase · Jan 27, 2006

fd.write('0')
[cut]

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')

While a mindblowingly simple/elegant/fast solution (kudos!), the
OP's file ends up with full of the character zero (ASCII 0x30),
while your solution ends up full of the NUL character (ASCII 0x00):

tkc@oblique:~/temp$ xxd op.bin
0000000: 3030 3030 3030 3030 3030 0000000000
tkc@oblique:~/temp$ xxd new.bin
0000000: 0000 0000 0000 0000 0000 ..........

(using only length 10 instead of 400 megs to save time and disk
space...)

-tkc

Grant Edwards · Jan 27, 2006

fd.write('0')

Click to expand...

[cut]

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')

Click to expand...

While a mindblowingly simple/elegant/fast solution (kudos!), the
OP's file ends up with full of the character zero (ASCII 0x30),
while your solution ends up full of the NUL character (ASCII 0x00):

Oops. I missed the fact that he was writing 0x30 and not 0x00.

Yes, the "hole" in the file will read as 0x00 bytes. If the OP
actually requires that the file contain something other than
0x00 bytes, then my solution won't work.

rbt · Jan 27, 2006

Grant said:
fd.write('0')
[cut]

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')

Click to expand...

While a mindblowingly simple/elegant/fast solution (kudos!), the
OP's file ends up with full of the character zero (ASCII 0x30),
while your solution ends up full of the NUL character (ASCII 0x00):

Click to expand...

Oops. I missed the fact that he was writing 0x30 and not 0x00.

Yes, the "hole" in the file will read as 0x00 bytes. If the OP
actually requires that the file contain something other than
0x00 bytes, then my solution won't work.

Won't work!? It's absolutely fabulous! I just need something big, quick
and zeros work great.

How the heck does that make a 400 MB file that fast? It literally takes
a second or two while every other solution takes at least 2 - 5 minutes.
Awesome... thanks for the tip!!!

Thanks to all for the advice... one can really learn things here

rbt · Jan 27, 2006

Grant said:
Don't write so much data.

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')
f.close()

OK, I'm still trying to pick my jaw up off of the floor. One question...
how big of a file could this method create? 20GB, 30GB, limit depends
on filesystem, etc?

Donn Cave · Jan 27, 2006

Won't work!? It's absolutely fabulous! I just need something big, quick
and zeros work great.

How the heck does that make a 400 MB file that fast? It literally takes
a second or two while every other solution takes at least 2 - 5 minutes.
Awesome... thanks for the tip!!!

Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.

Donn Cave, (e-mail address removed)

rbt · Jan 27, 2006

Donn said:
Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks.

Hmmm... when I copy the file to a different drive, it takes up
409,600,000 bytes. Also, an md5 checksum on the generated file and on
copies placed on other drives are the same. It looks like a regular, big
file... I don't get it.

Robert Kern · Jan 27, 2006

rbt said:
Hmmm... when I copy the file to a different drive, it takes up
409,600,000 bytes. Also, an md5 checksum on the generated file and on
copies placed on other drives are the same. It looks like a regular, big
file... I don't get it.

google("sparse files")

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Grant Edwards · Jan 27, 2006

fd.write('0')

[cut]

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')

While a mindblowingly simple/elegant/fast solution (kudos!), the
OP's file ends up with full of the character zero (ASCII 0x30),
while your solution ends up full of the NUL character (ASCII 0x00):

Click to expand...

Oops. I missed the fact that he was writing 0x30 and not 0x00.

Yes, the "hole" in the file will read as 0x00 bytes. If the OP
actually requires that the file contain something other than
0x00 bytes, then my solution won't work.

Click to expand...

Won't work!? It's absolutely fabulous! I just need something big, quick
and zeros work great.

Then Bob's your uncle, eh?

How the heck does that make a 400 MB file that fast?

Most of the file isn't really there, it's just a big "hole" in
a sparse array containing a single allocation block that
contains the single '0x00' byte that was written:

$ ls -l large_file.bin
-rw-r--r-- 1 grante users 409600000 Jan 27 15:02 large_file.bin
$ du -h large_file.bin
12K large_file.bin

The filesystem code in the OS is written so that it returns
'0x00' bytes when you attempt to read data from the "hole" in
the file. So, if you open the file and start reading, you'll
get 400MB of 0x00 bytes before you get an EOF return. But the
file really only takes up a couple "chunks" of disk space, and
chunks are usually on the order of 4KB.

Grant Edwards · Jan 27, 2006

Hmmm... when I copy the file to a different drive, it takes up
409,600,000 bytes. Also, an md5 checksum on the generated file and on
copies placed on other drives are the same. It looks like a regular, big
file... I don't get it.

Because the filesystem code keeps track of where you are in
that 400MB stream, and returns 0x00 anytime you're reading from
a "hole". The "cp" program and the "md5sum" just open the file
and start read()ing. The filesystem code returns 0x00 bytes
for all of the read positions that are in the "hole", just like
Don said:

Grant Edwards · Jan 27, 2006

OK, I'm still trying to pick my jaw up off of the floor. One
question... how big of a file could this method create? 20GB,
30GB, limit depends on filesystem, etc?

Right. Back in the day, the old libc and ext2 code had a 2GB
file size limit at one point (it used an signed 32 bit value
to keep track of file size/position). That was back when 1GB
drive was something to brag about, so it wasn't a big deal for
most people.

I think everthing has large file support is enabled by default
now, so the limit is 2^63 for most "modern" filesystems --
that's the limit of the file size you can create using the
seek() trick. The limit for actual on-disk bytes may not be
that large.

Here's a good link:

http://www.suse.de/~aj/linux_lfs.html

Erik Andreas Brandstadmoen · Jan 27, 2006

Grant said:
Because the filesystem code keeps track of where you are in
that 400MB stream, and returns 0x00 anytime you're reading from
a "hole". The "cp" program and the "md5sum" just open the file
and start read()ing. The filesystem code returns 0x00 bytes
for all of the read positions that are in the "hole", just like
Don said:

And, this file is of course useless for FS benchmarking, since you're
barely reading data from disk at all. You'll just be testing the FS's
handling of sparse files. I suggest you go for one of the suggestions
with larger block sizes. That's probably your best bet.

Regards,

Erik Brandstadmoen

rbt · Jan 27, 2006

Grant said:
Because the filesystem code keeps track of where you are in
that 400MB stream, and returns 0x00 anytime you're reading from
a "hole". The "cp" program and the "md5sum" just open the file
and start read()ing. The filesystem code returns 0x00 bytes
for all of the read positions that are in the "hole", just like
Don said:

OK I finally get it. It's too good to be true

I'm going back to using _real_ files... files that don't look as if they
are there but aren't. BTW, the file 'size' and 'size on disk' were
identical on win 2003. That's a bit deceptive. According to the NTFS
docs, they should be drastically different... 'size on disk' should be
like 64K or something.

Grant Edwards · Jan 27, 2006

And, this file is of course useless for FS benchmarking, since
you're barely reading data from disk at all.

Quite right. Copying such a sparse file is probably only
really testing the write performance of the filesystem
containing the destination file.

You'll just be testing the FS's handling of sparse files.

Which may be a useful thing to know, but I rather doubt it.

Whats the simplest way to convert PST to EML files quickly?	3	Mar 3, 2026
How do I quickly export PST emails to HTML for browser viewing?	1	Apr 2, 2026
How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
Processing large CSV files - how to maximise throughput?	11	Oct 24, 2013
writing large dictionaries to file using cPickle	11	Jan 28, 2009
Reading a large csv file	1	Jun 22, 2009
mmap and large files	7	Nov 14, 2009
HTTP POST uploading large files	4	Jan 19, 2008

writing large files quickly

rbt

superfun

casevh

Tim Chase

Paul Rubin

Grant Edwards

casevh

Tim Chase

Grant Edwards

rbt

rbt

Donn Cave

rbt

Robert Kern

Grant Edwards

Grant Edwards

Grant Edwards

Erik Andreas Brandstadmoen

rbt

Grant Edwards

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads