writing large files quickly


R

rbt

I've been doing some file system benchmarking. In the process, I need to
create a large file to copy around to various drives. I'm creating the
file like this:

fd = file('large_file.bin', 'wb')
for x in xrange(409600000):
fd.write('0')
fd.close()

This takes a few minutes to do. How can I speed up the process?

Thanks!
 
Ad

Advertisements

S

superfun

One way to speed this up is to write larger strings:

fd = file('large_file.bin', 'wb')
for x in xrange(51200000):
fd.write('00000000')
fd.close()

However, I bet within an hour or so you will have a much better answer
or 10. =)
 
C

casevh

rbt said:
I've been doing some file system benchmarking. In the process, I need to
create a large file to copy around to various drives. I'm creating the
file like this:

fd = file('large_file.bin', 'wb')
for x in xrange(409600000):
fd.write('0')
fd.close()

This takes a few minutes to do. How can I speed up the process?

Thanks!

Untested, but this should be faster.

block = '0' * 409600
fd = file('large_file.bin', 'wb')
for x in range(1000):
fd.write('0')
fd.close()
 
T

Tim Chase

Untested, but this should be faster.
block = '0' * 409600
fd = file('large_file.bin', 'wb')
for x in range(1000):
fd.write('0')
fd.close()

Just checking...you mean

fd.write(block)

right? :) Otherwise, you end up with just 1000 "0" characters in
your file :)

Is there anything preventing one from just doing the following?

fd.write("0" * 409600000)

It's one huge string for a very short time. It skips all the
looping and allows Python to pump the file out to the disk as
fast as the OS can handle it. (and sorta as fast as Python can
generate this humongous string)

-tkc
 
P

Paul Rubin

Tim Chase said:
Is there anything preventing one from just doing the following?
fd.write("0" * 409600000)
It's one huge string for a very short time. It skips all the looping
and allows Python to pump the file out to the disk as fast as the OS
can handle it. (and sorta as fast as Python can generate this
humongous string)

That's large enough that it might exceed your PC's memory and cause
swapping. Try strings of about 64k (65536).
 
G

Grant Edwards

I've been doing some file system benchmarking. In the process, I need to
create a large file to copy around to various drives. I'm creating the
file like this:

fd = file('large_file.bin', 'wb')
for x in xrange(409600000):
fd.write('0')
fd.close()

This takes a few minutes to do. How can I speed up the process?

Don't write so much data.

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')
f.close()

That should be almost instantaneous in that the time required
for those 4 lines of code is neglgible compared to interpreter
startup and shutdown.
 
Ad

Advertisements

C

casevh

Oops. I did mean

fd.write(block)

The only limit is available memory. I've used 1MB block sizes when I
did read/write tests. I was comparing NFS vs. local disk performance. I
know Python can do at least 100MB/sec.
 
T

Tim Chase

fd.write('0')
[cut]
f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')

While a mindblowingly simple/elegant/fast solution (kudos!), the
OP's file ends up with full of the character zero (ASCII 0x30),
while your solution ends up full of the NUL character (ASCII 0x00):

[email protected]:~/temp$ xxd op.bin
0000000: 3030 3030 3030 3030 3030 0000000000
[email protected]:~/temp$ xxd new.bin
0000000: 0000 0000 0000 0000 0000 ..........

(using only length 10 instead of 400 megs to save time and disk
space...)

-tkc
 
G

Grant Edwards

fd.write('0')
[cut]

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')

While a mindblowingly simple/elegant/fast solution (kudos!), the
OP's file ends up with full of the character zero (ASCII 0x30),
while your solution ends up full of the NUL character (ASCII 0x00):

Oops. I missed the fact that he was writing 0x30 and not 0x00.

Yes, the "hole" in the file will read as 0x00 bytes. If the OP
actually requires that the file contain something other than
0x00 bytes, then my solution won't work.
 
R

rbt

Grant said:
fd.write('0')
[cut]

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')

While a mindblowingly simple/elegant/fast solution (kudos!), the
OP's file ends up with full of the character zero (ASCII 0x30),
while your solution ends up full of the NUL character (ASCII 0x00):


Oops. I missed the fact that he was writing 0x30 and not 0x00.

Yes, the "hole" in the file will read as 0x00 bytes. If the OP
actually requires that the file contain something other than
0x00 bytes, then my solution won't work.

Won't work!? It's absolutely fabulous! I just need something big, quick
and zeros work great.

How the heck does that make a 400 MB file that fast? It literally takes
a second or two while every other solution takes at least 2 - 5 minutes.
Awesome... thanks for the tip!!!

Thanks to all for the advice... one can really learn things here :)
 
R

rbt

Grant said:
Don't write so much data.

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')
f.close()

OK, I'm still trying to pick my jaw up off of the floor. One question...
how big of a file could this method create? 20GB, 30GB, limit depends
on filesystem, etc?
 
Ad

Advertisements

D

Donn Cave

Won't work!? It's absolutely fabulous! I just need something big, quick
and zeros work great.

How the heck does that make a 400 MB file that fast? It literally takes
a second or two while every other solution takes at least 2 - 5 minutes.
Awesome... thanks for the tip!!!

Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.

Donn Cave, (e-mail address removed)
 
R

rbt

Donn said:
Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks.

Hmmm... when I copy the file to a different drive, it takes up
409,600,000 bytes. Also, an md5 checksum on the generated file and on
copies placed on other drives are the same. It looks like a regular, big
file... I don't get it.
 
R

Robert Kern

rbt said:
Hmmm... when I copy the file to a different drive, it takes up
409,600,000 bytes. Also, an md5 checksum on the generated file and on
copies placed on other drives are the same. It looks like a regular, big
file... I don't get it.

google("sparse files")

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
G

Grant Edwards

fd.write('0')

[cut]

f = file('large_file.bin','wb')
f.seek(409600000-1)
f.write('\x00')

While a mindblowingly simple/elegant/fast solution (kudos!), the
OP's file ends up with full of the character zero (ASCII 0x30),
while your solution ends up full of the NUL character (ASCII 0x00):

Oops. I missed the fact that he was writing 0x30 and not 0x00.

Yes, the "hole" in the file will read as 0x00 bytes. If the OP
actually requires that the file contain something other than
0x00 bytes, then my solution won't work.

Won't work!? It's absolutely fabulous! I just need something big, quick
and zeros work great.

Then Bob's your uncle, eh?
How the heck does that make a 400 MB file that fast?

Most of the file isn't really there, it's just a big "hole" in
a sparse array containing a single allocation block that
contains the single '0x00' byte that was written:

$ ls -l large_file.bin
-rw-r--r-- 1 grante users 409600000 Jan 27 15:02 large_file.bin
$ du -h large_file.bin
12K large_file.bin

The filesystem code in the OS is written so that it returns
'0x00' bytes when you attempt to read data from the "hole" in
the file. So, if you open the file and start reading, you'll
get 400MB of 0x00 bytes before you get an EOF return. But the
file really only takes up a couple "chunks" of disk space, and
chunks are usually on the order of 4KB.
 
G

Grant Edwards

Hmmm... when I copy the file to a different drive, it takes up
409,600,000 bytes. Also, an md5 checksum on the generated file and on
copies placed on other drives are the same. It looks like a regular, big
file... I don't get it.

Because the filesystem code keeps track of where you are in
that 400MB stream, and returns 0x00 anytime you're reading from
a "hole". The "cp" program and the "md5sum" just open the file
and start read()ing. The filesystem code returns 0x00 bytes
for all of the read positions that are in the "hole", just like
Don said:
 
Ad

Advertisements

G

Grant Edwards

OK, I'm still trying to pick my jaw up off of the floor. One
question... how big of a file could this method create? 20GB,
30GB, limit depends on filesystem, etc?

Right. Back in the day, the old libc and ext2 code had a 2GB
file size limit at one point (it used an signed 32 bit value
to keep track of file size/position). That was back when 1GB
drive was something to brag about, so it wasn't a big deal for
most people.

I think everthing has large file support is enabled by default
now, so the limit is 2^63 for most "modern" filesystems --
that's the limit of the file size you can create using the
seek() trick. The limit for actual on-disk bytes may not be
that large.

Here's a good link:

http://www.suse.de/~aj/linux_lfs.html
 
E

Erik Andreas Brandstadmoen

Grant said:
Because the filesystem code keeps track of where you are in
that 400MB stream, and returns 0x00 anytime you're reading from
a "hole". The "cp" program and the "md5sum" just open the file
and start read()ing. The filesystem code returns 0x00 bytes
for all of the read positions that are in the "hole", just like
Don said:

And, this file is of course useless for FS benchmarking, since you're
barely reading data from disk at all. You'll just be testing the FS's
handling of sparse files. I suggest you go for one of the suggestions
with larger block sizes. That's probably your best bet.

Regards,

Erik Brandstadmoen
 
R

rbt

Grant said:
Because the filesystem code keeps track of where you are in
that 400MB stream, and returns 0x00 anytime you're reading from
a "hole". The "cp" program and the "md5sum" just open the file
and start read()ing. The filesystem code returns 0x00 bytes
for all of the read positions that are in the "hole", just like
Don said:

OK I finally get it. It's too good to be true :)

I'm going back to using _real_ files... files that don't look as if they
are there but aren't. BTW, the file 'size' and 'size on disk' were
identical on win 2003. That's a bit deceptive. According to the NTFS
docs, they should be drastically different... 'size on disk' should be
like 64K or something.
 
Ad

Advertisements

G

Grant Edwards

And, this file is of course useless for FS benchmarking, since
you're barely reading data from disk at all.

Quite right. Copying such a sparse file is probably only
really testing the write performance of the filesystem
containing the destination file.
You'll just be testing the FS's handling of sparse files.

Which may be a useful thing to know, but I rather doubt it.
 

Top