writing large files quickly

Grant Edwards · Jan 27, 2006

OK I finally get it. It's too good to be true

Sorry about that. I should have paid closer attention to what
you were going to do with the file.

I'm going back to using _real_ files... files that don't look
as if they are there but aren't. BTW, the file 'size' and
'size on disk' were identical on win 2003. That's a bit
deceptive.

What?! Windows lying to the user? I don't believe it!

According to the NTFS docs, they should be drastically
different... 'size on disk' should be like 64K or something.

Probably.

Steven D'Aprano · Jan 27, 2006

Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.

Isn't this a file system specific solution though? Won't your file system
need to have support for "sparse files", or else it won't work?

Here is another possible solution, if you are running Linux, farm the real
work out to some C code optimised for writing blocks to the disk:

# untested and, it goes without saying, untimed
os.system("dd if=/dev/zero of=largefile.bin bs=64K count=16384")

That should make a 4GB file as fast as possible. If you have lots and lots
of memory, you could try upping the block size (bs=...).

Ivan Voras · Jan 27, 2006

Steven said:
Isn't this a file system specific solution though? Won't your file system
need to have support for "sparse files", or else it won't work?

Yes, but AFAIK the only "modern" (meaning: in wide use today) file
system that doesn't have this support is FAT/FAT32.

Grant Edwards · Jan 27, 2006

Isn't this a file system specific solution though? Won't your file system
need to have support for "sparse files", or else it won't work?

If your fs doesn't support sparse files, then you'll end up with a
file that really does have 400MB of 0x00 bytes in it. Which is
what the OP really needed in the first place.

Here is another possible solution, if you are running Linux, farm the real
work out to some C code optimised for writing blocks to the disk:

# untested and, it goes without saying, untimed
os.system("dd if=/dev/zero of=largefile.bin bs=64K count=16384")

That should make a 4GB file as fast as possible. If you have lots and lots
of memory, you could try upping the block size (bs=...).

I agree. that probably is the optimal solution for Unix boxes.
I messed around with something like that once, and block sizes
bigger than 64k didn't make much difference.

Jens Theisen · Jan 28, 2006

Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.

Under which operating system/file system?

As far as I know this should be file system dependent at least under
Linux, as the calls to open and seek are served by the file system driver.

Jens

Jens Theisen · Jan 28, 2006

Ivan said:
Steven D'Aprano wrote:

Yes, but AFAIK the only "modern" (meaning: in wide use today) file
system that doesn't have this support is FAT/FAT32.

I don't think ext2fs does this either. At least the du and df commands
tell something different.

Actually I'm not sure what this optimisation should give you anyway. The
only circumstance under which files with only zeroes are meaningful is
testing, and that's exactly when you don't want that optimisation.

On compressing filesystems such as ntfs you will get this behaviour as a
special case of compression and compression makes more sense.

Jens

Jens Theisen · Jan 28, 2006

Donn said:
Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.

Are you sure that's not just a case of asynchronous writing that can be
done in a particularly efficient way? df quite clearly tells me that I'm
running out of disk space on my ext2fs linux when I dump it full of
zeroes.

Jens

Ivan Voras · Jan 28, 2006

Jens said:
Ivan wrote:

I don't think ext2fs does this either. At least the du and df commands
tell something different.

ext2 is a reimplementation of BSD UFS, so it does. Here:

f = file('bigfile', 'w')
f.seek(1024*1024)
f.write('a')

$ l afile
-rw-r--r-- 1 ivoras wheel 1048577 Jan 28 14:57 afile
$ du afile
8 afile

Actually I'm not sure what this optimisation should give you anyway. The
only circumstance under which files with only zeroes are meaningful is
testing, and that's exactly when you don't want that optimisation.

I read somewhere that it has a use in database software, but the only
thing I can imagine for this is when using heap queues
(http://python.active-venture.com/lib/node162.html).

Jens Theisen · Jan 28, 2006

Ivan said:
ext2 is a reimplementation of BSD UFS, so it does. Here:

f = file('bigfile', 'w')
f.seek(1024*1024)
f.write('a')

$ l afile
-rw-r--r-- 1 ivoras wheel 1048577 Jan 28 14:57 afile
$ du afile
8 afile

Interesting:

cp bigfile bigfile2

cat bigfile > bigfile3

du bigfile*
8 bigfile2
1032 bigfile3

So it's not consumings 0's. It's just doesn't store unwritten data. And I
can think of an application for that: An application might want to write
the biginning of a file at a later point, so this makes it more efficient.

I wonder how other file systems behave.

I read somewhere that it has a use in database software, but the only
thing I can imagine for this is when using heap queues
(http://python.active-venture.com/lib/node162.html).

That's an article about the heap efficient data structure. Was it your
intention to link this?

Jens

Scott David Daniels · Jan 28, 2006

I've used this feature eons ago where the file was essentially a single
large address space (memory mapped array) that was expected to never
fill all that full. I was tracking data from a year of (thousands? of)
students seeing Drill-and-practice questions from a huge database of
questions. The research criticism we got was that our analysis did not
rule out any kid seeing the same question more than once, and getting
"practice" that would improve performance w/o learning. I built a
bit-filter and copied tapes dropping any repeats seen by students.
We then just ran the same analysis we had on the raw data, and found
no significant difference.

The nice thing is that file size grew over time, so (for a while) I
could run on the machine with other users. By the last block of
tapes I was sitting alone in the machine room at 3:00 AM on Sat mornings
afraid to so much as fire up an editor.

Tim Peters · Jan 28, 2006

[Jens Theisen]

...
Actually I'm not sure what this optimisation should give you anyway. The
only circumstance under which files with only zeroes are meaningful is
testing, and that's exactly when you don't want that optimisation.

In most cases, a guarantee that reading "uninitialized" file data will
return zeroes is a security promise, not an optimization. C doesn't
require this behavior, but POSIX does.

On FAT/FAT32, if you create a file, seek to a "large" offset, write a
byte, then read the uninitialized data from offset 0 up to the byte
just written, you get back whatever happened to be sitting on disk at
the locations now reserved for the file. That can include passwords,
other peoples' email, etc -- anything whatsoever that may have been
written to disk at some time in the disk's history. Security weenies
get upset at stuff like that ;-)

Ivan Voras · Jan 28, 2006

Jens said:
cp bigfile bigfile2

cat bigfile > bigfile3

du bigfile*
8 bigfile2
1032 bigfile3

So it's not consumings 0's. It's just doesn't store unwritten data. And I

Very possibly cp "understands" sparse file and cat (doint what it's
meant to do) doesn't

That's an article about the heap efficient data structure. Was it your
intention to link this?

Yes. The idea is that in implementing such a structure, in which each
level is 2^x (x="level" of the structure, and it's depentent on the
number of entries the structure must hold) wide, most of blocks could
exist and never be written to (i.e. they'd be "empty"). Using sparse
files would save space

(It has nothing to do with python; I remembered the article so I linked
to it; The sparse-file issue is useful only when implementing heaps
directly on file or in mmaped file).

Bengt Richter · Jan 28, 2006

Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.

I wonder if it will also "write" virtual blocks when it gets real
zero blocks to write from a user, or even with file system copy utils?

Regards,
Bengt Richter

Dennis Lee Bieber · Jan 28, 2006

Very possibly cp "understands" sparse file and cat (doint what it's
meant to do) doesn't

I'd suspect that "cp" is working from the allocation data in the
directory entry, only copying those blocks that are flagged as
allocated.

"cat", OTOH, is supposed to show the logical contents of the file,
so will touch, in one way or another, every block whether allocated or
not (at the worst, I could see using "cat" would result in filling in
the previously unallocated blocks of the source file -- did you check
its size after the use of "cat"?)
--

Grant Edwards · Jan 28, 2006

I wonder if it will also "write" virtual blocks when it gets real
zero blocks to write from a user, or even with file system copy utils?

No, in my experience none of the Linux filesystems do that.
It's easy enough to test:

$ dd if=/dev/zero of=zeros bs=64k count=1024
1024+0 records in
1024+0 records out
$ ls -l zeros
-rw-r--r-- 1 grante users 67108864 Jan 28 14:49 zeros
$ du -h zeros
65M zeros

In my book that's 64MB not 65MB, but that's an argument for
another day.

Fredrik Lundh · Jan 29, 2006

Bengt said:
I wonder if it will also "write" virtual blocks when it gets real
zero blocks to write from a user, or even with file system copy utils?

I've seen this behaviour on "big iron" Unix systems, in a benchmark that
repeatedly copied data from a memory mapped section to an output file.

but for the general case, I doubt that adding "is this block all zeros" or
"does this block match something we recently wrote to disk" checks will
speed things up, on average...

</F>

Thomas Bellman · Jan 29, 2006

Grant Edwards said:
$ dd if=/dev/zero of=zeros bs=64k count=1024
1024+0 records in
1024+0 records out
$ ls -l zeros
-rw-r--r-- 1 grante users 67108864 Jan 28 14:49 zeros
$ du -h zeros
65M zeros

In my book that's 64MB not 65MB, but that's an argument for
another day.

You should be aware that the size that 'du' and 'ls -s' reports,
include any indirect blocks needed to keep track of the data
blocks of the file. Thus, you get the amount of space that the
file actually uses in the file system, and would become free if
you removed it. That's why it is larger than 64 Mbyte. And 'du'
(at least GNU du) rounds upwards when you use -h.

Try for instance:

$ dd if=/dev/zero of=zeros bs=4k count=16367
16367+0 records in
16367+0 records out
$ ls -ls zeros
65536 -rw-rw-r-- 1 bellman bellman 67039232 Jan 29 13:57 zeros
$ du -h zeros
64M zeros

$ dd if=/dev/zero of=zeros bs=4k count=16368
16368+0 records in
16368+0 records out
$ ls -ls zeros
65540 -rw-rw-r-- 1 bellman bellman 67043328 Jan 29 13:58 zeros
$ du -h zeros
65M zeros

(You can infer from the above that my file system has a block
size of 4 Kbyte.)

Processing large CSV files - how to maximise throughput?	11	Oct 24, 2013
writing large dictionaries to file using cPickle	11	Jan 28, 2009
Reading a large csv file	1	Jun 22, 2009
mmap and large files	7	Nov 14, 2009
HTTP POST uploading large files	4	Jan 19, 2008
Downloading Large Files -- Feedback?	10	Feb 12, 2006
Processing large files with TextFieldParser	3	Nov 30, 2009
How to efficiently work with gettext PO files when making small editsto large text values	0	Jun 3, 2010

writing large files quickly

Grant Edwards

Steven D'Aprano

Ivan Voras

Grant Edwards

Jens Theisen

Jens Theisen

Jens Theisen

Ivan Voras

Jens Theisen

Scott David Daniels

Tim Peters

Ivan Voras

Bengt Richter

Dennis Lee Bieber

Grant Edwards

Fredrik Lundh

Thomas Bellman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads