writing large files quickly

G

Grant Edwards

OK I finally get it. It's too good to be true :)

Sorry about that. I should have paid closer attention to what
you were going to do with the file.
I'm going back to using _real_ files... files that don't look
as if they are there but aren't. BTW, the file 'size' and
'size on disk' were identical on win 2003. That's a bit
deceptive.

What?! Windows lying to the user? I don't believe it!
According to the NTFS docs, they should be drastically
different... 'size on disk' should be like 64K or something.

Probably.
 
S

Steven D'Aprano

Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.


Isn't this a file system specific solution though? Won't your file system
need to have support for "sparse files", or else it won't work?


Here is another possible solution, if you are running Linux, farm the real
work out to some C code optimised for writing blocks to the disk:

# untested and, it goes without saying, untimed
os.system("dd if=/dev/zero of=largefile.bin bs=64K count=16384")


That should make a 4GB file as fast as possible. If you have lots and lots
of memory, you could try upping the block size (bs=...).
 
I

Ivan Voras

Steven said:
Isn't this a file system specific solution though? Won't your file system
need to have support for "sparse files", or else it won't work?

Yes, but AFAIK the only "modern" (meaning: in wide use today) file
system that doesn't have this support is FAT/FAT32.
 
G

Grant Edwards

Isn't this a file system specific solution though? Won't your file system
need to have support for "sparse files", or else it won't work?

If your fs doesn't support sparse files, then you'll end up with a
file that really does have 400MB of 0x00 bytes in it. Which is
what the OP really needed in the first place.
Here is another possible solution, if you are running Linux, farm the real
work out to some C code optimised for writing blocks to the disk:

# untested and, it goes without saying, untimed
os.system("dd if=/dev/zero of=largefile.bin bs=64K count=16384")

That should make a 4GB file as fast as possible. If you have lots and lots
of memory, you could try upping the block size (bs=...).

I agree. that probably is the optimal solution for Unix boxes.
I messed around with something like that once, and block sizes
bigger than 64k didn't make much difference.
 
J

Jens Theisen

Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.

Under which operating system/file system?

As far as I know this should be file system dependent at least under
Linux, as the calls to open and seek are served by the file system driver.

Jens
 
J

Jens Theisen

Ivan said:
Steven D'Aprano wrote:
Yes, but AFAIK the only "modern" (meaning: in wide use today) file
system that doesn't have this support is FAT/FAT32.

I don't think ext2fs does this either. At least the du and df commands
tell something different.

Actually I'm not sure what this optimisation should give you anyway. The
only circumstance under which files with only zeroes are meaningful is
testing, and that's exactly when you don't want that optimisation.

On compressing filesystems such as ntfs you will get this behaviour as a
special case of compression and compression makes more sense.

Jens
 
J

Jens Theisen

Donn said:
Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.

Are you sure that's not just a case of asynchronous writing that can be
done in a particularly efficient way? df quite clearly tells me that I'm
running out of disk space on my ext2fs linux when I dump it full of
zeroes.

Jens
 
I

Ivan Voras

Jens said:
Ivan wrote:

I don't think ext2fs does this either. At least the du and df commands
tell something different.

ext2 is a reimplementation of BSD UFS, so it does. Here:

f = file('bigfile', 'w')
f.seek(1024*1024)
f.write('a')

$ l afile
-rw-r--r-- 1 ivoras wheel 1048577 Jan 28 14:57 afile
$ du afile
8 afile
Actually I'm not sure what this optimisation should give you anyway. The
only circumstance under which files with only zeroes are meaningful is
testing, and that's exactly when you don't want that optimisation.

I read somewhere that it has a use in database software, but the only
thing I can imagine for this is when using heap queues
(http://python.active-venture.com/lib/node162.html).
 
J

Jens Theisen

Ivan said:
ext2 is a reimplementation of BSD UFS, so it does. Here:
f = file('bigfile', 'w')
f.seek(1024*1024)
f.write('a')
$ l afile
-rw-r--r-- 1 ivoras wheel 1048577 Jan 28 14:57 afile
$ du afile
8 afile

Interesting:

cp bigfile bigfile2

cat bigfile > bigfile3

du bigfile*
8 bigfile2
1032 bigfile3

So it's not consumings 0's. It's just doesn't store unwritten data. And I
can think of an application for that: An application might want to write
the biginning of a file at a later point, so this makes it more efficient.

I wonder how other file systems behave.
I read somewhere that it has a use in database software, but the only
thing I can imagine for this is when using heap queues
(http://python.active-venture.com/lib/node162.html).

That's an article about the heap efficient data structure. Was it your
intention to link this?

Jens
 
S

Scott David Daniels

I've used this feature eons ago where the file was essentially a single
large address space (memory mapped array) that was expected to never
fill all that full. I was tracking data from a year of (thousands? of)
students seeing Drill-and-practice questions from a huge database of
questions. The research criticism we got was that our analysis did not
rule out any kid seeing the same question more than once, and getting
"practice" that would improve performance w/o learning. I built a
bit-filter and copied tapes dropping any repeats seen by students.
We then just ran the same analysis we had on the raw data, and found
no significant difference.

The nice thing is that file size grew over time, so (for a while) I
could run on the machine with other users. By the last block of
tapes I was sitting alone in the machine room at 3:00 AM on Sat mornings
afraid to so much as fire up an editor.
 
T

Tim Peters

[Jens Theisen]
...
Actually I'm not sure what this optimisation should give you anyway. The
only circumstance under which files with only zeroes are meaningful is
testing, and that's exactly when you don't want that optimisation.

In most cases, a guarantee that reading "uninitialized" file data will
return zeroes is a security promise, not an optimization. C doesn't
require this behavior, but POSIX does.

On FAT/FAT32, if you create a file, seek to a "large" offset, write a
byte, then read the uninitialized data from offset 0 up to the byte
just written, you get back whatever happened to be sitting on disk at
the locations now reserved for the file. That can include passwords,
other peoples' email, etc -- anything whatsoever that may have been
written to disk at some time in the disk's history. Security weenies
get upset at stuff like that ;-)
 
I

Ivan Voras

Jens said:
cp bigfile bigfile2

cat bigfile > bigfile3

du bigfile*
8 bigfile2
1032 bigfile3

So it's not consumings 0's. It's just doesn't store unwritten data. And I

Very possibly cp "understands" sparse file and cat (doint what it's
meant to do) doesn't :)

That's an article about the heap efficient data structure. Was it your
intention to link this?

Yes. The idea is that in implementing such a structure, in which each
level is 2^x (x="level" of the structure, and it's depentent on the
number of entries the structure must hold) wide, most of blocks could
exist and never be written to (i.e. they'd be "empty"). Using sparse
files would save space :)

(It has nothing to do with python; I remembered the article so I linked
to it; The sparse-file issue is useful only when implementing heaps
directly on file or in mmaped file).
 
B

Bengt Richter

Because it isn't really writing the zeros. You can make these
files all day long and not run out of disk space, because this
kind of file doesn't take very many blocks. The blocks that
were never written are virtual blocks, inasmuch as read() at
that location will cause the filesystem to return a block of NULs.
I wonder if it will also "write" virtual blocks when it gets real
zero blocks to write from a user, or even with file system copy utils?

Regards,
Bengt Richter
 
D

Dennis Lee Bieber

Very possibly cp "understands" sparse file and cat (doint what it's
meant to do) doesn't :)
I'd suspect that "cp" is working from the allocation data in the
directory entry, only copying those blocks that are flagged as
allocated.

"cat", OTOH, is supposed to show the logical contents of the file,
so will touch, in one way or another, every block whether allocated or
not (at the worst, I could see using "cat" would result in filling in
the previously unallocated blocks of the source file -- did you check
its size after the use of "cat"?)
--
 
G

Grant Edwards

I wonder if it will also "write" virtual blocks when it gets real
zero blocks to write from a user, or even with file system copy utils?

No, in my experience none of the Linux filesystems do that.
It's easy enough to test:

$ dd if=/dev/zero of=zeros bs=64k count=1024
1024+0 records in
1024+0 records out
$ ls -l zeros
-rw-r--r-- 1 grante users 67108864 Jan 28 14:49 zeros
$ du -h zeros
65M zeros


In my book that's 64MB not 65MB, but that's an argument for
another day.
 
F

Fredrik Lundh

Bengt said:
I wonder if it will also "write" virtual blocks when it gets real
zero blocks to write from a user, or even with file system copy utils?

I've seen this behaviour on "big iron" Unix systems, in a benchmark that
repeatedly copied data from a memory mapped section to an output file.

but for the general case, I doubt that adding "is this block all zeros" or
"does this block match something we recently wrote to disk" checks will
speed things up, on average...

</F>
 
T

Thomas Bellman

Grant Edwards said:
$ dd if=/dev/zero of=zeros bs=64k count=1024
1024+0 records in
1024+0 records out
$ ls -l zeros
-rw-r--r-- 1 grante users 67108864 Jan 28 14:49 zeros
$ du -h zeros
65M zeros
In my book that's 64MB not 65MB, but that's an argument for
another day.

You should be aware that the size that 'du' and 'ls -s' reports,
include any indirect blocks needed to keep track of the data
blocks of the file. Thus, you get the amount of space that the
file actually uses in the file system, and would become free if
you removed it. That's why it is larger than 64 Mbyte. And 'du'
(at least GNU du) rounds upwards when you use -h.

Try for instance:

$ dd if=/dev/zero of=zeros bs=4k count=16367
16367+0 records in
16367+0 records out
$ ls -ls zeros
65536 -rw-rw-r-- 1 bellman bellman 67039232 Jan 29 13:57 zeros
$ du -h zeros
64M zeros

$ dd if=/dev/zero of=zeros bs=4k count=16368
16368+0 records in
16368+0 records out
$ ls -ls zeros
65540 -rw-rw-r-- 1 bellman bellman 67043328 Jan 29 13:58 zeros
$ du -h zeros
65M zeros

(You can infer from the above that my file system has a block
size of 4 Kbyte.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top