creating/modifying sparse files on linux

draghuram · Aug 17, 2005

Hi,

Is there any special support for sparse file handling in python? My
initial search didn't bring up much (not a thorough search). I wrote
the following pice of code:

options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]
fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])
print "off =", off, " len =", len
fd.seek(off)
for x in range(len):
fd.write("a")

fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here? Is there a better way to create/modify sparse files?

Thanks,
Raghu.

Trent Mick · Aug 17, 2005

[[email protected] wrote]

Hi,

Is there any special support for sparse file handling in python? My
initial search didn't bring up much (not a thorough search). I wrote
the following pice of code:

options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]
fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])
print "off =", off, " len =", len
fd.seek(off)
for x in range(len):
fd.write("a")

fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here? Is there a better way to create/modify sparse files?

test_largefile.py in the Python test suite does this kind of thing and
doesn't take very long for me to run on Linux (SuSE 9.0 box).

Trent

Marc 'BlackJack' Rintsch · Aug 17, 2005

In <[email protected]>,

options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]
fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])
print "off =", off, " len =", len
fd.seek(off)
for x in range(len):
fd.write("a")

fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here? Is there a better way to create/modify sparse files?

`range(len)` creates a list of size `len` *in memory* so you are trying to
build a list with 314,572,800 numbers. That seems to eat up all your RAM
and causes the swapping.

You can use `xrange(len)` instead which uses a constant amount of memory.
But be prepared to wait some time because now you are writing 314,572,800
characters *one by one* into the file. It would be faster to write larger
strings in each step.

Ciao,
Marc 'BlackJack' Rintsch

Terry Reedy · Aug 17, 2005

Is there any special support for sparse file handling in python?

Since I have not heard of such in several years, I suspect not. CPython,
normally compiled, uses the standard C stdio lib. If your system+C has a
sparseIO lib, you would probably have to compile specially to use it.

options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]

options.ranges = [(4096,1024),(30000,314572800)] # makes below nicer

fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])

off,len = map(int, drange.split(",")) # or
off,len = [int(s) for s in drange.split(",")] # or for tuples as suggested
above
off,len = drange

print "off =", off, " len =", len
fd.seek(off)
for x in range(len):

If I read the above right, the 2nd len is 300,000,000+ making the space
needed for the range list a few gigabytes. I suspect this is where you
started thrashing ;-). Instead:

for x in xrange(len): # this is what xrange is for ;-)

fd.write("a")

Without indent, this is syntax error, so if your code ran at all, this
cannot be an exact copy. Even with xrange fix, 300,000,000 writes will be
slow. I would expect that an real application should create or accumulate
chunks larger than single chars.

fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here?

See above

Is there a better way to create/modify sparse files?

Unless you can access builting facilities, create your own mapping index.

Terry J. Reedy

draghuram · Aug 17, 2005

Thanks for the info on xrange. Writing single char is just to get going
quickly. I knew that I would have to improve on that. I would like to
write chunks of 1MB which would require that I have 1MB string to
write. Is there any simple way of generating this 1MB string (other
than keep appending to a string until it reaches 1MB len)? I don't care
about the actual value of the string itself.

Thanks,
Raghu.

Terry Reedy · Aug 17, 2005

Thanks for the info on xrange. Writing single char is just to get going
quickly. I knew that I would have to improve on that. I would like to
write chunks of 1MB which would require that I have 1MB string to
write. Is there any simple way of generating this 1MB string

megastring = 1000000*'a' # t < 1 sec on my machine

(other than keep appending to a string until it reaches 1MB len)?

You mean like (unexecuted)
s = ''
for i in xrange(1000000): s += 'a' #?

This will allocate, copy, and deallocate 1000000 successively longer
temporary strings and is a noticeable O(n**2) operation. Since strings are
immutable, you cannot 'append' to them the way you can to lists.

Terry J. Reedy

=?iso-8859-1?Q?Fran=E7ois?= Pinard · Aug 18, 2005

[[email protected]]

Is there any simple way of generating this 1MB string (other than keep
appending to a string until it reaches 1MB len)?

You might of course use 'x' * 1000000 for fairly quickly generating a
single string holding one million `x'.

Yet, your idea of generating a sparse file is interesting. I never
tried it with Python, but would not see why Python would not allow
it. Did someone ever played with sparse files in Python? (One problem
with sparse files is that it is next to impossible for a normal user to
create an exact copy. There is no fast way to read read them either.)

Bengt Richter · Aug 18, 2005

Hi,

Is there any special support for sparse file handling in python? My
initial search didn't bring up much (not a thorough search). I wrote
the following pice of code:

options.size = 6442450944
options.ranges = ["4096,1024","30000,314572800"]
fd = open("testfile", "w")
fd.seek(options.size-1)
fd.write("a")
for drange in options.ranges:
off = int(drange.split(",")[0])
len = int(drange.split(",")[1])
print "off =", off, " len =", len
fd.seek(off)
for x in range(len):
fd.write("a")

fd.close()

This piece of code takes very long time and in fact I had to kill it as
the linux system started doing lot of swapping. Am I doing something
wrong here? Is there a better way to create/modify sparse files?

Thanks

I'm unclear as to what your goal is. Do you just need an object that provides
an interface like a file object, but internally is more efficient than an
a normal file object when you access it as above[1], or do you need to create
a real file and record all the bytes in full (with what default for gaps?)
on disk, so that it can be opened by another program and read as an ordinary file?

Some operating system file systems may have some support for virtual zero-block runs
and lazy allocation/representation of non-zero blocks in files. It's easy to imagine
the rudiments, but I don't know of such a file system, not having looked ;-)

You could write your own "sparse-file"-representation object, and maybe use pickle
for persistence. Or maybe you could use zipfiles. The kind of data you are creating above
would probably compress really well ;-)

[1] writing 314+ million identical bytes one by one is silly, of course ;-)
BTW, len is a built-in function, and using built-in names for variables
is frowned upon as a bug-prone practice.

Regards,
Bengt Richter

Benji York · Aug 18, 2005

Terry said:
megastring = 1000000*'a' # t < 1 sec on my machine

You mean like (unexecuted)
s = ''
for i in xrange(1000000): s += 'a' #?

This will allocate, copy, and deallocate 1000000 successively longer
temporary strings and is a noticeable O(n**2) operation.

Not exactly. CPython 2.4 added an optimization of "+=" for strings.
The for loop above takes about 1 second do execute on my machine. You
are correct in that it will take *much* longer on 2.3.

draghuram · Aug 18, 2005

My goal is very simple. Have a mechanism to create sparse files and
modify them by writing arbitratry ranges of bytes at arbitrary offsets.
I did get the information I want (xrange instead of range, and a simple
way to generate 1Mb string in memory). Thanks for pointing out about
using "len" as variable. It is indeed silly.

My only assumption from underlying OS/file system is that if I seek
past end of file and write some data, it doesn't generate blocks for
data in between. This is indeed true on Linux (I tested on ext3).

Thanks,
Raghu.

Mike Meyer · Aug 19, 2005

[email protected] said:
My goal is very simple. Have a mechanism to create sparse files and
modify them by writing arbitratry ranges of bytes at arbitrary offsets.
I did get the information I want (xrange instead of range, and a simple
way to generate 1Mb string in memory). Thanks for pointing out about
using "len" as variable. It is indeed silly.

My only assumption from underlying OS/file system is that if I seek
past end of file and write some data, it doesn't generate blocks for
data in between. This is indeed true on Linux (I tested on ext3).

This better be true for anything claiming to be Unix. The results on
systems that break this aren't pretty.

<mike

Plz Help ..Strucs-->U_char arrays<--Structs	2	Dec 7, 2004
mmap thoughts	1	May 12, 2007
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
python-dev Summary for 2005-01-16 through 2005-01-31	7	Mar 1, 2005
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005
REQ: Perl 5.8.3 on OpenBSD	3	Mar 6, 2004
Better crypto hash functions, long, with code	2	Aug 26, 2005

creating/modifying sparse files on linux

draghuram

Trent Mick

Marc 'BlackJack' Rintsch

Terry Reedy

draghuram

Terry Reedy

=?iso-8859-1?Q?Fran=E7ois?= Pinard

Bengt Richter

Benji York

draghuram

Mike Meyer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads