Creating huge data in very less time.

venutaurus539 · Mar 31, 2009

Hello all,
I've a requirement where I need to create around 1000
files under a given folder with each file size of around 1GB. The
constraints here are each file should have random data and no two
files should be unique even if I run the same script multiple times.
Moreover the filenames should also be unique every time I run the
script.One possibility is that we can use Unix time format for the
file names with some extensions. Can this be done within few minutes
of time. Is it possble only using threads or can be done in any other
way. This has to be done in Windows.

Please mail back for any queries you may have,

Thank you,
Venu Madhav.

CTO · Mar 31, 2009

1) How random is random enough? Some PRNGs are very fast, and some are
very random, but theres always a compromise.
2) How closely related can the files be? It would be easy to generate
1GB of pseudorandom numbers, then just append UUIDs to them
3) Unique filenames can be generated with tmpnam

John Machin · Mar 31, 2009

Hello all,
I've a requirement where I need to create around 1000
files under a given folder with each file size of around 1GB. The
constraints here are each file should have random data and no two
files should be unique even if I run the same script multiple times.
Moreover the filenames should also be unique every time I run the
script.One possibility is that we can use Unix time format for the
file names with some extensions. Can this be done within few minutes
of time.

You should be able to write a simple script to create 1000 files with
unique names and each containing 1GB of doesn't-matter-what data and
find out for yourself how long that takes. If it takes much longer
than a "few" (how many is a few?) minutes, then it's pointless
worrying about other constraints like "no two files should be
unique" (whatever that means) and "random data" (why do you want to
create 1000GB of random data??) because imposing them certainly won't
make it run faster.

Is it possble only using threads or can be done in any other
way. This has to be done in Windows.

Please mail back for any queries you may have,

This looks VERY SIMILAR to a question you asked about 12 days ago ...

Steven D'Aprano · Mar 31, 2009

Hello all,
I've a requirement where I need to create around 1000
files under a given folder with each file size of around 1GB. The
constraints here are each file should have random data and no two files
should be unique even if I run the same script multiple times.

I don't understand what you mean. "No two files should be unique" means
literally that only *one* file is unique, the others are copies of each
other.

Do you mean that no two files should be the same?

Moreover
the filenames should also be unique every time I run the script. One
possibility is that we can use Unix time format for the file names
with some extensions.

That's easy. Start a counter at 0, and every time you create a new file,
name the file by that counter, then increase the counter by one.

Can this be done within few minutes of time. Is it
possble only using threads or can be done in any other way. This has to
be done in Windows.

Is it possible? Sure. In a couple of minutes? I doubt it. 1000 files of
1GB each means you are writing 1TB of data to a HDD. The fastest HDDs can
reach about 125 MB per second under ideal circumstances, so that will
take at least 8 seconds per 1GB file or 8000 seconds in total. If you try
to write them all in parallel, you'll probably just make the HDD waste
time seeking backwards and forwards from one place to another.

Tim Chase · Mar 31, 2009

andrea said:
In randomness is not necessary (as I understood) you can just create
one single file and then modify one bit of it iteratively for 1000
times.
It's enough to make the checksum change.

Is there a way to create a file to big withouth actually writing
anything in python (just give me the garbage that is already on the
disk)?

Not exactly AFAIK, but this line of thinking does remind me of
sparse files[1] if your filesystem supports them:

f = file('%i.txt' % i, 'wb')
data = str(i) + '\n'
f.seek(1024*1024*1024 - len(data))
f.write(data)
f.close()

On FS's that support sparse files, it's blindingly fast and
creates a virtual file of that size without the overhead of
writing all the bits to the file. However, this same
optimization may also throw off any benchmarking you do, as it
doesn't have to read a gig off the physical media. This may be a
good metric for hash calculation across such files, but not a
good metric for I/O.

-tkc

[1]
http://en.wikipedia.org/wiki/Sparse_file

Dave Angel · Mar 31, 2009

I wrote a tiny DOS program called resize that simply did a seek out to a
(user specified) point, and wrote zero bytes. One (documented) side
effect of DOS was that writing zero bytes would truncate the file at
that point. But it also worked to extend the file to that point without
writing any actual data. The net effect was that it adjusted the FAT
table, and none of the data. It was used frequently for file recovery,
unformatting, etc. And it was very fast.

Unfortunately, although the program still ran under NT (which includes
Win 2000, XP, ...), the security system insists on zeroing all the
intervening sectors, which takes much time, obviously.

Still, if the data is not important (make the first sector unique, and
the rest zeroes), this would probably be the fastest way to get all
those files created. Just write the file name in the first sector
(since we'[ll separately make sure the filename is unique), and then
seek out to a billion, and write one more byte. I won't assume that
writing zero bytes would work for Unix.

Terry Reedy · Mar 31, 2009

That time is reasonable. The randomness should be in such a way that
MD5 checksum of no two files should be the same.The main reason for
having such a huge data is for doing stress testing of our product.

For most purposes (other than stress testing the HD and HD read
routines], I suspect you would be better off directly piping the data
into your product (or a special version of it).

Tim Chase · Mar 31, 2009

Is there a way to create a file to big withouth actually writing

No. That would be a monstrous security hole.

Sure...just install 26 hard-drives and partition each up into 40
1-GB unformatted partitions each, and then read directly from
/dev/hd[a-z][0-39]

<gdr>

-tkc
(ponders to self, "does logical partitioning allow for that many
partitions on a disk?")

Irmen de Jong · Mar 31, 2009

That time is reasonable. The randomness should be in such a way that
MD5 checksum of no two files should be the same.The main reason for
having such a huge data is for doing stress testing of our product.

Does it really need to be *files* on the *hard disk*?

What nobody has suggested yet is that you can *simulate* the files by making a large set
of custom file-like object and feed that to your application. (If possible!)
The object could return a 1 GB byte stream consisting of a GUID followed by random bytes
(or just millions of A's, because you write that the only requirement is to have a
different MD5 checksum).
That way you have no need of a 1 terabyte hard drive and the huge wait time to create
the actual files...

--irmen

Dave Angel · Mar 31, 2009

The FAT file system does not support sparse files. They were added in
NTFS, in the Windows 2000 timeframe, to my recollection.

Don't try to install NTFS on a floppy.

Any suggestions for handling data of huge dimension in Java?	13	Mar 24, 2011
Creating 50K text files in python	13	Mar 18, 2009
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
HELP , with operating system related program in c.	1	Mar 27, 2023
Huge cgi!Help!	5	Oct 7, 2007
sync databse table based on current directory data without losignprevious values	39	Mar 6, 2013
nested dictionaries and functions in data structures.	0	Jan 7, 2014
Very URgent Required : MYSQL Administrator for Bahrain	0	Sep 28, 2012

Creating huge data in very less time.

venutaurus539

CTO

John Machin

Steven D'Aprano

Tim Chase

Dave Angel

Terry Reedy

Tim Chase

Irmen de Jong

Dave Angel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads