Creating huge data in very less time.

Discussion in 'Python' started by venutaurus539@gmail.com, Mar 31, 2009.

  1. Guest

    Hello all,
    I've a requirement where I need to create around 1000
    files under a given folder with each file size of around 1GB. The
    constraints here are each file should have random data and no two
    files should be unique even if I run the same script multiple times.
    Moreover the filenames should also be unique every time I run the
    script.One possibility is that we can use Unix time format for the
    file names with some extensions. Can this be done within few minutes
    of time. Is it possble only using threads or can be done in any other
    way. This has to be done in Windows.

    Please mail back for any queries you may have,

    Thank you,
    Venu Madhav.
    , Mar 31, 2009
    #1
    1. Advertising

  2. CTO Guest

    1) How random is random enough? Some PRNGs are very fast, and some are
    very random, but theres always a compromise.
    2) How closely related can the files be? It would be easy to generate
    1GB of pseudorandom numbers, then just append UUIDs to them
    3) Unique filenames can be generated with tmpnam
    CTO, Mar 31, 2009
    #2
    1. Advertising

  3. John Machin Guest

    On Mar 31, 4:44 pm, ""
    <> wrote:
    > Hello all,
    >             I've a requirement where I need to create around 1000
    > files under a given folder with each file size of around 1GB. The
    > constraints here are each file should have random data and no two
    > files should be unique even if I run the same script multiple times.
    > Moreover the filenames should also be unique every time I run the
    > script.One possibility is that we can use Unix time format for the
    > file   names with some extensions. Can this be done within few minutes
    > of time.


    You should be able to write a simple script to create 1000 files with
    unique names and each containing 1GB of doesn't-matter-what data and
    find out for yourself how long that takes. If it takes much longer
    than a "few" (how many is a few?) minutes, then it's pointless
    worrying about other constraints like "no two files should be
    unique" (whatever that means) and "random data" (why do you want to
    create 1000GB of random data??) because imposing them certainly won't
    make it run faster.

    > Is it possble only using threads or can be done in any other
    > way. This has to be done in Windows.
    >
    > Please mail back for any queries you may have,
    >


    This looks VERY SIMILAR to a question you asked about 12 days ago ...
    John Machin, Mar 31, 2009
    #3
  4. On Mon, 30 Mar 2009 22:44:41 -0700, wrote:

    > Hello all,
    > I've a requirement where I need to create around 1000
    > files under a given folder with each file size of around 1GB. The
    > constraints here are each file should have random data and no two files
    > should be unique even if I run the same script multiple times.


    I don't understand what you mean. "No two files should be unique" means
    literally that only *one* file is unique, the others are copies of each
    other.

    Do you mean that no two files should be the same?


    > Moreover
    > the filenames should also be unique every time I run the script. One
    > possibility is that we can use Unix time format for the file names
    > with some extensions.


    That's easy. Start a counter at 0, and every time you create a new file,
    name the file by that counter, then increase the counter by one.


    > Can this be done within few minutes of time. Is it
    > possble only using threads or can be done in any other way. This has to
    > be done in Windows.


    Is it possible? Sure. In a couple of minutes? I doubt it. 1000 files of
    1GB each means you are writing 1TB of data to a HDD. The fastest HDDs can
    reach about 125 MB per second under ideal circumstances, so that will
    take at least 8 seconds per 1GB file or 8000 seconds in total. If you try
    to write them all in parallel, you'll probably just make the HDD waste
    time seeking backwards and forwards from one place to another.



    --
    Steven
    Steven D'Aprano, Mar 31, 2009
    #4
  5. Tim Chase Guest

    andrea wrote:
    > On 31 Mar, 12:14, "" <>
    > wrote:
    >> That time is reasonable. The randomness should be in such a way that
    >> MD5 checksum of no two files should be the same.The main reason for
    >> having such a huge data is for doing stress testing of our product.

    >
    >
    > In randomness is not necessary (as I understood) you can just create
    > one single file and then modify one bit of it iteratively for 1000
    > times.
    > It's enough to make the checksum change.
    >
    > Is there a way to create a file to big withouth actually writing
    > anything in python (just give me the garbage that is already on the
    > disk)?


    Not exactly AFAIK, but this line of thinking does remind me of
    sparse files[1] if your filesystem supports them:

    f = file('%i.txt' % i, 'wb')
    data = str(i) + '\n'
    f.seek(1024*1024*1024 - len(data))
    f.write(data)
    f.close()

    On FS's that support sparse files, it's blindingly fast and
    creates a virtual file of that size without the overhead of
    writing all the bits to the file. However, this same
    optimization may also throw off any benchmarking you do, as it
    doesn't have to read a gig off the physical media. This may be a
    good metric for hash calculation across such files, but not a
    good metric for I/O.

    -tkc

    [1]
    http://en.wikipedia.org/wiki/Sparse_file
    Tim Chase, Mar 31, 2009
    #5
  6. Dave Angel Guest

    I wrote a tiny DOS program called resize that simply did a seek out to a
    (user specified) point, and wrote zero bytes. One (documented) side
    effect of DOS was that writing zero bytes would truncate the file at
    that point. But it also worked to extend the file to that point without
    writing any actual data. The net effect was that it adjusted the FAT
    table, and none of the data. It was used frequently for file recovery,
    unformatting, etc. And it was very fast.

    Unfortunately, although the program still ran under NT (which includes
    Win 2000, XP, ...), the security system insists on zeroing all the
    intervening sectors, which takes much time, obviously.

    Still, if the data is not important (make the first sector unique, and
    the rest zeroes), this would probably be the fastest way to get all
    those files created. Just write the file name in the first sector
    (since we'[ll separately make sure the filename is unique), and then
    seek out to a billion, and write one more byte. I won't assume that
    writing zero bytes would work for Unix.

    andrea wrote:
    > On 31 Mar, 12:14, "" <>
    > wrote:
    >
    >> That time is reasonable. The randomness should be in such a way that
    >> MD5 checksum of no two files should be the same.The main reason for
    >> having such a huge data is for doing stress testing of our product.
    >>

    >
    >
    > In randomness is not necessary (as I understood) you can just create
    > one single file and then modify one bit of it iteratively for 1000
    > times.
    > It's enough to make the checksum change.
    >
    > Is there a way to create a file to big withouth actually writing
    > anything in python (just give me the garbage that is already on the
    > disk)?
    >
    >
    Dave Angel, Mar 31, 2009
    #6
  7. Terry Reedy Guest

    wrote:

    > That time is reasonable. The randomness should be in such a way that
    > MD5 checksum of no two files should be the same.The main reason for
    > having such a huge data is for doing stress testing of our product.


    For most purposes (other than stress testing the HD and HD read
    routines], I suspect you would be better off directly piping the data
    into your product (or a special version of it).
    Terry Reedy, Mar 31, 2009
    #7
  8. Tim Chase Guest

    >>> Is there a way to create a file to big withouth actually writing
    >>> anything in python (just give me the garbage that is already on the
    >>> disk)?

    >
    > No. That would be a monstrous security hole.


    Sure...just install 26 hard-drives and partition each up into 40
    1-GB unformatted partitions each, and then read directly from
    /dev/hd[a-z][0-39]

    <gdr>

    -tkc
    (ponders to self, "does logical partitioning allow for that many
    partitions on a disk?")
    Tim Chase, Mar 31, 2009
    #8
  9. wrote:
    > On Mar 31, 1:15 pm, Steven D'Aprano
    > <> wrote:
    >> On Mon, 30 Mar 2009 22:44:41 -0700, wrote:
    >>> Hello all,
    >>> I've a requirement where I need to create around 1000
    >>> files under a given folder with each file size of around 1GB. The
    >>> constraints here are each file should have random data and no two files
    >>> should be unique even if I run the same script multiple times.

    >> I don't understand what you mean. "No two files should be unique" means
    >> literally that only *one* file is unique, the others are copies of each
    >> other.
    >>
    >> Do you mean that no two files should be the same?
    >>
    >>> Moreover
    >>> the filenames should also be unique every time I run the script. One
    >>> possibility is that we can use Unix time format for the file names
    >>> with some extensions.

    >> That's easy. Start a counter at 0, and every time you create a new file,
    >> name the file by that counter, then increase the counter by one.
    >>
    >>> Can this be done within few minutes of time. Is it
    >>> possble only using threads or can be done in any other way. This has to
    >>> be done in Windows.

    >> Is it possible? Sure. In a couple of minutes? I doubt it. 1000 files of
    >> 1GB each means you are writing 1TB of data to a HDD. The fastest HDDs can
    >> reach about 125 MB per second under ideal circumstances, so that will
    >> take at least 8 seconds per 1GB file or 8000 seconds in total. If you try
    >> to write them all in parallel, you'll probably just make the HDD waste
    >> time seeking backwards and forwards from one place to another.
    >>
    >> --
    >> Steven

    >
    > That time is reasonable. The randomness should be in such a way that
    > MD5 checksum of no two files should be the same.The main reason for
    > having such a huge data is for doing stress testing of our product.



    Does it really need to be *files* on the *hard disk*?

    What nobody has suggested yet is that you can *simulate* the files by making a large set
    of custom file-like object and feed that to your application. (If possible!)
    The object could return a 1 GB byte stream consisting of a GUID followed by random bytes
    (or just millions of A's, because you write that the only requirement is to have a
    different MD5 checksum).
    That way you have no need of a 1 terabyte hard drive and the huge wait time to create
    the actual files...

    --irmen
    Irmen de Jong, Mar 31, 2009
    #9
  10. Dave Angel Guest

    Re: Re: Creating huge data in very less time.

    The FAT file system does not support sparse files. They were added in
    NTFS, in the Windows 2000 timeframe, to my recollection.

    Don't try to install NTFS on a floppy.

    Grant Edwards wrote:
    > On 2009-03-31, Dave Angel <> wrote:
    >
    >
    >> I wrote a tiny DOS program called resize that simply did a
    >> seek out to a (user specified) point, and wrote zero bytes.
    >> One (documented) side effect of DOS was that writing zero
    >> bytes would truncate the file at that point. But it also
    >> worked to extend the file to that point without writing any
    >> actual data. The net effect was that it adjusted the FAT
    >> table, and none of the data. It was used frequently for file
    >> recovery, unformatting, etc. And it was very fast.
    >>
    >> Unfortunately, although the program still ran under NT (which includes
    >> Win 2000, XP, ...), the security system insists on zeroing all the
    >> intervening sectors, which takes much time, obviously.
    >>

    >
    > Why would it even _allocate_ intevening sectors? That's pretty
    > brain-dead.
    >
    >
    >>> Is there a way to create a file to big withouth actually writing
    >>> anything in python (just give me the garbage that is already on the
    >>> disk)?
    >>>

    >
    > No. That would be a monstrous security hole.
    >
    >
    Dave Angel, Mar 31, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Raymond Arthur St. Marie II of III

    very Very VERY dumb Question About The new Set( ) 's

    Raymond Arthur St. Marie II of III, Jul 23, 2003, in forum: Python
    Replies:
    4
    Views:
    449
    Raymond Hettinger
    Jul 27, 2003
  2. Replies:
    3
    Views:
    472
  3. jiajia wu
    Replies:
    0
    Views:
    344
    jiajia wu
    Oct 1, 2009
  4. 6668
    Replies:
    0
    Views:
    140
  5. lllll
    Replies:
    0
    Views:
    120
    lllll
    Jun 8, 2009
Loading...

Share This Page