Creating 50K text files in python

V

venutaurus539

Hello all,
I've an application where I need to create 50K files spread
uniformly across 50 folders in python. The content can be the name of
file itself repeated 10 times.I wrote a code using normal for loops
but it is taking hours together for that. Can some one please share
the code for it using Multithreading. As am new to Python, I know
little about threading concepts.

This is my requiremnt:

C:\TestFolder....
That folder contains 5 Folders.. Folder1, Folder2, Folder3.....
Folder5
Each folder in turn contains 10 folders:
and Each of those folder contains 1000 text files.

Please let me know if you are not clear.

Thank you,
Venu Madhav.
 
G

Gabriel Genellina

En Wed, 18 Mar 2009 11:40:26 -0200, (e-mail address removed)
On Mar 18, 6:16 pm, "(e-mail address removed)"

I managed to get this code but the problem is main thread exits before
the completetion of child threads:

class MyThread ( threading.Thread ):

def run ( self ):

global theVar
print 'This is thread ' + str ( theVar ) + ' speaking.'
print 'Hello and good bye.'
theVar = theVar + 1
time.sleep (4)
print 'This is thread ' + str ( theVar ) + ' speaking. again'

for x in xrange ( 20 ):
MyThread().start()

You have to .join() each thread. Replace the above two lines with:

threads = [MyThread() for x in range(20)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()

But this is unrelated to your main problem. You don't have to use threads
for this.
 
G

Gabriel Genellina

En Wed, 18 Mar 2009 11:50:32 -0200, (e-mail address removed)
Really...!!!! I just can't beleive it. I am running my scirpt on an
INTEL Core2duo CPU machine with 2GB of RAM. I've started it two hours
back, but still it is running. This is how my code looks like

You're creating 500000 processes, 500000 shells, just to call 'echo' (!)
def createFiles(path):
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Creating files in the
folder "+path+"\n")
global c
global d
os.chdir (path)
k = 1
for k in range (1,1001):
p = "%.2d"%(k)
FName = "TextFile"+c+"_"+d+"_"+p+".txt"
l =1
for l in range(1 , 11):
os.system ("\"echo "+FName+" >> "+FName+"\"")
l = l +1
k = k+1

with open(FName, "w") as f:
for i in range(10):
f.write(FName)
 
P

Peter Otten

Hello all,
I've an application where I need to create 50K files spread
uniformly across 50 folders in python. The content can be the name of
file itself repeated 10 times.I wrote a code using normal for loops
but it is taking hours together for that. Can some one please share
the code for it using Multithreading. As am new to Python, I know
little about threading concepts.

This is my requiremnt:

C:\TestFolder....
That folder contains 5 Folders.. Folder1, Folder2, Folder3.....
Folder5
Each folder in turn contains 10 folders:
and Each of those folder contains 1000 text files.

Please let me know if you are not clear.

Thank you,
Venu Madhav.

I've just tried it, creating the 50,000 text files took 17 seconds on my not
very fast machine. Python is certainly not the bottleneck here.

Peter
 
V

venutaurus539

Hello all,
          I've an application where I need to create 50K files spread
uniformly across 50 folders in python. The content can be the name of
file itself repeated 10 times.I wrote a code using normal for loops
but it is taking hours together for that. Can some one please share
the code for it using Multithreading. As am new to Python, I know
little about threading concepts.

This is my requiremnt:

C:\TestFolder....
That folder contains 5 Folders.. Folder1, Folder2, Folder3.....
Folder5
Each folder in turn contains 10 folders:
and Each of those folder contains 1000 text files.

Please let me know if you are not clear.

Thank you,
Venu Madhav.

-------------
I managed to get this code but the problem is main thread exits before
the completetion of child threads:

class MyThread ( threading.Thread ):

def run ( self ):

global theVar
print 'This is thread ' + str ( theVar ) + ' speaking.'
print 'Hello and good bye.'
theVar = theVar + 1
time.sleep (4)
print 'This is thread ' + str ( theVar ) + ' speaking. again'

for x in xrange ( 20 ):
MyThread().start()
-------------------------------------------------------------------------------------------------------
I can't use a sleep in the main thread because I don't know when the
child threads end.

Thank you,
venu madhav.
 
T

Tim Chase

I've an application where I need to create 50K files spread
uniformly across 50 folders in python. The content can be the name of
file itself repeated 10 times.I wrote a code using normal for loops
but it is taking hours together for that. Can some one please share
the code for it using Multithreading. As am new to Python, I know
little about threading concepts.

I'm not sure multi-threading will assist much with what appears
to be a filesystem & I/O bottleneck. Far more important will be
factors like

- what OS are you running?

- what filesystem are you writing to and does it journal? (NTFS?
FAT? ext[234]? JFS? XFS? tempfs/tmpfs? etc...) Some
filesystems deal much better with lots of small files while
others perform better with fewer-but-larger files.

- how is that filesystem mounted? Are syncs forced for all
writes? How large are your write caches?

- what sort of hardware is it hitting? A 4500 RPM drive will
have far worse performance than a 10k RPM drive; or you could try
against a flash drive or RAID-0 which should have higher
sustained write throughput.

- how sparse is the drive layout beforehand? If you're cramped
for space, the FS drivers have to scatter your files across
disparate space; while lots of free-space may allow it to spew
the files without great consideration.

Hope this gives you some ways to test your configuration...

-tkc
 
V

venutaurus539

I've just tried it, creating the 50,000 text files took 17 seconds on my not
very fast machine. Python is certainly not the bottleneck here.

Peter

Really...!!!! I just can't beleive it. I am running my scirpt on an
INTEL Core2duo CPU machine with 2GB of RAM. I've started it two hours
back, but still it is running. This is how my code looks like


def createFiles(path):
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Creating files in the
folder "+path+"\n")
global c
global d
os.chdir (path)
k = 1
for k in range (1,1001):
p = "%.2d"%(k)
FName = "TextFile"+c+"_"+d+"_"+p+".txt"
l =1
for l in range(1 , 11):
os.system ("\"echo "+FName+" >> "+FName+"\"")
l = l +1
k = k+1



MainPath = "C:\\Many_50000_1KB"
try:
os.mkdir (MainPath)
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Created the base directory
\n")
except:
m.write(strftime("%Y-%m-%d %H:%M:%S") +" base directory already
exists\n")
os.chdir (MainPath)
for i in range (1 , 6):
j = 1
c = "%.2d"%(i)
FolderName ="Folder"+c
try:
os.mkdir (FolderName)
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Created the folder
"+FolderName+"\n")
except:
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Folder "+FolderName+"
already exists \n")
os.chdir (FolderName)
path = os.getcwd ()
#createFiles(path)
for j in range ( 1 , 11):
d = "%.2d"%(j)
FolderName = "Folder"+c+"_"+d
try:
os.mkdir (FolderName)
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Created the
folder "+FolderName+"\n")
except:
m.write(strftime("%Y-%m-%d %H:%M:%S") +" the folder
"+FolderName+" exists \n")
os.chdir (FolderName)
path = os.getcwd ()
createFiles(path)
os.chdir ("..")
j = j + 1
os.chdir ("..")
i = i + 1

Can you please let me know where do I have to modify it, to make it
faster.

Thank you,
Venu Madhav
 
V

venutaurus539

Man, you have a trouble with loops, all over.

But the situation demands it.:-(
I've to create 5 folders and 10 folders in each of them. Each folder
again should contain 1000 text files.
 
P

Peter Otten

Really...!!!! I just can't beleive it. I am running my scirpt on an
INTEL Core2duo CPU machine with 2GB of RAM. I've started it two hours
back, but still it is running. This is how my code looks like


def createFiles(path):
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Creating files in the
folder "+path+"\n")
global c
global d
os.chdir (path)
k = 1
for k in range (1,1001):
p = "%.2d"%(k)
FName = "TextFile"+c+"_"+d+"_"+p+".txt"
l =1
for l in range(1 , 11):
os.system ("\"echo "+FName+" >> "+FName+"\"")
l = l +1
k = k+1



MainPath = "C:\\Many_50000_1KB"
try:
os.mkdir (MainPath)
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Created the base directory
\n")
except:
m.write(strftime("%Y-%m-%d %H:%M:%S") +" base directory already
exists\n")
os.chdir (MainPath)
for i in range (1 , 6):
j = 1
c = "%.2d"%(i)
FolderName ="Folder"+c
try:
os.mkdir (FolderName)
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Created the folder
"+FolderName+"\n")
except:
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Folder "+FolderName+"
already exists \n")
os.chdir (FolderName)
path = os.getcwd ()
#createFiles(path)
for j in range ( 1 , 11):
d = "%.2d"%(j)
FolderName = "Folder"+c+"_"+d
try:
os.mkdir (FolderName)
m.write(strftime("%Y-%m-%d %H:%M:%S") +" Created the
folder "+FolderName+"\n")
except:
m.write(strftime("%Y-%m-%d %H:%M:%S") +" the folder
"+FolderName+" exists \n")
os.chdir (FolderName)
path = os.getcwd ()
createFiles(path)
os.chdir ("..")
j = j + 1
os.chdir ("..")
i = i + 1

Can you please let me know where do I have to modify it, to make it
faster.

Multiple (!) os.system() calls to create a file are certainly a bad idea.
Personally I don't use os.chdir() because I try to avoid global state when
possible.

Here's my somewhat carelessly written code. Change the

root = ...

line as appropriate before you run it (the root folder must not exist).


from __future__ import with_statement

import os

def makedir(*parts):
dir = os.path.join(*parts)
os.mkdir(dir)
return dir

def makefile(*parts):
fn = os.path.join(*parts)
with open(fn, "w") as out:
out.write((parts[-1]+"\n")*10)

if __name__ == "__main__":
root = "./tmp_tree"
makedir(root)
for i in range(1, 6):
outer = "Folder%d" % i
makedir(root, outer)
for k in range(1, 11):
inner = "Folder%02d" % k
dir = makedir(root, outer, inner)
for i in range(1, 1001):
makefile(dir, "file%04d.txt" % i)

Peter
 
J

John Machin

On Mar 19, 12:50 am, "(e-mail address removed)"
        FName = "TextFile"+c+"_"+d+"_"+p+".txt"
        l =1
        for l in range(1 , 11):
            os.system ("\"echo "+FName+" >> "+FName+"\"")
            l = l +1

That is not the most clear code that I've ever seen. Makeover time!
os.system ("\"echo "+FName+" >> "+FName+"\"")
os.system('"echo ' + FName + ' >> ' + FName + '"')
os.system('"echo %s >> %s"') % (FName, FName)

Some questions:
(1) Isn't that one level of quoting too many?
(2) Does os.system create a whole new shell process to execute that
command?
(3) Can you think of a better/faster way to write 10 lines to a file
than doing os.system in a loop?
(4) Why are you incrementing the loop variable yourself?
(5) Why are you setting the loop variable to 1 before the loop?
(6, 7) as for (4, 5) but this time in relation to the enclosing "for
k..." loop.
(8) Did you actually test that this script was working on some *small*
number of files before you fired it off?
 
S

S Arrowsmith

FName = "TextFile"+c+"_"+d+"_"+p+".txt"
l = 1
for l in range(1 , 11):
os.system ("\"echo "+FName+" >> "+FName+"\"")
l = l +1

1. os.system spawns a new process, which on Windows (I'm guessing
you're on Windows given the full path names you've given) is a
particularly expensive operation.

2. You really do have a problem with loops. Not how many your
task has landed you, but the way you're writing them. You
don't need to pre-assign the loop variable, you don't need to
increment it, and it's generally clearer if you use a 0-based
range and add 1 when you use the variable if you need it to be
1-based.

So rewrite the above as:

fname = "TextFile%s_%s_%s.txt" % (c, d, p)
f = open(fname, 'a')
for l in range(10):
f.write(fname)
f.close()

Also:

3. All the logging with m.write is likely to be slowing things
down. It might be useful to see just how much of an impact that's
having.

4. Lose the globals and pass them as function arguments.
 
L

Laurent Rahuel

Hi,

I don't know why you are forking shells using os.system. You should
either use system commands to do the job or plain python, but mixing
both should never be an option.

Here is a plain portable python script (using the with statement. Your
python should not be to old)


from __future__ import with_statement
import os

for i in xrange(1, 1001):
os.mkdir('Folder%s' %i)
for j in xrange(1, 51):
with open('Folder%s%sTextFile%s.txt' %(i, os.sep, j), "w") as f:
f.write('TextFile%s.txt' %j)

Here are the results on a ext3 file system

time python test.py

real 0m5.299s
user 0m0.939s
sys 0m3.539s

Regards,

Laurent


(e-mail address removed) a écrit :
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top