Break large file down into multiple files

B

brianrpsgt1

New to python.... I have a large file that I need to break up into
multiple smaller files. I need to break the large file into sections
where there are 65535 lines and then write those sections to seperate
files. I am familiar with opening and writing files, however, I am
struggling with creating the sections and writing the different
sections to their own files.

Thanks

B
 
G

Gabriel Genellina

New to python.... I have a large file that I need to break up into
multiple smaller files. I need to break the large file into sections
where there are 65535 lines and then write those sections to seperate
files. I am familiar with opening and writing files, however, I am
struggling with creating the sections and writing the different
sections to their own files.

This function copies at most n lines from fin to fout:

def copylines(fin, fout, n):
for i, line in enumerate(fin):
fout.write(line)
if i+1>=n: break

Now you have to open the source file, create new files as needed and
repeatedly call the above function until the end of source file. You'll
have to enhace it bit, to know whether there are remaining lines or not.
 
B

brianrpsgt1

This function copies at most n lines from fin to fout:

def copylines(fin, fout, n):
   for i, line in enumerate(fin):
     fout.write(line)
     if i+1>=n: break

Now you have to open the source file, create newfilesas needed and  
repeatedly call the above function until the end of source file. You'll  
have to enhace it bit, to know whether there are remaining lines or not.

Gabriel ::

Thanks for the direction. Do I simply define fin, fout and n as
variables above the def copylines(fin, fout, n): line?

Would it look like this?

fin = open('C:\Path\file')
fout = 'C:\newfile.csv')
n = 65535

def copylines(fin, fout, n):
for i, line in enumerate(fin):
     fout.write(line)
     if i+1>=n: break

Thanks

B
 
R

redbaron

New to python.... I have a large file that I need to break up into
multiple smaller files. I need to break the large file into sections
where there are 65535 lines and then write those sections to seperate
files.

If your lines are variable-length, then look at itertools recipes.

from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)

with open("/file","r") as f:
for lines in grouper(65535,f,""):
data_to_write = '\n'.join(lines).rstrip("\n")
...
<write data where you need it here>
...
 
G

Gabriel Genellina

Gabriel ::

Thanks for the direction. Do I simply define fin, fout and n as
variables above the def copylines(fin, fout, n): line?

Would it look like this?

fin = open('C:\Path\file')
fout = 'C:\newfile.csv')
n = 65535

Warning: see this FAQ entry
http://www.python.org/doc/faq/general/#why-can-t-raw-strings-r-strings-end-with-a-backslash
def copylines(fin, fout, n):
for i, line in enumerate(fin):
     fout.write(line)
     if i+1>=n: break

Almost. You have to *call* the copylines function, not just define it.
After calling it with: copylines(fin, fout, 65535), you'll have the
*first* chunk of lines copied. So you'll need to create a second file,
call the copylines function again, create a third file, call... You'll
need some kind of loop, and a way to detect when to stop and break out of
it.

The copylines function already knows what happened (whether there are more
lines to copy or not) so you should enhace it and return such information
to the caller.

It isn't so hard, after working out the tutorial (linked from
http://wiki.python.org/moin/BeginnersGuide ) you'll know enough Python to
finish this program.

If you have some previous programming experience, Dive into Python (linked
from the Beginners Guide above) is a good online book.

Feel free to come back when you're stuck with something.
 
C

Chris

New to python.... I have a large file that I need to break up into
multiple smaller files. I need to break the large file into sections
where there are 65535 lines and then write those sections to seperate
files.

If your lines are variable-length, then look at itertools recipes.

from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

with open("/file","r") as f:
    for lines in grouper(65535,f,""):
        data_to_write = '\n'.join(lines).rstrip("\n")
        ...
        <write data where you need it here>
        ...

I really would not recommend joining a large about of lines, that will
take some times.

fIn = open(input_filename, 'rb')
chunk_size = 65535

for i,line in enumerate(fIn):
if not i: # First Line in the File, create a file to start
writing to
filenum = '%04d'%(i%chunk_size)+1
fOut = open('%s.txt'%filenum, 'wb')
if i and not i % chunk_size: # Once at the chunk_size close the
old file object and create a new one
fOut.close()
filenum = '%04d'%(i%chunk_size)+1
fOut = open('%s.txt'%filenum, 'wb')
if not i % 1000:
fOut.flush()
fOut.write(line)

fOut.close()
fIn.close()
 
C

Chris

If your lines are variable-length, then look at itertools recipes.
from itertools import izip_longest
def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)
with open("/file","r") as f:
    for lines in grouper(65535,f,""):
        data_to_write = '\n'.join(lines).rstrip("\n")
        ...
        <write data where you need it here>
        ...

I really would not recommend joining a large about of lines, that will
take some times.

fIn = open(input_filename, 'rb')
chunk_size = 65535

for i,line in enumerate(fIn):
    if not i:   # First Line in the File, create a file to start
writing to
        filenum = '%04d'%(i%chunk_size)+1
        fOut = open('%s.txt'%filenum, 'wb')
    if i and not i % chunk_size:   # Once at the chunk_size close the
old file object and create a new one
        fOut.close()
        filenum = '%04d'%(i%chunk_size)+1
        fOut = open('%s.txt'%filenum, 'wb')
    if not i % 1000:
        fOut.flush()
    fOut.write(line)

fOut.close()
fIn.close()

Whoops, day-dreaming mistake. Use "filenum = '%04d'%(i/chunk_size)+1"
and not i%chunk_size.
 
T

Tim Chase

New to python.... I have a large file that I need to break up
into multiple smaller files. I need to break the large file
into sections where there are 65535 lines and then write those
sections to seperate files. I am familiar with opening and
writing files, however, I am struggling with creating the
sections and writing the different sections to their own
files.

While this thread has offered many nice Python solutions, the
"split" command is pretty standard on most Linux boxes:

bash$ split -l 65535 infile.txt

which will do what you describe. You can read the man page for
more details.

So if you just want a fast route to a goal rather than going
through the learning process (I'm all for learning how to do it
too), this may be a quick answer.

-tkc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top