Joining Big Files

mosscliffe · Aug 25, 2007

I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

I am working on a hosted website with no direct telnet or similar
access.

I would appreciate any tips, best code, as I do not want to break
anything on the website.

Richard

Marc 'BlackJack' Rintsch · Aug 25, 2007

I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

There are some copy functions that work with file like objects in the
`shutil` module.

Ciao,
Marc 'BlackJack' Rintsch

beginner · Aug 26, 2007

I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

I am working on a hosted website with no direct telnet or similar
access.

I would appreciate any tips, best code, as I do not want to break
anything on the website.

Richard

I would probably open the files in binary mode and copy 128 KB - 512
KB at a time. I wouldn't copy a line at a time as it is possible that
one of the files contains a very long line.

Paul McGuire · Aug 26, 2007

I have 4 text files each approx 50mb.

<yawn> 50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul

Paul McGuire · Aug 26, 2007

I have 4 text files each approx 50mb.

Click to expand...

<yawn> 50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul

My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Keep it simple.

-- Paul

vasudevram · Aug 26, 2007

<yawn> 50mb? Really? Did you actually try this and find out it was a
problem?

Click to expand...

Try this:
import time

Click to expand...

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

Click to expand...

print end-start,"seconds"

Click to expand...

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

Click to expand...

-- Paul

Click to expand...

My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Keep it simple.

-- Paul

There are (at least) another couple of approaches possible, each with
some possible tradeoffs or requirements:

Approach 1. (Least amount of code to write - not that the others are
large

Just use os.system() and the UNIX cat command - the requirement here
is that:
a) your web site is hosted on *nix (ok, you can do it if on Windows
too - use copy instead of cat, you might have to add a "cmd /c "
prefix in front of the copy command, and you have to use the right
copy command syntax for concatenating multiple input files into one
output file).

b) your hosting plan allows you to execute OS level commands like cat,
and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
for Windows hosts).

import os
os.system("cat file1.txt file2.txt file3.txt file4.txt >
file_out.txt")

cat will take care of buffering, etc. transparently to you.

Approach 2: Read (in a loop, as you originally thought of doing) each
line of each of the 4 input files and write it to the output file:

("Reusing" Paul McGuire's code above

outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
for lin in infile:
outfile.write(lin)
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

# You may need to check that newlines are not removed in the above
code, in the output file. Can't remember right now. If they are, just
add one back with:

outfile.write(lin + "\n") instead of outfile.write(lin) .

( Code not tested, test it locally first, though looks ok to me. )

The reason why this _may_ not be much slower than manually coded
buffering approaches, is that:

a) Python's standard library is written in C (which is fast),
including use of stdio (the C Standard IO library, which already does
intelligent buffering)
b) OS's do I/O buffering anyway, so do hard disk controllers
c) from some recent Python version, I think it was 2.2, that idiom
"for lin in infile" has been (based on somethng I read in the Python
Cookbook) stated to be pretty efficient anyway (and yet (slightly)
more readable that earlier followed approaches of reading a text
file).

Given all the above facts, it probably isn't worth your while to try
and optimize the code unless and until you find (by measurements) that
it's too slow - which is a good practice anyway:

http://en.wikipedia.org/wiki/Optimization_(computer_science)

Excerpt from the above page (its long but worth reading, IMO):

"Donald Knuth said, paraphrasing Hoare[1],

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." (Knuth, Donald.
Structured Programming with go to Statements, ACM Journal Computing
Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

Charles Cook commented,

"I agree with this. It's usually not worth spending a lot of time
micro-optimizing code before it's obvious where the performance
bottlenecks are. But, conversely, when designing software at a system
level, performance issues should always be considered from the
beginning. A good software developer will do this automatically,
having developed a feel for where performance issues will cause
problems. An inexperienced developer will not bother, misguidedly
believing that a bit of fine tuning at a later stage will fix any
problems." [2]
"

HTH
Vasudev
-----------------------------------------
Vasudev Ram
http://www.dancingbison.com
http://jugad.livejournal.com
http://sourceforge.net/projects/xtopdf
-----------------------------------------

mcl · Aug 26, 2007

I have 4 text files each approx 50mb.
<yawn> 50mb? Really? Did you actually try this and find out it was a
problem?
Try this:
import time
start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()
print end-start,"seconds"
For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)
-- Paul

Click to expand...

Click to expand...

My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Click to expand...

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Click to expand...

Keep it simple.

Click to expand...

-- Paul

Click to expand...

There are (at least) another couple of approaches possible, each with
some possible tradeoffs or requirements:

Approach 1. (Least amount of code to write - not that the others are
large

Just use os.system() and the UNIX cat command - the requirement here
is that:
a) your web site is hosted on *nix (ok, you can do it if on Windows
too - use copy instead of cat, you might have to add a "cmd /c "
prefix in front of the copy command, and you have to use the right
copy command syntax for concatenating multiple input files into one
output file).

b) your hosting plan allows you to execute OS level commands like cat,
and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
for Windows hosts).

import os
os.system("cat file1.txt file2.txt file3.txt file4.txt >
file_out.txt")

cat will take care of buffering, etc. transparently to you.

Approach 2: Read (in a loop, as you originally thought of doing) each
line of each of the 4 input files and write it to the output file:

("Reusing" Paul McGuire's code above

outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
for lin in infile:
outfile.write(lin)
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

# You may need to check that newlines are not removed in the above
code, in the output file. Can't remember right now. If they are, just
add one back with:

outfile.write(lin + "\n") instead of outfile.write(lin) .

( Code not tested, test it locally first, though looks ok to me. )

The reason why this _may_ not be much slower than manually coded
buffering approaches, is that:

a) Python's standard library is written in C (which is fast),
including use of stdio (the C Standard IO library, which already does
intelligent buffering)
b) OS's do I/O buffering anyway, so do hard disk controllers
c) from some recent Python version, I think it was 2.2, that idiom
"for lin in infile" has been (based on somethng I read in the Python
Cookbook) stated to be pretty efficient anyway (and yet (slightly)
more readable that earlier followed approaches of reading a text
file).

Given all the above facts, it probably isn't worth your while to try
and optimize the code unless and until you find (by measurements) that
it's too slow - which is a good practice anyway:

http://en.wikipedia.org/wiki/Optimization_(computer_science)

Excerpt from the above page (its long but worth reading, IMO):

"Donald Knuth said, paraphrasing Hoare[1],

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." (Knuth, Donald.
Structured Programming with go to Statements, ACM Journal Computing
Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

Charles Cook commented,

"I agree with this. It's usually not worth spending a lot of time
micro-optimizing code before it's obvious where the performance
bottlenecks are. But, conversely, when designing software at a system
level, performance issues should always be considered from the
beginning. A good software developer will do this automatically,
having developed a feel for where performance issues will cause
problems. An inexperienced developer will not bother, misguidedly
believing that a bit of fine tuning at a later stage will fix any
problems." [2]
"

HTH
Vasudev

All,

Thank you very much.

As my background is much smaller memory machines than today's giants -
64k being a big machine and 640k being gigantic. I get very worried
about crashing machines when copying or editing big files, especially
in a multi-user environment.

Mr Knuth - that brings back memories. I rememeber implementing some
of his sort routines on a mainframe with 24 tape units and an 8k drum
and almost eliminating one shift per day of computer operator time.

Thanks again

Richard

vasudevram · Aug 31, 2007

All,

Thank you very much.

As my background is much smaller memory machines than today's giants -
64k being abigmachine and 640k being gigantic. I get very worried
about crashing machines when copying or editingbigfiles, especially
in a multi-user environment.

Mr Knuth - that brings back memories. I rememeber implementing some
of his sort routines on a mainframe with 24 tape units and an 8k drum
and almost eliminating one shift per day of computer operator time.

Thanks again

Richard

I can imagine ... though I don't go back that far.
Cool ...

Vasudev

Big problem I need to solve with some unix utils	1	Jun 19, 2022
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Taking out text between symbols and joining together	1	Jun 30, 2010
Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
I want to Display Excel As HTML In js	2	Feb 24, 2023
Joining 2 strings	13	Aug 7, 2004
BIG memory problem	2	Aug 8, 2008

Joining Big Files

mosscliffe

Marc 'BlackJack' Rintsch

beginner

Paul McGuire

Paul McGuire

vasudevram

mcl

vasudevram

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads