S
SzH
I need to read very large text files and do some simple processing on
them. I'm trying to make this as fast as possible. Before spending a
lot of time with it and going through a series of futile attempts to
optimize this, I thought I'd post here and ask where to start.
Consider this very small program for outputting every 10th line:
------------ filt.cpp ----------------
#include <iostream>
#include <string>
using namespace std;
int main() {
ios::sync_with_stdio(false);
string line;
unsigned long nr = 0;
while (getline(cin, line))
if (nr++ % 10 == 0)
cout << line << '\n';
return 0;
}
-----------------------------------
How can this be made faster? I know very little about C++ I/O. I
usually only do simple numerical stuff, and I think that to speed this
up, one needs to be familiar with how I/O works (internally) in the C+
+ standard library.
First I found that ios::sync_with_stdio(false); really does help.
Then I noticed that compressing the data file with gzip, and piping it
with zcat to this simple program speeds up things *lot* (it's several
time faster). So I suppose that before compression was applied, the
speed of reading the uncompressed file was limited by the hard drive.
Now this is the time it takes to decompress the file (tt2.gz), and
throw away the result:
TimeThis : Command Line : zcat tt2.gz > NUL
TimeThis : Start Time : Mon Jan 21 19:10:11 2008
TimeThis : Command Line : zcat tt2.gz > NUL
TimeThis : Start Time : Mon Jan 21 19:10:11 2008
TimeThis : End Time : Mon Jan 21 19:10:26 2008
TimeThis : Elapsed Time : 00:00:14.750
This is filtering the decompressed data through the filt program from
above, and throw away the result:
TimeThis : Command Line : zcat tt2.gz |filt > NUL
TimeThis : Start Time : Mon Jan 21 18:51:16 2008
TimeThis : Command Line : zcat tt2.gz |filt > NUL
TimeThis : Start Time : Mon Jan 21 18:51:16 2008
TimeThis : End Time : Mon Jan 21 18:51:53 2008
TimeThis : Elapsed Time : 00:00:37.031
This is more than twice as slow. Could some knowledgeable people give
some hints on why is simply reading the data line-by-line and
outputting every tenth line more than twice as slow as decompressing
it?
Is it because the memory allocations (happening in string)? Are the
limiting factor the C++ I/O routines? Can this be sped up?
The compression ratio of the data is about 1:10, so zcat is reading
approx. the same amount of data that filt is outputting.
Any insights will be most welcome!
Szabolcs
(P.S. I'm on WinXP, if this matters. The program was compiled with
mingw gcc 4.2.1 with the -O3 option.)
them. I'm trying to make this as fast as possible. Before spending a
lot of time with it and going through a series of futile attempts to
optimize this, I thought I'd post here and ask where to start.
Consider this very small program for outputting every 10th line:
------------ filt.cpp ----------------
#include <iostream>
#include <string>
using namespace std;
int main() {
ios::sync_with_stdio(false);
string line;
unsigned long nr = 0;
while (getline(cin, line))
if (nr++ % 10 == 0)
cout << line << '\n';
return 0;
}
-----------------------------------
How can this be made faster? I know very little about C++ I/O. I
usually only do simple numerical stuff, and I think that to speed this
up, one needs to be familiar with how I/O works (internally) in the C+
+ standard library.
First I found that ios::sync_with_stdio(false); really does help.
Then I noticed that compressing the data file with gzip, and piping it
with zcat to this simple program speeds up things *lot* (it's several
time faster). So I suppose that before compression was applied, the
speed of reading the uncompressed file was limited by the hard drive.
Now this is the time it takes to decompress the file (tt2.gz), and
throw away the result:
timethis "zcat tt2.gz > NUL"
TimeThis : Command Line : zcat tt2.gz > NUL
TimeThis : Start Time : Mon Jan 21 19:10:11 2008
TimeThis : Command Line : zcat tt2.gz > NUL
TimeThis : Start Time : Mon Jan 21 19:10:11 2008
TimeThis : End Time : Mon Jan 21 19:10:26 2008
TimeThis : Elapsed Time : 00:00:14.750
This is filtering the decompressed data through the filt program from
above, and throw away the result:
timethis "zcat tt2.gz |filt > NUL"
TimeThis : Command Line : zcat tt2.gz |filt > NUL
TimeThis : Start Time : Mon Jan 21 18:51:16 2008
TimeThis : Command Line : zcat tt2.gz |filt > NUL
TimeThis : Start Time : Mon Jan 21 18:51:16 2008
TimeThis : End Time : Mon Jan 21 18:51:53 2008
TimeThis : Elapsed Time : 00:00:37.031
This is more than twice as slow. Could some knowledgeable people give
some hints on why is simply reading the data line-by-line and
outputting every tenth line more than twice as slow as decompressing
it?
Is it because the memory allocations (happening in string)? Are the
limiting factor the C++ I/O routines? Can this be sped up?
The compression ratio of the data is about 1:10, so zcat is reading
approx. the same amount of data that filt is outputting.
Any insights will be most welcome!
Szabolcs
(P.S. I'm on WinXP, if this matters. The program was compiled with
mingw gcc 4.2.1 with the -O3 option.)