Increasing C++ throughput

S

SzH

I need to read very large text files and do some simple processing on
them. I'm trying to make this as fast as possible. Before spending a
lot of time with it and going through a series of futile attempts to
optimize this, I thought I'd post here and ask where to start.

Consider this very small program for outputting every 10th line:

------------ filt.cpp ----------------

#include <iostream>
#include <string>

using namespace std;

int main() {
ios::sync_with_stdio(false);

string line;
unsigned long nr = 0;
while (getline(cin, line))
if (nr++ % 10 == 0)
cout << line << '\n';
return 0;
}

-----------------------------------

How can this be made faster? I know very little about C++ I/O. I
usually only do simple numerical stuff, and I think that to speed this
up, one needs to be familiar with how I/O works (internally) in the C+
+ standard library.

First I found that ios::sync_with_stdio(false); really does help.
Then I noticed that compressing the data file with gzip, and piping it
with zcat to this simple program speeds up things *lot* (it's several
time faster). So I suppose that before compression was applied, the
speed of reading the uncompressed file was limited by the hard drive.

Now this is the time it takes to decompress the file (tt2.gz), and
throw away the result:
timethis "zcat tt2.gz > NUL"

TimeThis : Command Line : zcat tt2.gz > NUL
TimeThis : Start Time : Mon Jan 21 19:10:11 2008

TimeThis : Command Line : zcat tt2.gz > NUL
TimeThis : Start Time : Mon Jan 21 19:10:11 2008
TimeThis : End Time : Mon Jan 21 19:10:26 2008
TimeThis : Elapsed Time : 00:00:14.750

This is filtering the decompressed data through the filt program from
above, and throw away the result:
timethis "zcat tt2.gz |filt > NUL"

TimeThis : Command Line : zcat tt2.gz |filt > NUL
TimeThis : Start Time : Mon Jan 21 18:51:16 2008

TimeThis : Command Line : zcat tt2.gz |filt > NUL
TimeThis : Start Time : Mon Jan 21 18:51:16 2008
TimeThis : End Time : Mon Jan 21 18:51:53 2008
TimeThis : Elapsed Time : 00:00:37.031

This is more than twice as slow. Could some knowledgeable people give
some hints on why is simply reading the data line-by-line and
outputting every tenth line more than twice as slow as decompressing
it?

Is it because the memory allocations (happening in string)? Are the
limiting factor the C++ I/O routines? Can this be sped up?

The compression ratio of the data is about 1:10, so zcat is reading
approx. the same amount of data that filt is outputting.

Any insights will be most welcome!

Szabolcs

(P.S. I'm on WinXP, if this matters. The program was compiled with
mingw gcc 4.2.1 with the -O3 option.)
 
F

fnegroni

I am no C++ expert.
But the most obvious optimization coming to mind is right there in the
specs.
You are only printing one line every ten.
Why are you storing the lines you are not printing?
You could just scan the file (using buffered IO) for line endings, and
store only the tenth line.
 
S

SzH

If you want to optimise text file processing, consider platform specific
file mapping facilities over iostreams.

So are you saying that there is no room for improvement here? The
reason why I am unsure about it is that if I change that string to
char, like this

----------- filt2.cpp ----------------

#include <iostream>
#include <string>

using namespace std;

int main() {
ios::sync_with_stdio(false);

char ch;
unsigned long nr = 0;
while (cin >> ch)
if (nr++ % 10 == 0)
cout << ch;
return 0;
}

--------------------------------

, then it gets much-much-much slower.

Any pointers to tutorials/documentation which explains why this is/how
these things work (and how these things can be sped up---for example
is it worth using stdio.h instead of iostreams?) would be much
appreciated.
 
J

Juha Nieminen

SzH said:
How can this be made faster?

You might want to try using the C I/O functions. They may in some
cases be significantly faster.

(And no, I don't have good suggestions about how to easily print each
10th line using C I/O functions. It's complicated.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top