C++ programming challenge

Chris M. Thomasson · Jun 14, 2009

Hendrik Schober said:
Chris said:

[...]
I only care about the platforms in which it works.

Click to expand...

And what about customers?

You know of companies that are interested in the programs that this contest
has spawned?

Do you also only care about
customers for which it works? (I'd starve my kids doing
it this way.)

I do not foresee anybody purchasing the little program I created for this
contest.

Jerry Coffin · Jun 14, 2009

[ ... ]

Here's a simple solution that reads from stdin and doesn't do any timing

Simple, but wrong. The percentages it computes will (at least usually)
be incorrect.

[ ... ]

while ( (c = getchar()) != EOF )
{
f [c]++;
size++;

At least as I read the proposal, only letters should be counted toward
the total.

Jerry Coffin · Jun 14, 2009

(e-mail address removed)>, (e-mail address removed)
says...

Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

Yet another bit of code:

#include <iostream>
#include <fstream>
#include <vector>
#include <numeric>
#include <iomanip>

class counter {
std::vector<double> counts;
long total;
public:
static const int num = 'z'-'a'+1;

counter() : counts(num), total(0) {}

counter operator+(char c) {
char ch = tolower(c);
if (isalpha(ch)) {
++total;
++counts[ch-'a'];
}
return *this;
}

friend std:

stream &operator<<(std:

stream &os, counter const &c)
{
for (int i=0; i<num; ++i)
if (isalpha('a'+i))
// The single most complex part of the program is getting the
// output formatted as your web site shows it!
os << char('a'+ i) << " "
<< std::fixed
<< std::setw(6)
<< std::setprecision(3)
<< c.counts/c.total*100.0 << "%\n";
return os;
}
};

int main(int argc, char **argv) {
std::ifstream input(argv[1]);

counter &counts = std::accumulate(
std::istream_iterator<char>(input),
std::istream_iterator<char>(),
counter());

std::cout << counts;
return 0;
}

I suppose it's open to argument whether I might be abusing
std::accumulate and/or operator overloading. I (obviously) don't really
think so, but I'll admit that especially in the latter case there's room
for question.

I don't have the right sort of machine handy to test it with, but I
believe this should work with something like EBCDIC. Making it work
correctly with something like almost any Unicode encoding would take
substantial modifications.

Chris M. Thomasson · Jun 14, 2009

Jonathan Lee said:
If this has already been brought up then please ignore me, but how are
you ensuring the programs are reading from disk? For example, the
current program listed as fastest on your blog clocks in at .13s. So I
downloaded the source, downloaded the text file and unzipped it, and
ran it. Result: 0.17s. Fine, but I notice that my hard disk light
doesn't flash. So I restart the computer and run it again. Result:
6.1s.

I suspect that some of the times you've posted may be benefiting from
the fact that the file is in memory, and others are genuinely reading
it from disk.

It might be interesting to see a posted result that executed on a system in
which portions of the target file were not in memory. Then, run the program
10 or 20 times; take the average time (e.g., time command) and post that
result as well.

Jerry Coffin · Jun 14, 2009

[ ... ]

counter operator+(char c) {

Oops, that should be:

counter &operator+(char c) {

Given the degree to which it's I/O bound, it probably doesn't make any
real difference, but at least in theory this should improve speed a bit.

Chris M. Thomasson · Jun 14, 2009

Jerry Coffin said:
(e-mail address removed)>, (e-mail address removed)
says...

Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here: [...]

Click to expand...

I suppose it's open to argument whether I might be abusing
std::accumulate and/or operator overloading. I (obviously) don't really
think so, but I'll admit that especially in the latter case there's room
for question.

I don't have the right sort of machine handy to test it with, but I
believe this should work with something like EBCDIC.

Well, imagine a contrived character set in which the following:

static const int num = 'z'-'a'+1;

produced a negative number?

Making it work
correctly with something like almost any Unicode encoding would take
substantial modifications.

Indeed.

Jerry Coffin · Jun 14, 2009

"Jerry Coffin" <[email protected]> wrote in message

[ ... ]

Well, imagine a contrived character set in which the following:

static const int num = 'z'-'a'+1;

produced a negative number?

Yes, it's possible to theorize an 8-bit character set for which it
wouldn't work -- but in reality, I'm pretty sure there's never been one
that would cause a problem. I hardly think it's worth worrying a great
deal about when there are so many real character sets that would cause
problems.

OTOH, a fair number of the other solutions that have been posted have
been far more dependent on the details of encoding for considerably less
benefit.

Ioannis Vranos · Jun 14, 2009

Jerry said:
[ ... ]

counter operator+(char c) {

Click to expand...

Oops, that should be:

counter &operator+(char c) {

Given the degree to which it's I/O bound, it probably doesn't make any
real difference, but at least in theory this should improve speed a bit.

Your code does not compile:

john@john-laptop:~/Projects/anjuta/cpp/src$ g++ main.cc -o foobar
main.cc: In function â€˜int main(int, char**)â€™:
main.cc:43: error: â€˜istream_iteratorâ€™ is not a member of â€˜stdâ€™
main.cc:43: error: expected primary-expression before â€˜charâ€™
main.cc:44: error: â€˜istream_iteratorâ€™ is not a member of â€˜stdâ€™
main.cc:44: error: expected primary-expression before â€˜charâ€™
john@john-laptop:~/Projects/anjuta/cpp/src$

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Ioannis Vranos · Jun 14, 2009

Opening the text file in binary mode is ==> definitely a mistake, so I think you should remove all such
answers from the valid answers.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Chris M. Thomasson · Jun 14, 2009

Peter Jansson said:
I agree. Do you know how to remove the target file from memory before
executing the program?

Reboot should do it, or read a huge file, before each subsequent test.

Chris M. Thomasson · Jun 14, 2009

Ioannis Vranos said:
Opening the text file in binary mode is ==> definitely a mistake, so I
think you should remove all such answers from the valid answers.

Even programs which allow one to customize the mode via pre-defined macros?

Alf P. Steinbach · Jun 14, 2009

* Ioannis Vranos:

Opening the text file in binary mode is ==> definitely a mistake, so I
think you should remove all such answers from the valid answers.

In principle you may be right, because binary mode won't necessarily work on a
system where a file is not simply a sequence of bytes.

But in practice, for systems like Windows and *nix, binary mode just works, and
for systems like Windows it's required to avoid time-consuming conversions of
e.g. newlines on some systems.

Since the programs are not counting lines binary mode has no other impact.

Cheers,

- Alf

Chris M. Thomasson · Jun 14, 2009

Chris M. Thomasson said:
Reboot should do it, or read a huge file, before each subsequent test.

Or perhaps create an individual copy of the input datum on a per-test basis,
reboot, and recomputed.

Anyway, I think the posted timing results are more realistic/consistent now.
However, it would be fairly interesting, IMVHO or course, to see times that
represent the average time a program consumes across 20+ runs... Forget
awesome cached results, execute program numerous times and give average.
IMVVHO, that would seem to be "fair" enough.

Forget individual timing codes. Perhaps you should strip all timing code out
of each entry and blast them all with the UNIX time command?

Perhaps execute a phantom program to get file in cache before executing the
"first" contest submission...

Chris M. Thomasson · Jun 14, 2009

Chris M. Thomasson said:
Or perhaps create an individual copy of the input datum on a per-test
basis, reboot, and recomputed.

Perhaps you should increase the target file size by a couple of gigs. Say,
50 gigs+? Then programs should be hitting hard drive sooner or later...

Chris M. Thomasson · Jun 14, 2009

Jerry Coffin said:
"Jerry Coffin" <[email protected]> wrote in message

Click to expand...

[ ... ]

Well, imagine a contrived character set in which the following:

static const int num = 'z'-'a'+1;

produced a negative number?

Click to expand...

Yes, it's possible to theorize an 8-bit character set for which it
wouldn't work -- but in reality, I'm pretty sure there's never been one
that would cause a problem. I hardly think it's worth worrying a great
deal about when there are so many real character sets that would cause
problems.

100% Agreed.

OTOH, a fair number of the other solutions that have been posted have
been far more dependent on the details of encoding for considerably less
benefit.

YES! ;^o

Slapping self!

Ouch...

Ioannis Vranos · Jun 14, 2009

Peter said:
I will keep the binary mode programs in the list since it interesting to
see if it makes any speed difference compared to the non-binary mode
programs.

Sincerely,
Peter Jansson

Data stored in binary files, are implementation-dependent. That is, if you store something in a binary file,
and then you read it as it is in another platform (e.g. with the same program recompiled either with a
different compiler and/or at a different OS), there is high probability that the binary file will be incompatible.

Text files on the other hand are portable.

Opening a text file with binary mode has two possibilities. Either binary mode is implemented as text mode and
the program works, or you are reading trash.

That's why we have "text mode" and "binary mode". So, opening the text file in binary mode is clearly a mistake.

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Jerry Coffin · Jun 14, 2009

[ ... ]

I had to #include <iterator>

Yup -- should have been there, though it happened to work as it was with
the compiler I tested with.

and also changed in your main a little bit:

Perhaps a slightly larger change would be worthwhile:

int main(int argc, char **argv) {
static char buffer[20*1024*1024];

std::ifstream input(argv[1]);
input.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
input.sync_with_stdio(false);

std::cout << std::accumulate(
std::istream_iterator<char>(input),
std::istream_iterator<char>(),
counter());
return 0;
}

It's hard to guess exactly how much difference those changes will make,
but they might be significant. Of course, it's all a bit pointless -- if
you're really processing large files often enough to care about speed,
some non-portable code will usually make a much larger difference than
anything like this.

Jerry Coffin · Jun 14, 2009

Given that the problem is largely I/O bound, there's another possibility
for reading the data into memory quickly that should probably be tested.
Try replacing the main from my previous program with the following:

int main(int argc, char **argv) {
std::ifstream input(argv[1]);
std::stringstream data;

data << input.rdbuf();
std::string const &d=data.str();

std::cout << std::accumulate(d.begin(), d.end(), counter());
return 0;
}

To use this, you'll have to add:

#include <sstream>
#include <string>

up with the rest of the #include's, but the rest of the program remains
the same. At least on my computer, using the compilers I have handy,
this has a substantial performance advantage. It's hard to guess what
it'll do on your target though -- if your computer doesn't have enough
free memory to hold the entire input file, it could cause thrashing,
leading to truly awful performance.

Ioannis Vranos · Jun 14, 2009

Peter said:
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

With kind regards,
Peter Jansson

This is my final, corrected ISO C++ version. Please erase all my previous submissions from your site, and
include only this:

#include <valarray>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// Warning: long double has problems with MINGW compiler for Windows.
//
// The C++ basic character set is using the value range [0, 127].
// If we used vector<long double>, it would not have any run-time difference in any modern compiler.
valarray<long double> characterFrequencies(128);

// The array where the read characters will be stored.
char buffer[4* BUFSIZ]= {};

// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}

// We disable synchronisation with stdio, to speed up C++ standard I/O.
ios_base::sync_with_stdio(false);

cout<< fixed;

string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

// We start timing.
time1= clock();

// We open the file
ifstream inputFile(argv[argc -1]);

// An error happened
if(not inputFile)
{
cerr<< "\nCould not open file for reading, exiting...\n\n";

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(inputFile);

// Since rule 1 is: "Your program should be case insensitive when it counts letters",
// we add the results of lowercase characters and their equivallent uppercase letters together.
cout<< "\n\n\nThe letter frequencies are:\n";

long double totalcharacterFrequencies= 0;

for(string::size_type i= 0; i< characters.size(); ++i)
totalcharacterFrequencies+= characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ];

for(string::size_type i= 0; i< characters.size(); ++i)
cout<< characters<< ": "<< (characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ])/ totalcharacterFrequencies* 100<< "%\n";

// We "stop" timing.
time2= clock();

// We convert the timing to seconds.
double totalTimeInSeconds= static_cast<double>(time2- time1)/ CLOCKS_PER_SEC;

cout<<"\n\nThe whole process took "<< totalTimeInSeconds<< " seconds.\n";

cout<<"\n\nHave a nice day!\n";
}

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Ioannis Vranos · Jun 14, 2009

Corrected: Please include this one, and NOT the previous one.

#include <valarray>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// The C++ basic character set is using the value range [0, 127].
// If we used vector<unsigned long>, it would not have any run-time difference in any modern compiler.
valarray<unsigned long> characterFrequencies(128);

// The array where the read characters will be stored.
char buffer[4* BUFSIZ]= {};

// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}

// We disable synchronisation with stdio, to speed up C++ standard I/O.
ios_base::sync_with_stdio(false);

cout<< fixed;

string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

// We start timing.
time1= clock();

// We open the file
ifstream inputFile(argv[argc -1]);

// An error happened
if(not inputFile)
{
cerr<< "\nCould not open file for reading, exiting...\n\n";

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(inputFile);

// Since rule 1 is: "Your program should be case insensitive when it counts letters",
// we add the results of lowercase characters and their equivallent uppercase letters together.
cout<< "\n\n\nThe letter frequencies are:\n";

unsigned long totalcharacterFrequencies= 0;

for(string::size_type i= 0; i< characters.size(); ++i)
totalcharacterFrequencies+= characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ];

for(string::size_type i= 0; i< characters.size(); ++i)
cout<< characters<< ": "<< static_cast<double>(characterFrequencies[ characters ]+
characterFrequencies[
tolower(characters) ])/ totalcharacterFrequencies* 100<< "%\n";

// We "stop" timing.
time2= clock();

// We convert the timing to seconds.
double totalTimeInSeconds= static_cast<double>(time2- time1)/ CLOCKS_PER_SEC;

cout<<"\n\nThe whole process took "<< totalTimeInSeconds<< " seconds.\n";

cout<<"\n\nHave a nice day!\n";
}

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

C programming in 2011	148	May 26, 2011
Ann: Registration for the Python Game Programming Challenge is nowopen!	0	Jul 29, 2005
Requirement @ Appulse	0	Jul 24, 2008
Some errors in MIT's intro C++ course	109	Sep 8, 2010
Looking for C++ Professionals (Bangalore)	4	Aug 17, 2007
C language now truly universal	0	Jan 1, 2011
Seek Contract Programming Work - 17 Years Experience	0	Feb 22, 2005
Fundamentals of Financial Management Concise 7e Brigham Houston	0	May 1, 2011

C++ programming challenge

Chris M. Thomasson

Jerry Coffin

Jerry Coffin

Chris M. Thomasson

Jerry Coffin

Chris M. Thomasson

Jerry Coffin

Ioannis Vranos

Ioannis Vranos

Chris M. Thomasson

Chris M. Thomasson

Alf P. Steinbach

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Ioannis Vranos

Jerry Coffin

Jerry Coffin

Ioannis Vranos

Ioannis Vranos

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads