C++ programming challenge

I

Ioannis Vranos

Thomas said:
Because you used fstream::eof in a wrong way, as we concluded in the
thread "Correction of FAQ - [15.5]".

The loop should be:

do {
// ...
} while (f);

instead of
/*...*/ while (!f.eof());

I wonder where this wrong usage of eof comes from. Is it some bad book
or something?


Actually this is not "wrong usage". The loop just doesn't look for other error conditions, that is, it assumes
that the only error condition is the end of file. However if there is some other kind of error, and the end of
file has not been reached, the loop will continue endlessly.



--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
C

Chris M. Thomasson

Jonathan Lee said:
Nor I. It works on two computers I've tried. A Intel Core 2 Duo, 64-
bit. And a Xeon Quad, 32-bit. The latter was tested from copying and
pasting the above (i.e., the source, as posted, works at least on one
computer :/ )

The output you described would mean that the file hasn't reached EOF
and yet it is reading 0 bytes... Erm... don't know why that would be
right now.. ??

Maybe I've got a buffer overrun and the ifstream is getting corrupted.
I'll take a closer look later.

The code seems to work for me with MSVC, but not with MINGW; weird...


Anyway, the code seems to consistently perform slower than say, something
like:

http://groups.google.com/group/comp.lang.c++/msg/2f8903a47b5c493c



AFAICT, the reason lies within your main data-processing loop. Here is the
loop I use:

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize;

for (tmpsize = 0; tmpsize < size; ++tmpsize) {
++counts[buf[tmpsize]];
}

if (size < BUFFER) break;
}



Here is your loop:

if (f.good()) {
do {
f.read(buff, BUFFSIZE);
streamsize bytesread = f.gcount();
streamsize i = 0;

for (; i < bytesread; i += sizeof(unsigned long)) {
unsigned long c = *(reinterpret_cast<unsigned long*>(buff + i));
unsigned long d = c >> 32;
++count[c & 0xFFFF];
++count[d & 0xFFFF];
c >>= 16;
d >>= 16;
++count[c & 0xFFFF];
++count[d & 0xFFFF];
}
for (; i < bytesread; ++i)
++count[buff];
} while (!f.eof());




As you can see, you basically iterate the read buffer twice, which in turn
performs a heck of a lot more computations one a per-read basis as compared
to something like mine.




Then, after your main loop, you go on to peform another "lengthy"
post-processing loop:


for (int i = 0; i < 65536; ++i) {
count2[i & 0xFF] += count;
count2[(i >> 8) & 0xFF] += count;
}





then another short one:

for (int i = 0; i < 26; ++i) {
unsigned long x = count2[alpha] + count2[ALPHA];
total_alpha = x;
total += x;
}





While I only peform a single short post-processing loop after
data-processing:

for (i = 0; i < 26; ++i) {
ABC_COUNT(i, LOWER) += ABC_COUNT(i, UPPER);
total += ABC_COUNT(i, LOWER);
}




I don't see the need to all your extra looping.
 
C

Chris M. Thomasson

That leads to an interesting question. Can I do something like:

#if defined(_POSIX_VERSION) || defined(__linux__)
// mmap based implementation
#else
// iostream based implementation
#endif

I think the `mmap()' version would probablly be filed under the:


Using other libraries (not following the rules)


section on Peter's blog as he lists pthreads and boost as "other libraries".
 
J

Jonathan Lee

Anyway, the code seems to consistently perform slower than say, something
Perhaps it's because my processor is 64-bit. In that case, I'm reading
8 bytes at a time from memory and doing the processing in the
registers (so the theory goes). On a 32-bit machine the benefit of
reading 4 bytes at a time might be negated by the overhead of AND-ing
the index into the array.
AFAICT, the reason lies within your main data-processing loop. Here is the
loop I use:

That's what I was using in my original code posted a few days ago.
Actually, I'm sure everyone is using some variation of that. I was
trying to save time by going to memory less often.
As you can see, you basically iterate the read buffer twice
No, the second loop doesn't restart the counter variable "i". The
second loop is just to clean up in case the last block isn't a
multiple of 4 or 8 bytes (sizeof long).
Then, after your main loop, you go on to peform another "lengthy"
post-processing loop:
What happens here is clean up from above. Instead of incrementing an
entry in the "count" array for every byte in the file, I increment
once every two bytes. That is, I treat every pair of bytes as a
"double width" character. Again, this saves going to memory half as
often. The "lengthy" loop of 65536 saves us (filesize/2) writes to the
count array.
I don't see the need to all your extra looping.
Perhaps with the explanations above it will make more sense. Also, it
was running 20% faster than anything else on my computer so I didn't
really make an effort to make it better.
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Peter Jansson said:
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Here is one more submission:
________________________________________________________________________
[...]
________________________________________________________________________
[...]

That program only computed frequencies of characters relative to the total
amount of "valid" characters (e.g., `isalpha()'). Here is another submission
that calculates requests against all characters, and only the valid ones.
Here is the output I get from the downloaded text file:



a %5.454 %6.920
b %0.916 %1.162
c %3.315 %4.205
d %2.615 %3.317
e %9.179 %11.645
f %2.017 %2.559
g %1.494 %1.895
h %3.013 %3.823
i %6.163 %7.818
j %0.080 %0.101
k %0.504 %0.639
l %2.677 %3.397
m %1.866 %2.368
n %5.412 %6.865
o %7.395 %9.381
p %2.208 %2.801
q %0.100 %0.126
r %6.200 %7.865
s %4.780 %6.064
t %6.954 %8.822
u %2.344 %2.974
v %0.930 %1.180
w %1.181 %1.498
x %0.159 %0.202
y %1.838 %2.332
z %0.031 %0.040





The first column of data represents frequencies off of the total characters
in the file, while the second column show frequencies off the total number
of "valid" characters.






Anyway, here is the augmented code:
________________________________________________________________________
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>
#include <ctype.h>


#define CONCATX(mp_token1, mp_token2)mp_token1 ## mp_token2
#define CONCAT(mp_token1, mp_token2)CONCATX(mp_token1, mp_token2)


#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif


#define BUFFER 65536U


#define ABC_LOWER ("abcdefghijklmnopqrstuvwxyz")
#define ABC_UPPER ("ABCDEFGHIJKLMNOPQRSTUVWXYZ")


#define ABC_COUNT(mp_index, mp_case) ( \
counts[(unsigned char)(CONCAT(ABC_, mp_case)[(mp_index)])] \
)


#define CALC_FREQ(mp_index, mp_scale) ( \
((double)ABC_COUNT(mp_index, LOWER) * (double)(mp_scale)) \
)


int main(int argc, char** argv) {
if (argc > 1) {
FILE* file = fopen(argv[1], FILE_MODE);

if (file) {
unsigned char buf[BUFFER];
unsigned long counts[UCHAR_MAX + 1] = { 0 };
size_t size;
int status = EXIT_SUCCESS;
unsigned long total_char = 0, total_valid_char = 0;

setvbuf(file, 0, _IONBF, 0);

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize;

for (tmpsize = 0; tmpsize < size; ++tmpsize) {
++counts[buf[tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {
size_t i;
double scale_char, scale_valid;

for (i = 0; i <= UCHAR_MAX; ++i) {
total_char += counts;
if (i < 26) {
ABC_COUNT(i, LOWER) += ABC_COUNT(i, UPPER);
total_valid_char += ABC_COUNT(i, LOWER);
if ((unsigned char)ABC_LOWER > i) {
total_char -= ABC_COUNT(i, UPPER);
}
}
}

scale_char = 100.0 / (double)total_char;
scale_valid = 100.0 / (double)total_valid_char;

for (i = 0; i < 26; ++i) {
printf("%c %%%.3f\t%%%.3f\n", ABC_LOWER,
CALC_FREQ(i, scale_char), CALC_FREQ(i, scale_valid));
}

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}

________________________________________________________________________




Not sure exactly which output would be more desirable, so I give both
frequencies even it means taking a little hit wrt performance...
 
C

Chris M. Thomasson

Jonathan Lee said:
Perhaps it's because my processor is 64-bit. In that case, I'm reading
8 bytes at a time from memory and doing the processing in the
registers (so the theory goes). On a 32-bit machine the benefit of
reading 4 bytes at a time might be negated by the overhead of AND-ing
the index into the array.


That's what I was using in my original code posted a few days ago.
Actually, I'm sure everyone is using some variation of that. I was
trying to save time by going to memory less often.

No, the second loop doesn't restart the counter variable "i". The
second loop is just to clean up in case the last block isn't a
multiple of 4 or 8 bytes (sizeof long).

DOH! Stupid retard me! Your 100% correct.



What happens here is clean up from above. Instead of incrementing an
entry in the "count" array for every byte in the file, I increment
once every two bytes. That is, I treat every pair of bytes as a
"double width" character. Again, this saves going to memory half as
often. The "lengthy" loop of 65536 saves us (filesize/2) writes to the
count array.

Perhaps with the explanations above it will make more sense.

YES... ;^/


Also, it
was running 20% faster than anything else on my computer so I didn't
really make an effort to make it better.

Great!
 
C

Chris M. Thomasson

Also, the platform I am testing on might have an extremely shi%ty C++ stream
implementation.



[...]


I think I notice a buffer bounds bug... Consider:



if (f.good()) {
do {
f.read(buff, BUFFSIZE);
streamsize bytesread = f.gcount();
streamsize i = 0;

// okay, lets assume that `bytesread == 10'.


for (; i < bytesread; i += sizeof(unsigned long)) {
unsigned long c = *(reinterpret_cast<unsigned long*>(buff + i));
unsigned long d = c >> 32;
++count[c & 0xFFFF];
++count[d & 0xFFFF];
c >>= 16;
d >>= 16;
++count[c & 0xFFFF];
++count[d & 0xFFFF];



// now, on the first iteration, `i' will be incremented and
// probably be equal to say, `8', which is less then
// `bytesread'. Therefore the loop condition will hold...
// then BAM, you read 8 more bytes even though you can
// only legally read 2 bytes.



}
for (; i < bytesread; ++i)
++count[buff];
} while (!f.eof());




Can you see what I am getting at? You can hard-code things so the bug will
show up. Alls you have to do is read from a file that is only 10 bytes long.


What am I missing here Jonathan?
 
J

Jonathan Lee

Hi Chris,
I think I notice a buffer bounds bug... Consider:
...
What am I missing here Jonathan?

You're not missing anything; you're quite correct about the bounds.
Really the test on the for loop should be something like "i + sizeof
(long) <= bytesread".

Or perhaps better, just access the array as an array of unsigned longs
while (++i <= (bytesread/sizeof(long)))? At least on x86 processors,
the address calculation will be free. And the cmp/jbe will be paired.
Like,

unsigned long* buffl = reinterpret_cast<unsigned long*>(buff);
long i = 0;
long nlongs = bytesread/sizeof(unsigned long);
if (nlongs > 0) {
do {
unsigned long c = buffl; // free
...
} while (++i <= nlongs); // paired instructions
}
i *= sizeof(unsigned long);
for (; i < bytesread; ++i)
++count[buff];

Anyways, you get the idea. To be honest I just hacked the code
together in a few minutes to show that some "parallelization" could be
done. I'm not surprised that there are these oversights, and there's
probably a number of tiny optimizations that could be done. I was more
interested in the idea than the execution.

--Jonathan
 
C

Chris M. Thomasson

Jonathan Lee said:
Hi Chris,
I think I notice a buffer bounds bug... Consider:
...
What am I missing here Jonathan?

You're not missing anything; you're quite correct about the bounds.
Really the test on the for loop should be something like "i + sizeof
(long) <= bytesread".

Or perhaps better, just access the array as an array of unsigned longs
while (++i <= (bytesread/sizeof(long)))? At least on x86 processors,
the address calculation will be free. And the cmp/jbe will be paired.
Like,
[...]

Anyways, you get the idea. To be honest I just hacked the code
together in a few minutes to show that some "parallelization" could be
done. I'm not surprised that there are these oversights, and there's
probably a number of tiny optimizations that could be done. I was more
interested in the idea than the execution.

Yes; I totally the idea and think that it is a good one.
 
C

Chris M. Thomasson

Chris M. Thomasson said:
Jonathan Lee said:
Hi Chris,
I think I notice a buffer bounds bug... Consider:
...
What am I missing here Jonathan?

You're not missing anything; you're quite correct about the bounds.
Really the test on the for loop should be something like "i + sizeof
(long) <= bytesread".

Or perhaps better, just access the array as an array of unsigned longs
while (++i <= (bytesread/sizeof(long)))? At least on x86 processors,
the address calculation will be free. And the cmp/jbe will be paired.
Like,
[...]

Anyways, you get the idea. To be honest I just hacked the code
together in a few minutes to show that some "parallelization" could be
done. I'm not surprised that there are these oversights, and there's
probably a number of tiny optimizations that could be done. I was more
interested in the idea than the execution.

Yes; I totally the idea and think that it is a good one.
^


understand







Well, I need to type text, then
read/read/read/read/read/read/read/read/READ/READ



__BEFORE__




posting!!!




Ouch.



;^(...
 
J

James Kanze

[ ... ]
I think it's a good example of the sort of thing accumulate
should be usable for. Note, however, that accumulate will
normally copy your counter twice for each element, which isn't
going to help performance much. In the past, I've used a
somewhat hacky solution to avoid this:
Right -- as you've seen, I used a truly crufty hack to avoid
it, which worked in this case (at least with the compiler I
tested on -- I'd have to do a bit to work to be sure whether
it's really guaranteed to work).

I think it is, although I don't know if it is intentional. The
standard says very clearly that the function must execute acc =
acc + *i. This may be accidental overspecification, rather than
an explicit intent to allow "funny" operator+ (and operator=, of
course---note that your version still depends on std::vector
recognizing self-asignment, and not doing anything in that
case).

I'll admit that in this case, I rather prefer my solution, with
an explicit binary operator (rather than unexpected semantics
for the + operator), returning a special type, for which a no-op
assignment operator is defined. Both types (the operator and
its return type) having names which are suggestive enough to
discourage the user using them elsewhere. (In addition, in my
case, the class was carefully designed to contain only very
simple PODs, so that the default copy constructor and copy
assignment operator are appropriate. With the hope that the
compiler knows how to implement these better than I could.)
My comment about this being an abuse of accumulate wasn't
about the basic idea of using accumulate to um...accumulate
the statistics -- to me, that seemed perfectly
straightforward. It was about the contortion of having
operator+ return a reference to its input to reduce copying.

Yes. I've often wondered why std::accumulate uses +, and not
+=. Or more interestingly, use meta-programming so that it uses
+= if it exists, + otherwise, and simply "binary_op(acc, *i)",
if binary_op takes a non-const reference for the first argument,
and the version with assignment otherwise.

Regretfully, I think it's too late to fix it now.
[ ... ]
It will work for EBCDIC, but it won't work for ISO 8859-1
(which is probably the most widespread single byte
encoding).
Right -- as it stands, it assumes that all lower case letters
fall in the range 'a'..'z', which is true for ASCII and
EBCDIC, but not much of anything else. Of course, if you
wanted to make it work for ISO 8859, you'd just use UCHAR_MAX
instead, and use (or cast to) unsigned char instead of char in
a couple of places. You'd only be using a fairly small
fraction of the allocated memory, but for typical values of
UCHAR_MAX (e.g., 255) it's not a big deal, at least a long as
you avoid copying much.
After more thought, however, I suspect this might work better
for Unicode/ISO 10646 than I'd previously assumed, at least on
most typical systems. The problem is that the array usage is
quite sparse. As long as we're careful, however, we can get
the virtual memory manager to handle the sparseness for us.

Supposing we have enough memory:).
As long as we never "touch" -- not even intialize -- the parts
of the array that don't represent alphabetic characters, the
virtual memory manager won't (at least in a typical case)
allocate any physical memory for that address. Even though the
alphabetic characters aren't all arranged in one place, I
believe they are arranged into clusters -- which should work
pretty well for a virtual memory manager.
I think I might have to write a bit of test code, to figure
out how much memory this really uses with something like UCS-4
(which I think would be about the worst case).

There are two approachs, I think. I have a template class,
StateTable (a poor name, but that's what I first used it for),
which maps UTF-8 to a StateType, using a tri internally; it
allocates lazily, so sequences never seen are never allocated.
The other alternative, of course, is to convert on the fly to
UTF-32, and index directly. That means a table of roughly 8 MB
on most machines (8 bytes per counter), but as you say, there
will be large blocks which will never be accessed. (Outside of
the first 256 characters, blocks with letters are fairly
distinct from those with other types of characters.) If you
declare it static, the system should take care of initializing
it to 0; I don't know how different systems do this, however
(whether they touch the pages or not). Another frequent
solution is a two level table, a table of pointers indexed by
the top bits, to tables of counters indexed by the bottom bits;
if you use lazy allocation for the second level, you should also
save significant memory.

With any of these solutions, you'll also need a mapping table
for case folding. This shouldn't be that big, however, since
most alphabets (and particularly the big ones) don't have case.
(My StateTable for simple lower case mapping only requires 37
data blocks, of 64 entries each.)
 
I

Ioannis Vranos

Peter said:
Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

With kind regards,
Peter Jansson


I don't give major importants with the contest, but your timings don't look OK to me.


For example the code that you marked as fastest:


/*
CStopWatch s;
s.startTimer();
The_Stuff_You_Want_To_Time_Executes_Here();
s.stopTimer();
You may then get the elapsed time in seconds
via the method getElapsedTime.
*/
#ifdef WIN32
#include <windows.h>
class CStopWatch
{
private:
typedef struct {
LARGE_INTEGER start;
LARGE_INTEGER stop;
} stopWatch;
stopWatch timer;
LARGE_INTEGER frequency;
double LIToSecs( LARGE_INTEGER & L)
{
return ((double)L.QuadPart/(double)frequency.QuadPart);
}
public:
CStopWatch()
{
timer.start.QuadPart=0;
timer.stop.QuadPart=0;
QueryPerformanceFrequency( &frequency );
}
void startTimer() { QueryPerformanceCounter(&timer.start); }
void stopTimer() { QueryPerformanceCounter(&timer.stop); }
double getElapsedTime()
{
LARGE_INTEGER time;
time.QuadPart=timer.stop.QuadPart-timer.start.QuadPart;
return LIToSecs( time );
}
};
#else
#include <sys/time.h>
class CStopWatch
{
private:
typedef struct {
timeval start;
timeval stop;
} stopWatch;
stopWatch timer;
public:
void startTimer( ) { gettimeofday(&(timer.start),0); }
void stopTimer( ) { gettimeofday(&(timer.stop),0); }
double getElapsedTime()
{
timeval res;
timersub(&(timer.stop),&(timer.start),&res);
return res.tv_sec + res.tv_usec/1000000.0;
}
};
#endif



#include <limits.h>
#include <stdio.h>
#include <ctype.h>

#include <iostream>


int main()
{
size_t f [UCHAR_MAX+1] = { 0 };
size_t size = 0;
int c;

CStopWatch timer;

timer.startTimer();

while ( (c = getchar()) != EOF )
{
f [c]++;
size++;
}

for ( c = 0; c <= UCHAR_MAX; ++c )
if ( islower( c ) )
f [toupper( c )] += f [c];

for ( c = 0; c <= UCHAR_MAX; ++c )
if ( f [c] && !islower( c ) && isprint( c ) )
printf( "%c %.3f%%\n", c, f[c]*100.0/size );


timer.stopTimer();


std::cout<< std::fixed<< "\nIt took "<< timer.getElapsedTime()<< "seconds.\n\n";


return 0;
}


at the first run it produces:


john@john-laptop:~/Projects/anjuta/cpp/src$ g++ -ansi -pedantic-errors -Wall -O3 main.cc -o foobar
john@john-laptop:~/Projects/anjuta/cpp/src$ ./foobar <LetterFrequencyInput.txt
16.602%
" 0.233%
' 0.068%
( 0.128%
) 0.171%
, 0.891%
- 0.068%
.. 0.620%
/ 0.057%
0 0.040%
1 0.080%
2 0.037%
3 0.026%
4 0.014%
5 0.014%
6 0.023%
7 0.023%
8 0.006%
9 0.011%
: 0.031%
; 0.048%
< 0.028%
A 5.454%
B 0.916%
C 3.315%
D 2.615%
E 9.179%
F 2.017%
G 1.494%
H 3.013%
I 6.163%
J 0.080%
K 0.504%
L 2.677%
M 1.866%
N 5.412%
O 7.395%
P 2.208%
Q 0.100%
R 6.200%
S 4.780%
T 6.954%
U 2.344%
V 0.930%
W 1.181%
X 0.159%
Y 1.838%
Z 0.031%
` 0.011%

It took 3.024299seconds.

john@john-laptop:~/Projects/anjuta/cpp/src$





and after some runs it produces:

john@john-laptop:~/Projects/anjuta/cpp/src$ ./foobar <LetterFrequencyInput.txt
16.602%
" 0.233%
' 0.068%
( 0.128%
) 0.171%
, 0.891%
- 0.068%
.. 0.620%
/ 0.057%
0 0.040%
1 0.080%
2 0.037%
3 0.026%
4 0.014%
5 0.014%
6 0.023%
7 0.023%
8 0.006%
9 0.011%
: 0.031%
; 0.048%
< 0.028%
A 5.454%
B 0.916%
C 3.315%
D 2.615%
E 9.179%
F 2.017%
G 1.494%
H 3.013%
I 6.163%
J 0.080%
K 0.504%
L 2.677%
M 1.866%
N 5.412%
O 7.395%
P 2.208%
Q 0.100%
R 6.200%
S 4.780%
T 6.954%
U 2.344%
V 0.930%
W 1.181%
X 0.159%
Y 1.838%
Z 0.031%
` 0.011%

It took 2.846462seconds.

john@john-laptop:~/Projects/anjuta/cpp/src$



while mine with your timer:



/*
CStopWatch s;
s.startTimer();
The_Stuff_You_Want_To_Time_Executes_Here();
s.stopTimer();
You may then get the elapsed time in seconds
via the method getElapsedTime.
*/
#ifdef WIN32
#include <windows.h>
class CStopWatch
{
private:
typedef struct {
LARGE_INTEGER start;
LARGE_INTEGER stop;
} stopWatch;
stopWatch timer;
LARGE_INTEGER frequency;
double LIToSecs( LARGE_INTEGER & L)
{
return ((double)L.QuadPart/(double)frequency.QuadPart);
}
public:
CStopWatch()
{
timer.start.QuadPart=0;
timer.stop.QuadPart=0;
QueryPerformanceFrequency( &frequency );
}
void startTimer() { QueryPerformanceCounter(&timer.start); }
void stopTimer() { QueryPerformanceCounter(&timer.stop); }
double getElapsedTime()
{
LARGE_INTEGER time;
time.QuadPart=timer.stop.QuadPart-timer.start.QuadPart;
return LIToSecs( time );
}
};
#else
#include <sys/time.h>
class CStopWatch
{
private:
typedef struct {
timeval start;
timeval stop;
} stopWatch;
stopWatch timer;
public:
void startTimer( ) { gettimeofday(&(timer.start),0); }
void stopTimer( ) { gettimeofday(&(timer.stop),0); }
double getElapsedTime()
{
timeval res;
timersub(&(timer.stop),&(timer.start),&res);
return res.tv_sec + res.tv_usec/1000000.0;
}
};
#endif



#include <valarray>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// We disable synchronisation with stdio, to speed up C++ standard I/O.
ios_base::sync_with_stdio(false);

cout<< fixed;

// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}

// The C++ basic character set is using the value range [0, 127].
// If we used vector<unsigned long>, it would not have any run-time difference in any modern compiler.
valarray<unsigned long> characterFrequencies(128);

// The array where the read characters will be stored.
char buffer[4* BUFSIZ]= {};

string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";


CStopWatch timer;

timer.startTimer();


// We open the file
ifstream inputFile(argv[argc -1]);

// An error happened
if(not inputFile)
{
cerr<< "\nCould not open file for reading, exiting...\n\n";

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer ];

}while(inputFile);

// Since rule 1 is: "Your program should be case insensitive when it counts letters",
// we add the results of lowercase characters and their equivallent uppercase letters together.
cout<< "\n\n\nThe letter frequencies are:\n";

unsigned long totalcharacterFrequencies= 0LU;

for(string::size_type i= 0; i< characters.size(); ++i)
totalcharacterFrequencies+= characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ];

for(string::size_type i= 0; i< characters.size(); ++i)
{
double percentage= static_cast<double>(characterFrequencies[ characters ]+ characterFrequencies[
tolower(characters) ])/ totalcharacterFrequencies* 100;

cout<< characters<< ": "<< percentage<< "%\n";
}

timer.stopTimer();


// We convert the timing to seconds.
double totalTimeInSeconds= timer.getElapsedTime();

cout<<"\n\nThe whole process took "<< totalTimeInSeconds<< " seconds.\n";

cout<<"\n\nHave a nice day!\n";

}


at the first run produces:


hn@john-laptop:~/Projects/anjuta/cpp/src$ ./foobar LetterFrequencyInput.txt



The letter frequencies are:
A: 6.919578%
B: 1.162287%
C: 4.205169%
D: 3.317211%
E: 11.644528%
F: 2.559197%
G: 1.895033%
H: 3.822553%
I: 7.818366%
J: 0.101068%
K: 0.638897%
L: 3.396621%
M: 2.367889%
N: 6.865435%
O: 9.381317%
P: 2.801040%
Q: 0.126336%
R: 7.865290%
S: 6.064106%
T: 8.821831%
U: 2.974300%
V: 1.180335%
W: 1.497979%
X: 0.202137%
Y: 2.331793%
Z: 0.039705%


The whole process took 0.456543 seconds.


Have a nice day!
john@john-laptop:~/Projects/anjuta/cpp/src$



and after some runs it produces:


ohn@john-laptop:~/Projects/anjuta/cpp/src$ ./foobar LetterFrequencyInput.txt



The letter frequencies are:
A: 6.919578%
B: 1.162287%
C: 4.205169%
D: 3.317211%
E: 11.644528%
F: 2.559197%
G: 1.895033%
H: 3.822553%
I: 7.818366%
J: 0.101068%
K: 0.638897%
L: 3.396621%
M: 2.367889%
N: 6.865435%
O: 9.381317%
P: 2.801040%
Q: 0.126336%
R: 7.865290%
S: 6.064106%
T: 8.821831%
U: 2.974300%
V: 1.180335%
W: 1.497979%
X: 0.202137%
Y: 2.331793%
Z: 0.039705%


The whole process took 0.482213 seconds.


Have a nice day!
john@john-laptop:~/Projects/anjuta/cpp/src$





I think this challenge is bogus.



--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
I

Ioannis Vranos

Thomas said:
The loop should be:

do {
// ...
} while (f);

instead of
/*...*/ while (!f.eof());

I wonder where this wrong usage of eof comes from. Is it some bad book
or something?


In C it is common to use the style:


int c;


while((c= getchar())!= EOF)
/* We perform operations. */


So I guess this is also applied to C++. It is not wrong to check for EOF in the condition, however checking
only for EOF, if something else wrong happens, you end up with an infitite loop.




--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
B

Bart van Ingen Schenau

Ioannis said:
In C it is common to use the style:


int c;


while((c= getchar())!= EOF)
/* We perform operations. */


So I guess this is also applied to C++. It is not wrong to check for
EOF in the condition, however checking only for EOF, if something else
wrong happens, you end up with an infitite loop.
It should be noted that in the C I/O functions the value EOF is used to
signal either an end-of-file condition or a read/write error.

A return value of EOF is more or less equivalent with the boolean
inverse of iostream::good(). iostream::eof() is equivalent with feof().

Bart v Ingen Schenau
 
J

James Kanze

Thomas J. Gritzan wrote:
In C it is common to use the style:
while((c= getchar())!= EOF)
/* We perform operations. */

It's not common among professional programmers. The usual idiom
is:

int c = getchar() ;
while ( c != EOF ) {
// ...
c = getchar() ;
}

The equivalent in C++ is:

int c = std::cin.get() ;
while ( c != EOF ) {
// ...
c = std::cin.get() ;
}

or (better):

char c ;
while ( std::cin.get( c ) ) {
// ...
}
So I guess this is also applied to C++. It is not wrong to
check for EOF in the condition, however checking only for EOF,
if something else wrong happens, you end up with an infitite
loop.

There are two different things. When you get a character
(getchar(), std::istream::get()) into an int, the function
returns EOF if the input failed, not because end of file was
seen (unless seeing end of file caused the function to fail).
In particular, std::istream::get() may return a character even
if std::ios::eof() returns true, and may return EOF even if
std::ios::eof() returns false.

After you've seen the EOF value, you use feof()/ferror() or
std::ios::eof()/std::ios::bad() to find out why the last input
failed.
 
J

James Kanze

It should be noted that in the C I/O functions the value EOF
is used to signal either an end-of-file condition or a
read/write error.

Actually, it is used (in both C and C++) to signal that the
attempted input failed. It's obviously possible to have EOF
when feof()/std::ios::eof() returns false, and depending on the
implementation, it may be possible to have something other than
EOF when std::ios::eof() returns true. (I'm not sure about
feof() in this case.)
A return value of EOF is more or less equivalent with the
boolean inverse of iostream::good(). iostream::eof() is
equivalent with feof().

No. A return value of EOF is exactly equivalent to
std::ios::fail(). If std::istream::get() returns EOF, then
std::ios::fail() will be true; if it returns anything else, then
std::ios::fail() will be false.

std::ios::good() includes the eofbit in its return value, which
makes it for all intents and purposes useless.
 
I

Ioannis Vranos

Bart said:
It should be noted that in the C I/O functions the value EOF is used to
signal either an end-of-file condition or a read/write error.\\


Yes you are right, but people tend to forget that (including me), that's why you see iostream::eof() checks in
loops.

Personally I corrected my code in this thread regarding this.




--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
L

ld

ohn@john-laptop:~/Projects/anjuta/cpp/src$ ./foobar LetterFrequencyInput.txt

The letter frequencies are:
A: 6.919578%
B: 1.162287%
C: 4.205169%
D: 3.317211%
E: 11.644528%
F: 2.559197%
G: 1.895033%
H: 3.822553%
I: 7.818366%
J: 0.101068%
K: 0.638897%
L: 3.396621%
M: 2.367889%
N: 6.865435%
O: 9.381317%
P: 2.801040%
Q: 0.126336%
R: 7.865290%
S: 6.064106%
T: 8.821831%
U: 2.974300%
V: 1.180335%
W: 1.497979%
X: 0.202137%
Y: 2.331793%
Z: 0.039705%

The whole process took 0.482213 seconds.

2GB/sec, nice disk performance!
I think this challenge is bogus.

It measures the speed of the OS caches, not the speed of the disk
after the first run (but which one matters?). For example the
following code below run about 5 times faster than the fastest one
from the web site (i.e. Blarrg), after the first run. You can see it
easily:

1st run: 35 sec -> ~ 32 MB/sec (disk : 5400 rpm)
2nd+ run: 1.7 sec -> ~ 650 MB/sec (memory: 667 Mhz DDR2)

the cpu (2.5 Ghz) is not really loaded.

by comparison, Blarrg version takes 10 sec after the 1st run on my
laptop.

a+, ld.

$ time ./foobar LetterFrequencyInput.txt
16.602%
" 0.233%
' 0.068%
( 0.128%
) 0.171%
, 0.891%
- 0.068%
.. 0.620%
/ 0.057%
0 0.040%
1 0.080%
2 0.037%
3 0.026%
4 0.014%
5 0.014%
6 0.023%
7 0.023%
8 0.006%
9 0.011%
: 0.031%
; 0.048%
< 0.028%
A 5.454%
B 0.916%
C 3.315%
D 2.615%
E 9.179%
F 2.017%
G 1.494%
H 3.013%
I 6.163%
J 0.080%
K 0.504%
L 2.677%
M 1.866%
N 5.412%
O 7.395%
P 2.208%
Q 0.100%
R 6.200%
S 4.780%
T 6.954%
U 2.344%
V 0.930%
W 1.181%
X 0.159%
Y 1.838%
Z 0.031%
` 0.011%

-- 1st run

real 0m34.823s
user 0m2.507s
sys 0m1.465s

-- next runs

real 0m1.698s
user 0m1.253s
sys 0m0.444s

--- foobar.c

#include <limits.h>
#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>

int main(int argc, char *argv[])
{
static unsigned char buf[1UL<<20]; // \approx BUFSIZ*100;
size_t freq [UCHAR_MAX+1] = { 0 };
size_t i, n = 0;
FILE *fp;

fp = argc == 2 ? fopen(argv[1], "r") : argc == 1 ? stdin : 0;
if ( !fp ) {
fprintf(stderr, "usage: %s [file]\n", argv[0]);
return EXIT_FAILURE;
}

setvbuf(fp, (char*)buf, _IOFBF, sizeof buf);

for (;;) {
size_t n = fread(buf, 1, sizeof buf, fp);

if ( !n ) break;

while ( n ) ++freq[ buf[--n] ];
}

if ( !feof(fp) ) {
fprintf(stderr, "unexpected error while reading %s\n", fp ==
stdin ? "-" : argv[1]);
return EXIT_FAILURE;
}

fclose(fp);

for ( i = 0; i <= UCHAR_MAX; ++i ) {
if ( islower(i) )
freq[ toupper(i) ] += freq;
n += freq;
}

for ( i = 0; i <= UCHAR_MAX; ++i )
if ( freq && !islower(i) && isprint(i) )
printf("%c %.3f%%\n", (int)i, freq*100.0/n);

return EXIT_SUCCESS;
}
 
I

Ioannis Vranos

James said:
On Jun 17, 8:26 am, Bart van Ingen Schenau <[email protected]>

It's obviously possible to have EOF
when feof()/std::ios::eof() returns false, and depending on the
implementation, it may be possible to have something other than
EOF when std::ios::eof() returns true. (I'm not sure about
feof() in this case.)


I assume you are talking about the return value of getchar(). I do not think "it may be possible to have
something other than EOF when std::ios::eof() returns true".





--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 
I

Ioannis Vranos

ld said:
2GB/sec, nice disk performance!


Actually the text file is 274.6 MB, so the performance is ~2 *274.6 MBs/sec= 549.2 MBs/sec but I see your point.

It measures the speed of the OS caches, not the speed of the disk
after the first run (but which one matters?).


Yes you are right. However I think a code that fails to take advantage the OS's implicit caching for no
apparent purpose, is problematic.




--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top