Can read all data off file accurately with fstream

D

Don Kim

I'm writing a program to read 2,000,000 floating point numbers off a text
file, to compute the sum, mean, and median. This is a direct example of
Stroustrup's paper.

But the program will not display the total number of elements beyone 23,272:

#include<vector>
#include<fstream>
#include<iostream>
#include<algorithm>
using namespace std;

int main(int argc, char* argv[])
{
char* file = argv[2] ;
vector<double> buf;

double median = 0;
double mean = 0;

fstream fin("num.txt", ios::in) ; // open file for input
double d;

while(fin >> d) {
buf.push_back(d) ;
mean = (buf.size()==1) ? d : mean+(d-mean)/buf.size(); // prone to
rounding errors
}

sort(buf.begin() ,buf.end()) ;

if (buf.size()) {
int mid = buf.size()/2;
median = (buf.size()%2) ? buf[mid] : (buf[mid-1]+buf[mid])/2;
}

cout << "Number of elements = " << buf.size()
<< ", median = " << median << ", mean = " << mean << "\n";

}

Anyone have any ideas?

-Don Kim
 
G

Gianni Mariani

Don said:
Anyone have any ideas?

I ran your code unmodified on almost 20 million values.

Number of elements = 19315296, median = 3.4, mean = 3.03333

Are you sure you have 2 million elements in your file ?
 
W

wittempj

I ran also with 2000000 numbers in file on windows xp, compiled program
with gcc - no problems...
 
D

Don Kim

Gianni Mariani said:
Are you sure you have 2 million elements in your file ?

Yes.

I wrote a program to generate random numbers to a text file. Originally, I
had it generate the numbers to file like this:

98.989
72.585
58.986

When the numbers are floating point and formatted like that, I get the odd
behavior. But when I make the program generate integer numbers like this
instead:

98
75
58

The program runs correctly.

I'm running this on Windows XP, and tested it on VC 7.1, 8.0, gcc 3.3.3 and
Digital Mars 8.4.1 and get the same odd behavior on all compilers.

Puzzled as to why this is the case.

-Don Kim
 
M

Mike Wahler

Don Kim said:
Yes.

I wrote a program to generate random numbers to a text file. Originally, I
had it generate the numbers to file like this:

98.989
72.585
58.986

When the numbers are floating point and formatted like that, I get the odd
behavior. But when I make the program generate integer numbers like this
instead:

98
75
58

The program runs correctly.

I'm running this on Windows XP, and tested it on VC 7.1, 8.0, gcc 3.3.3 and
Digital Mars 8.4.1 and get the same odd behavior on all compilers.

Puzzled as to why this is the case.

Show the code that created the file.

-Mike
 
D

Don Kim

Mike Wahler said:
Show the code that created the file.

-Mike

Ok, here it is:

#include <iostream>
#include <fstream>
#include <iomanip>
#include <cstdlib>
#include <ctime>
using namespace std;

int main()
{
srand(time(0));

cout << "Enter a number: ";
int n;
cin >> n;

ofstream numfile("num.txt");
for (int i = 0; i < n; i++)
{
//numfile << (rand() % 99)*((double)rand()/rand()) << "\n"; //to create
floats
numfile << rand() % 99 << "\n"; //to create ints
}
numfile.close();

}
 
D

Don Kim

On another issue, the code was adapted from Stroustrup's fine article
"Learning Standard C++ as a New Language", and running the C code against
the C++ code, my average running times on the program were as follows:

C version:

Unoptimized: 25 secs.
Optmized: 26 secs.

C++ version:

Unoptimized: 75 secs.
Optimized: 35 secs.

This was done on VC 7.1 on a P3 500 MHz, 800 MB Ram PC running WinXP with 5
million integer input values.

This seems to be the opposite of Stroupstrup's results. I haven't run this
with other compilers yet.

-Don Kim

P.S. - Here's the C version for those interested:

// C-style solution:
#include<stdlib.h>
#include<stdio.h>

#include "timecpp.hpp"
using namespace timecpp;

int compare(const void* p, const void* q) // comparison function for use by
qsort()
{
register double p0 = *(double*)p; // compare doubles
register double q0 = *(double*)q;
if (p0 > q0) return 1;
if (p0 < q0) return -1;
return 0;
}

void quit() // write error message and quit
{
fprintf(stderr,"memory exhausted\n") ;
exit(1) ;
}

int main(int argc, char* argv[])
{
timer t;
t.start();

int res = 1000; // initial allocation
char* file = argv[2];

double* buf = (double*)malloc(sizeof(double)*res) ;
if (buf==0) quit() ;

double median = 0;
double mean = 0;
int n = 0; // number of elements

FILE* fin = fopen("num.txt","r") ; // openfile for reading
double d;
while (fscanf(fin,"%lg",&d)==1) { // read number, update running mean
if (n==res) {
res += res;
buf = (double*)realloc(buf,sizeof(double)*res) ;
if (buf==0) quit() ;
}
buf[n++] = d;
mean = (n==1) ? d : mean+(d-mean); // prone to rounding errors
}

qsort(buf, n, sizeof(double) , compare) ;

if (n) {
int mid = n/2;
median = (n%2) ? buf[mid] : (buf[mid-1]+buf[mid])/2;
}

printf("number of elements = %d, median = %g, mean = %g\n", n, median,
mean);

t.stop("Time: ");
free(buf) ;
}
 
G

Gianni Mariani

Don said:
Ok, here it is: ....


ofstream numfile("num.txt");
for (int i = 0; i < n; i++)
{
//numfile << (rand() % 99)*((double)rand()/rand()) << "\n"; //to create
floats

What happens on divide by zero ?
 
M

Mike Wahler

Don Kim said:
Ok, here it is:

The problem is a 'range' error in your data creation.
I used your program to write 2,000,000 values to a
file. I loaded it into a text editor and visually
verifed that 2 million lines were actually written.
The screenful of values I saw looked OK, but I made
no assumptions.

Then I tried to read the file in with the input
program you posted, but I added an error check to
it:

while(fin >> d) {
buf.push_back(d) ; ++count; /* I defined 'count' as a 'size_t' */
mean = (buf.size()==1) ? d : mean+(d-mean)/buf.size();
}

/* this tells us whether the above loop terminated because of
error or EOF */
if(!fin.eof())
{
cerr << "input error (count == " << count << ")\n";
cerr << "last value read == " << d << '\n';
}

For my test run I got output of:

input error (count == 85459)
last value read == 1
Number of elements = 85459, median = 41.573, mean = 299.587

So I loaded up the file in an editor, and looked at line
85459. It looked like this:

1.#INF


So of course it showed 'last value read' as 1, and the '#'
character put the stream in a 'fail' state (because that
is an invalid character for a floating point value), terminating
the 'while' loop.

Since you're generating values with 'rand()', of course
the exact point where this happens, and now many times it
happens, if any, can vary. Also, exactly what happens will
vary among implementations. I think this is really a case of
undefined behavior, which the compiler I used (MSVC++)
manifested as the "#INF output". Another compiler might
do something completely different, and not necessarily
consistently.


Morals:

1. *Always* check for *all* possible failures of the functions you call.

2. *Never* make assumptions about the integrity of your test data sets.
Ensure that you *know* their exact content.

3. Always be thinking about the possiblity of overflow/underflow
in your numeric objects, and of ways to protect against it.
The facilities provided by the <numeric_limits> header can
help with this.

3. When initially testing a program, it's best to use a known, static
data set, rather than a randomly created one. It's much easier
to determine if a result is correct if you know what it should
be in advance. Only after you've proven that should you move
on to things like random inputs.

HTH,
-Mike
 
M

Mike Wahler

Mike Wahler said:
3. Always be thinking about the possiblity of overflow/underflow
in your numeric objects, and of ways to protect against it.
The facilities provided by the <numeric_limits> header can
help with this.

And as Gianni mentions, and I overlooked, protect from
divide by zero.

-Mike
 
G

Gianni Mariani

Don said:
On another issue, the code was adapted from Stroustrup's fine article
"Learning Standard C++ as a New Language", and running the C code against
the C++ code, my average running times on the program were as follows:

C version:

Unoptimized: 25 secs.
Optmized: 26 secs.

C++ version:

Unoptimized: 75 secs.
Optimized: 35 secs.

Take these numbers with a grain-o-salt. A number of other optimizations
should be considered. e.g. When both C and C++ versions are linked
static on the amd64 version, the times are the same.

AMD Athlon(tm) MP 2400+
gcc version 4.0.0 20050102

C++
Unoptimized: 11.4
Optimized: 7

C
Unoptimized: 8.6
Optimized: 7.6


model name : AMD Opteron(tm) Processor 248
gcc-3.4.2 amd64

C++
Unoptimized: 6.9
Optimized: 3.4

C
Unoptimized: 3.8
Optimized: 2.9


model name : AMD Athlon(tm) MP 2400+
stepping : 1
cpu MHz : 2000.085
gcc version 4.0.0 20050102 (experimental)

$ text_rdr_mkr #integers
Enter a number: 5000000
$ g++ -O0 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 48.9781
11.430u 0.240s 0:11.66 100.0% 0+0k 0+0io 262pf+0w

$ g++ -O2 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 48.9781
7.040u 0.270s 0:07.30 100.1% 0+0k 0+0io 259pf+0w

$ g++ -O3 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 48.9781
6.900u 0.320s 0:07.19 100.4% 0+0k 0+0io 259pf+0w


$ gcc -O0 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 88
8.560u 0.180s 0:08.74 100.0% 0+0k 0+0io 130pf+0w

$ gcc -O2 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 88
7.630u 0.240s 0:07.89 99.7% 0+0k 0+0io 130pf+0w

$ gcc -O3 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 88
7.580u 0.280s 0:07.85 100.1% 0+0k 0+0io 130pf+0w

model name : AMD Opteron(tm) Processor 248
stepping : 10
cpu MHz : 2191.059
gcc-3.4.2

$ g++ -O0 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 49.0195
6.866u 0.130s 0:07.01 99.7% 0+0k 0+0io 0pf+0w

$ g++ -O2 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 49.0195
3.365u 0.105s 0:03.47 99.7% 0+0k 0+0io 0pf+0w

$ g++ -O3 -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 49.0195
3.342u 0.105s 0:03.44 100.0% 0+0k 0+0io 0pf+0w


$ gcc -O0 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 48
3.787u 0.084s 0:03.87 99.7% 0+0k 0+0io 0pf+0w

$ gcc -O2 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 48
2.908u 0.076s 0:02.98 99.6% 0+0k 0+0io 0pf+0w

$ gcc -O3 -o text_rdr_c text_rdr_c.c
$ time text_rdr_c
number of elements = 5000000, median = 49, mean = 48
2.908u 0.080s 0:02.99 99.6% 0+0k 0+0io 0pf+0w

Other optimizations
32bit

$ g++ -fPIC -O3 -finline-limit=5000 -static -o text_rdr text_rdr.cpp
$ time ./text_rdr
Number of elements = 5000000, median = 49, mean = 48.9781
5.240u 0.270s 0:05.50 100.1% 0+0k 0+0io 102pf+0w

$ gcc -fPIC -O3 -finline-limit=5000 -static -o text_rdr text_rdr_c.c
$ time ./text_rdr_c
number of elements = 5000000, median = 49, mean = 88
7.740u 0.250s 0:07.97 100.2% 0+0k 0+0io 130pf+0w

64bit
$ g++ -fPIC -O3 -finline-limit=5000 -static -o text_rdr text_rdr.cpp
Number of elements = 5000000, median = 49, mean = 49.0195
3.007u 0.106s 0:03.11 99.6% 0+0k 0+0io 0pf+0w

$ gcc -fPIC -O3 -finline-limit=5000 -static -o text_rdr text_rdr_c.c
$ time ./text_rdr_c
number of elements = 5000000, median = 49, mean = 48
3.107u 0.105s 0:03.21 99.6% 0+0k 0+0io 0pf+0w
 
D

Don Kim

Mike Wahler said:
The screenful of values I saw looked OK, but I made
no assumptions.

Excellent points. That's what my problem was... I made assumptions about
the data.

Thanks.

-Don Kim
 
F

Francis Glassborow

Don Kim said:
C version:

Unoptimized: 25 secs.
Optmized: 26 secs.

C++ version:

Unoptimized: 75 secs.
Optimized: 35 secs.

This was done on VC 7.1 on a P3 500 MHz, 800 MB Ram PC running WinXP with 5
million integer input values.

This seems to be the opposite of Stroupstrup's results. I haven't run this
with other compilers yet.

-Don Kim


And one of the points Bjarne Stroustrup has frequently made in the past
is that the variability in performance between different implementations
is far too high. We are sometimes getting an order of magnitude
variation in performance for different implementations of the Standard
Library.

In this case I suspect that the problem lies in whether a compiler (with
its current compilation switches) is inlining small functions or not.
For a compiler such as VC++7.x the term 'optimised' has no meaning
because there are so many different optimisation options available. Note
also that according to your figures the optimisation options you chose
did nothing to improve the C version.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top