Binary file IO: Converting imported sequences of chars to desiredtype

R

Rune Allnor

Hi all.

I have used the method from this page,

http://www.cplusplus.com/reference/iostream/istream/read/

to read some binary data from a file to a char[] buffer.

The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?

The naive C way would be to use memcopy. Is there a
better C++ way?

Rune
 
M

Maxim Yegorushkin

Hi all.

I have used the method from this page,

http://www.cplusplus.com/reference/iostream/istream/read/

to read some binary data from a file to a char[] buffer.

The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?

The naive C way would be to use memcopy. Is there a
better C++ way?

This is the correct way since memcpy() allows you to copy unaligned data
into an aligned object.

Another way is to read data directly into the aligned object:

float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);
 
J

James Kanze

On 17/10/09 18:39, Rune Allnor wrote:
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy
unaligned data into an aligned object.
Another way is to read data directly into the aligned object:
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);

Neither, of course, work, except in very limited cases.

To convert bytes written in a binary byte stream to any internal
format, you have to know the format in the file; if you also
know the internal format, and have only limited portability
concerns, you can generally do the conversion much faster; a
truely portable read requires use of ldexp, etc., but if you are
willing to limit your portability to machines using IEEE
(Windows and mainstream Unix, but not mainframes), and the file
format is IEEE, you can simply read the data as a 32 bit
unsigned int, then use reinterpret_cast (or memcpy).

FWIW: the fully portable solution is something like:

class ByteGetter
{
public:
explicit ByteGetter( ixdrstream& stream )
: mySentry( stream )
, myStream( stream )
, mySB( stream->rdbuf() )
, myIsFirst( true )
{
if ( ! mySentry ) {
mySB = NULL ;
}
}
uint8_t get()
{
int result = 0 ;
if ( mySB != NULL ) {
result = mySB->sgetc() ;
if ( result == EOF ) {
result = 0 ;
myStream.setstate( myIsFirst
? std::ios::failbit | std::ios::eofbit
: std::ios::failbit | std::ios::eofbit |
std::ios::badbit ) ;
}
}
myIsFirst = false ;
return result ;
}

private:
ixdrstream::sentry mySentry ;
ixdrstream& myStream ;
std::streambuf* mySB ;
bool myIsFirst ;
} ;

ixdrstream&
ixdrstream::eek:perator>>(
uint32_t& dest )
{
ByteGetter source( *this ) ;
uint32_t tmp = source.get() << 24 ;
tmp |= source.get() << 16 ;
tmp |= source.get() << 8 ;
tmp |= source.get() ;
if ( *this ) {
dest = tmp ;
}
return *this ;
}

ixdrstream&
ixdrstream::eek:perator>>(
float& dest )
{
uint32_t tmp ;
operator>>( tmp ) ;
if ( *this ) {
float f = 0.0 ;
if ( (tmp & 0x7FFFFFFF) != 0 ) {
f = ldexp( ((tmp & 0x007FFFFF) | 0x00800000),
(int)((tmp & 0x7F800000) >> 23) - 126 -
24 ) ;
}
if ( (tmp & 0x80000000) != 0 ) {
f = -f ;
}
dest = f ;
}
return *this ;
}

The above code still needs work to handle NaN's and Infinity
correctly, but it should give a good idea of what it necessary.

If you aren't concerned about machines which aren't IEEE, of
course, you can just memcpy the tmp after having read it in the
last function above, or use a reinterpret_cast to force the
types.
 
R

Rune Allnor

I have used the method from this page,

to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?

This is the correct way since memcpy() allows you to copy unaligned data
into an aligned object.

Another way is to read data directly into the aligned object:

     float f;
     stream.read(reinterpret_cast<char*>(&f), sizeof f);

The naive

std::vector<float> v;
for (n=0;n<N;++n)
{
file.read(reinterpret_cast<char*>(&f), sizeof f);
v.push_back(v);
}

doesn't work as expected. Do I need to call 'seekg'
inbetween?

Rune
 
A

Alf P. Steinbach

* Rune Allnor:
Hi all.
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy unaligned data
into an aligned object.

Another way is to read data directly into the aligned object:

float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);

The naive

std::vector<float> v;
for (n=0;n<N;++n)
{
file.read(reinterpret_cast<char*>(&f), sizeof f);
v.push_back(v);
}

doesn't work as expected. Do I need to call 'seekg'
inbetween?

post complete code

cheers & hth

- alf
 
R

Rune Allnor

* Rune Allnor:




On 17/10/09 18:39, Rune Allnor wrote:
Hi all.
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy unaligned data
into an aligned object.
Another way is to read data directly into the aligned object:
     float f;
     stream.read(reinterpret_cast<char*>(&f), sizeof f);
The naive
std::vector<float> v;
for (n=0;n<N;++n)
{
   file.read(reinterpret_cast<char*>(&f), sizeof f);
   v.push_back(v);
}
doesn't work as expected. Do I need to call 'seekg'
inbetween?

post complete code

Never mind. The project was compiled in 'release mode'
with every optimization flag I could find set to 11.
No reason to expect the source code to have anything
whatsoever to do with what actually goes on.

Once I switched back to debug mode, I was able to
track the progress.

Rune
 
M

Maxim Yegorushkin

On 17/10/09 18:39, Rune Allnor wrote:
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy
unaligned data into an aligned object.
Another way is to read data directly into the aligned object:
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);

Neither, of course, work, except in very limited cases.

The assumption was that the float was written by the same program or a
program with a compatible binary API. Is that the case you meant in
"except in very limited cases"?
 
J

James Kanze

On 17/10/09 18:39, Rune Allnor wrote:
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy
unaligned data into an aligned object.
Another way is to read data directly into the aligned object:
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);
Neither, of course, work, except in very limited cases.
The assumption was that the float was written by the same
program or a program with a compatible binary API. Is that the
case you meant in "except in very limited cases"?

More or less. Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.

Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses depends
on the version and the options used for compiling. In practice,
it's not something you can count on except for very short lived
data: I wouldn't hesitate about using it for spilling temporary
data to disk, to be reread later by the same process. I can
imagine that it's quite acceptable as well if you have one
program collecting data during e.g. a week, and another
processing all of the data in batch over the week-end, provided
that both programs were compiled with the same compiler, using
the same options. Beyond that, I'd have my doubts (having been
bit with the problem more than once in the past). As a general
rule, it's better to define a format, and match it. (Even if I
were using a memory dump, I'd first "define" the format, just
ensuring that the definition was compatible to the in memory
image. That way, if worse comes to worse, at least a
maintenance programmer will know what to expect, and will have a
chance at making it work.)
 
J

Jorgen Grahn

On Oct 18, 12:13 pm, Maxim Yegorushkin <[email protected]>
wrote: ....

More or less. Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.

Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses depends
on the version and the options used for compiling. In practice,
it's not something you can count on except for very short lived
data: I wouldn't hesitate about using it for spilling temporary
data to disk, to be reread later by the same process. I can
imagine that it's quite acceptable as well if you have one
program collecting data during e.g. a week, and another
processing all of the data in batch over the week-end, provided
that both programs were compiled with the same compiler, using
the same options. Beyond that, I'd have my doubts (having been
bit with the problem more than once in the past). As a general
rule, it's better to define a format, and match it. (Even if I
were using a memory dump, I'd first "define" the format, just
ensuring that the definition was compatible to the in memory
image. That way, if worse comes to worse, at least a
maintenance programmer will know what to expect, and will have a
chance at making it work.)

But if you have a choice, it's IMO almost always better to write the
data as text, compressing it first using something like gzip if I/O or
disk space is an issue.

(Loss of precision when printing decimal floats could be a problem in
this case though ...)

/Jorgen
 
J

James Kanze

But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.

Totally agreed. Especially for the maintenance programmer, who
can see at a glance what is being written.
(Loss of precision when printing decimal floats could be a
problem in this case though ...)

It's a hard problem in general. If writing and reading to
internal formats with the same precision, it's sufficient to
output enough digits. If you don't know the precision of the
reader, however, you don't really know how many digits to output
when writing.
 
J

Jorgen Grahn

It's a hard problem in general. If writing and reading to
internal formats with the same precision, it's sufficient to
output enough digits. If you don't know the precision of the
reader, however, you don't really know how many digits to output
when writing.

Good point; I didn't think of that aspect (i.e. not give a false
impression of precision when the input is e.g. 3.14 and you output
it as 3.14000000).

I was more thinking about reading "0.20000000000000000" but printing
0.20000000000000001. But now that I think of it, it's a loss of
precision in the input; there is no way to avoid it and still use
float/double internally.

/Jorgen
 
R

Rune Allnor

Totally agreed.  Especially for the maintenance programmer, who
can see at a glance what is being written.

The user might have opinions, though.

File I/O operations with text-formatted floating-point data
take time. A *lot* of time. The rule-of-thumb is 30-60 seconds
per 100 MBytes of text-formatted FP numeric data, compared to
fractions of a second for the same data (natively) binary encoded
(just try it).

In heavy-duty data processing applications one just can not afford
to spend more time than absolutely necessary. Text-formatted data
is not an option.

If there are problems with binary floating point I/O formats, then
that's a question for the C++ standards committee. It ought to be
a simple technical (as opposed to political) matter to specify that
binary FP I/O could be set to comply to some already defined
standard,
like e.g. IEEE 754.

The matter isn't fundamentally different from setting locales and
character encodings with text files.

Rune
 
J

James Kanze

Good point; I didn't think of that aspect (i.e. not give a
false impression of precision when the input is e.g. 3.14 and
you output it as 3.14000000).

I'm not sure what you're referring to here. We're talking about
the format used for transmitting data from one machine to
another. Given enough digits and the same basic format, it's
always possible to make a round trip, writing, then reading, and
getting the exact value back (even if the value output isn't the
exact value).
I was more thinking about reading "0.20000000000000000" but
printing 0.20000000000000001.

For data communications, the problem occurs in the opposite
sense. Except that with enough digits (17 for IEEE double, I
think), it won't occur.
But now that I think of it, it's a loss of precision in the
input; there is no way to avoid it and still use float/double
internally.

But for this application, if you know how many digits are needed
to ensure correct reading, the loss of precision when reading
will exactly offset the error when writing.

The problem only comes up when you don't know the number of
digits in the reader's format. This is particularly an issue
with double, since the second most widely used format (IBM
mainframe double) has more digits precision than IEEE double,
and 17 digits probably won't be enough; you'll get something
very close, but it might not be the closest possible
representation. Which in this case would be exactly the
starting value---I think that IBM mainframe double precision can
represent all IEEE double values in range exactly. (Warning:
this is all very much off the top of my head. I've not done any
real analysis to verify the actual case of IBM floating point
versus IEEE. The problem can definitely occur, however, and it
wouldn't be difficult to imagine a 128 bit double format where
it did.)
 
J

James Kanze

[...]
The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.

A lot of time compared to what? My experience has always been
that the disk IO is the limiting factor (but my data sets have
generally been very mixed, with a lot of non floating point data
as well). And binary formatting can be more or less expensive
as well---I'd rather deal with text than a BER encoded double.
And Jorgen said very explicitly "if you have a choice".
Sometimes you don't have the choice: you have to conform to an
already defined external format, or the profiler says you don't
have the choice.
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just try
it).

Try it on what machine:). Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously. By
a factor of much more than 2 (which is what you've mentionned).

Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format, and on a
slow medium, that can make a real difference. (In one
application, where we had to transmit tens of kilobytes over a
50 Baud link---and there's no typo there, it was 50 bits, or
about 6 bytes, per second---we didn't even consider using text.
Even though there wasn't any floating point involved.)
In heavy-duty data processing applications one just can not
afford to spend more time than absolutely necessary.
Text-formatted data is not an option.

I'm working in such an application at the moment, and our
external format(s) are all text. And the conversions of the
individual values has never been a problem. (One of the formats
is XML. And our disks and network are fast enough that even
that hasn't been a problem.)
If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.

So that the language couldn't be used on some important
platforms? (Most mainframes still do not use IEEE. Most don't
even use binary: IBM's are base 16, and Unisys's base 8.) And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.
 
R

Rune Allnor

    [...]
The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.

A lot of time compared to what?

Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.
 My experience has always been
that the disk IO is the limiting factor

Disk IO is certainly *a* limiting factor. But not the
only one. In this case it's not even the dominant one.
See the example below.
(but my data sets have
generally been very mixed, with a lot of non floating point data
as well).  And binary formatting can be more or less expensive
as well---I'd rather deal with text than a BER encoded double.
And Jorgen said very explicitly "if you have a choice".
Sometimes you don't have the choice: you have to conform to an
already defined external format, or the profiler says you don't
have the choice.


Try it on what machine:).

Any machine. The problem is to decode text-formatted numbers
to binary.
 Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously.  By
a factor of much more than 2 (which is what you've mentionned).

Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,

This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.

Here is a test I wrote in matlab a few years ago, to demonstrate
the problem (WinXP, 2.4GHz, no idea about disk):

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
N = 10000000;
d1=randn(N,1);
t1=cputime;
save test.txt d1 -ascii
t2=cputime-t1;
disp(['Wrote ASCII data in ',num2str(t2),' seconds'])

t3=cputime;
d2=load('test.txt','-ascii');
t4=cputime-t3;
disp(['Read ASCII data in ',num2str(t4),' seconds'])

t5=cputime;
fid=fopen('test.raw','w');
fwrite(fid,d1,'double');
fclose(fid);
t6=cputime-t5;
disp(['Wrote binary data in ',num2str(t6),' seconds'])

t7=cputime;
fid=fopen('test.raw','r');
d3=fread(fid,'double');
fclose(fid);
t8=cputime-t7;
disp(['Read binary data in ',num2str(t8),' seconds'])
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Output:
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------

Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.

The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc. The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes. The first few lines
in the text file look like

-4.3256481e-001
-1.6655844e+000
1.2533231e-001
2.8767642e-001

(one leading whitespace, one negative sign or whitespace,
no trailing spaces) which is not excessive, neither with
respect to the number of significant digits, or the number
of other characters.

The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.
and on a
slow medium, that can make a real difference.  (In one
application, where we had to transmit tens of kilobytes over a
50 Baud link---and there's no typo there, it was 50 bits, or
about 6 bytes, per second---we didn't even consider using text.
Even though there wasn't any floating point involved.)


I'm working in such an application at the moment, and our
external format(s) are all text.  And the conversions of the
individual values has never been a problem.  (One of the formats
is XML.  And our disks and network are fast enough that even
that hasn't been a problem.)

The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order of

1e10/1.75e8*42s = 2400s = 40 minutes.

There is no point in even considering using a text format
for these kinds of things.
So that the language couldn't be used on some important
platforms?  (Most mainframes still do not use IEEE.  Most don't
even use binary: IBM's are base 16, and Unisys's base 8.)  And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.

I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many 16-bit
encodings. No one says which one should be used. Only which
ones should be available.

Rune
 
J

Jorgen Grahn

Good point; I didn't think of that aspect (i.e. not give a
false impression of precision when the input is e.g. 3.14 and
you output it as 3.14000000).

I'm not sure what you're referring to here. We're talking about
the format used for transmitting data from one machine to
another. [...]

I guess I am demonstrating why I try to stay away from
floating-point ;-) It is a tricky area.

/Jorgen
 
J

James Kanze

On Oct 23, 9:07 am, Jorgen Grahn <[email protected]> wrote:
[...]
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.
Totally agreed. Especially for the maintenance programmer,
who can see at a glance what is being written.
The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.
A lot of time compared to what?
Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.

The only comparison that is relevant is compared to some other
way of doing it.
Disk IO is certainly *a* limiting factor. But not the only
one. In this case it's not even the dominant one.

And that obviously depends on the CPU speed and the disk speed.
Text formatting does take some additional CPU time; if the disk
is slow and the CPU fast, this will be less important than if
the disk is fast and the CPU slow.
See the example below.

Which will only be for one compiler, on one particular CPU, with
one set of compiler options.

(Note that it's very, very difficult to measure these things
accurately, because of things like disk buffering. The order
you run the tests can make a big difference: under Windows, at
least, the first test run always runs considerably faster than
if it is run in some other position, for example.)
Any machine. The problem is to decode text-formatted numbers
to binary.

You're giving concrete figures. "Any machine" doesn't make
sense in such cases: I've seen factors of more than 10 in terms
of disk speed between different hard drives (and if the drive is
remote mounted, over a slow network, the difference can be even
more), and in my time, I've seen at least six or seven orders of
magnitude in speed between CPU's. (I've worked on 8 bit machines
which took on an average 10 ųs per machine instruction, with no
hardware multiply and divide, much less floating point
instructions.)

The compiler and the library implementation also make a
significant difference. I knocked up a quick test (which isn't
very accurate, because it makes no attempt to take into account
disk caching and such), and tried it on the two machines I have
handy: a very old (2002) laptop under Windows, using VC++, and a
very recent, high performance desktop under Linux, using g++.
Under Windows, the difference between text and binary was a
factor of about 3; under Linux, about 15. Apparently, the
conversion routines in the Microsoft compiler are a lot, lot
better than those in g++. The difference would be larger if I
had a higher speed disk or data bus; it would be significantly
smaller (close to zero, probably) if I synchronized each write.
(A synchronized disk write is about 10 ms, at least on a top of
the line Sun Sparc.)

In terms of concrete numbers, of course... Using time gave me
values too small to be significant for 10000000 doubles on the
Linux machine (top of the line AMD processor of less than a year
ago); for 100000000 doubles, it was around 85 seconds for text
(written in scientific format, with 17 digits precision, each
value followed by a new line, total file size 2.4 GB). For
10000000, it was around 45 seconds under Windows (file size 250
MB).

It's interesting to note that the Windows version is clearly IO
dominated. The difference in speed between text and binary is
pretty much the same as the difference in file size.
This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.

There is no 50-100x difference. There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase); I won't bother
trying it with synchronized writes, however, because that would
go to the opposite extreme, and you'd probably never use
synchronized writes for each double: when they're needed, it's
for each record.
Here is a test I wrote in matlab a few years ago, to
demonstrate the problem (WinXP, 2.4GHz, no idea about disk):

I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works. It might be using unbuffered output
for text, or synchronizing at each double. And in what format?
The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc.

Actually, reading immediately after writing maximizes the
effects of file caches. And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.
The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.

If you're dumping raw data, a binary file with 10000000 doubles,
on a PC, should be exactly 80 MB.
The first few lines in the text file look like

(one leading whitespace, one negative sign or whitespace, no
trailing spaces) which is not excessive, neither with respect
to the number of significant digits, or the number of other
characters.

It's not sufficient with regards to the number of digits. You
won't read back in what you've written.
The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.

I did, and they aren't. They're actually very different in two
separate C++ environments.
The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order
of
1e10/1.75e8*42s = 2400s = 40 minutes.
There is no point in even considering using a text format for
these kinds of things.

But it must not be doing much processing on the data, just
copying it and maybe a little scaling. My applications do
significant calculations (which I'll admit I don't understand,
but they do take a lot of CPU time). The time spent writing the
results, even in XML, is only a small part of the total runtime.
I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.

The current standard doesn't even say that. It only gives a
minimum list of characters which must be supported. But I'm not
sure what your argument is: you're saying that we should
standardize some binary format more than the text format?

(The big difference is, of course, is that while the standard
doesn't specify any encoding, there are a number of different
encodings which are supported on a lot of different machines.
Where as a raw dump of double doesn't work even between a PC and
a Sparc. Or between an older Mac, with a Power PC, and a newer
one, with an Intel chip. Upgrade your machine, and you loose
your data.)
 
R

Rune Allnor

The only comparison that is relevant is compared to some other
way of doing it.

OK. Text-based IO compard to binary IO.
You're giving concrete figures.

Yep. But as rule-of-thumb. My point is not to be accurate
(you have made a very convincing case why that would be
difficult), but to point out what performance costs and
trade-offs are involved when using text-based file fomats.
In terms of concrete numbers, of course... Using time gave me
values too small to be significant for 10000000 doubles on the
Linux machine (top of the line AMD processor of less than a year
ago); for 100000000 doubles, it was around 85 seconds for text
(written in scientific format, with 17 digits precision, each
value followed by a new line, total file size 2.4 GB).  For
10000000, it was around 45 seconds under Windows (file size 250
MB).

I suspect you might either have access to a bit more funky
hardware than most users, or have the skills to fine tune
what you have better than most users. Or both.
There is no 50-100x difference.  There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase);

Again, your assets ight not be representative for the
average users.
I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works.  It might be using unbuffered output
for text, or synchronizing at each double.  And in what format?


Actually, reading immediately after writing maximizes the
effects of file caches.  And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.

I'll rephrase: Eliminates *variability* due to file caches.
Whatever happens affect both files in equal amounts. It would
bias results if one file was cached and the other not.
If you're dumping raw data, a binary file with 10000000 doubles,
on a PC, should be exactly 80 MB.

It was. The file browser I used reported the file size
in KBytes. Multiply the number by 1024 and you get
exactly 80 Mbytes.
It's not sufficient with regards to the number of digits.  You
won't read back in what you've written.

I know. If that was a constraint, file sizes and read/write
times would increase correspondingly.
I did, and they aren't.  They're actually very different in two
separate C++ environments.


But it must not be doing much processing on the data, just
copying it and maybe a little scaling.  My applications do
significant calculations (which I'll admit I don't understand,
but they do take a lot of CPU time).  The time spent writing the
results, even in XML, is only a small part of the total runtime.

The read? Th eapplication I am talking about would require
a fair bit of number crunching. If I could process 1 hrs worth
of measurements in 20 minutes, I'd rather cash out the remaining
40 minutes in early results, rather than spend them waiting
for disk IO to complete.
The current standard doesn't even say that.  It only gives a
minimum list of characters which must be supported.  But I'm not
sure what your argument is: you're saying that we should
standardize some binary format more than the text format?

Yep. Some formats. like IEEE 754 (and maybe descendants)
are fairly universal. No matter what the native formats
look like, it ought to suffice to call a standard method
to dump binary data on the format.
(The big difference is, of course, is that while the standard
doesn't specify any encoding, there are a number of different
encodings which are supported on a lot of different machines.
Where as a raw dump of double doesn't work even between a PC and
a Sparc.  Or between an older Mac, with a Power PC, and a newer
one, with an Intel chip.  Upgrade your machine, and you loose
your data.)

Exactly. Which is why there ought to be a standardized
binary floating point format that is portable between
platforms.

Rune
 
B

Brian

    [...]
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.
Totally agreed.  Especially for the maintenance programmer,
who can see at a glance what is being written.
The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.
A lot of time compared to what?
Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.

The only comparison that is relevant is compared to some other
way of doing it.
Disk IO is certainly *a* limiting factor. But not the only
one. In this case it's not even the dominant one.

And that obviously depends on the CPU speed and the disk speed.
Text formatting does take some additional CPU time; if the disk
is slow and the CPU fast, this will be less important than if
the disk is fast and the CPU slow.
See the example below.

Which will only be for one compiler, on one particular CPU, with
one set of compiler options.

(Note that it's very, very difficult to measure these things
accurately, because of things like disk buffering.  The order
you run the tests can make a big difference: under Windows, at
least, the first test run always runs considerably faster than
if it is run in some other position, for example.)
Any machine. The problem is to decode text-formatted numbers
to binary.

You're giving concrete figures.  "Any machine" doesn't make
sense in such cases:  I've seen factors of more than 10 in terms
of disk speed between different hard drives (and if the drive is
remote mounted, over a slow network, the difference can be even
more), and in my time, I've seen at least six or seven orders of
magnitude in speed between CPU's.  (I've worked on 8 bit machines
which took on an average 10 ųs per machine instruction, with no
hardware multiply and divide, much less floating point
instructions.)

The compiler and the library implementation also make a
significant difference.  I knocked up a quick test (which isn't
very accurate, because it makes no attempt to take into account
disk caching and such), and tried it on the two machines I have
handy: a very old (2002) laptop under Windows, using VC++, and a
very recent, high performance desktop under Linux, using g++.
Under Windows, the difference between text and binary was a
factor of about 3; under Linux, about 15.  Apparently, the
conversion routines in the Microsoft compiler are a lot, lot
better than those in g++.  The difference would be larger if I
had a higher speed disk or data bus; it would be significantly
smaller (close to zero, probably) if I synchronized each write.
(A synchronized disk write is about 10 ms, at least on a top of
the line Sun Sparc.)

In terms of concrete numbers, of course... Using time gave me
values too small to be significant for 10000000 doubles on the
Linux machine (top of the line AMD processor of less than a year
ago); for 100000000 doubles, it was around 85 seconds for text
(written in scientific format, with 17 digits precision, each
value followed by a new line, total file size 2.4 GB).  For
10000000, it was around 45 seconds under Windows (file size 250
MB).

It's interesting to note that the Windows version is clearly IO
dominated.  The difference in speed between text and binary is
pretty much the same as the difference in file size.
This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.

There is no 50-100x difference.  There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase); I won't bother
trying it with synchronized writes, however, because that would
go to the opposite extreme, and you'd probably never use
synchronized writes for each double: when they're needed, it's
for each record.
Here is a test I wrote in matlab a few years ago, to
demonstrate the problem (WinXP, 2.4GHz, no idea about disk):

I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works.  It might be using unbuffered output
for text, or synchronizing at each double.  And in what format?
The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc.

Actually, reading immediately after writing maximizes the
effects of file caches.  And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.
The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.

If you're dumping raw data, a binary file with 10000000 doubles,
on a PC, should be exactly 80 MB.
The first few lines in the text file look like
 -4.3256481e-001
 -1.6655844e+000
  1.2533231e-001
  2.8767642e-001
(one leading whitespace, one negative sign or whitespace, no
trailing spaces) which is not excessive, neither with respect
to the number of significant digits, or the number of other
characters.

It's not sufficient with regards to the number of digits.  You
won't read back in what you've written.
The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.

I did, and they aren't.  They're actually very different in two
separate C++ environments.
The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order
of
1e10/1.75e8*42s = 2400s = 40 minutes.
There is no point in even considering using a text format for
these kinds of things.

But it must not be doing much processing on the data, just
copying it and maybe a little scaling.  My applications do
significant calculations (which I'll admit I don't understand,
but they do take a lot of CPU time).  The time spent writing the
results, even in XML, is only a small part of the total runtime.




I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.

The current standard doesn't even say that.  It only gives a
minimum list of characters which must be supported.  But I'm not
sure what your argument is: you're saying that we should
standardize some binary format more than the text format?

I haven't invested in text or XML marshalling because
I think binary formats are going to prevail. With the
portability edge taken away from text, there won't be
much reason to use text.


Brian Wood
http://webEbenezer.net

"All things (e.g. A camel's journey through
A needle's eye) are possible it's true.
But picture how the camel feels, squeezed out
In one long bloody thread from tail to snout."

C. S. Lewis
 
J

James Kanze

Yep. But as rule-of-thumb. My point is not to be accurate (you
have made a very convincing case why that would be difficult),
but to point out what performance costs and trade-offs are
involved when using text-based file fomats.

The problem is that there is no real rule-of-thumb possible.
Machines (and compilers) differ too much today.
I suspect you might either have access to a bit more funky
hardware than most users, or have the skills to fine tune what
you have better than most users. Or both.

The code was written very quickly, with no tricks or anything.
It was tested on off the shelf PC's---one admittedly older than
those most people are using, the other fairly recent. The
compilers in question were the version of g++ installed with
Suse Linux, and the free download version of VC++. I don't
think that there's anything in there that can be considered
"funky" (except maybe that most people professionally concerned
with high input have professional class machines to do it, which
are out of my price range), and I certainly didn't tune
anything.
Again, your assets might not be representative for the
average users.

Well, I'm not sure there's such a thing as an average user. But
my machines are very off the shelf, and I'd consider VC++ and
g++ very "average" as well, in the sense that they're what an
average user is most likely to see.
I'll rephrase: Eliminates *variability* due to file caches.

By choosing the best case, which rarely exists in practice.
Whatever happens affect both files in equal amounts. It would
bias results if one file was cached and the other not.

What is cached depends on what the OS can fit in memory. In
other words, the first file you wrote was far more likely to be
cached than the second.
It was. The file browser I used reported the file size
in KBytes. Multiply the number by 1024 and you get
exactly 80 Mbytes.

Strictly speaking, a KB is exactly 1000 bytes, not 1024:). But
I know, different programs treat this differently.
I know. If that was a constraint, file sizes and read/write
times would increase correspondingly.

It was a constraint. Explicitly. At least in this thread, but
more generally: about the only time it won't be a constraint is
when the files are for human consumption, in which case, I think
you'd agree, binary isn't acceptable.
The read?

I don't know. It's by some other applications, in other
departments, and I have no idea what they do with the data.

You're probably right, however, that to be accurate, I should do
some comparisons including reading. For various reasons (having
to deal with possible errors, etc.), the CPU overhead when
reading is typically higher than when writing.

But I'm really only disputing your order of magnitude
differences, because they don't correspond with my experience
(nor my measurements). There's definitely more overhead with
text format. The only question is whether that overhead is more
expensive than the cost of the alternatives, and a there depends
on what you're doing. Obviously, if you can't afford the
overhead (and I've worked on applications which couldn't), then
you use binary, but my experience is that a lot of people jump
to binary far too soon, because the overhead isn't that critical
that often.
Yep. Some formats. like IEEE 754 (and maybe descendants)
are fairly universal. No matter what the native formats
look like, it ought to suffice to call a standard method
to dump binary data on the format.

To date, neither C nor C++ have made the slightest gest in the
direction of standardizing any binary formats. There are other
(conflicting) standards which do: XDR, for example, or BER. I
personally think that adding a second set of streams, supporting
XDR, to the standard, would be a good thing, but I've never had
the time to actually write up such a proposal. And a general
binary format is quite complex to specify; it's one thing to say
you want to output a table of double, but to be standardized,
you also have to define what is output when a large mix of types
are streamed, and how much information is necessary about the
initial data in order to read them.
Exactly. Which is why there ought to be a standardized binary
floating point format that is portable between platforms.

There are several: I've used both XDR and BER in applications in
the past. One of the reasons C++ doesn't address this issue is
that there are several, and C++ doesn't want to choose one over
the others.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top