Binary file IO: Converting imported sequences of chars to desiredtype

James Kanze · Oct 28, 2009

On Oct 26, 12:06 pm, James Kanze <[email protected]> wrote:

I haven't invested in text or XML marshalling because
I think binary formats are going to prevail.

Which binary format? There are quite a few to choose from.

With the portability edge taken away from text, there won't be
much reason to use text.

The main reason to use text is that it's an order of magnitude
easier to debug. And that's not likely to change.

Rune Allnor · Oct 28, 2009

The code was written very quickly, with no tricks or anything.

Just out of curiosity - would it be possible to see your code?
As far as I can tell, you haven't posted it (If you have, I have
missed it).

Rune

mzdude · Oct 28, 2009

The main reason to use text is that it's an order of magnitude
easier to debug. And that's not likely to change.

Is that text 8 bit ASCII, 16 bit, wchart_t, MBCS, UNICODE ... :^)

Mick · Oct 28, 2009

mzdude said:
Is that text 8 bit ASCII, 16 bit, wchart_t, MBCS, UNICODE ... :^)

Quill & Parchment.

--
------------
< I'm Karmic >
------------
\
\
___
{~._.~}
( Y )
()~*~()
(_)-(_)

Brian · Oct 28, 2009

Which binary format? There are quite a few to choose from.

I'm only aware of a few of them. I don't know if
it matters much to me which one is selected. It's
more that there's a standard.

The main reason to use text is that it's an order of magnitude
easier to debug. And that's not likely to change.

I was thinking that having a standard for binary would
help with debugging. I guess it is a tradeoff between
development costs and bandwidth costs.

Brian Wood
http://webEbenezer.net

Brian · Oct 28, 2009

I was thinking that having a standard for binary would
help with debugging. I guess it is a tradeoff between
development costs and bandwidth costs.

Does this perspective seem accurate? Assuming the order
of magnitude is correct, then the question becomes
something like if language A takes 10 times longer to
learn than language B, but once you learn A you can
communicate in 1/3 the time it takes for those using B.
So those who learn how to use A have an advantage over
those who don't.

Brian Wood

Gerhard Fiedler · Oct 28, 2009

Rune said:
Here is a test I wrote in matlab a few years ago, to demonstrate
the problem (WinXP, 2.4GHz, no idea about disk):

[... Matlab code]

Output:
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds

In Matlab. This doesn't say much if anything about any other program.
Possibly Matlab has a lousy (in terms of speed) text IO.

Re the precision issue: When writing out text, there isn't really a need
to go decimal, too. Hex or octal numbers are also text. Speeds up the
conversion (probably not by much, but still) and provides a way to write
out the exact value that is in memory (and recreate that exact value --
no matter the involved precisions).

Gerhard

James Kanze · Oct 29, 2009

Just out of curiosity - would it be possible to see your code?
As far as I can tell, you haven't posted it (If you have, I
have missed it).

I haven't posted it because it's on my machine at home (in
France), and I'm currently working in London, and don't have
immediate access to it. Redoing it here (from memory):

#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <stddef.h>
#include <stdlib.h>
#include <time.h>

class FileOutput
{
protected:
std::string my_type;
std:

fstream my_file;
time_t my_start;
time_t my_end;
public:
FileOutput( std::string const& type, bool is_binary = true )
: my_type( type )
, my_file( ("test_" + type + ".dat").c_str(),
is_binary ? std::ios:

ut | std::ios::binary
: std::ios:

ut )
{
my_start = time( NULL );
}
~FileOutput()
{
my_end = time( NULL ) ;
my_file.close();
std::cout << my_type << ": "
<< (my_end - my_start) << " sec." << std::endl;
}

virtual void output( double d ) = 0;
};

class RawOutput : public FileOutput
{
public:
RawOutput() : FileOutput( "raw" ) {}
virtual void output( double d )
{
my_file.write( reinterpret_cast< char* >(&d), sizeof(d) );
}
};

class CookedOutput : public FileOutput
{
public:
CookedOutput() : FileOutput( "cooked" ) {}
virtual void output( double d )
{
unsigned long long const& tmp
= reinterpret_cast< unsigned long long const& >(d);
int shift = 64 ;
while ( shift > 0 ) {
shift -= 8 ;
my_file.put( (tmp >> shift) & 0xFF );
}
}
};

class TextOutput : public FileOutput
{
public:
TextOutput() : FileOutput( "text", false )
{
my_file.setf( std::ios::scientific,
std::ios::floatfield );
my_file.precision( 17 );
}
virtual void output( double d )
{
my_file << d << '\n';
}
};

template< typename File >
void
test( std::vector< double > const& values )
{
File dest;
for ( std::vector< double >::const_iterator iter = values.begin
();
iter != values.end();
++ iter ) {
dest.output( *iter );
}
}

int
main()
{
size_t const size = 10000000;
std::vector< double >
v;
while ( v.size() != size ) {
v.push_back( (double)( rand() ) / (double)( RAND_MAX ) );
}
test< TextOutput >( v );
test< CookedOutput >( v );
test< RawOutput >( v );
return 0;
}

Compiled with "cl /EHs /O2 timefmt.cc". On my local disk here,
I get:
text: 90 sec.
cooked: 31 sec.
raw: 9 sec.
The last is, of course, not significant, except that it is very
small. (I can't run it on the networked disk, where any real
data would normally go, because it would use too much network
bandwidth, possibly interfering with others. Suffice it to say
that the networked disk is about 5 or more times slower, so the
relative differences would be reduced by that amount.) I'm not
sure what's different in the code above (or the environment---I
suspect that the disk bandwidth is higher here, since I'm on a
professional PC, and not a "home computer") compared to my tests
at home (under Windows); at home, there was absolutely no
difference in the times for raw and cooked. (Cooked is, of
course, XDR format, at least on a machine like the PC, which
uses IEEE floating point.)

James Kanze · Oct 29, 2009

I'm only aware of a few of them. I don't know if
it matters much to me which one is selected. It's
more that there's a standard.

I was thinking that having a standard for binary would help
with debugging.

It might. It would certainly encourage tools for reading it.
On the other hand: we already have a couple of standards for
binary, and I haven't seen that many tools. Part of the reason
might be because one of the most common standards, XDR, is
basically untyped, so the tools wouldn't really know how to read
it anyway. (There are tools which display certain specific uses
of XDR in human readable format, e.g. tcpdump.)

James Kanze · Oct 29, 2009

Rune said:
Rune said:

Here is a test I wrote in matlab a few years ago, to
demonstrate the problem (WinXP, 2.4GHz, no idea about disk):
[... Matlab code]
Output:
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds

Click to expand...

In Matlab. This doesn't say much if anything about any other
program. Possibly Matlab has a lousy (in terms of speed) text
IO.

Obviously, not possibly. I get a factor of between 3 and 10,
depending on the compiler and the system. I get a signficant
difference simply running what I think is the same program (more
or less) on two different machines, using the same compiler and
having the same architecture---one probably has a much higher
speed IO bus than the other, and that makes the difference.

Re the precision issue: When writing out text, there isn't
really a need to go decimal, too. Hex or octal numbers are
also text. Speeds up the conversion (probably not by much, but
still) and provides a way to write out the exact value that is
in memory (and recreate that exact value -- no matter the
involved precisions).

But it defeats one of the major reasons for using text: human
readability.

Rune Allnor · Oct 29, 2009

On 29 Okt said:
Compiled with "cl /EHs /O2 timefmt.cc". On my local disk here,
I get:
text: 90 sec.
cooked: 31 sec.
raw: 9 sec.
The last is, of course, not significant, except that it is very
small. (I can't run it on the networked disk, where any real
data would normally go, because it would use too much network
bandwidth, possibly interfering with others. Suffice it to say
that the networked disk is about 5 or more times slower, so the
relative differences would be reduced by that amount.) I'm not
sure what's different in the code above (or the environment---I
suspect that the disk bandwidth is higher here, since I'm on a
professional PC, and not a "home computer") compared to my tests
at home (under Windows); at home, there was absolutely no
difference in the times for raw and cooked. (Cooked is, of
course, XDR format, at least on a machine like the PC, which
uses IEEE floating point.)

Hmm.... so everything was done on your local disc? Which means
one would expect that disk I/O delays are proportional to file
sizes?

If so, the raw/cooked binary formats are a bit confusing.
According to this page,

http://publib.boulder.ibm.com/infoc...m.aix.progcomm/doc/progcomc/xdr_datatypes.htm

the XDR data type format uses "the IEEE standard" (I can find no
mention of exactly *which* IEEE standard...) to encode both single-
precision and double-precision floating point numbers.

IF "the IEEE standard" happens to mean "IEEE 754" there is a
chance that an optimizing compiler might deduce that re-coding
numbers on IEEE 754 format to another number on IEEE 754 format
essentially is a No-Op.

Even if XDR uses some other format than IEEE754, your numbers
show one significant effect:

1) Double-precision XDR is of the same size as double-precision
IEEE 754 (64 bits / number).
2) Handling XDR takes significantly longer than handling native
binary formats.

Since you run the test with the same amopunts of data on the
same local disk with the same delay factors, this factor ~4
of longer time spent on handling XDR data must be explained by
something else than mere disk IO.

The obvious suspect is the extra manipulations and recoding of
XDR data. Where native-format binary IO only needs to perform
a memcpy from the file buffer to the destination, the XDR data
first needs to be decoded to an intermediate format, and then
re-encoded to the native binary format before the result can
be piped on to the destination.

The same happens - but on a larger scale - when dealing with
text-based formats:

1) Verify that the next sequence of characters represent a
valid number format
2) Decide how many glyphs need to be considered for decoding
3) Decode text characters to digits
4) Scale according to digit placement in number
5) Repeat for exponent
6) Do the math to compute the number

True, this takes insignificant amounts of time when compared
to disk IO, but unless you use a multi-thread system where
one thread reads from disk and another thread converts the
formats while one waits for the next batch of data to arrive
from the disk, one have to do all of this sequentially in
addition to waiting for disk IO.

Nah, I still think that any additional non-trivial handling
of data will impact IO times of data. In single-thread
environments.

Rune

James Kanze · Oct 29, 2009

Hmm.... so everything was done on your local disc? Which means
one would expect that disk I/O delays are proportional to file
sizes?

More or less. There are also caching effects, which I've not
tried to mask or control, which means that the results should be
taken with a grain of salt. More generally, there are a lot of
variables involved, and I've not made any attempts to control
any of them, which probably explains the differences I'm seeing
from one machine to the next.

If so, the raw/cooked binary formats are a bit confusing.
According to this page,

the XDR data type format uses "the IEEE standard" (I can find
no mention of exactly *which* IEEE standard...) to encode both
single- precision and double-precision floating point numbers.

IF "the IEEE standard" happens to mean "IEEE 754" there is a
chance that an optimizing compiler might deduce that re-coding
numbers on IEEE 754 format to another number on IEEE 754
format essentially is a No-Op.

I'm not sure what you're referring to. My "cooked" format is a
simplified, non-portable implementation of XDR---non portable
because it only works on machines which have 64 long longs and
use IEEE floating point.

Even if XDR uses some other format than IEEE754, your numbers
show one significant effect:

1) Double-precision XDR is of the same size as double-precision
IEEE 754 (64 bits / number).
2) Handling XDR takes significantly longer than handling native
binary formats.

Again, that depends on the machine. On my tests at home, it
didn't. I've not had the occasion to determine where the
difference lies.

Since you run the test with the same amopunts of data on the
same local disk with the same delay factors,

I don't know whether the delay factor is the same. A lot
depends on how the system caches disk accesses. A more
significant test would use synchronized writing, but
synchronized at what point?

this factor ~4 of longer time spent on handling XDR data must
be explained by something else than mere disk IO.

*IF* there is no optimization, *AND* disk accesses cost nothing,
then a factor of about 4 sounds about right.

The obvious suspect is the extra manipulations and recoding of
XDR data. Where native-format binary IO only needs to perform
a memcpy from the file buffer to the destination, the XDR data
first needs to be decoded to an intermediate format, and then
re-encoded to the native binary format before the result can
be piped on to the destination.

The same happens - but on a larger scale - when dealing with
text-based formats:

1) Verify that the next sequence of characters represent a
valid number format
2) Decide how many glyphs need to be considered for decoding
3) Decode text characters to digits
4) Scale according to digit placement in number
5) Repeat for exponent
6) Do the math to compute the number

That's input, not output. Input is significantly harder for
text, since it has to be able to detect errors. For XDR, the
difference between input and output probably isn't signficant,
since the only error that you can really detect is an end of
file in the middle of a value.

True, this takes insignificant amounts of time when compared
to disk IO, but unless you use a multi-thread system where one
thread reads from disk and another thread converts the formats
while one waits for the next batch of data to arrive from the
disk, one have to do all of this sequentially in addition to
waiting for disk IO.

Nah, I still think that any additional non-trivial handling of
data will impact IO times of data. In single-thread
environments.

You can always use asynchronous IO

. And what if your
implementation of filebuf uses memory mapped files?

The issues are extremely complex, and can't easily be
summarized. About the most you can say is that using text I/O
won't increase the time more than about a factor of 10, and may
increase it significantly less. (I wish I could run the tests
on the drives we usually use---I suspect that the difference
between text and binary would be close to negligible, because of
the significantly lower data transfer rates.)

Gerhard Fiedler · Oct 29, 2009

James said:
But it defeats one of the major reasons for using text: human
readability.

Not that much. For (casual, not precision) reading, a few digits are
usually enough, and most people who read this type of output (meant to
be communication between programs) are programmers, hence typically
reasonably fluent in octal and hex. The most important issue is that the
fields (mantissa sign, mantissa, exponent sign, exponent, etc.) are
decoded and appropriately presented. Whether the mantissa and the
exponent are then in decimal, octal or hexadecimal IMO doesn't make much
of a difference.

Since what we're talking about is only relevant for huge amounts of
data, doing anything more with that data than just a cursory look at
some numbers (which IMO is fine in octal or hex) generally needs a
program anyway.

Gerhard

James Kanze · Oct 30, 2009

Not that much. For (casual, not precision) reading, a few
digits are usually enough, and most people who read this type
of output (meant to be communication between programs) are
programmers, hence typically reasonably fluent in octal and
hex. The most important issue is that the fields (mantissa
sign, mantissa, exponent sign, exponent, etc.) are decoded and
appropriately presented. Whether the mantissa and the exponent
are then in decimal, octal or hexadecimal IMO doesn't make
much of a difference.

Agreed (sort of): I thought you were talking about outputting a
hex dump of the bytes. Separating out the mantissa and the
exponent is a simple and rapid compromize: it's not anywhere
near as readable as the normal format, but as you say, it should
be sufficient for most uses by a professional in the field.
Having done that, however, I suspect that on most machines,
outputting the different fields in decimal, rather than hex,
would probably not make a significant different.

Since what we're talking about is only relevant for huge
amounts of data, doing anything more with that data than just
a cursory look at some numbers (which IMO is fine in octal or
hex) generally needs a program anyway.

One would hope that you could start debugging with much smaller
sets of data. And if you do end up one LSB off after reading,
you'll probably want to look at the exact value.

Rune Allnor · Oct 30, 2009

Agreed (sort of): I thought you were talking about outputting a
hex dump of the bytes. Separating out the mantissa and the
exponent is a simple and rapid compromize: it's not anywhere
near as readable as the normal format, but as you say, it should
be sufficient for most uses by a professional in the field.
Having done that, however, I suspect that on most machines,
outputting the different fields in decimal, rather than hex,
would probably not make a significant different.

One would hope that you could start debugging with much smaller
sets of data. And if you do end up one LSB off after reading,
you'll probably want to look at the exact value.

So what does text-based formats actually buy you?

- Files are several times larger than binary dumps
- IO delays are several times (I'd say orders) slower
for thext than for binary
- Human users don't benefit from the text dumps anyway,
since they are too large to be useful
- Human readers would have to make an effort to
convert text dumps to readable format

In the end, text formats require humans to do the same
work converting data to a readable format as would be
required with binary data, AND it provides file sizes
and IO delays as additional nuisances.

Rune

Gerhard Fiedler · Oct 30, 2009

James said:
Agreed (sort of): I thought you were talking about outputting a hex
dump of the bytes. Separating out the mantissa and the exponent is a
simple and rapid compromize: it's not anywhere near as readable as
the normal format, but as you say, it should be sufficient for most
uses by a professional in the field.

I think the biggest advantage of doing it this way is that the text
representation makes it portable between different binary floating point
formats, and that the octal or hex representation avoids any rounding
problems and maintains the exact value, independently of precision and
other details of the binary representation (on both sides).

Having done that, however, I suspect that on most machines, outputting
the different fields in decimal, rather than hex, would probably not
make a significant different.

That may well be. But the rounding aspect is still a problem.

One would hope that you could start debugging with much smaller sets
of data. And if you do end up one LSB off after reading, you'll
probably want to look at the exact value.

Sure. You always can use debug flags for outputting debug values.

Gerhard

James Kanze · Oct 30, 2009

So what does text-based formats actually buy you?

Shorter development times, less expensive development, greater
reliability...

In sum, lower cost.

James Kanze · Oct 30, 2009

I think the biggest advantage of doing it this way is that the
text representation makes it portable between different binary
floating point formats, and that the octal or hex
representation avoids any rounding problems and maintains the
exact value, independently of precision and other details of
the binary representation (on both sides).

That may well be. But the rounding aspect is still a problem.

No. You're basically outputting (and reading) two integers: the
exponent (expressed as a power of two), and the mantissa
(expressed as the actual value times some power of two,
depending on the number of bits). For an IEEE double, for
example, you'd do something like:

MyOStream&
operator<<( MyOStream& dest, double value )
{
unsigned long long const&
u
= reinterpret_cast< unsigned long long const& >( value );
dest << ((u & 0x8000000000000000) != 0 ? '-' : '+')
<< (u & 0x000FFFFFFFFFFFFF) << 'b'
<< (((u >> 52) & 0x0FFF) | 0x8000);
return dest;
}

Gerhard Fiedler · Oct 31, 2009

James said:
No. You're basically outputting (and reading) two integers:

Ah, of course...

<slap on forehead>

Gerhard

Rune Allnor · Oct 31, 2009

Shorter development times, less expensive development, greater
reliability...

In sum, lower cost.

As long as you keep two factors in mind:

1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
bandwidth etc) are not yours (the programmer) to waste.

Those who want easy, not awfully challenging jobs might be
better off flipping burgers.

Rune

reading binary file into memory. Converting from char to uint32,float, double, ASCII strings etc (st	37	Oct 15, 2011
Streaming file IO and binary files	3	Jul 25, 2007
binary file to CString	2	Nov 19, 2008
converting char to int (reading from a binary file)	12	May 16, 2008
converting char to float (reading binary data from file)	17	May 21, 2008
How to convert CSV to parquet file without RLE_DICTIONARY encoding?	0	Sep 2, 2022
Writing binary data from database to file	2	Sep 3, 2010
performance of script to write very long lines of random chars	15	Apr 11, 2013

Binary file IO: Converting imported sequences of chars to desiredtype

James Kanze

Rune Allnor

mzdude

Mick

Brian

Brian

Gerhard Fiedler

James Kanze

James Kanze

James Kanze

Rune Allnor

James Kanze

Gerhard Fiedler

James Kanze

Rune Allnor

Gerhard Fiedler

James Kanze

James Kanze

Gerhard Fiedler

Rune Allnor

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads