binary file parsing

Christopher · May 4, 2009

Been awhile since I've done this.

I am at a point in my binary file, where there is a 4 byte integral
value I need to read. I don't think I am guarenteed an int is 4 bytes
by C++, although it probably is. So, how do I read in 4 bytes and then
make it a numerical value?

Should I read char by char, put them together with some bitwise
operation magic, then try to convert it to an integer?

Or is there an easier way of telling the ifstream to extract exactly
four bytes into an actual int?

I am going to have the same problem in a few minutes with floats too.

joshuamaurice · May 4, 2009

Been awhile since I've done this.

I am at a point in my binary file, where there is a 4 byte integral
value I need to read. I don't think I am guarenteed an int is 4 bytes
by C++, although it probably is. So, how do I read in 4 bytes and then
make it a numerical value?

Should I read char by char, put them together with some bitwise
operation magic, then try to convert it to an integer?

Or is there an easier way of telling the ifstream to extract exactly
four bytes into an actual int?

I am going to have the same problem in a few minutes with floats too.

Google ios::binary.

Neelesh · May 4, 2009

Been awhile since I've done this.

I am at a point in my binary file, where there is a 4 byte integral
value I need to read. I don't think I am guarenteed an int is 4 bytes
by C++, although it probably is. So, how do I read in 4 bytes and then
make it a numerical value?

Should I read char by char, put them together with some bitwise
operation magic, then try to convert it to an integer?

Or is there an easier way of telling the ifstream to extract exactly
four bytes into an actual int?

I am going to have the same problem in a few minutes with floats too.

The following code works for me (Uses reinterpret_cast):

#include <iostream>
#include <fstream>

int main()
{

std::ifstream f("a.txt", ios::in | ios::binary); //a.txt has an int
followed by a double (binary format)
char x[4];
char y[8];
f.read(x,4); //Read 4 bytes into x, this will read the int
f.read(y,8); //Read 8 bytes into y, this will read the float
int m = *(reinterpret_cast<int*>(x)); //reinterpret_cast will
simply reinterprete the bit patten
double n = *(reinterpret_cast<double*>(y)); //ditto
std::cout << m << std::endl; //prints the integer correctly
std::cout << n << std::endl; //prints the double correctly
}

Of couse since the mapping performed by reinterpret_cast is
implementation defined, I said that "this works for me", and I use
such a cast when I donot need to write portable code.

Christopher · May 4, 2009

Google ios::binary.

Do you realize how many contradicting opinions come up in a google
search for something so basic, 95% of which are probably wrong?
Here is what I found in my google search:
1) make a template function that converts an array of bytes into an
integral type (neat but probably overkill)
2) use >> and extract to an int, don't worry about it (probably
wrong)
3) read byte by byte and bit shift into an integral result
(probably overly complicated and still depends on size)
4) let's use fstream.h and do it the deprecated way
(definatly wrong)
5) Well first we better check the indianess of the machine.... (I
can assume the same OS wrote it that is reading it in my case)

I bet most of those are either not necessary, overly complicated, or
just wrong.

I think a better source is:
http://www.parashift.com/c++-faq-lite/serialization.html#faq-36.6

However, it still fails to explain why we need to avoid the extraction
operator in favor of read and write, although that tid bit did answer
my question.

---------------------------
Here is my summation

1) There is no guarentee of the size of an integral type
2) The extraction operator, as I've always dealth with, is going to
extract a number of bytes from the stream that are equiv to the sizeof
the data type you are attempting to extract to, it will try to convert
it and if it fails, will set the fail bit

So, it wouldn't be safe to

int numBytes; // 4 bytes make this value
file >> numBytes;

Because the authoring tool that wrote the file wrote 4 bytes and
depends on you reading 4 bytes, if any implementation of an int comes
along that isn't 4 bytes, your parser is broken.

Regardless if it is incorrect or not, the fail bit gets set in my code
and I would like to understand why?

//---------------------------------------------------------------------------
void PolygonSetParser:

arseFile(const std::string & filepath,
ID3D10Device & device,
InputLayoutManager &
inputLayoutManager,
EffectManager & effectManager,
bool generateTangentData)
{
BaseException exception("Not Set",
"void PolygonSetParser:

arseFile(const
std::string & filepath)",
"PolygonSetParser.cpp");

// Open the file
std::ifstream file(filepath.c_str(), std::fstream::in |
std::fstream::binary);

// Check if the file was successfully opened
if( file.fail() )
{
exception.m_msg = std::string("Failed to open file: ") +
filepath;
throw exception;
}

// snip

// Read in a data identifier
unsigned char dataID;

file >> dataID;

while( file.good() )
{
switch( static_cast<int>(dataID) )
{
case MATERIAL_START:
{
ParseMaterial(file);
break;
}

default:
{
std::stringstream msg;
msg << "Unknown data identifier encountered in file: " <<
filepath << " at streampos " << file.tellg();

exception.m_msg = msg.str();
throw exception;
}
}

file >> dataID;
}
}

//---------------------------------------------------------------------------
void PolygonSetParser:

arseMaterial(std::ifstream & file)
{
BaseException exception("Not Set",
"void PolygonSetParser:

arseMaterial
(std::ifstream & file)",
"PolygonSetParser.cpp");

// Get how many bytes to parse
long numBytes; // 4 bytes make this value
file >> numBytes;

if( !file.good() ) // fails test
{
exception.m_msg = "Error parsing number of bytes to read for
material";
throw exception;
}
}

First few bytes of file in hex:
01 24 01 01 00 00 16 43 00 00 16 43 00 00 16 43

So, I am going to try it this way:
int numBytes;
fileread(static_cast<unsigned char *>(&numBytes), 4);

Which looks like it solves my problem of worrying about reading 4
bytes and the conversion.
I'd still like to know why the extraction operator fails though, if
anyone can explain.

Christopher · May 4, 2009

So, I am going to try it this way:
int numBytes;
fileread(static_cast<unsigned char *>(&numBytes), 4);

Oops, spoke too soon. It looks like I do indeed have to
reinterpret_cast

// Read 4 byte unsigned integral value
unsigned numBytes;
fileread(reinterpret_cast<char *>(&numBytes), 4);

yucky.

James Kanze · May 4, 2009

Not at all. I'm aware of implementations where int's are 16, 36
or 48 bits. 16 bit is, or at least used to be, quite common
(and the machines using 36 and 48 bits are still being sold).

That's about it. I'm not aware of any other solution that
actually works, except in extremely limited cases.

No. The iostream abstraction is formatted text, and it only
supports formatting to and from text. If you have a binary
format, then you have to write the formatting yourself (or find
it somewhere on the net).

That can be a much larger problem, since the differences in
format can be much greater.

In practice, you need to do two things: define the format you
will be reading, and decide your portability requirements. If
you need to handle float, and need to be portable to just about
everything, the code is far from trivial. If the format you're
reading used IEEE floats, however (often the case), and you
don't have to worry about machines which use other floating
point formats (mainly---perhaps only---mainframes), then it is a
lot easier.

Google ios::binary.

Which is totally irrelevant here. (The file must be opened in
binary mode, but that won't change the rules iostream uses for
formatting.)

Pascal J. Bourguignon · May 4, 2009

Christopher said:
Oops, spoke too soon. It looks like I do indeed have to
reinterpret_cast

// Read 4 byte unsigned integral value
unsigned numBytes;
fileread(reinterpret_cast<char *>(&numBytes), 4);

I don't know.

If I were you, I'd check very closely what the standard says about
padding, and how it would be processed by reinterpret_cast.

Then I would check closely how well the implementations match the
specifications.

Now, the most important point is to know the format you have in your
file. How are the integers stored? Byte size, bytesex, bit order,
how signed numbers are encoded (two-complement, one-complement,
sign+magnitude, BCD, etc)? Same for your floating point numbers.
Beware that bytesex still matters for IEEE754.

Then the safest bet is to read the file byte by byte, and to build the
values from the bytes. I'd rather use arithmetic operations since
this would avoid the need for reinterpreting bits.

For example, for a 2-complement, 32-bit integer stored as four 8-bit
bytes in big-endian order, I would do:

#include <limits.h>
#if CHAR_BITS<8
#error "Won't be able to read the bytes"
#endif
#if (CHAR_BITS*sizeof(int))<32
#if (CHAR_BITS*sizeof(long int))<32
#error "int too small (would need to use a bignum library"
#else
typedef long int integer;
typedef unsigned long int cardinal;
#endif
#else
typedef int integer;
typedef unsigned int cardinal;
#endif
typedef unsigned char byte;

cardinal read_cardinal(stream){
cardinal u=0;
for(int i=0;i<4;i++){
byte b=read_byte(stream);
u=(u*256)+b; /* big endian */
}
return(u);
}

integer read_integer(stream){
cardinal u=read_cardinal(stream);
integer i;
if(u<=2147483647UL){ /* 2-complement */
i=u;
}else{
i=((integer)(u-2147483648UL))-2147483648L;
}
return(i);
}

James Kanze · May 4, 2009

On May 4, 1:21 am, (e-mail address removed) wrote:

Do you realize how many contradicting opinions come up in a
google search for something so basic, 95% of which are
probably wrong?

Here is what I found in my google search:
1) make a template function that converts an array of bytes
into an integral type (neat but probably overkill)

I'm not sure I see the use of a template here, either. You
generally have at most two or three cases to deal with (2 bytes,
4 bytes, and once in a while, 8 bytes). And the meta
programmation necessary for the templates is probably more work
and more lines of code than just writing the three functions.

2) use >> and extract to an int, don't worry about it (probably
wrong)

If the format is binary, certainly wrong.

3) read byte by byte and bit shift into an integral result
(probably overly complicated and still depends on size)

It's the only solution I know. And it's not that complicated.

4) let's use fstream.h and do it the deprecated way
(definatly wrong)

And doesn't work any better than with said:
5) Well first we better check the indianess of the machine....
(I can assume the same OS wrote it that is reading it in my
case)

Endianness doesn't really depend on the OS. I've seen it change
from one version of the compiler to the next. And of course,
you really can't assume that your users are never going to
upgrade the material, which depending on the evolution path,
could mean a lot of things. It's simpler just to do it right to
begin with.

I bet most of those are either not necessary, overly
complicated, or just wrong.

I think a better source is:http://www.parashift.com/c++-faq-lite/serialization.html#faq-36.6

However, it still fails to explain why we need to avoid the
extraction operator in favor of read and write, although that
tid bit did answer my question.

The extraction operator formats/unformats. For a specific
format. (You can control it somewhat via flags, but it always
handles text.) For binary input, *if* you use istream or
ostream (I wouldn't, for a file that is completely binary), you
want to bypass the iostream formatting, using unformatted input
and output (read/get and write/put), and do your own formatting.

1) There is no guarentee of the size of an integral type
2) The extraction operator, as I've always dealth with, is
going to extract a number of bytes from the stream that are
equiv to the sizeof the data type you are attempting to
extract to, it will try to convert it and if it fails, will
set the fail bit

No. The extraction operator will read bytes, interpret them as
characters, and convert the resulting string into the type you
want. If the format in the source file is not text based, they
simply won't work.

So, it wouldn't be safe to

int numBytes; // 4 bytes make this value
file >> numBytes;

Because the authoring tool that wrote the file wrote 4 bytes

That is *NOT* enough information. Four bytes doesn't mean
anything. You still have to know the format used.

and depends on you reading 4 bytes, if any implementation of
an int comes along that isn't 4 bytes, your parser is broken.

In general, if you try to read in a different format than that
was written, it's not going to work.

Regardless if it is incorrect or not, the fail bit gets set in
my code and I would like to understand why?

Probably, the extractor didn't find what looked like an int.
The extractor skips whitespace, then looks for an optional sign,
followed by one or more digits. In whatever encoding is imbued
into the stream (typically either ISO 8859-1 or UTF-8 in the
environments I work in, but YMMV).

//---------------------------------------------------------------------------
void PolygonSetParser:arseFile(const std::string & filepath,
ID3D10Device & device,
InputLayoutManager &
inputLayoutManager,
EffectManager & effectManager,
bool generateTangentData)
{
BaseException exception("Not Set",
"void PolygonSetParser:arseFile(const
std::string & filepath)",
"PolygonSetParser.cpp");

// Open the file
std::ifstream file(filepath.c_str(), std::fstream::in |
std::fstream::binary);

// Check if the file was successfully opened
if( file.fail() )
{
exception.m_msg = std::string("Failed to open file: ") +
filepath;
throw exception;
}

You also want to imbue the "C" locale, to ensure that no code
translation occurs. (This is the sort of thing that works in
your test programs, because the "C" locale is the default, and
your test programs don't need to change it, but fails in actual
code, because the application has switched to some other
locale.)

// snip

// Read in a data identifier
unsigned char dataID;

file >> dataID;

Note that I'd use get for this. And pass through an
intermediate int:

int dataId = file.get() ;
if ( dataId == EOF ) {
// bad format...
}

while( file.good() )

No. ios::good() is generally useless.

If you're reading into an int, as above, you can use:

while ( dataID != EOF )

Otherwise (and more generally):

while ( file )

This is one of the reasons I wouldn't use istream, but would
implement my own ibinstream, or whatever. The ibinstream (which
would still derive from ios) would then define the extraction
operator to handle the format you're reading, setting the
various status bits as appropriate.

{
switch( static_cast<int>(dataID) )
{
case MATERIAL_START:
{
ParseMaterial(file);
break;
}

default:
{
std::stringstream msg;
msg << "Unknown data identifier encountered in file: " <<
filepath << " at streampos " << file.tellg();

exception.m_msg = msg.str();
throw exception;
}
}

file >> dataID;
}
}

//---------------------------------------------------------------------------
void PolygonSetParser:arseMaterial(std::ifstream & file)
{
BaseException exception("Not Set",
"void PolygonSetParser:arseMaterial
(std::ifstream & file)",
"PolygonSetParser.cpp");

// Get how many bytes to parse
long numBytes; // 4 bytes make this value

Not on my machines. Long is usually 64 bits today (although 32,
36 and 48 bits are not unknown, and 32 bits was common in the
past).

Note that this is one of the reasons why you must pass through
explicit serialization. Even if long is 32 bits on your machine
today, it's likely that if your user upgrades, it will be 64
bits. Whereas the format of the data file will not change.

file >> numBytes;

This reads ASCII.

if( !file.good() ) // fails test

And again, this may fail even if the input succeeded. Just use:

if ( ! file )

{
exception.m_msg = "Error parsing number of bytes to read for
material";
throw exception;
}
}

First few bytes of file in hex:
01 24 01 01 00 00 16 43 00 00 16 43 00 00 16 43

So, I am going to try it this way:
int numBytes;
fileread(static_cast<unsigned char *>(&numBytes), 4);

Which looks like it solves my problem of worrying about
reading 4 bytes and the conversion.

Except that it doesn't. It may seem to work in a particular
case, but it doesn't provide a general solution, and may fail
the next time you recompile your code.

I'd still like to know why the extraction operator fails
though, if anyone can explain.

A better question is why you would expect it to work.

James Kanze · May 4, 2009

Oops, spoke too soon. It looks like I do indeed have to
reinterpret_cast

// Read 4 byte unsigned integral value
unsigned numBytes;
fileread(reinterpret_cast<char *>(&numBytes), 4);

The reinterpret_cast is necessary because you're doint something
that won't normally work.

James Kanze · May 4, 2009

On May 4, 10:39 am, Christopher <[email protected]> wrote:

The following code works for me (Uses reinterpret_cast):

#include <iostream>
#include <fstream>

int main()
{
std::ifstream f("a.txt", ios::in | ios::binary); //a.txt has an int
followed by a double (binary format)
char x[4];
char y[8];
f.read(x,4); //Read 4 bytes into x, this will read the int
f.read(y,8); //Read 8 bytes into y, this will read the float
int m = *(reinterpret_cast<int*>(x)); //reinterpret_cast will
simply reinterprete the bit patten
double n = *(reinterpret_cast<double*>(y)); //ditto
std::cout << m << std::endl; //prints the integer correctly
std::cout << n << std::endl; //prints the double correctly
}

Of couse since the mapping performed by reinterpret_cast is
implementation defined, I said that "this works for me", and I
use such a cast when I donot need to write portable code.

Only because you've been lucky (or unlucky). It will generally
work if the exact same executable is doing both the reading and
the writing. As soon as you recompile one or the other, it may
fail.

Christopher · May 4, 2009

I don't know.

If I were you, I'd check very closely what the standard says about
padding, and how it would be processed by reinterpret_cast.

Then I would check closely how well the implementations match the
specifications.

Now, the most important point is to know the format you have in your
file. How are the integers stored? Byte size, bytesex, bit order,
how signed numbers are encoded (two-complement, one-complement,
sign+magnitude, BCD, etc)? Same for your floating point numbers.
Beware that bytesex still matters for IEEE754.

Then the safest bet is to read the file byte by byte, and to build the
values from the bytes. I'd rather use arithmetic operations since
this would avoid the need for reinterpreting bits.

For example, for a 2-complement, 32-bit integer stored as four 8-bit
bytes in big-endian order, I would do:

#include <limits.h>
#if CHAR_BITS<8
#error "Won't be able to read the bytes"
#endif
#if (CHAR_BITS*sizeof(int))<32
#if (CHAR_BITS*sizeof(long int))<32
#error "int too small (would need to use a bignum library"
#else
typedef long int integer;
typedef unsigned long int cardinal;
#endif
#else
typedef int integer;
typedef unsigned int cardinal;
#endif
typedef unsigned char byte;

cardinal read_cardinal(stream){
cardinal u=0;
for(int i=0;i<4;i++){
byte b=read_byte(stream);
u=(u*256)+b; /* big endian */
}
return(u);

}

integer read_integer(stream){
cardinal u=read_cardinal(stream);
integer i;
if(u<=2147483647UL){ /* 2-complement */
i=u;
}else{
i=((integer)(u-2147483648UL))-2147483648L;
}
return(i);

}

I am trying to expand on that. I am having trouble though and got
stuck.

I want to make it a template function, because the program writing
this binary file frequently writes 1,2, or 4 byte integral values,
however a client of my library is not going to necessarily be storing
these values in the same number of bytes. Often a 1 byte value maybe
be stored in a 4 byte type, or perhaps even more than 4 bytes
considering I am doing this in 64bit....

So here is what I have so far:

template typename<IntegralType>
IntegralType IntegralTypeFromBytes(const char * bytes, const unsigned
numBytes, const bool isSigned)
{
if( CHAR_BIT != 8 || sizeof(char) != 1 )
{
// Error
}

if( sizeof(IntegralType) < numBytes )
{
// Error
}

if( !isSigned )
{
IntegralType u = 0;

// Little Endian
for(unsigned i = numBytes; i > 0; --i)
{
char b = bytes;
u = u * 256 + b;
}
}

}
else
{
}
}

I don't know what to do in the signed case. Can you explain your
example a little more?
Does my code so far look OK?

Pascal J. Bourguignon · May 4, 2009

Christopher said:
[...]
I am trying to expand on that. I am having trouble though and got
stuck.

I want to make it a template function, because the program writing
this binary file frequently writes 1,2, or 4 byte integral values,
however a client of my library is not going to necessarily be storing
these values in the same number of bytes. Often a 1 byte value maybe
be stored in a 4 byte type, or perhaps even more than 4 bytes
considering I am doing this in 64bit....

I would keep separate the physical file format from the application
level types.

There's little point in writting templates to handle the different
integer size you may have in the file, after all, there's only three
or four sizes.

On the other hand, when you convert to or from application types to
file format types, you want to check boundaries, so you may use
boost::lexical_cast<> which will throw an exception if the value is
out of the range of the target type.

So here is what I have so far:

template typename<IntegralType>
IntegralType IntegralTypeFromBytes(const char * bytes, const unsigned
numBytes, const bool isSigned)
{
if( CHAR_BIT != 8 || sizeof(char) != 1 )

sizeof(char) is always 1, by definition. If it wasn't 1, it would
mean the compiler is broken.

{
// Error
}

if( sizeof(IntegralType) < numBytes )
{
// Error
}

if( !isSigned )
{
IntegralType u = 0;

// Little Endian
for(unsigned i = numBytes; i > 0; --i)
{
char b = bytes;
u = u * 256 + b;
}
}

}
else
{
}
}

I don't know what to do in the signed case. Can you explain your
example a little more?

The principle of two-complement representation is to offset the
negative values above the positive ones, where the most significant
bit of the word is 1.

For example, if the word is 8 bit wide, then we can code 256 (2^8)
different values. Unsigned will go from 0 to 255. We split the range
in two, keeping positive codes from 0 to 127 to encode the positive
values from 0 to 127, and the remaining codes, from 127 to 255 are
offsetted by -256 to cover the range of values from -128 to -1.

0 ... 127, 128 ... 255
00000000 ... 01111111, 10000000 ... 11111111
0 ... 127, -128 ... -1

-128 = 128 - 256
-1 = 255 - 256

So when you have a code in two-complement on w bits, to convert it
into a signed value, you check whether the code is between 0 and
2^(w-1) exclusive, in which case the value is equal to the code, and
if the code is greater or equal to 2^(w-1), then the value is the code
minus 2^w.

Does my code so far look OK?

Click to expand...

I would do something like the following (not tested, not even compiled):

typedef unsigned char byte;

template <class NativeUnsignedInt,int ByteSize,int FileUnsignedIntBytes>
NativeUnsignedInt read_big_endian_cardinal(FILE* stream){
assert(ByteSize<=CHAR_BITS);
assert(ByteSize*FileUnsignedIntBytes<=sizeof(NativeUnsignedInt));
NativeUnsignedInt u=0;
for(int i=0;i<FileUnsignedIntBytes;i++){
byte b=read_byte(stream);
u=(u*256)+b; /* big endian */
}
return(u);
}

template <class NativeUnsignedInt,int ByteSize,int FileUnsignedIntBytes>
NativeUnsignedInt read_little_endian_cardinal(FILE* stream){
assert(ByteSize<=CHAR_BITS);
assert(ByteSize*FileUnsignedIntBytes<=CHAR_BITS*sizeof(NativeUnsignedInt));
NativeUnsignedInt u=0;
NativeUnsignedInt w=1<<(ByteSize*(FileUnsignedIntBytes-1));
for(int i=0;i<FileUnsignedIntBytes;i++){
byte b=read_byte(stream);
u+=(b*w); /* little endian */
w>>=ByteSize;
}
return(u);
}

#include <limits.h>

template<char> unsigned long long maximumOf(){ return(CHAR_MAX); }
template<unsigned char> unsigned long long maximumOf(){ return(UCHAR_MAX); }
template<short> unsigned long long maximumOf(){ return(SHRT_MAX); }
template<unsigned short> unsigned long long maximumOf(){ return(USHRT_MAX); }
template<int> unsigned long long maximumOf(){ return(INT_MAX); }
template<unsigned int> unsigned long long maximumOf(){ return(UINT_MAX); }
template<long int> unsigned long long maximumOf(){ return(LONG_MAX); }
template<unsigned long int> unsigned long long maximumOf(){ return(ULONG_MAX); }
template<long long int> unsigned long long maximumOf(){ return(LLONG_MAX); }
template<unsigned long long int> unsigned long long maximumOf(){ return(ULLONG_MAX); }

template<char> long long minimumOf(){ return(CHAR_MIN); }
template<unsigned char> long long minimumOf(){ return(0); }
template<short> long long minimumOf(){ return(SHRT_MIN); }
template<unsigned short> long long minimumOf(){ return(0); }
template<int> long long minimumOf(){ return(INT_MIN); }
template<unsigned int> long long minimumOf(){ return(0); }
template<long int> long long minimumOf(){ return(LONG_MIN); }
template<unsigned long int> long long minimumOf(){ return(0); }
template<long long int> long long minimumOf(){ return(LLONG_MIN); }
template<unsigned long long int> long long minimumOf(){ return(0); }

template <class NativeSignedInt,class NativeUnsignedInt,int ByteSize,int FileUnsignedIntBytes>
NativeSignedInt two_complement(NativeUnsignedInt u){
/* 2-complement */
NativeSignedInt i;
if(u<=(1ULL<<(ByteSize*FileUnsignedIntBytes-1)-1)){
assert(u<=maximumOf<NativeSignedInt>());
i=u;
}else{
assert((1UL<<(ByteSize*FileUnsignedIntBytes-1))+minimumOf<NativeSignedInt>()
<=(u-(1UL<<(ByteSize*FileUnsignedIntBytes-1))));
i=((NativeSignedInt)(u-(1ULL<<(ByteSize*FileUnsignedIntBytes-1))))-(1LL<<(ByteSize*FileUnsignedIntBytes-1));
}
return(i);
}

template <class NativeSignedInt,class NativeUnsignedInt,int ByteSize,int FileUnsignedIntBytes>
NativeSignedInt read_big_endian_integer(FILE* stream){
return(two_complement<NativeUnsignedInt,ByteSize,FileUnsignedIntBytes>(
read_big_endian_cardinal<NativeUnsignedInt,ByteSize,FileUnsignedIntBytes>(stream)));
}

template <class NativeSignedInt,class NativeUnsignedInt,int ByteSize,int FileUnsignedIntBytes>
NativeSignedInt read_little_endian_integer(FILE* stream){
return(two_complement<NativeUnsignedInt,ByteSize,FileUnsignedIntBytes>(
read_little_endian_cardinal<NativeUnsignedInt,ByteSize,FileUnsignedIntBytes>(stream)));
}

Here, the asserts in two_complement check for the native type range
when going from unsigned to signed. You could call:

short s=read_little_endian_integer<short,unsigned long long,8,8>(stream);

if you had a short stored over a 64-bit two-complement in little-endian order coding.

James Kanze · May 5, 2009

Christopher <[email protected]> writes:

[...]

I'd rather use arithmetic operations since
this would avoid the need for reinterpreting bits.

I fully agree with the rest of what you wrote, but I'm curious
about this. If the format is specified in terms of bits
(usually the case), it would seem to me that the bit operations
are more appropriate, in the sense that they are closer to the
specification. IOW, for a 32 bit unsigned integer: if the
format specification says that the first octet contains the
value divided by 16777216, the second the value divided by
65536, modulo 256, etc., then I'd use arithmetic operators. If
it says that the first octet contains the bits 24-31, the second
the bits 16-23, etc., I'd probably use bit operations (shifting
and masking).

Of course, the distinction isn't always that clear. For a
floating point value, the format probably specifies where the
sign, exponant and mantissa fields are located in terms of bits,
but the semantic value of each field will be specified
mathematically. So I'd naturally use bitwise operations to
extract the fields from the input, then mathematical functions
(e.g. ldexp) to merge them into the final value. Something like
(assume that I've set up source to handle errors like unexpected
end of file correctly):

unsigned char byte = source.get() ;
bool isNegative = (byte & 0x80) != 0 ;
int exponent = (byte & 0x7F) << 1 ;
int mantissa = 0 ;
byte = source.get() ;
exponent |= byte >> 7 ;
mantissa = (byte & 0x7F) << 16 ;
mantissa |= source.get() << 8 ;
mantissa |= source.get() ;
float result = 0.0F ;
if ( exponent != 0 ) {
result = ldexp( mantissa | 0x00800000,
exponent -126 - 24 ) ;
} else {
result = ldexp( mantissa, -126 - 24 ) ;
}
if ( isNegative ) {
result = -result ;
}

(This supposes input conform to XDR, and should work on any
machine where the results don't overflow. It doesn't support
NaN or Infinity, however---that would require extra handling.)

For example, for a 2-complement, 32-bit integer stored as four 8-bit
bytes in big-endian order, I would do:

#include <limits.h>
#if CHAR_BITS<8
#error "Won't be able to read the bytes"

#error "Isn't a legal C/C++ implementation"

#endif
#if (CHAR_BITS*sizeof(int))<32
#if (CHAR_BITS*sizeof(long int))<32
#error "int too small (would need to use a bignum library"
#else
typedef long int integer;
typedef unsigned long int cardinal;
#endif
#else
typedef int integer;
typedef unsigned int cardinal;
#endif
typedef unsigned char byte;

I'm lazy; for the moment, I just include <stdint.h>, and punt if
uint32_t and int32_t aren't defined. Otherwise, if <stdint.h>
(or <cstdint> in C++0x) is available, uint_fast_32_t and
int_fast32_t Should do the job. (I don't like the name
"cardinal", because C++'s unsigned doesn't really behave like a
"cardinal", which should be a sub-range of "integer", and behave
rationally when compared with "integer".)

cardinal read_cardinal(stream){
cardinal u=0;
for(int i=0;i<4;i++){
byte b=read_byte(stream);
u=(u*256)+b; /* big endian */
}
return(u);
}

integer read_integer(stream){
cardinal u=read_cardinal(stream);
integer i;
if(u<=2147483647UL){ /* 2-complement */
i=u;
}else{
i=((integer)(u-2147483648UL))-2147483648L;
}
return(i);
}

In the end, the real question is how portable do you want (or
need) to be. For a lot of people, it's probably acceptable to
suppose that 1) there is a 32 bit integral type, using 2's
complement, and 2) that conversion unsigned to signed int
doesn't change the bit pattern. In such cases, you can make
reading an integer a lot simpler. Similarly, if in addition you
can suppose IEEE floating point, my floating point read can be
made a lot, lot simpler---just put the four bytes in an array,
and memcpy into the double. (Currently, there are only a few
exotic machines which don't have a 32 bit 2's complement
integral type. None of the mainframes I know use IEEE floating
point, however.) I would stress, however, that if you make such
simplifying assumptions, you document them clearly and
explicitly.

James Kanze · May 5, 2009

Christopher wrote:

A template seems like overkill. Use an unsigned long or
unsigned long long to accumulate the values from the bytes and
return the result. Yes, stupid compilers might warn you that
the value is being truncated if you store in into a smaller
integer type, but you're not writing code to please some
compiler writer's notion of good coding style.

He still needs different versions for different lengths in the
source file. He could use a loop for that, but it's probably
not much more difficult to maintain the 3 or 4 functions
separately, each returning the correct type. (This also makes
handling the sign a little bit easier.)

As for range checking, this really depends on higher level code
anyway. If the protocol is encoding e.g. Sudoko boards, it's
likely that the legal range for some integral values in the file
will be 1-9 (or 0-9, if you use 0 for an open square). You
really don't want to encode this sort of information in the
routine reading an int (or an unsigned char, or whatever).

Pascal J. Bourguignon · May 5, 2009

James Kanze said:
[...]
I'd rather use arithmetic operations since
this would avoid the need for reinterpreting bits.

Click to expand...

I fully agree with the rest of what you wrote, but I'm curious
about this. If the format is specified in terms of bits
(usually the case), it would seem to me that the bit operations
are more appropriate, in the sense that they are closer to the
specification. IOW, for a 32 bit unsigned integer: if the
format specification says that the first octet contains the
value divided by 16777216, the second the value divided by
65536, modulo 256, etc., then I'd use arithmetic operators. If
it says that the first octet contains the bits 24-31, the second
the bits 16-23, etc., I'd probably use bit operations (shifting
and masking).

For integers, I assume it wouldn't make a lot of difference. The
compiler probably generate the same code for n<<8 and for n*256
anyways. You'd just have to be careful not to leave uninitialized
bits in the case where the word size of the host is bigger than that
of the file (eg. reading a 32-bit integer on a 36-bit host),
particularly for signed integers where you may need to do a sign
extend. That's where using the substration to convert a
two-complement seems to me to be more interesting than just using bit-or.

Of course, the distinction isn't always that clear. For a
floating point value, the format probably specifies where the
sign, exponant and mantissa fields are located in terms of bits,
but the semantic value of each field will be specified
mathematically. So I'd naturally use bitwise operations to
extract the fields from the input, then mathematical functions
(e.g. ldexp) to merge them into the final value. Something like
(assume that I've set up source to handle errors like unexpected
end of file correctly):

unsigned char byte = source.get() ;
bool isNegative = (byte & 0x80) != 0 ;
int exponent = (byte & 0x7F) << 1 ;
int mantissa = 0 ;
byte = source.get() ;
exponent |= byte >> 7 ;
mantissa = (byte & 0x7F) << 16 ;
mantissa |= source.get() << 8 ;
mantissa |= source.get() ;
float result = 0.0F ;
if ( exponent != 0 ) {
result = ldexp( mantissa | 0x00800000,
exponent -126 - 24 ) ;
} else {
result = ldexp( mantissa, -126 - 24 ) ;
}
if ( isNegative ) {
result = -result ;
}

(This supposes input conform to XDR, and should work on any
machine where the results don't overflow. It doesn't support
NaN or Infinity, however---that would require extra handling.)

Absolutely. The file format will be specified at the bit or byte
level and will have to be processed with bit operations. But to build
the host value, integer or floating-point, it's safer to use
arithmetic operations such as ldexp() and negation.

#error "Isn't a legal C/C++ implementation"

Are you sure? AFAIK, char may be 6 bits, there are trigraphs to deal
with these hosts. But I haven't read the recent C standards.

In the end, the real question is how portable do you want (or
need) to be. For a lot of people, it's probably acceptable to
suppose that 1) there is a 32 bit integral type, using 2's
complement, and 2) that conversion unsigned to signed int
doesn't change the bit pattern. In such cases, you can make
reading an integer a lot simpler. Similarly, if in addition you
can suppose IEEE floating point, my floating point read can be
made a lot, lot simpler---just put the four bytes in an array,
and memcpy into the double. (Currently, there are only a few
exotic machines which don't have a 32 bit 2's complement
integral type. None of the mainframes I know use IEEE floating
point, however.) I would stress, however, that if you make such
simplifying assumptions, you document them clearly and
explicitly.

Of course, this is a question of specification.

If all is called is to be able to save and load binary data on the
same computer, we could just memory map the file.

However, I see one problem in relying on specifications: they are not
formally defined, and cannot automatically be checked (in general, I'm
yet to see a real project using formal specification tools, seems I'm
not in the right industry for this kind of tools :-( ). So the
problem is that you will clearly document that your lateral
accelerometre outputs 16-bit values, you will document it and write it
all over. But when another team will reuse your module, and embed it
in hardware able to support lateral accelerations that need to be
expressed in more than 16 bits, no compiler will be there to check the
specification mismatch and you'll lose costly hardware and perhaps
lives. Ok, you've specified and documented the limits of your module,
so nobody can reproach you anything, nonetheless, this didn't prevent
a catastrophe.

So perhaps, that's my point of view, if you didn't expect a restricted
context, you could write your code to be able to handle the unexpected
environment and avoid such problems, or at least, have the mismatch be
detected at compilation time.

joshuamaurice · May 5, 2009

Are you sure? AFAIK, char may be 6 bits, there are trigraphs to deal
with these hosts. But I haven't read the recent C standards.

It's technically nonconforming, at least for C++. I don't have the C
standard handy, but C++03 specifically says that CHAR_BITS must be 8
or larger.

Alf P. Steinbach · May 6, 2009

* (e-mail address removed):

It's technically nonconforming, at least for C++. I don't have the C
standard handy, but C++03 specifically says that CHAR_BITS must be 8
or larger.

Assuming you meant CHAR_BIT, where did you find that requirement in C++0x?

The C++ standard has an implicit size requirement on the bit size of 'char' via
§3.9.1/1 which requires a char to be able to hold any character in the
implementation's basic character set, which is a bit unclear but anyways AFAIUI
must refer to one of the character sets denoted "basic" in §2.2, any any of
those have at least 96 characters, which implies at least 7 bits. However, the
C++ standard refers to the C standard. And the C standard requires that CHAR_BIT
is mimimum 8, in §5.2.4.2.1/1.

So, summing up, as far as I can see it's CHAR_BIT, not CHAR_BITS, and it's the C
standard, not the C++ standard.

Cheers,

- Alf

James Kanze · May 6, 2009

James Kanze said:
James Kanze said:

Christopher <[email protected]> writes:

Click to expand...

[...]

I'd rather use arithmetic operations since
this would avoid the need for reinterpreting bits.

Click to expand...

I fully agree with the rest of what you wrote, but I'm
curious about this. If the format is specified in terms
of bits (usually the case), it would seem to me that the
bit operations are more appropriate, in the sense that
they are closer to the specification. IOW, for a 32 bit
unsigned integer: if the format specification says that
the first octet contains the value divided by 16777216,
the second the value divided by 65536, modulo 256, etc.,
then I'd use arithmetic operators. If it says that the
first octet contains the bits 24-31, the second the bits
16-23, etc., I'd probably use bit operations (shifting
and masking).

Click to expand...

For integers, I assume it wouldn't make a lot of
difference.

For anything: if the specification says "this byte
corresponds to bits 8-15", I find shifting left 8 more
intuitive (closer to what is written in the specification)
than multiplying by 256.

The compiler probably generate the same code
for n<<8 and for n*256 anyways. You'd just have to be
careful not to leave uninitialized bits in the case where
the word size of the host is bigger than that of the file
(eg. reading a 32-bit integer on a 36-bit host),
particularly for signed integers where you may need to do
a sign extend. That's where using the substration to
convert a two-complement seems to me to be more
interesting than just using bit-or.

That's a different issue. You use shifting, etc. to extract
the sign bit and the value bits, because that's how their
location in the input data is specified: by the bit position
of the data. You use arithmetic operations to create the
final value from the extracted fields, because you're now
dealing with mathematical identities.

The difference is obviously much more evident in floating
point: you're not going to be able to extract individual
fields using arithmetic operators on floating point, and
anything you do using bitwise operators to assemble the
fields will be very implementation dependent---for that, you
want the mathematical functions and operators.

But I see that later, after my example:

[...]

Absolutely. The file format will be specified at the bit
or byte level and will have to be processed with bit
operations. But to build the host value, integer or
floating-point, it's safer to use arithmetic operations
such as ldexp() and negation.

You actually agree with me. Bitwise for extracting (and
inserting when writing) the individual fields, arithmetic
for manipulating the values. (It seems reasonable to
consider each field an unsigned int of a specific size.)

Are you sure? AFAIK, char may be 6 bits, there are
trigraphs to deal with these hosts. But I haven't read
the recent C standards.

C90 required that UCHAR_MAX be at least 255, and that the
representation be pure binary. For an implementation to be
conform with less than 8 bits, it whould have to be able to
fit 255 in less than 8 bits, using a binary representation.
Which is impossible.

Admitted, it's a pretty indirect way to specify the minimum
size, but it's what C90 used, and it hasn't changed in C++
or C99.

Of course, this is a question of specification.

If all is called is to be able to save and load binary
data on the same computer, we could just memory map the
file.

Not necessarily. As I said, I've actually experienced the
case where the byte order of a long changed from one version
of the compiler to the next, and on the machine I currently
use, the size of a long depends on a compiler option (and
there are two different system API's, depending on which
compiler option was used).

However, I see one problem in relying on specifications:
they are not formally defined, and cannot automatically be
checked (in general, I'm yet to see a real project using
formal specification tools, seems I'm not in the right
industry for this kind of tools :-( ). So the problem is
that you will clearly document that your lateral
accelerometre outputs 16-bit values, you will document it
and write it all over. But when another team will reuse
your module, and embed it in hardware able to support
lateral accelerations that need to be expressed in more
than 16 bits, no compiler will be there to check the
specification mismatch and you'll lose costly hardware and
perhaps lives. Ok, you've specified and documented the
limits of your module, so nobody can reproach you
anything, nonetheless, this didn't prevent a catastrophe.

I see you've read the details of the Ariane 5 explosion as
well

. In practice, this is a real problem. The boss
says that the code will never have to be ported to another
machine, and the week following its delivery, he installs a
new machine, with different characteristics. I've found
that it is often worthwhile being "portable", even if it
isn't part of the specifications. How portable depends on
the extra effort required, but I'd certainly attempt to
handle at least byte order and the size of the basic
integral types. Beyond that, a lot depends---the type of
projects I usually work on might conceivably be run on a
mainframe, so I take mainframe architectures into
consideration (none of the mainframes I know use IEEE, and
some have 36 or 48 bit ints, padding bits in int, and 1's
complement or signed magnitude), but if I were writing, say,
a GUI interface, I probably wouldn't bother: I'd suppose
IEEE floating point, for example, because all current
workstations and PC's use it, and I can't imagine that
changing.

Jorgen Grahn · May 11, 2009

On Mon said:
In practice, you need to do two things: define the format you
will be reading, and decide your portability requirements.

Yes. I hate maintaining code where the external data format is defined
by "whatever the first implementation ended up generating, on the
first machine it happened to run on."

If
you need to handle float, and need to be portable to just about
everything, the code is far from trivial. If the format you're
reading used IEEE floats, however (often the case), and you
don't have to worry about machines which use other floating
point formats (mainly---perhaps only---mainframes), then it is a
lot easier.

It seems to me that XDR would be useful here -- it defines storage of
floating-point types. Or better, don't insist on an unreadable data
format; use plain text instead.

/Jorgen

James Kanze · May 12, 2009

On Mon, 4 May 2009 01:35:40 -0700 (PDT), James Kanze

Yes. I hate maintaining code where the external data format is
defined by "whatever the first implementation ended up
generating, on the first machine it happened to run on."

Especially when they've been around for a while. Some of those
older machines had some really wierd formats. (Byte order of a
long 2301, for example.)

It seems to me that XDR would be useful here -- it defines
storage of floating-point types.

It defines it to be IEEE. If your portability needs are
limited to machines with IEEE (PC's and most mid-sized Unix,
but not mainframes), then using IEEE in the external format is
particularly easy: just copy the bytes into an unsigned integer
of the same size, and output it as usual. If you need to
support other machines, however, you'll need to extract the
bits for each field, as they're defined by the format (which
defines them by reference to IEEE), and reassemble them using
things like ldexp.

Or better, don't insist on an unreadable data format; use
plain text instead.

That's always to be preferred. If only because it makes
debugging several orders of magnitude easier.

Note that if you want a portable format, even if it is a text
format, you'll still have to open the file as a binary file.
And output whatever the format requires for line endings. For
that matter, if you really want to be portable, you'll also have
to consider the possibility that your implementation of C++ uses
a different encoding internally than that required by the
format. Alternatively, you define the format as pure text, in
the native format for text, and require translation when moving
between systems. Historically, this is the traditional
solution, but it doesn't work very well if the machines are
physically connected or if you're sharing disks.

Rearranging .ply file via C++ String Parsing	0	Dec 14, 2019
reading binary file into memory. Converting from char to uint32,float, double, ASCII strings etc (st	37	Oct 15, 2011
binary file to CString	2	Nov 19, 2008
converting char to int (reading from a binary file)	12	May 16, 2008
Help with EXT3 Filesystem work	1	Mar 13, 2022
problems reading binary file	14	Jun 9, 2010
strange read in of binary file	4	Apr 11, 2008
Streaming file IO and binary files	3	Jul 25, 2007

binary file parsing

Christopher

joshuamaurice

Neelesh

Christopher

Christopher

James Kanze

Pascal J. Bourguignon

James Kanze

James Kanze

James Kanze

Christopher

Pascal J. Bourguignon

James Kanze

James Kanze

Pascal J. Bourguignon

joshuamaurice

Alf P. Steinbach

James Kanze

Jorgen Grahn

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads