portable handling of binary data

S

SG

Hi!

I'm wondering what the preferred portable way of handling binary data
is. For example, I want to read a binary file which contains 32bit and
16bit integers in the little endian format. Now, I'm aware that a
character might have more bits than 8. But I don't care about this
case for now. So, I enclose my conversion routines for char* to some
int with preprocessor directives:

#include <climits>
#if CHAR_BIT == 8
// conversion code here
#endif

As far as I know the C++ standard doesn't specify whether a char is
signed or unsigned nor does it specify what will happen if i convert
between signed and unsigned in case the original value can't be
represented. Also, signed integers don't need to be stored in two's
complement. Unfortunately, this seems to make decoding a 16 bit signed
number in two's complement & little endian byte order in a portable
way impossible. I came up with the following piece of code which still
invokes implementation defined behaviour:

// decode signed 16 bit int (two's complement & little endian)
inline int_fast16_t get_s16le(const char* p)
{
// we already know that CHAR_BIT == 8 but "char" might be signed
// as well as unsigned
unsigned char low = p[0]; // implementation-defined for p[0]<0
signed char hi = p[1]; // implementation-defined for p[1]>=128
return int_fast16_t(low) + int_fast16_t(hi) * 256;
}

Also, this is horrorbly slow. I'd much rather be able to query certain
implementation properties so I can use much faster code.

My latest incarnation looks like this:

inline uint_fast16_t swap_bytes_16bit(uint_fast16_t x) {
return ((x & 0xFF00u) >> 8) | ((x & 0x00FFu) << 8);
}

inline uint_fast16_t get_u16le(const char* p) {
uint_fast16_t x;
assert(sizeof(x)>=2);
std::memcpy(&x,p,2);
#if BYTE_ORDER == LITTLE_ENDIAN
return x;
#else
return swap_bytes_16bit(x);
#endif
}

inline int_least16_t get_s16le(const char * p) {
assert( signed(~0u) == -1 ); //< This is not guaranteed by the
stamdard
return get_u16le(p);
}

What's the preferred way to do this in a reasonably portable way?

Cheers!
SG
 
T

Triple-DES

[snip]
I came up with the following piece of code which still
invokes implementation defined behaviour:

// decode signed 16 bit int (two's complement & little endian)
inline  int_fast16_t  get_s16le(const char* p)
{
   // we already know that CHAR_BIT == 8 but "char" might be signed
   // as well as unsigned
   unsigned char low = p[0]; // implementation-defined for p[0]<0
   signed   char hi  = p[1]; // implementation-defined for p[1]>=128
   return  int_fast16_t(low) + int_fast16_t(hi) * 256;

}

I must admit that I have only skimmed your post. I'd just like to
point out that the result of the conversion from char to unsigned char
is not implementation-defined, but is guaranteed to be equivalent to a
simple reinterpretation of bits in two's complement (regardless of the
representation actually used).

Therefore, in your example, low is guaranteed to hold the value
(2**CHAR_BIT + p[0]) mod 2**CHAR_BIT.
 
J

James Kanze

I'm wondering what the preferred portable way of handling
binary data is.

It depends on the format, and just how portable you have to be.
And the data types; truely portable floating point can be a
bitch.
For example, I want to read a binary file which contains 32bit
and 16bit integers in the little endian format.

Two's complement, or?
Now, I'm aware that a character might have more bits than 8.
But I don't care about this case for now. So, I enclose my
conversion routines for char* to some int with preprocessor
directives:
#include <climits>
#if CHAR_BIT == 8
// conversion code here
#endif

More usual would be

#include <climits>
#if CHAR_BIT != 8
#error Only 8 bit char's supported
#endif
As far as I know the C++ standard doesn't specify whether a
char is signed or unsigned

No, and it varies in practice. (Not that I think it makes a
difference in your case.)
nor does it specify what will happen if i convert between
signed and unsigned in case the original value can't be
represented.

Conversions to unsigned integral types are fully defined.
Also, signed integers don't need to be stored in two's
complement. Unfortunately, this seems to make decoding a 16
bit signed number in two's complement & little endian byte
order in a portable way impossible.

Not really. First, you do the input as unsigned:

uint16_t result = source.get() ;
result |= source.get() << 8 ;

(source.get() should return a value in the range 0-255.
std::istream.get() could be used here. The only time it's out
of range is if you read past EOF: the results are still well
defined in that case, even if they don't mean anything, there's
no undefined behavior; and you can test for the case
afterwards.)

For uint32_t, do the same thing with four bytes.

Unless I knew I had to support a machine where it didn't work,
I'd just assign the results to an int16_t and be done with it.
(I only know of two machines where it wouldn't work, and neither
has a 16 bit integral type to begin with.) Otherwise, you have
might have to do some juggling:

return result <= 0x7FFF
: static_cast< int16_t >( result )
? - static_cast said:
I came up with the following piece of code which still invokes
implementation defined behaviour:
// decode signed 16 bit int (two's complement & little endian)
inline int_fast16_t get_s16le(const char* p)
{
// we already know that CHAR_BIT == 8 but "char" might be signed
// as well as unsigned
unsigned char low = p[0]; // implementation-defined for p[0]<0
signed char hi = p[1]; // implementation-defined for p[1]>=128
return int_fast16_t(low) + int_fast16_t(hi) * 256;
}

Don't use char (or char const*) here. Use unsigned char, or
unsigned char const*. Or just use the istream directly (opened
in binary mode, of course), using istream::get() (and thus
leaving the problem up to the implementation of filebuf/istream
to make this work.

(Actually, of course, *if* the char have the correct values,
there's no problem. The problem only occurs if char is signed,
AND the machine doesn't use 2's complement---there would be one
unsigned char value that couldn't occur. And there's so much
code out there which uses char* for pointing to raw memory that
any implementation which doesn't use 2's complement will almost
certainly make char unsigned.)
Also, this is horrorbly slow.

Have you actually measured it. I've found no measurable
difference using the shifting technique, above.
I'd much rather be able to query certain implementation
properties so I can use much faster code.
My latest incarnation looks like this:
inline uint_fast16_t swap_bytes_16bit(uint_fast16_t x) {
return ((x & 0xFF00u) >> 8) | ((x & 0x00FFu) << 8);
}
inline uint_fast16_t get_u16le(const char* p) {
uint_fast16_t x;
assert(sizeof(x)>=2);
std::memcpy(&x,p,2);
#if BYTE_ORDER == LITTLE_ENDIAN
return x;
#else
return swap_bytes_16bit(x);
#endif
}

Such swapping is likely to be slower than just doing it right in
the first place, using the shifts immediately on reading.
inline int_least16_t get_s16le(const char * p) {
assert( signed(~0u) == -1 ); //< This is not guaranteed by the
stamdard
return get_u16le(p);
}
What's the preferred way to do this in a reasonably portable
way?

See above. Most people, I suspect, count on the conversion of
the uint16_t to int16_t to do the right thing, although
formally, it's implementation defined (and may result in a
signal).
 
S

SG

Two's complement, or?
yes.

Conversions to unsigned integral types are fully defined.

Right. The remaining issue would be how bits are interpreted for the
value of signed char. That's why you recommended raw access as
sequence of unsigned chars.
Not really. First, you do the input as unsigned:

uint16_t result = source.get() ;
result |= source.get() << 8 ;

I intend to use istream::read which requires a pointer to char. I
checked the current C++ specification draft again and it seems that
I'm allowed to cast a pointer to void* and then to unsigned char* to
access the raw data. So, I expect the following to work in case
CHAR_BIT == 8:

/// extract unsigned 16 bit int (little endian format)
inline uint_fast16_t get_u16le(const void* pv) {
const unsigned char* pc = static_cast<const unsigned char*>(pv);
return pc[0] | (pc[1] << 8);
}

char buff[123];
ifstream ifs ("somefile.dat", ifstream::binary | ifstream::in);
ifs.read(buff,123);
uint_fast16_t foo = get_u16le(buff);
Unless I knew I had to support a machine where it didn't work,
I'd just assign the results to an int16_t and be done with it.
[...] Most people, I suspect, count on the conversion of
the uint16_t to int16_t to do the right thing, although
formally, it's implementation defined (and may result in a
signal).

Is there an elegant way for querying this implementation-defined
behaviour at compile-time so I can make the compiler reject the code
if it won't work like intended?

Cheers!
SG
 
S

SG

/// extract unsigned 16 bit int (little endian format)
inline uint_fast16_t get_u16le(const void* pv) {
const unsigned char* pc = static_cast<const unsigned char*>(pv);
return pc[0] | (pc[1] << 8);
}

Sorry. That should be

return pc[0] | (uint_fast16_t(pc[1]) << 8);

Cheers!
SG
 
J

James Kanze

Right. The remaining issue would be how bits are interpreted
for the value of signed char. That's why you recommended raw
access as sequence of unsigned chars.

Yes.

In practice, if the machine is 2's complement, you should be
able to type pun (i.e. reinterpret_cast on a pointer) plain
chars and unsigned chars without problems. But why bother.
I intend to use istream::read which requires a pointer to
char.

All "raw IO" in C++ is defined in terms of char. But I don't
really see any advantage of read() over using istream::get(), as
above, and I see several (very minor) disadvantages.
I checked the current C++ specification draft again and it
seems that I'm allowed to cast a pointer to void* and then to
unsigned char* to access the raw data.

You can just use reinterpret_cast. You're type punning, and
that's what reinterpret_cast was designed for.
So, I expect the following to work in case CHAR_BIT == 8:
/// extract unsigned 16 bit int (little endian format)
inline uint_fast16_t get_u16le(const void* pv) {
const unsigned char* pc = static_cast<const unsigned char*>(pv);
return pc[0] | (pc[1] << 8);
}
char buff[123];
ifstream ifs ("somefile.dat", ifstream::binary | ifstream::in);
ifs.read(buff,123);
uint_fast16_t foo = get_u16le(buff);

Yes. Provided your protocol requires 123 bytes to be available.
Unless I knew I had to support a machine where it didn't work,
I'd just assign the results to an int16_t and be done with it.
[...] Most people, I suspect, count on the conversion of
the uint16_t to int16_t to do the right thing, although
formally, it's implementation defined (and may result in a
signal).
Is there an elegant way for querying this
implementation-defined behaviour at compile-time so I can make
the compiler reject the code if it won't work like intended?

Not really, if int16_t is present. About all I can suggest is a
small program which actually tries it, and outputs the results,
compiled and executed from a script which generates a #define of
something with the appropriate value, invoked automatically from
your makefile. In practice, however, I probably wouldn't
bother. The unit tests will fail in an obvious way if the
compiler doesn't do the expected, at which point you can add
whatever you need to your configuration file.
 
T

Triple-DES

Unless I knew I had to support a machine where it didn't work,
I'd just assign the results to an int16_t and be done with it.
[...] Most people, I suspect, count on the conversion of
the uint16_t to int16_t to do the right thing, although
formally, it's implementation defined (and may result in a
signal).

Is there an elegant way for querying this implementation-defined
behaviour at compile-time so I can make the compiler reject the code
if it won't work like intended?

Warning: untested on anything but two's complement machines
// <code>
typedef unsigned short uint16_t;
typedef short int16_t;
#include <climits>
// unfortunately numeric_limits<int16_t>::max() is not a constant
expr.
#define int16_MAX SHRT_MAX
#define uint16_MAX USHRT_MAX

template<uint16_t u> struct Check
{
// statically assert the conversion unsigned->signed->unsigned
char b[uint16_t ( int16_t(u) ) == u ];
// trigger the instantiation of the next check
typedef Check<u-1> Next;
};
// values from int16_MAX to zero yield well-defined results
template<> struct Check<int16_MAX> {};
// explicitly instantiate the check for uint16_max
template struct Check<uint16_MAX>;

// </code>

This should verify that for all values from uint16_MAX to int16_MAX,
the conversion from unsigned to signed yields a value that is
equivalent to a reinterpretation of the _unsigned_ value in binary as
a two's complement bit pattern.

I'm however not sure how a compiler would treat a compile-time
conversion that would cause a signal at runtime. For something like
overflow, the program would be ill-formed, per 5/5.

Also some compilers may have problems with it since it causes around
32k classes to be instantiated. I also thought of a simpler check that
may be sufficient:

// poor man's check:
// verify that int16_t can represent as many values as uint16_t
sizeof(char [uint16_MAX == (-int16_MIN + int16_MAX)] );
 
S

SG

All "raw IO" in C++ is defined in terms of char. But I don't
really see any advantage of read() over using istream::get(), as
above, and I see several (very minor) disadvantages.

You mean

int istream::get();

But I'd like to be able to "decode" such 16bit and 32bit ints from raw
memory instead of having to use an istream object. As far as I can
tell

istream& istream::get(char* s, streamsize n);

is useless on binary data as it only reads until a delimiter ('\n') is
found whereas

istream& istream::read(char* s, streamsize n);

behaves exactly like I need it. I also suspect that calling
istream::get() for every single byte might hurt the performance -- and
no, I havn't tested it. I just don't see any reason to do it that way.

Cheers!
SG
 
J

James Kanze

int istream::get();
Yes.

But I'd like to be able to "decode" such 16bit and 32bit ints
from raw memory instead of having to use an istream object.

Well, you can always design a streambuf to do it. But why? The
only reason for serialization is IO.
As far as I can tell
istream& istream::get(char* s, streamsize n);
is useless on binary data as it only reads until a delimiter
('\n') is found whereas
Yes.

istream& istream::read(char* s, streamsize n);
behaves exactly like I need it.

If you know in advance that there will be enough bytes in the
stream (or it is a format error). In practice, with most
protocols, you can't use it for more than about four bytes
anyway, and the bufferization of the stream means that there
isn't really any difference in spead compared to using
istream::get().
I also suspect that calling istream::get() for every single
byte might hurt the performance -- and no, I havn't tested it.
I just don't see any reason to do it that way.

The main reason is that it is a lot more convenient, and more
natural.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,525
Members
44,997
Latest member
mileyka

Latest Threads

Top