# Standard way of converting a byte stream to two's complement

P

#### petek1976

Hi,

The C++ standard doesguarantee any representation for
signed integers which of course means it is implementation defined to
right-shift a signed numeric type (no guarantee of sign-extension
etc.) So if I were defining an interface that expected a
sequence of bytes in big-endian byte order (no matter what the
architecture/platform) and I wanted to convert a sequence of unsigned
char* to a signed number do I really have to do this:

unsigned long ExtractUInt32M(const unsigned char *apnOctets)
{
const unsigned char *lpnOctets = apnOctets;
unsigned long lnValue = *(lpnOctets++);
lnValue = (lnValue << 8) | *(lpnOctets++);
lnValue = (lnValue << 8) | *(lpnOctets++);
lnValue = (lnValue << 8) | *(lpnOctets);
return lnValue;
}

long ExtractSInt32M(const unsigned char *apnOctets)
{
unsigned long lnBaseBits = ExtractUInt32M(apnOctets); // treat as
unsigned first
long lnValue;
if ((lnBaseBits & 0x80000000) != 0) // Check for negative value
{
lnValue = lnBaseBits - 2147483648; // line 7
lnValue -= 2147483648; // line 8
}
else
{
lnValue = lnBaseBits;

}
return lnValue;
}

Notice the lines that subtract 2^31. This is all based on the fact
that a negative number in twos complement is represented as 2^N -
abs(value). First 2^32 (4294967296) is beyond the guaranteed range for
either a long or an unsigned long in C++. Second the C++ language does
not define what happens if you convert an out of range value long (or
any other integer type). In our case if the two's complement number
represents a negative value, it is undefined what will happen if I
were to assign lnBaseBits directly to lnValue. I solve both of these
issues by
performing the subtraction in two steps. 2^31 is subtracted from
lnBaseBits. Since that is a positive value which is guaranteed to be
within the range of a long, I can safely assign that result to
lnValue. In line 8 the process is completed by subtracting 2^31 again,
which will result in the correct negative value. Of course what I am
doing here is still not really portable since the standard only
guarantees a long value of 2147483647. Isn't there an easier way that
does not violate the C++ standard and stil be portable? Will the next
revision of the standard make such a task easier by doing something
like guaranteeing the representation of a signed number so that bit-
shifting will result in sign-extension?

The main point it that I do not want to rely on the underlying data-
representation of a particular machine for portability reasons.

J

#### James Kanze

The C++ standard does guarantee any representation for

You meant does not, of course.
signed integers which of course means it is implementation
defined to right-shift a signed numeric type (no guarantee of
sign-extension etc.)

More correctly, it's implementation defined to right shift a
negative signed integer. The standard explicitly allows an
implementation to either insert a zero, or the sign bit, in the
upper bit.

This really has nothing to do with the representation, directly;
it affects all of the representations equally.

Note that the representation can't be just anything. Signed
integers are required to have the same representation as the
corresponding unsigned integer for the common subset of values,
and that representation is required to be pure binary. So the
sign bit of a positive value must be 0, for example, and 0 must
have all bits 0.
So if I were defining an interface that expected a sequence of
bytes in big-endian byte order (no matter what the
architecture/platform) and I wanted to convert a sequence of
unsigned char* to a signed number do I really have to do this:
unsigned long ExtractUInt32M(const unsigned char *apnOctets)
{
const unsigned char *lpnOctets = apnOctets;
unsigned long lnValue = *(lpnOctets++);
lnValue = (lnValue << 8) | *(lpnOctets++);
lnValue = (lnValue << 8) | *(lpnOctets++);
lnValue = (lnValue << 8) | *(lpnOctets);
return lnValue;
}

More or less. Somthing along those lines, at least.
long ExtractSInt32M(const unsigned char *apnOctets)
{
unsigned long lnBaseBits = ExtractUInt32M(apnOctets); // treat as
unsigned first
long lnValue;
if ((lnBaseBits & 0x80000000) != 0) // Check for negative value
{
lnValue = lnBaseBits - 2147483648; // line 7
lnValue -= 2147483648; // line 8
}
else
{
lnValue = lnBaseBits;
}
return lnValue;
}

This is the tricky one. Formally, you really do have very few
guarantees; in particular, assigning a value out of range (and
ULONG_MAX will usually be out of range for a long) is
implementation defined behavior, and an implementation may even
generate an implementation defined signal in such cases. Which
means that you do have to check the range first, and "fiddle"
the value, somthing along the lines of what you've done. (But
I'm not sure that your code is really portable either. If long
is 32 bits, and the compiler doesn't support long long, it
shouldn't even compile, since 2147483648 is not a legal integral
constant in such cases.)

In practice, it depends on just how portable you have to be.
Just assigning the unsigned long to a long obviously doesn't
work. (At least, it doesn't work on most of the machines I have
access to.) For the restricted set of machines I use, I use
uint32_t and int32_t, from the <stdint.h> header from C99. This
guarantees at least the size of the integer, and that it is 2's
complement. (If these two conditions aren't met, int32_t won't
be defined, and the code won't compile.)
Notice the lines that subtract 2^31.

Yes. They're the problem; unless the compiler supports long
long, an integral constant in base 10 cannot have a value larger
than LONG_MAX, which on a 32 bit machine is 2^21 - 1.

You can append a UL to the value for the first subtraction. You
can't for the second, however, because that would have the
result of converting the long to unsigned long, doing the
subtraction, and then converting the (positive) result back to
long.

Another possibility might be something along the lines of:

lnValue = ( lnBaseBits > LONG_MAX
? ~(lnBaseBits - 1)
: lnBaseBits ) ;

Regretfully, at least as written above, it fails if lnBaseBits
is 0x80000000.
This is all based on the fact
that a negative number in twos complement is represented as 2^N -
abs(value). First 2^32 (4294967296) is beyond the guaranteed range for
either a long or an unsigned long in C++. Second the C++ language does
not define what happens if you convert an out of range value long (or
any other integer type). In our case if the two's complement number
represents a negative value, it is undefined what will happen if I
were to assign lnBaseBits directly to lnValue.

Not undefined, implementation defined. The C standard is more
explicit: it says that either you obtain an implementation
defined result, or an implementation defined signal is raised.

From the point of view of code quality, the signal is arguably
the best solution. C and C++ are traditionally so careless with
their typing, however, that any implementation actually doing
this would probably cause an enormous number of programs to
fail. (Think of the number of times an int in the range
0...UCHAR_MAX is assigned to a char. Of course, from the point
of view of supporting strict typing, char should probably be
required to be unsigned.)

Curiously enough, the standard defines the conversions in the
opposite direction (to an unsigned integral type). The whole
thing is really a bit hacky: logically, one would expect
undefined behavior in both cases, with standard functions (or a
special cast syntax) to do the conversions in both directions.
I solve both of these issues by performing the subtraction in
two steps. 2^31 is subtracted from lnBaseBits. Since that is a
positive value which is guaranteed to be within the range of a
long, I can safely assign that result to lnValue. In line 8
the process is completed by subtracting 2^31 again, which will
result in the correct negative value. Of course what I am
doing here is still not really portable since the standard
only guarantees a long value of 2147483647.
Isn't there an easier way that does not violate the C++
standard and stil be portable?

If you want to be 100% portable, no.
Will the next revision of the standard make such a task easier
by doing something like guaranteeing the representation of a
signed number so that bit- shifting will result in
sign-extension?

That wouldn't change anything.

Note that in your examples above, you only shift left, never
right. And there's no implementation specified in left
shifting.
The main point it that I do not want to rely on the underlying
data-representation of a particular machine for portability
reasons.

And the real problem is that you probably won't be able to test
the code on a machine which doesn't use 2's complement, with
modulo arithmetic even on signed integral types, so you'll never
really know if your code works.

I'd strongly suggest just using int32_t and uint32_t (assigning
the uint32_t to the int32_t once you have it), documenting the
restriction, and just counting on the code either failing to
compile, or a test failing, if the conditions aren't met.
There's a very, very good chance you'll nver actually have to do
more. Otherwise, I'd probably special case 0x80000000 and use
what I wrote above:

long
ExtractSInt32( unsigned char const* buffer )
{
unsigned long value = ExtractUInt32( buffer ) ;
return value == 0x80000000UL
? -0x7FFFFFFFL - 1L
: value > 0x80000000UL
? - static_cast< long >( ~( value - 1 ) )
: static_cast< long >( value ) ;
}

(Note that you still might have problems in ExtractUInt32 if the
bytes are more than 8 bits. But at that point, you'll have
define how the data is actually read and written; presumably,
you'll only read and write the lower order 8 bits, and
everything will be OK.)