Reading little-endian data from a file in a portable manner


C

chrisbazley

Greetings,

Over the past few days I have been porting a C program to a platform
that doesn't support unaligned data access, using a compiler that
doesn't support packed structures.

The program currently loads large chunks of data from an uncompressed
binary file format and addresses them using packed structures. To make
it work on big-endian systems, someone has added extra functions to
iterate over arrays of these packed structs and reverse the endianess
of 16 bit and 32 bit members. This takes place after reading data from
file and before writing it back.

I don't particularly like packed structures because they are not
standard C and even when supported they are slower to access and
require more code than properly-aligned data. The solution adopted for
my platform by the last person to port this program (an older version)
was to realign all the data after having loaded it. Yet another ugly
and time-consuming post-fread() step!

I would prefer to abandon the practice of loading data from the file
in large chunks. If the program instead loads one value at a time then
it can simultaneously correctly align it and swap endianness (if
required). I'm assuming that the file is buffered, so the additional
cost of many calls to fread() and/or fgetc() shouldn't be unbearable.
Of course that must be weighed against the benefit of not needing to
post-process the data.

Does that sound like the right strategy?

Having to read data one byte at a time (or even one 16- or 32-bit
value at a time) obviates one of the advantages of using a binary file
format in the first place, although there is still benefit in having a
smaller file size than a text format such as XML.

The second question I have been considering is the best method of
implementing the load-value-and-swap-endianness functions. So far, I
have tried two methods.

The first method is to load a whole value using fread() and then swap
the endianness if necessary:

int read_uint16(uint16_t *s, FILE *f)
{
if (fread(s, sizeof(*s), 1, f) != 1)
return 1; /* non-zero means error */
#ifdef BIG_ENDIAN
*s = ((*s >> 8) & 0xFF) | ((*s << 8) & 0xFF00);
#endif
return 0; /* zero means success */
}

int read_uint32(uint32_t *s, FILE *f)
{
if (fread(s, sizeof(*s), 1, f) != 1)
return 1; /* non-zero means error */
#ifdef BIG_ENDIAN
*s = ((*s >> 24) & 0xFF) | ((*s >> 8) & 0xFF00) | ((*s << 8) &
0xFF0000) | ((*s << 24) & 0xFF000000);
#endif
return 0; /* zero means success */
}

The second method is to load one byte at a time:

static int read_int(unsigned long *s, size_t size, FILE *f)
{
unsigned long value = 0;
size_t byte;
for (byte = 0; byte < size; byte++)
{
int c = fgetc(f); /* assumes char is 8 bit */
if (c == EOF)
break;
value |= ((unsigned long)c & 0xFF) << (8 * byte);
}
*s = value;
return byte < size; /* non-zero means error */
}

int read_uint16(uint16_t *s, FILE *f)
{
unsigned long value;
int error = read_int(&value, sizeof(*s), f);
*s = (uint16_t)value; /* narrowing cast */
return error;
}

int read_uint32(uint32_t *s, FILE *f)
{
unsigned long value;
int error = read_int(&value, sizeof(*s), f);
*s = (uint32_t)value; /* narrowing cast */
return error;
}

At the moment I prefer the second method because it has the advantage
that it doesn't need to be configured for a big-endian or little-
endian system; the code works naturally on either. It seems more of a
'pure' solution for that reason, and I prefer to avoid conditional
compilation where possible.

However, the second method will be slower on both little and big-
endian systems, because of the loop and multiple calls to fgetc().
Also, it relies on an assumption that fgetc will always return an
octet (or EOF). Does anyone know (or can anyone guess) whether that
would be true on a system where the type 'char' is wider than 8 bits?

Unfortunately, I can't think of a portable endian-independent solution
that avoids both conditional compilation and the assumption that
'char' is an 8 bit type (or fgetc() returns an octet). Even methods
for determining endianness at run time would surely rely upon treating
'int' types as arrays of 'char'.

I would be very interested in your opinions.

Cheers,
 
Ad

Advertisements

T

Thad Smith

At the moment I prefer the second method because it has the advantage
that it doesn't need to be configured for a big-endian or little-
endian system; the code works naturally on either. It seems more of a
'pure' solution for that reason, and I prefer to avoid conditional
compilation where possible.
Agreed.

However, the second method will be slower on both little and big-
endian systems, because of the loop and multiple calls to fgetc().

If you use getc() instead, it will normally be implemented by a macro or inline
function, eliminating the function call overhead.

Similarly, function read_int() could be made an inline function or long macro to
eliminate that call overhead.

Measure the difference in timing. If it is not significant, you could you the
more generalized approach.
Also, it relies on an assumption that fgetc will always return an
octet (or EOF). Does anyone know (or can anyone guess) whether that
would be true on a system where the type 'char' is wider than 8 bits?

I haven't worked with any (in C). It is important to have a specification for
the data you are reading. You said that the data is encoded in little endian.
Specify the width of characters and then code accordingly. If you want, add

#include <limits.h>
#if CHAR_BIT != 8
#error Code change is needed for char size != 8 bits
#endif

then worry about implementing other widths if you get the compiler error.
 
M

Malcolm McLean

Unfortunately, I can't think of a portable endian-independent solution
that avoids both conditional compilation and the assumption that
'char' is an 8 bit type (or fgetc() returns an octet). Even methods
for determining endianness at run time would surely rely upon treating
'int' types as arrays of 'char'.


void write16le(int x, FILE *fp)
{
fputc(x & 0xFF, fp);
fputc( (x >> 8) & 0xFF, fp);
}

int read16le(FILE *fp)
{
int extension = ~0 << 15;
int x;


x = fgetc(fp);
x |= (fgetc(fp) << 8);
if(x & 0x8000)
x |= extension;

return x;
}

This works as long as the machine is two's complement. A language
lawyer will note that it can break on a Deathstation (char is nine
bits, whilst ints are 2 bytes with the upper two bits reserved for
trap representations), but this won't occur in practice.
 
B

Ben Bacarisse

Malcolm McLean said:
void write16le(int x, FILE *fp)
{
fputc(x & 0xFF, fp);
fputc( (x >> 8) & 0xFF, fp);

This shift is implementation defined when x is negative. If you convert
x to unsigned it becomes portable.
}

int read16le(FILE *fp)
{
int extension = ~0 << 15;

This is UB. It "probably works" but there is no need to take the risk.
~0u << 15 is well-defined. You need, of course, to change the type of x
as well or the initialisation can become implementation defined.
int x;

x = fgetc(fp);
x |= (fgetc(fp) << 8);

This, too can be UB if fgetc(fp) times 256 is not representable as an
int (and this is the usual case on a 16-bit machine). Again, if you use
unsigned types the problem goes away.
if(x & 0x8000)
x |= extension;

return x;
}

This works as long as the machine is two's complement. A language
lawyer will note that it can break on a Deathstation (char is nine
bits, whilst ints are 2 bytes with the upper two bits reserved for
trap representations), but this won't occur in practice.

It can break on any machine. However, I can't see the specific problem
you obviously do with the special "upper two bits" in an 18-bit int. I
think your are worrying about things that don't matter, and not worrying
about things that do.
 
N

Nobody

If you use getc() instead, it will normally be implemented by a macro or
inline function, eliminating the function call overhead.

With glibc, you need to use getc_unlocked(). The standard functions lock
the stream so that use in multi-threaded programs works correctly. Most
such functions have *_unlocked() versions which are defined as inline
functions or macros.

Or you could just implement your own buffered I/O.
 
C

chrisbazley

I haven't worked with any (in C).  It is important to have a specification for
the data you are reading.  You said that the data is encoded in little endian.
Specify the width of characters and then code accordingly.  If you want, add

#include <limits.h>
#if CHAR_BIT != 8
#error Code change is needed for char size != 8 bits
#endif

then worry about implementing other widths if you get the compiler error.

I like your idea, but I think perhaps I overlooked a simpler solution.
I'm not really reading characters, so using the 'char' type is not
really appropriate. If I use the C99 types instead then presumably
uint8_t and int8_t won't be defined on platforms that don't support 8
bit values.

Cheers,
 
Ad

Advertisements

C

chrisbazley

Rather than a series of fgetc() calls, you could use a single fread() into
an "unsigned char buf[sizeof(uint32_t)]" and construct the value from there.

x = buf[0] + (buf[1] << 8) + (buf[2] << 16) + (buf[3] << 24);

Of all the solutions mooted this is the one that I prefer (and will
adopt). It's a nice compromise because it avoids lots of calls to
fgetc() without requiring the preprocessor to adapt it for target
machines with different endianness. I think my original solution using
the loop was somewhat over-engineered! :)
Yes, this still assumes CHAR_BIT==8.  I'm not sure how you go about reading
octects from a file on CHAR_BIT>8 systems.  However, that holds true for the
rest of the file, not just these 16- and 32-bit little endian values.  I
would expect that such a system might include a system-specific way of
saying "this file is in octects".

Perhaps if I used 'uint8_t' instead of 'unsigned char' and #include
<stdint.h> then the compiler will give an error if 'uint8_t' is
undefined?

Cheers,
 
T

Thad Smith

I like your idea, but I think perhaps I overlooked a simpler solution.
I'm not really reading characters, so using the 'char' type is not
really appropriate. If I use the C99 types instead then presumably
uint8_t and int8_t won't be defined on platforms that don't support 8
bit values.

Many compilers still don't support C99 and don't have those data types anyway.

When you are serializing data, as you are here, the best representation, in my
opinion, is usually octets with explicit byte ordering. You might consider
ASN.1 BER encoding for a generalized approach.

If you are restricted to, say, graphic ASCII characters, you need another
encoding scheme.
 
N

Nick Keighley

I think the implementor is free to do what he pleases. I think any
modern system would give you octets or same easy way to lay your hands
on octets. Failure to do so would making writing communication
protocols unnecessarily hard.
I like your idea, but I think perhaps I overlooked a simpler solution.
I'm not really reading characters, so using the 'char' type is not
really appropriate.

a char holds a byte. To me the most natural thing to stuff an octet
into is an unsigned char. And I do comms stuff.
 
M

Malcolm McLean

This, too can be UB if fgetc(fp) times 256 is not representable as an
int (and this is the usual case on a 16-bit machine).  Again, if you use
unsigned types the problem goes away.
Except it doesn't, because on that architecture the routine will break
when you convert the unsigned to a signed.

The only way I can see of doing it is to manually apply 2s complement
conversion on an unsigned. Which isn't really acceptable.
 
B

Ben Bacarisse

Malcolm McLean said:
Except it doesn't, because on that architecture the routine will break
when you convert the unsigned to a signed.

Yes, my suggestion only fixes the arithmetic -- the problem of UB goes
away.

Since the result of the conversion is not UB (it's implementation
defined) one could wing it in most cases. Many implementations will
just do what you want and because it's not UB one could even test a
conversion at program start-up.
The only way I can see of doing it is to manually apply 2s complement
conversion on an unsigned. Which isn't really acceptable.

I can't see an obvious, 100% portable solution but I don't know what you
mean by "[it] isn't really acceptable". You seem to be saying that
because there is no perfect solution to the unsigned to signed
conversion you'd rather stick with the original. But the undefined
shift is a real practical problem (at least that's my recollection of it
on some hardware) whereas the implementation-defined conversion is very
often exactly what you want. What, in your opinion, is the best
compromise between complexity and portability?
 
Ad

Advertisements

T

Thad Smith

Malcolm said:
Except it doesn't, because on that architecture the routine will break
when you convert the unsigned to a signed.

The only way I can see of doing it is to manually apply 2s complement
conversion on an unsigned. Which isn't really acceptable.

In the grand scheme of portable data transfer, one must specify the encoding of
data in the file. Can there be negative values? If so, how are they encoded?
There are more than way. Pick one, then ensure that code that writes encodes
properly and code that reads decodes properly.
 
M

Malcolm McLean

In the grand scheme of portable data transfer, one must specify the encoding of
data in the file.  Can there be negative values?  If so, how are they encoded?
There are more than way.  Pick one, then ensure that code that writes encodes
properly and code that reads decodes properly.

I think we can do it

int fget16(FILE *fp)
{
int a;
int b;
int x;
unsigned int ux;

a = fgets(fp);
if(a == EOF) /* error handler */
b = fgetc(fp);
if(b == EOF) /* error handler */

if(x > 255 || b > 255) /* error handler */

ux = (((unsigned) b) << 8) | a;
if(ux & 0x8000)
{
ux |= ~0u << 15;
ux = ~ux;
ux++;
x = -ux;
}
else
x = ux;

return x;
}

However in my view this is unacceptably tricky to write. I got it
wrong first time.
 
B

Ben Bacarisse

Malcolm McLean said:
I think we can do it

int fget16(FILE *fp)
{
int a;
int b;
int x;
unsigned int ux;

a = fgets(fp);
fgetc

if(a == EOF) /* error handler */
b = fgetc(fp);
if(b == EOF) /* error handler */

if(x > 255 || b > 255) /* error handler */

a > 255
ux = (((unsigned) b) << 8) | a;
if(ux & 0x8000)
{
ux |= ~0u << 15;
ux = ~ux;
ux++;
x = -ux;

I don't think this avoids the problem you are trying to avoid. -ux is
of type unsigned int and can be arithmetically > INT_MAX.
}
else
x = ux;

return x;
}

However in my view this is unacceptably tricky to write. I got it
wrong first time.

Provided we are content to deal only with two's complement machines and
we rule out the peculiar case of a machine where an int with sign-bit 1
and all other bits zero is a trap representation I think we can do this:

if (ux >= 0x8000u) {
x = 0xffffu - ux;
x = -x - 1;
}
else x = ux;

When int has more than 16 bits, the above only assumes 2's complement.
If there is a chance that (a) int is 16 bits and (b) int has the special
permitted trap representation, then one could add a test for ux ==
0x8000u and INT_MIN == -32767 to catch this special case. (Of course, if
the data comes from te same machine originally, we don't need to test
for this special case since it won't even have been written in the first
place.)

However, on what machine would this work when the simple implementation-
defined conversion (return ux) would not? I think this is a case where
a pragmatic solution wins.
 
M

Malcolm McLean

a > 255


I don't think this avoids the problem you are trying to avoid.  -ux is
of type unsigned int and can be arithmetically > INT_MAX.
We check that neither a nor b is greater than 255. So -ux can be at
maximum 0x7FFF at this point, which is the minimum value allowed for
INT_MAX.
 
B

Ben Bacarisse

Malcolm McLean said:
We check that neither a nor b is greater than 255. So -ux can be at
maximum 0x7FFF at this point, which is the minimum value allowed for
INT_MAX.

All I can say is no and that gcc agrees with me (though there may be a
bug in gcc of course).

Lets take an example. Given the input 0xff 0xff both 'a' and 'b' will
be 255 and ux will start the code in question with the value 0xffff.
After the or the not and the increment, ux will be 1. Do you agree so
far?

What unsigned value do you think -ux should have when ux is 1? I can
only think that it will be UINT_MAX.
 
Ad

Advertisements

M

Malcolm McLean

What unsigned value do you think -ux should have when ux is 1?  I can
only think that it will be UINT_MAX.
If ux is 1 then the test ux & 0x8000 fails, so x is set to 1.
If ux is 0xFFFF then the test succeeds, it is sign-extended, inverted
(ux = 0), and incremented, then negated, so x is -1.
If ux is 0x8000 the then test succeeds, it is sign extended, inverted
(ux = 0x7FFFF) and incremeted (ux = 8000), then negated. INT_MIN is
one bit larger then INT_MAX, so this is OK. Although you can see how
the standard was tweaked to work on two's complement machines.

So all the extremes are OK.

(My sister, who's a mathematican, says that negative numbers are just
a fiction anyway).
 
B

Ben Bacarisse

Malcolm McLean said:
If ux is 1 then the test ux & 0x8000 fails, so x is set to 1.

Not the case I was discussing of course. You snipped the part where I
explain how ux gets to be 1 when it starts as 0xffff. I.e. this case:
If ux is 0xFFFF then the test succeeds, it is sign-extended, inverted
(ux = 0), and incremented, then negated, so x is -1.

No. Once ux is incremented it is 1 so what is the value of the
expresson -ux in this this case? It is not -1 -- it is UINT_MAX. If
you disagree, please say what part of standard supports your view.
If ux is 0x8000 the then test succeeds, it is sign extended, inverted
(ux = 0x7FFFF) and incremeted (ux = 8000), then negated. INT_MIN is
one bit larger then INT_MAX, so this is OK.

ux == 0x8000 or 32768. -ux is a *positive number*. On a 32-bit machine
it is 4294934528 and that does not fit into a 32-bit int. On a 16-bit
machine, it is 32768 and that does not fit in a 16-bit int either.
Although you can see how
the standard was tweaked to work on two's complement machines.

So all the extremes are OK.

No, neither extreme is OK (and nor are any of the data points in between).
 
Ad

Advertisements

M

Malcolm McLean

So you're saying we have to say

x = -(int) ux;

to negate an unsigned integer?

Which means that INT_MIN will break.

I think you've found a bug in the standard.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top