Question about unpacking a binary file: endian troubles

D

David Buchan

Hi guys,

This may be a dumb question; I'm just getting into C language here.

I wrote a program to unpack a binary file and write out the contents to
a new file as a list of unsigned integers. It works on an IBM mainframe
under AIX with GCC, but has trouble on Intel with GCC (DJGPP). I think
it's an endian problem.

The trouble is, although most values in the extracted list are correct,
I know of at least two values that are definitely wrong. They should be
249 and I get 4294967289 (or -7, depending on whether I have the output
to unsigned integer or just integer, respectively).

Anyway, it sure looks like an endian mixup to me.

The extract program...

#include <stdio.h>

int main ()
{
int fdi, n;
char buf[1];
FILE *fdo, *fopen();

/* Open binary bitmap file */
fdi=open("output.bmp", 0);
if (fdi==-1) {
printf ("Can't open bitmap file.\n");
exit (1);
}

/* Open decimal output file */
fdo=fopen("output.dec", "w");

/* Copy contents of file */
while ((n=read (fdi, buf, 1))>0) {
fprintf (fdo, "%u\n", (unsigned int) buf[0]);
}

/* Close all files */
close (fdi);
close (fdo);
}

Any idea what to do on Intel to make it work?

Thanks,
Dave
 
E

Eric Sosman

David said:
Hi guys,

This may be a dumb question; I'm just getting into C language here.

I wrote a program to unpack a binary file and write out the contents to
a new file as a list of unsigned integers. It works on an IBM mainframe
under AIX with GCC, but has trouble on Intel with GCC (DJGPP). I think
it's an endian problem.

The trouble is, although most values in the extracted list are correct,
I know of at least two values that are definitely wrong. They should be
249 and I get 4294967289 (or -7, depending on whether I have the output
to unsigned integer or just integer, respectively).

Anyway, it sure looks like an endian mixup to me.

I don't think so, since you seem to be dealing with
single-byte quantities. "Endianness" only comes into play
with multi-byte objects: the question is how the multiple
bytes are organized. When there's only one byte, there's
only one organization.

More likely, you're running into confusion over whether
a `char' is signed or unsigned: on some implementations a
`char' is always zero or positive, while on others some
`char' values are negative. If that's the root of your
problem, the simplest (and most "honest") fix is to read
the data into an `unsigned char' rather than a plain `char'.

There are a few other issues with your code:
The extract program...

#include <stdio.h>

You're using the exit() function, so you should
include said:
int main ()

`int main(void)' is preferable. There's no effect
on the correctness of the code, but it's considered
better style.
{
int fdi, n;
char buf[1];

This should be `unsigned char buf[1]'. If you adopt
a suggestion I'll make a little further down, it can
even be just `unsigned char buf'.
FILE *fdo, *fopen();

DON'T try to write free-hand declarations of library
functions; you'll sometimes get them wrong and even when
you get them right you may miss out on "implementation
magic" that makes them work more efficiently. In this
case you've already included <stdio.h> so fopen() is
already declared (correctly and efficiently); don't
muddy the waters.
/* Open binary bitmap file */
fdi=open("output.bmp", 0);

Sorry; open() is not a C function. Use fopen()
with "rb" (read, binary) as the second argument and
then use getc() to read each character. `fdi' becomes
a `FILE*' instead of an `int', and the subsequent test
for failure compares against NULL instead of -1.
if (fdi==-1) {
printf ("Can't open bitmap file.\n");
exit (1);

exit(EXIT_FAILURE) is better, because an exit
status of `1' means different things to different
systems. EXIT_FAILURE is declared in <stdlib.h>,
which you included so as to declare exit() properly,
remember?

Alternatively, you could `return EXIT_FAILURE;'
since this is the main() function.
}

/* Open decimal output file */
fdo=fopen("output.dec", "w");

Add `if (fdo == NULL) ...' to check for failure.
/* Copy contents of file */
while ((n=read (fdi, buf, 1))>0) {

When you switch to using fopen() and getc() on
the input file, you'll almost certainly experience a
sudden increase in your program's speed. The code
you've written is likely to be the slowest possible
way to read the data.
fprintf (fdo, "%u\n", (unsigned int) buf[0]);
}

/* Close all files */
close (fdi);

This would become fclose().
close (fdo);

This should have been fclose() in the first place.
As it stands, it's flat-out wrong.

Finally, you need `return EXIT_SUCCESS;' so your
`int'-valued main() can pass something back to the
environment. "Falling off the end," even though it's
been granted an indulgence in the latest version of
the C Standard, was incorrect under the older Standard
and has always been a sign of sloppiness.
 
K

Keith Thompson

Eric Sosman said:
David Buchan wrote: [...]
/* Open binary bitmap file */
fdi=open("output.bmp", 0);

Sorry; open() is not a C function. Use fopen()
with "rb" (read, binary) as the second argument and
then use getc() to read each character. `fdi' becomes
a `FILE*' instead of an `int', and the subsequent test
for failure compares against NULL instead of -1.

Using open() isn't exactly wrong, it's just non-portable. open() is a
standard function, it's just defined by the POSIX standard, not by the
C standard. You can use it if you like, but it has some drawbacks:
your program is no longer portable to implementations that support ISO
C but not POSIX, and you can't get good answers about it in comp.lang.c.

But in this case, I see no advantage in using open() rather than
fopen(), and a number of potential disadvantages.
 
?

=?ISO-8859-1?Q?Bj=F8rn_Augestad?=

Eric said:
I don't think so, since you seem to be dealing with
single-byte quantities. "Endianness" only comes into play
with multi-byte objects: the question is how the multiple
bytes are organized. When there's only one byte, there's
only one organization.

More likely, you're running into confusion over whether
a `char' is signed or unsigned: on some implementations a
`char' is always zero or positive, while on others some
`char' values are negative. If that's the root of your
problem, the simplest (and most "honest") fix is to read
the data into an `unsigned char' rather than a plain `char'.

Maybe the data was written to the file as 2 or 4 byte integers on a MSB
machine?

Bjørn

[snip]
 
D

David Buchan

Hi Eric,

Wow! That was a super post. I've learned a whole ton of stuff in one
shot.

I've redone the program and it works perfectly now.

#include <stdio.h>
#include <stdlib.h>

int main (void)
{
unsigned int n;
unsigned char buf[1];
FILE *fdi, *fdo;

/* Open binary bitmap file */
fdi=fopen("clown.bmp", "rb");
if (fdi==NULL) {
printf ("Can't open bitmap file.\n");
exit (EXIT_FAILURE);
}

/* Open decimal output file */
fdo=fopen("output.dec", "w");
if (fdo==NULL) {
printf ("Can't open new decimal file.\n");
exit (EXIT_FAILURE);
}

/* Copy contents of file */
while ((n=getc(fdi)) !=EOF) {
fprintf (fdo, "%u\n", n);
}

/* Close all files */
fclose (fdi);
fclose (fdo);
return EXIT_SUCCESS;
}

I really appreciate you taking the time to type all that out. I've
printed a copy which I'll keep in my file as reference.

Also, as promised, this thing runs like it's supercharged now - an added
bonus.

Thanks again,

Dave
 
E

Eric Sosman

Bjørn Augestad said:
Eric Sosman wrote:

David said:
[...]
The trouble is, although most values in the extracted list are correct,
I know of at least two values that are definitely wrong. They should be
249 and I get 4294967289 (or -7, depending on whether I have the output
to unsigned integer or just integer, respectively).

Anyway, it sure looks like an endian mixup to me.

I don't think so, since you seem to be dealing with
single-byte quantities. "Endianness" only comes into play
with multi-byte objects: the question is how the multiple
bytes are organized. When there's only one byte, there's
only one organization.

More likely, you're running into confusion over whether
a `char' is signed or unsigned: on some implementations a
`char' is always zero or positive, while on others some
`char' values are negative. If that's the root of your
problem, the simplest (and most "honest") fix is to read
the data into an `unsigned char' rather than a plain `char'.

Maybe the data was written to the file as 2 or 4 byte integers on a MSB
machine?

Possibly; the original poster didn't describe the file
format. However, there are two clues that make me think
he's got a byte-signedness problem rather than an endianness
problem. First, his program reads one byte at a time and
converts that byte as a complete number; there is no attempt
to read multiple bytes at a shot or to combine bytes read
separately Second, his specific complaint was about 249
being turned into -7, which is exactly what one would expect
from a byte-sign disagreement (with 8-bit two's complement
characters).

It's not conclusive, but it's suggestive ...
 
D

David Buchan

Oh. I forgot.

I can remove...

unsigned char buf[1];

....since i'm not using it anymore.

Also, I might use fgetc instead of getc. Almost the same according to my
book, and seems more consistent.

Thanks again,
Dave
 
E

Eric Sosman

David said:
Oh. I forgot.

I can remove...

unsigned char buf[1];

...since i'm not using it anymore.

Also, I might use fgetc instead of getc. Almost the same according to my
book, and seems more consistent.

There's one tiny difference between them, a difference
that is almost never important. If evaluating the FILE*
argument has a side-effect or is "expensive," you need to
use fgetc(). Otherwise, you can use getc(), and it may be
just a tiny bit faster than fgetc().

... and if you're looking for consistency in the names
of library functions, the C library is the wrong place to
look! Different parts of the library were developed by
people with different ideas about how to name things, and
those differences were well-established long before the
Standard codified them. The result? Local lumps of
consistency bobbing in a sea of chaos, and no chance that
they will ever coalesce into Pangaea.
 
O

Old Wolf

David said:
unsigned int n;

/* Copy contents of file */
while ((n=getc(fdi)) !=EOF) {
fprintf (fdo, "%u\n", n);
}

I have a slight problem with this: you're comparing
the unsigned value n to the signed value -1 .

This will do what you intended, because in
such a case the signed value is converted to
unsigned. But usually a signed-unsigned comparison
is an error and some compilers or warning levels
will give a diagnostic. Some people like to
avoid signed-unsigned comparisons on principle
because they are often a source of subtle bugs.

Preferable to me would be for 'n' to be signed
(to match the return type of fgetc), and
you can print it with "%d" because getc/fgetc
always returns a non-negative value, except for EOF.
 
R

Robert

Ok, so taking this example further... what is the best strategy
for dealing with long[er] integers. Say, a time_t (among other
things) for inclusion in a URL. Nevermind the RFC1738 encoding
for the moment (non A-Z,a-z,0-9 characters as %dd in hex) -- or
possibly a slightly modified base64 encoding for URL safety.

First, one converts the 32-bit time_t to network byte order using
htonl(), correct? Assuming an application on the other end will use
ntohl(), what order do the octets get cast to u_char's? Big end first
like:

int x;
uint32_t i;
char dest[5];
time_t now;

now = time ((time_t *) NULL);

i = htonl (((uint32_t) now));

for (x=3; x >= 0; x--)
{
dest[x] = (char) ((i & (0xff << (x*8))) >> (x*8));
}
dest[4] = '\0';

Or is there something simpler? Maybe it really doesn't matter, but if
the 'thing' on the other end is a perl cgi, it'd be nice if there were
something simple to deal with it.

TIA,
Robert
 
C

Chris Croughton

Ok, so taking this example further... what is the best strategy
for dealing with long[er] integers. Say, a time_t (among other
things) for inclusion in a URL. Nevermind the RFC1738 encoding
for the moment (non A-Z,a-z,0-9 characters as %dd in hex) -- or
possibly a slightly modified base64 encoding for URL safety.

You need to ask in a CGI or web newsgroup about that, it will be
specific to that environment.
First, one converts the 32-bit time_t to network byte order using
htonl(), correct? Assuming an application on the other end will use
ntohl(), what order do the octets get cast to u_char's? Big end first
like:

You will be dealing with octets with strange values (0x00 in particular
may well give problems). You need to encode that data in the format
expected by the application receiving it, which may be decimal, hex,
bas64 or something else.
Or is there something simpler? Maybe it really doesn't matter, but if
the 'thing' on the other end is a perl cgi, it'd be nice if there were
something simple to deal with it.

There may well be, but you need to find out about that specific Perl
CGI. First find out what application you're using, and what it wants as
input. I would suspect either decimal or hex if it's expecting a
number, or a time value (hh:mm:ss for instance) for a time. If te value
is an IPv4 host address then you probably need it as a dotted quad
(192.168.2.15 for instance). There is no generic way of encoding a
value in a URL.

Chris C
 
R

Richard Bos

Robert said:
Ok, so taking this example further... what is the best strategy
for dealing with long[er] integers. Say, a time_t (among other
things) for inclusion in a URL. Nevermind the RFC1738 encoding
for the moment (non A-Z,a-z,0-9 characters as %dd in hex) -- or
possibly a slightly modified base64 encoding for URL safety.

First, one converts the 32-bit time_t to network byte order using
htonl(), correct?

You don't know that a time_t is an integer; you don't know that it is 32
bits large; and htonl() isn't ISO C. IIRC POSIX defines all of this, but
ISO does not.

What's more, using a straight, untranslated time_t in _anything_
external to your program is a bad idea indeed. If nothing else, sooner
or later you'll want to start using 64-bits time_t's - at the very least
before 2038.
i = htonl (((uint32_t) now));

for (x=3; x >= 0; x--)
{
dest[x] = (char) ((i & (0xff << (x*8))) >> (x*8));
}
dest[4] = '\0';

Or is there something simpler?

Assuming for the moment that you have a valid 32 bit unsigned integer,
with a value that holds meaning (i.e., _not_ a time_t), then you can
leave out the htonl(). &, << and >> work by value, not by bit pattern.
Also, you want to use unsigned chars for your bytes (which can hold at
least an octet, maybe even more), not a plain char, which can be signed.
So you would do

int x;
unsigned char dest[4];
uint_fast32_t i; /* This type is required; uint32_t is not. */

i= some_value;
for (x=0; x<4; x++) {
dest[x]=i&0xFF;
i>>=8;
}

if you want dest[0] to hold the low byte of i, or

for (x=3; x>=0; x--) {
dest[x]=i&0xFF;
i>>=8;
}

if you want a big-endian array.

Richard
 
D

David Buchan

I have a slight problem with this: you're comparing
the unsigned value n to the signed value -1 .

This will do what you intended, because in
such a case the signed value is converted to
unsigned. But usually a signed-unsigned comparison
is an error and some compilers or warning levels
will give a diagnostic. Some people like to
avoid signed-unsigned comparisons on principle
because they are often a source of subtle bugs.

Preferable to me would be for 'n' to be signed
(to match the return type of fgetc), and
you can print it with "%d" because getc/fgetc
always returns a non-negative value, except for EOF.
Good point!

I've made the correction.

Thanks,
Dave
 
R

Robert

Richard said:
Robert said:
Ok, so taking this example further... what is the best strategy
for dealing with long[er] integers. Say, a time_t (among other
things) for inclusion in a URL. Nevermind the RFC1738 encoding
for the moment (non A-Z,a-z,0-9 characters as %dd in hex) -- or
possibly a slightly modified base64 encoding for URL safety.

First, one converts the 32-bit time_t to network byte order using
htonl(), correct?

You don't know that a time_t is an integer; you don't know that it is 32
bits large; and htonl() isn't ISO C. IIRC POSIX defines all of this, but
ISO does not.

Yes, you are right. An ill-chosen example, but instructive
nonetheless. On some platforms the epoch doesn't even begin at
the same point in time, making things worse.
What's more, using a straight, untranslated time_t in _anything_
external to your program is a bad idea indeed. If nothing else, sooner
or later you'll want to start using 64-bits time_t's - at the very least
before 2038.

Understood. What would be your recommendation for passing time values
between systems then? I assume by "untranslated" you mean the raw,
perhaps system-dependent format, and translated would mean some ASCII
representation?

I found a single reference to some kind of proposal for ISO C 200X to
modernize time-related functionality here:

http://www.cl.cam.ac.uk/~mgk25/time/c/

Any other efforts to update/modernize how time is handled?

Regarding applications that use/pass 32-bit time values, take a
look at the mod_unique_id module in Apache -- source here:

http://lxr.webperf.org/source.cgi/modules/metadata/mod_unique_id.c#112

At line 112, you'll note a comment about converting from using time_t
to an unsigned int to avoid problems with platforms that use 64-bit
time_t's. All that says is that perhaps it was a bad idea, they're
stuck trying to maintain it for backwards compatibility, but on the
other hand... for something handling lots of requests per second, you
don't want to translate something to some character set, then encode
it.
One probably should find some means of dealing with seconds or
microseconds.
Perhaps I should dig into some sources for protocols that require time
synchronization? But I guess I'm wandering OT for this group.

....
Assuming for the moment that you have a valid 32 bit unsigned integer,
with a value that holds meaning (i.e., _not_ a time_t), then you can
leave out the htonl(). &, << and >> work by value, not by bit pattern.
Also, you want to use unsigned chars for your bytes (which can hold at
least an octet, maybe even more), not a plain char, which can be signed.
So you would do

int x;
unsigned char dest[4];
uint_fast32_t i; /* This type is required; uint32_t is not. */

This intrigued me. I can't find much on uint_fast32_t, except that
it is supposed to be the most efficient means of storage for a given
implementation. Please expound on your comment a bit! Thanks,
Richard!

-Robert
i= some_value;
for (x=0; x<4; x++) {
dest[x]=i&0xFF;
i>>=8;
}

if you want dest[0] to hold the low byte of i, or

for (x=3; x>=0; x--) {
dest[x]=i&0xFF;
i>>=8;
}

if you want a big-endian array.

Richard
 
M

Michael Mair

Robert said:
This intrigued me. I can't find much on uint_fast32_t, except that
it is supposed to be the most efficient means of storage for a given
implementation. Please expound on your comment a bit!

In the C99 standard library header file <stdint.h>, typedefs
for int_leastN_t, int_fastN_t, uint_leastN_t, uint_fastN_t
are provided where N can at least be 8, 16, 32, 64 (but in principle,
uint_fast24_t also could be provided). These types are both
at least N bits wide. Moreover, it is possible but not guaranteed
that also intN_t and uintN_t are provided which are exact width
integers. You can check for a certain width by checking whether
the corresponding maximal (or minimal value) is defined,
e.g.

#include <stdint.h>
#ifdef UINT32_MAX
typedef uint32_t Counter;
#else
typedef uint_least32_t Counter;
#endif


Cheers
Michael
 
R

Robert

Michael said:
In the C99 standard library header file <stdint.h>, typedefs
for int_leastN_t, int_fastN_t, uint_leastN_t, uint_fastN_t
are provided where N can at least be 8, 16, 32, 64 (but in principle,
uint_fast24_t also could be provided). These types are both
at least N bits wide. Moreover, it is possible but not guaranteed
that also intN_t and uintN_t are provided which are exact width
integers. You can check for a certain width by checking whether
the corresponding maximal (or minimal value) is defined,
e.g.

#include <stdint.h>
#ifdef UINT32_MAX
typedef uint32_t Counter;
#else
typedef uint_least32_t Counter;
#endif

Ah, that makes sense. Well sort of... Re, the standard, I found the
text -- at least I have something to read now. ;-)

7.18.1.1 Exact-width integer types
1 The typedef name intN_t designates a signed integer type with width
N, no padding bits, and a two's complement representation. Thus,
int8_t
denotes a signed integer type with a width of exactly 8 bits.
2 The typedef name uintN_t designates an unsigned integer type with
width N. Thus, uint24_t denotes an unsigned integer type with a width
of exactly 24 bits.
3 These types are optional. However, if an implementation provides
integer types with widths of 8, 16, 32, or 64 bits, it shall define
the corresponding typedef names.

Can you name an implementation that doesn't provide integer types with
widths of 8, 16, 32, or 64 bits?? That means explicit widths, correct?
Not just short and long. Yes, I know, I shouldn't base any code on
existing (or non-existing) implementations, so forget I asked. ;-)
Makes me wish I could have been a fly on the wall to figure out why
point 3 was added at all.

Why does BSD add the u_intN[_t] types, or do they come from another
place? An earlier standard? BTW, I've taken to using an autoconf
macro to autogenerate a stdint.h, and it may include the standard
library header file, and create any missing typedefs, including the
ones labeled 'optional' above. But now I understand why it has to
do what it does! Thanks!!

-Robert
 
K

Keith Thompson

Robert said:
Can you name an implementation that doesn't provide integer types with
widths of 8, 16, 32, or 64 bits?? That means explicit widths, correct?
Not just short and long. Yes, I know, I shouldn't base any code on
existing (or non-existing) implementations, so forget I asked. ;-)
Makes me wish I could have been a fly on the wall to figure out why
point 3 was added at all.

Cray T3E has 8-bit char, 32-bit short, 64-bit int, long, and long long;
there is no 16-bit integer type.

Cray T90 and SV1 have 8-bit char, 64-bit short, int, long, and long long;
there is no 16-bit or 32-bit integer type.

Some systems (DSPs, I think) have CHAR_BIT equal to 16 or 32. and thus
no 8-bit integer type.

This makes porting software interesting.
 
R

Richard Bos

Robert said:
Understood. What would be your recommendation for passing time values
between systems then? I assume by "untranslated" you mean the raw,
perhaps system-dependent format, and translated would mean some ASCII
representation?

Yes, and possibly. My preference would be either textual (not
necessarily ASCII; may be Unicode, may still be EBCDIC on some systems)
YYYYMMDDhhmmss.fraction or the relevant portion thereof, or the Julian
Day Number in some kind of well-defined floating point format.
Any other efforts to update/modernize how time is handled?

Dozens. There's a thread on comp.std.c every now and then.
Unfortunately, time handling that it all of implementable, complete,
usable and accurate is much harder than it appears at first sight.
Assuming for the moment that you have a valid 32 bit unsigned integer,
with a value that holds meaning (i.e., _not_ a time_t), then you can
leave out the htonl(). &, << and >> work by value, not by bit pattern.
Also, you want to use unsigned chars for your bytes (which can hold at
least an octet, maybe even more), not a plain char, which can be signed.
So you would do

int x;
unsigned char dest[4];
uint_fast32_t i; /* This type is required; uint32_t is not. */

This intrigued me. I can't find much on uint_fast32_t, except that
it is supposed to be the most efficient means of storage for a given
implementation. Please expound on your comment a bit!

There's little to expound on.
uint32_t does not exist in C89; it does exists in C99, but is not a
required type under that Standard. If the implementation provides a
suitable integer type that could serve as a uint32_t, then it must also
provide uint32_t; but it is not required to provide such a type. For
example, on a hypothetical mainframe on which all types are 64 bits
large, there would not be any unsigned type of exactly 32 bits;
therefore, there would not be a uint32_t.
uint_fast32_t _is_ required to exist under C99, but isn't required to be
exactly 32 bits large; it may be larger. Ditto for uint_least32_t. The
difference is that uint_fast32_t should be the fastest type that can
contain at least 32 bits, while uint_least32_t should be the smallest
type of 32 bits or more. On most modern systems, both these types would
probably be equal to unsigned long, and so would uint32_t, but this
isn't required by the Standard.
In any case, if you write ISO C code and you have uint32_t, you are
using a C99 compiler; therefore, you also have uint_fast32_t; and the
latter type is more portable. It's also possibly faster, certainly not
slower; and any possible extra bits won't hurt you unless you have a lot
of them and it happens to be a 64-bit type, which is unlikely.

Richard
 
F

Flash Gordon

Cray T3E has 8-bit char, 32-bit short, 64-bit int, long, and long
long; there is no 16-bit integer type.

Cray T90 and SV1 have 8-bit char, 64-bit short, int, long, and long
long; there is no 16-bit or 32-bit integer type.

Some systems (DSPs, I think) have CHAR_BIT equal to 16 or 32. and thus
no 8-bit integer type.

Not forgetting the old machines that people keep mentioning which have
CHAR_BIT == 9
This makes porting software interesting.

Yes, it's wonderful fun.

<OT>
Dual port RAM shared between processors with differing endianess can
make life even more fun.
</OT>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top