liitle endian

J

Joe Wright

Eric said:
You needn't know or care: Richard's method works unchanged
on Little-, Big-, Middle-, and Mixed-Endian platforms. (Extra
credit: Devise a corresponding "endian-oblivious" scheme for
writing integers.)

Are you worried about offending the Little Tin God? Given
that there's I/O going on, it's dollars to doughnuts that your
CPU is loafing anyhow and could afford to do a half-dozen FFTs
between reads without slowing anything down.
I fear I am being misunderstood. Consider that the first 32 bytes of a
..DBF file have a structure like this..

typedef struct {
uchar version; /* 00 0x03 or 0x83 (with .dbt file) */
uchar date[3]; /* 01 Date YY MM DD in binary */
ulong numrecs; /* 04 Number of records in data file */
ushort hdrlen; /* 08 Offset to first record */
ushort reclen; /* 0A Length of each record */
uchar reserved[20];/* 0C Balance of 32 bytes */
} HEADER;

The ulong (unsigned long [32]) and ushort (unsigned short [16]) are
little endian. The .DBF is native format of dBASE II, dBASE III, FoxPro,
Clipper and other database management systems born on Intel processors.

The values are stored Little Endian!

In manipulating the .DBF we have to read the various numrecs, hdrlen and
reclen values into a HEADER structure. A straightforward fread().

If numrecs is ten, those four bytes will be 0A000000. All well and good
on x86 hardware. But what about Sparc hardware? Our numrecs is still
four bytes but its value is now 167 million and more.

Clearly my C programs must know whether they are running on big or
little endian hardware so that they know the value of 0A000000.
 
C

Chris Torek

I fear I am being misunderstood. Consider that the first 32 bytes of a
.DBF file have a structure like this..

typedef struct {
uchar version; /* 00 0x03 or 0x83 (with .dbt file) */
uchar date[3]; /* 01 Date YY MM DD in binary */
ulong numrecs; /* 04 Number of records in data file */
ushort hdrlen; /* 08 Offset to first record */
ushort reclen; /* 0A Length of each record */
uchar reserved[20];/* 0C Balance of 32 bytes */
} HEADER;

The ulong (unsigned long [32]) and ushort (unsigned short [16]) are
little endian. The .DBF is native format of dBASE II, dBASE III, FoxPro,
Clipper and other database management systems born on Intel processors.

The values are stored Little Endian!

In manipulating the .DBF we have to read the various numrecs, hdrlen and
reclen values into a HEADER structure. A straightforward fread().

"Doctor, doctor, it hurts when I do this!"

"Well, don't do that, then."

:)

Seriously, the above is a classic example of "what not to do". Using
using fread() and fwrite() on the raw internal data structures not
only makes you endian-dependent, it also makes you alignment- and
size-dependent. The code will fail horribly on machines where
"ulong" is 8 bytes instead of 4. Even "ushort" is 8 bytes on a Cray
(assuming "ushort" is an alias for "unsigned short" -- there is no
2-byte data type).

To read or write a ".DBF" file, "don't do that, then". Instead of:

result = fread(&header, sizeof header, 1, fp);
if (result != 1) ... handle error ...

-- which is admittedly very short -- use the more verbose:

unsigned char buf[32];

result = fread(buf, sizeof buf, 1, fp);
if (result != 1) ... handle error ...
/* optionally, check buf[0] here */
header.version = buf[0];
header.date[0] = buf[1];
header.date[1] = buf[2];
header.date[2] = buf[3];
header.numrecs = buf[4] + (buf[5] << 8) +
((ulong)buf[6] << 16) + ((ulong)buf[7] << 24);
header.hdrlen = buf[8] + (buf[9] << 8);
header.reclen = buf[10] + (buf[11] << 8);
/* ignore or copy "spare" bytes as desired */
If numrecs is ten, those four bytes will be 0A000000. All well and good
on x86 hardware. But what about Sparc hardware? Our numrecs is still
four bytes but its value is now 167 million and more.

The above works fine on SPARC, Cray, x86-64, and even the hardware
that has not been invented yet that will come out in four years.
Clearly my C programs must know whether they are running on big or
little endian hardware so that they know the value of 0A000000.

Or, maybe not. :)

Endianness is a problem only if you let someone else take apart
and put together your data into sub-units like "bytes" (unsigned
char, in C). If you do it "manually", instead of having the machine
do it for you inside fread() and fwrite(), *you* control the details.
 
C

CBFalconer

Joe said:
.... snip ...

I fear I am being misunderstood. Consider that the first 32 bytes
of a .DBF file have a structure like this..

typedef struct {
uchar version; /* 00 0x03 or 0x83 (with .dbt file) */
uchar date[3]; /* 01 Date YY MM DD in binary */
ulong numrecs; /* 04 Number of records in data file */
ushort hdrlen; /* 08 Offset to first record */
ushort reclen; /* 0A Length of each record */
uchar reserved[20];/* 0C Balance of 32 bytes */
} HEADER;

The ulong (unsigned long [32]) and ushort (unsigned short [16]) are
little endian. The .DBF is native format of dBASE II, dBASE III, FoxPro,
Clipper and other database management systems born on Intel processors.

The values are stored Little Endian!

The point is that the file arrangement is fixed. Get rid of that
definition and define the fields as arrays of unsigned char, which
you know to be describing little endian values of various lengths
in some places. Then use the methods that have been described here
to make local endian independent conversions to and from those
fields. Remember that C file i/o routines are always reading and
writing sequences of bytes.
 
R

Richard Heathfield

Joe Wright said:

I fear I am being misunderstood.

No, you're just stuck in a mind-rut. (As are we all, from time to time.)

Consider that the first 32 bytes of a
.DBF file have a structure like this..

typedef struct {
uchar version; /* 00 0x03 or 0x83 (with .dbt file) */
uchar date[3]; /* 01 Date YY MM DD in binary */
ulong numrecs; /* 04 Number of records in data file */
ushort hdrlen; /* 08 Offset to first record */
ushort reclen; /* 0A Length of each record */
uchar reserved[20];/* 0C Balance of 32 bytes */
} HEADER;

Now consider the possibility that the first 32 bytes of a .DBF file are
represented on the disk like this:

unsigned char firstbyte;
unsigned char secondbyte;
unsigned char thirdbyte;
......etc
The ulong (unsigned long [32]) and ushort (unsigned short [16]) are
little endian. The .DBF is native format of dBASE II, dBASE III, FoxPro,
Clipper and other database management systems born on Intel processors.

The values are stored Little Endian!

No, the values stored are just bytes.
In manipulating the .DBF we have to read the various numrecs, hdrlen and
reclen values into a HEADER structure. A straightforward fread().

Stop right there. Rewind. Re-read. This time, byte by byte, assembling
aggregate (multi-byte) values "manually".
If numrecs is ten, those four bytes will be 0A000000. All well and good
on x86 hardware. But what about Sparc hardware? Our numrecs is still
four bytes but its value is now 167 million and more.

No, you read the first byte: 0A. Okay, store that in your unsigned long, and
now read the second byte, and multiply by 256, and add in. 0000000A +
00000000 = 00000000. Now read the third byte, and multiply by 256^2, and
add in. That's 0000000A + 00000000 which is 0000000A. Now read the fourth
byte, multiply by 256^3, and add in. That's 0000000A + 00000000 = 0000000A,
which is the correct answer, irrespective of what end your CPU is ianing.
Clearly my C programs must know whether they are running on big or
little endian hardware so that they know the value of 0A000000.

What matters is the file's endianism, not the hardware's endianism. If your
file's ends are the other way about, your reader needs to be the other way
about, too. In fact, the number of different readers you need = number of
different integer formats (endianisms, signs) * number of different integer
sizes. That's always assuming you share with the file generating program a
common notion of the number of bits in a byte, of course!
 
J

Joe Wright

Richard said:
Joe Wright said:

I fear I am being misunderstood.

No, you're just stuck in a mind-rut. (As are we all, from time to time.)

Consider that the first 32 bytes of a
.DBF file have a structure like this..

typedef struct {
uchar version; /* 00 0x03 or 0x83 (with .dbt file) */
uchar date[3]; /* 01 Date YY MM DD in binary */
ulong numrecs; /* 04 Number of records in data file */
ushort hdrlen; /* 08 Offset to first record */
ushort reclen; /* 0A Length of each record */
uchar reserved[20];/* 0C Balance of 32 bytes */
} HEADER;

Now consider the possibility that the first 32 bytes of a .DBF file are
represented on the disk like this:

unsigned char firstbyte;
unsigned char secondbyte;
unsigned char thirdbyte;
.....etc
The ulong (unsigned long [32]) and ushort (unsigned short [16]) are
little endian. The .DBF is native format of dBASE II, dBASE III, FoxPro,
Clipper and other database management systems born on Intel processors.

The values are stored Little Endian!

No, the values stored are just bytes.
In manipulating the .DBF we have to read the various numrecs, hdrlen and
reclen values into a HEADER structure. A straightforward fread().

Stop right there. Rewind. Re-read. This time, byte by byte, assembling
aggregate (multi-byte) values "manually".
If numrecs is ten, those four bytes will be 0A000000. All well and good
on x86 hardware. But what about Sparc hardware? Our numrecs is still
four bytes but its value is now 167 million and more.

No, you read the first byte: 0A. Okay, store that in your unsigned long, and
now read the second byte, and multiply by 256, and add in. 0000000A +
00000000 = 00000000. Now read the third byte, and multiply by 256^2, and
add in. That's 0000000A + 00000000 which is 0000000A. Now read the fourth
byte, multiply by 256^3, and add in. That's 0000000A + 00000000 = 0000000A,
which is the correct answer, irrespective of what end your CPU is ianing.
Clearly my C programs must know whether they are running on big or
little endian hardware so that they know the value of 0A000000.

What matters is the file's endianism, not the hardware's endianism. If your
file's ends are the other way about, your reader needs to be the other way
about, too. In fact, the number of different readers you need = number of
different integer formats (endianisms, signs) * number of different integer
sizes. That's always assuming you share with the file generating program a
common notion of the number of bits in a byte, of course!
Thanks. I see the rut now and I'm climbing out.
 
C

christian.bau

Joe said:
Hi Jay, tell me how.

I have Standard C programs which must read, write and manipulate .DBF
data files. The .DBF file contains interesting data in 16 and 32 bit
integers in little endian format. My C programs must perform identically
on Sparc (big endian) and x86 (little endian) boxes. How shall I do that
without knowing and caring about endianess of the box?

Lets say you have an array

unsigned char data [6];

of eight-bit bytes which contains one 16 bit and one 32 bit unsigned
integer in little-endian format. You want to get the values into two
variables

unsigned int x1;
unsigned long x2; // x1 and x2 are guaranteed to be big enough

x1 = data [0] + (data [1] << 8);
x2 = data [2] + (data [3] << 8) + (((unsigned long) data [4]) << 16) +
(((unsigned long) data [5]) << 24);

This works even if your box has some weird mixed-endian format.
 
R

Richard Bos

dick said:
easy.

short int a=0x0001;

if( (*(char*)&a)>'\000')
{
// little endian
}
else
{
// big endian
}

good enough?

No. System has 16-bit chars, 16-bit shorts, 32-bit ints, and is
big-endian for chars and shorts within ints. Program falls over.

There is no general solution.

Richard
 
K

Kenneth Brody

raghu said:
Is it possible to know whether a system is little endian or big endian
by writing a C program? If so, can anyone please give me the idea to
approach...
[...]

Well, ignoring the "why should it matter" angle...

Overlay an int with an unsigned char array, store 0x11223344
(assuming 32-bit ints here) and examine it.

And don't forget that there are more than "little endian" and "big
endian" systems out there. I seem to recall that some form of
VAXen use an inside-out type order, where each 16-bit "word" is
stored little-endian, but each "word" in a 32-bit "dword" is
stored big-endian. (ie: 0x11223344 would be stored in the order
"22 11 44 33".)

Back to the "why should it matter" angle, I have the same issue.
I manage a program for which certain modules need to know the byte
ordering -- such as the program which converts the data files from
one byte order to another to move data between platforms. I wrote
a simple program which does the above (and more, as I need to know
the padding used as well) and dumps the unsigned char array to
stdout for me to examine. From there, I tweak the header file used
by the program to specify which byte order and padding is used
natively. (This could probably be automated nowadays, like many
gnu "configure" scripts do, but this predates gnu, and it's
probably overkill for this particular need.)

--
+-------------------------+--------------------+-----------------------+
| Kenneth J. Brody | www.hvcomputer.com | #include |
| kenbrody/at\spamcop.net | www.fptech.com | <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------+
Don't e-mail me at: <mailto:[email protected]>
 
S

sjdevnull

Kenneth said:
raghu said:
Is it possible to know whether a system is little endian or big endian
by writing a C program? If so, can anyone please give me the idea to
approach...
[...]

Well, ignoring the "why should it matter" angle...

Overlay an int with an unsigned char array, store 0x11223344
(assuming 32-bit ints here) and examine it.

And don't forget that there are more than "little endian" and "big
endian" systems out there. I seem to recall that some form of
VAXen use an inside-out type order, where each 16-bit "word" is
stored little-endian, but each "word" in a 32-bit "dword" is
stored big-endian. (ie: 0x11223344 would be stored in the order
"22 11 44 33".)

It was the PDP-11 (and probably other PDPs).

http://www.idiap.ch/~formaz/doc/glibdocs/glib-byte-order-macros.html
says:

"Finally, to complicate matters, some other processors store the bytes
in a rather curious order known as PDP-endian. For a 4-byte word, the
3rd most significant byte is stored first, then the 4th, then the 1st
and finally the 2nd."

Which I think is equivalent to what you said.
 
D

Dave Thompson

For example (untested):
Apparently.

/* convert 4 octet little endian to integer */
/* assumes each byte contains one octet */
/* also that UINT_MAX is >= 2 ** 32 */
/* (else use longs, which will always work) */
unsigned int convert4(const char *s) {

If you make s const unsigned char * (and assume byte=octet as already)
you don't need to mask below. Or if you want to keep plain char for
the convenience of (the) caller(s), copy s to say slocal and use that.
unsigned int i, val;

for (i = val = 0; i < 4; i++)
val = val * 256 + ((s & 0ffh) << (8 * i));


You want _either_ val = val * 256 + s
/* big-endian, run i from 3 downto 0 for littleendian */

_or_ val = val + (s << 8*i)
/* littleendian, 8*(3-i) for bigendian */
except you actually need (unsigned int)s because otherwise the
shift is done in _signed_ int possibly 1+31-bit and overflow is UB.

And you can make the + instead of |, which I consider clearer in at
least the second case which emphasis the bit-representation of
numbers, plus there it doesn't need the parentheses for grouping.

- David.Thompson1 at worldnet.att.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,436
Messages
2,571,696
Members
48,796
Latest member
Greg L.
Top