Transmitting/receiving binary content portably


C

Clint O

Hi:

I know the FAQ recommends using text where possible, but I've always
been somewhat intrigued by the idea of writing/reading binary data in
a portable way.

Rob Pike published a paper about this and other topics in:

http://plan9.bell-labs.com/sys/doc/comp.html

In there he suggested an implementation that manages to read an
unsigned long assuming that such a type holds 4 bytes and sidesteps
issues of endianness:

ulong getlong(void)
{
ulong l;

l = (getchar()&0xFF)<<24;
l |= (getchar()&0xFF)<<16;
l |= (getchar()&0xFF)<<8;
l |= (getchar()&0xFF)<<0;

return l;
}

It got me thinking how tractable it would be doing things in this way.

So, to check my understanding and see whether I understood the
nuances, I did it myself, this time doing it the way *I* think is the
most intuitive: shifting the bytes right instead of left:

#include <stdio.h>

unsigned long getlong(void)
{
unsigned long l;

l = (getchar() & 0xFFUL);
l |= (getchar() & 0xFFUL) << 8;
l |= (getchar() & 0xFFUL) << 16;
l |= (getchar() & 0xFFUL) << 24;

return l;
}

int main(void)
{
printf("Received %lu\n", getlong());
return 0;
}

Likewise I wrote something that serialized an unsigned long on the
transmitting side:

#include <stdio.h>

int main(void)
{
unsigned long l = 0xdeadbeef;

putchar(l & 0xFFUL);
putchar((l >> 8) & 0xFFUL);
putchar((l >> 16) & 0xFFUL);
putchar((l >> 24) & 0xFFUL);

fprintf(stderr, "Outputting %lu\n", l);
fprintf(stderr, "Unsigned long is %lu\n", sizeof(unsigned long));

return 0;
}

On the receiving side, I did run into one caveat. The intermediate
expression of (getchar() & 0xFF) << 24 caused the result to get
extended with 1s presumably because of operand size differences. Note
that on this Linux platform unsigned long is 8 bytes, not 4.

What I found interesting is that he alludes to the fact that you could
use this technique to send structures etc. and issues of padding and
alignment would not be a problem. I assume this is because you'd
transmit every struct member individually using this scheme? I could
imagine a scheme where you took an array of structure offsets and a
pointer to a struct and somehow naively transmitted them using helper
functions.

Also, he does not say how you'd handle floating point data. I assume
there's no way to do that strictly in binary form regardless of
architecture. I also hadn't mentally sorted out how you'd accommodate
differences in sizes of the basic types like unsigned long and
integer.

I'm curious if anyone on here has solved this problem in one way or
another and how these chose to handle it.

Thanks,

-Clint
 
Ad

Advertisements

E

Eric Sosman

Hi:

I know the FAQ recommends using text where possible, but I've always
been somewhat intrigued by the idea of writing/reading binary data in
a portable way.

Rob Pike published a paper about this and other topics in:

http://plan9.bell-labs.com/sys/doc/comp.html

In there he suggested an implementation that manages to read an
unsigned long assuming that such a type holds 4 bytes and sidesteps
issues of endianness:

ulong getlong(void)
{
ulong l;

I really, really, hate the use of "l" as an identifier.
IMHO, C compilers should issue warning messages whenever they
encounter it.

int x = 8765432l;

.... also gives me the heebie-jeebies. "But tush! I am puling."
l = (getchar()&0xFF)<<24;
l |= (getchar()&0xFF)<<16;
l |= (getchar()&0xFF)<<8;
l |= (getchar()&0xFF)<<0;

return l;
}

This requires a >32-bit int, or a 32-bit int that plays
nicely when a 1-bit is shifted into the sign position (and even
then it'll likely misbehave). Since int arithmetic is used for
all the crucial bits, one wonders what help the author thought
"ulong" would bring to the party.
It got me thinking how tractable it would be doing things in this way.

So, to check my understanding and see whether I understood the
nuances, I did it myself, this time doing it the way *I* think is the
most intuitive: shifting the bytes right instead of left:

#include<stdio.h>

unsigned long getlong(void)
{
unsigned long l;

l = (getchar()& 0xFFUL);
l |= (getchar()& 0xFFUL)<< 8;
l |= (getchar()& 0xFFUL)<< 16;
l |= (getchar()& 0xFFUL)<< 24;

return l;
}

This fixes the width issue, and also changes from a
Big-Endian to a Little-Endian "on the wire" format.
int main(void)
{
printf("Received %lu\n", getlong());
return 0;
}

Likewise I wrote something that serialized an unsigned long on the
transmitting side:

#include<stdio.h>

int main(void)
{
unsigned long l = 0xdeadbeef;

putchar(l& 0xFFUL);
putchar((l>> 8)& 0xFFUL);
putchar((l>> 16)& 0xFFUL);
putchar((l>> 24)& 0xFFUL);

fprintf(stderr, "Outputting %lu\n", l);
fprintf(stderr, "Unsigned long is %lu\n", sizeof(unsigned long));

return 0;
}

The careful "UL" suffixes on the constants are harmless,
but unnecessary.
On the receiving side, I did run into one caveat. The intermediate
expression of (getchar()& 0xFF)<< 24 caused the result to get
extended with 1s presumably because of operand size differences. Note
that on this Linux platform unsigned long is 8 bytes, not 4.

That's the consequence (on this platform) of shifting a 1-bit
into an int's sign position. Fix the masks, as in the modified
version, and the problem should go away.
What I found interesting is that he alludes to the fact that you could
use this technique to send structures etc. and issues of padding and
alignment would not be a problem. I assume this is because you'd
transmit every struct member individually using this scheme? I could
imagine a scheme where you took an array of structure offsets and a
pointer to a struct and somehow naively transmitted them using helper
functions.

That's the usual approach. A useful technique is to generate
both the struct declarations and the serialize/deserialize functions
from an intermediate "little language" of some kind.
Also, he does not say how you'd handle floating point data. I assume
there's no way to do that strictly in binary form regardless of
architecture. I also hadn't mentally sorted out how you'd accommodate
differences in sizes of the basic types like unsigned long and
integer.

For integer sizes, you step away from the "I'm sending an int"
point of view and say instead "I'm sending a 32-bit integer,"
the idea being that the platforms on each side will (1) pick a
suitable local type, and (2) not try to send values that the local
type might support but that are out of range for the "wire format."
That is, if the local type is a 36-bit int, you promise not to
try to send anything outside the 32-bit window. (If you haven't
settled on two's complement, note that INT_MIN might be out of
range.)

For floating-point, see below.
I'm curious if anyone on here has solved this problem in one way or
another and how these chose to handle it.

The FAQ (Question 20.5) mentions XDR, ASN.1, CDF, netCDF,
and HDF. I confess abysmal ignorance about all of these, and
about further developments that may have supplanted or enhanced
them, but it seems a good place to start your research.
 
P

Peter Nilsson

Clint O said:
Hi:

I know the FAQ recommends using text where possible,
but I've always been somewhat intrigued by the idea
of writing/reading binary data in a portable way.

Rob Pike published a paper about this and other topics in:

http://plan9.bell-labs.com/sys/doc/comp.html ....
It got me thinking how tractable it would be doing things
in this way.

So, to check my understanding and see whether I understood
the nuances, I did it myself, this time doing it the way
*I* think is the most intuitive: shifting the bytes right
instead of left: ....
On the receiving side, I did run into one caveat.  The
intermediate expression of (getchar() & 0xFF) << 24 caused
the result to get extended with 1s presumably because of
operand size differences.

That, and getchar() & 0xFF is an int, not an unsigned long.
It needn't be 24-bits wide. Right shifting signed integer
types is not recommended.
 Note that on this Linux platform unsigned long is 8 bytes,
not 4.

The point is to write code that works on any platform! ;)
If you're happy to make platform assumptions there are simpler
ways of reading and writing binary files.
What I found interesting is that he alludes to the fact that
you could use this technique to send structures etc. and
issues of padding and alignment would not be a problem.  I
assume this is because you'd transmit every struct member
individually using this scheme?  I could imagine a scheme
where you took an array of structure offsets and a pointer
to a struct and somehow naively transmitted them using
helper functions.

Also, he does not say how you'd handle floating point data.

With text, or use frexp() and print integers.
 I assume there's no way to do that strictly in binary form
regardless of architecture.

Someone recently posted code that prints an IEEE 64-bit double,
whatever the native platform.
 I also hadn't mentally sorted out how you'd accommodate
differences in sizes of the basic types like unsigned long
and integer.

I'm curious if anyone on here has solved this problem in one
way or another and how these chose to handle it.

Here's a highly stripped down example of how to parse Microsoft
'Docfile' headers. It reads a fixed number of bytes and parses
it into a struct of native type members that may be wider than
the data set they capture.

#include <stdio.h>
#include <stdlib.h>
#include "util.h"

#define pp_comma ,
#define pp_semicolon ;

#define pp_null
#define pp_null_4(a, b, c, d)

typedef unsigned char u8_t;
typedef unsigned short u16_t;
typedef unsigned long u32_t;

typedef u32_t secID_t;

#define little_endian_k 0xFEFF
#define big_endian_k 0xFFFE
#define network_endian_k 0xFFFE

#define pp_df_hdr(M, _) \
M( 0, 8, u8_t, docfile_id, [8], read_u8 ) _ \
M( 8, 16, u8_t, uid, [16], read_u8 ) _ \
M( 24, 2, u16_t, revision, , read_u16 ) _ \
M( 26, 2, u16_t, version, , read_u16 ) _ \
M( 28, 2, u16_t, byte_order, , pp_null_4) _ \
M( 30, 2, u16_t, sector_size, , read_u16 ) _ \
M( 32, 2, u16_t, short_sec_size, , read_u16 ) _ \
M( 34, 10, u8_t, ignore_1, [10], read_u8 ) _ \
M( 44, 4, u32_t, total_secs, , read_u32 ) _ \
M( 48, 4, secID_t, first_sec, , read_u32 ) _ \
M( 52, 4, u8_t, ignore_2, [4], read_u8 ) _ \
M( 56, 4, u32_t, min_std_strm_sz, , read_u32 ) _ \
M( 60, 4, secID_t, first_short_sec, , read_u32 ) _ \
M( 64, 4, u32_t, total_short_secs, , read_u32 ) _ \
M( 68, 4, secID_t, first_msat_sec, , read_u32 ) _ \
M( 72, 4, u32_t, total_msat_secs, , read_u32 ) _ \
M( 76, 436, secID_t, msat, [109], read_u32 )

#define df_hdr_size_k 512

#define pp_as_member(offset_pp, \
size_pp, \
type_pp, \
name_pp, \
array_pp, \
read_pp) \
type_pp name_pp array_pp

#define pp_as_offset_enum(offset_pp, \
size_pp, \
type_pp, \
name_pp, \
array_pp, \
read_pp) \
name_pp ## _offset_k = offset_pp

#define pp_as_read(offset_pp, \
size_pp, \
type_pp, \
name_pp, \
array_pp, \
read_pp) \
read_pp((type_pp *) &s->name_pp, t + offset_pp, size_pp, bo)

struct df_file_hdr
{
pp_df_hdr(pp_as_member, pp_semicolon);

size_t ssz;
size_t sssz;
};

enum offset_enum
{
pp_df_hdr(pp_as_offset_enum, pp_comma)
};

void *read_u8(u8_t *s, const void *t, size_t n, u16_t bo)
{
u8_t *p = s;
const unsigned char *q = t;

while (n--)
*p++ = *q++;

return s;
}

#define u16_byte_mask_k (0x0000FFFFu - 0x0000FFFFu + 0xFF)
#define u32_byte_mask_k (0xFFFFFFFFu - 0xFFFFFFFFu + 0xFF)

void *read_u16(u16_t *s, const void *t, size_t n, u16_t bo)
{
u16_t *p = s;
const unsigned char *q = t;

for (; n; q += 2, n -= 2)
{
if (bo == little_endian_k)
{
*p++ = ((q[0] & u16_byte_mask_k ) << 0)
| ((q[1] & u16_byte_mask_k ) << 8);
}
else
{
*p++ = ((q[0] & u16_byte_mask_k ) << 8)
| ((q[1] & u16_byte_mask_k ) << 0);
}
}

return s;
}

void *read_u32(u32_t *s, const void *t, size_t n, u16_t bo)
{
u32_t *p = s;
const unsigned char *q = t;

for (; n; q += 4, n -= 4)
{
if (bo == little_endian_k)
{
*p++ = ((q[0] & u32_byte_mask_k ) << 0)
| ((q[1] & u32_byte_mask_k ) << 8)
| ((q[2] & u32_byte_mask_k ) << 16)
| ((q[3] & u32_byte_mask_k ) << 24);
}
else
{
*p++ = ((q[0] & u32_byte_mask_k ) << 24)
| ((q[1] & u32_byte_mask_k ) << 16)
| ((q[2] & u32_byte_mask_k ) << 8)
| ((q[3] & u32_byte_mask_k ) << 0);
}
}

return s;
}

void parse_hdr(struct df_file_hdr *s, const unsigned char *t)
{
const unsigned char *bop = t + byte_order_offset_k;
u16_t bo = ((bop[0] & u16_byte_mask_k) << 8)
| ((bop[1] & u16_byte_mask_k) );

pp_df_hdr(pp_as_read, pp_semicolon);
}

int main(int argc, char **argv)
{
size_t i;
FILE *fi;
struct df_file_hdr hdr;
static unsigned char buf[df_hdr_size_k];

if (argc != 2) return 0;

fi = fopen(argv[1], "rb");
if (!fi) return EXIT_FAILURE;

if (fread(buf, sizeof(buf), 1, fi) != 1)
return EXIT_FAILURE;

parse_hdr(&hdr, buf);

printf("Docfile ID: ");
for (i = 0; i < sizeof hdr.docfile_id; i++)
printf(" %02X", (unsigned) hdr.docfile_id);
puts("");
printf("UID: ");
for (i = 0; i < sizeof hdr.uid; i++)
printf(" %02X", (unsigned) hdr.uid);
puts("");
printf("Revision: %04X\n", (unsigned) hdr.revision);
printf("Revision: %04X\n", (unsigned) hdr.version);
printf("Byte Order: %s\n",
hdr.byte_order == network_endian_k ?
"Network" :
"Little Endian");
printf("Sector Size: %u\n", (unsigned) hdr.sector_size);

return 0;
}

Sample output:

% docfile "New Microsoft Excel Worksheet.xls"
Docfile ID: D0 CF 11 E0 A1 B1 1A E1
UID: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Revision: 003E
Revision: 0003
Byte Order: Little Endian
Sector Size: 9

[Aside: actual sector size is 2**9 or 512 bytes.]
 
M

Malcolm McLean

Hi:

I know the FAQ recommends using text where possible, but I've always
been somewhat intrigued by the idea of writing/reading binary data in
a portable way.
There's a recent thread on writing a floating point value portably.
Integers are easy. You just break them down byte by byte and read/
write the bytes. The only tricky bit is sign-extending negatives where
the machine width is greater than the width of the data.
ASCII is in practise easy, in theory there's a catch. Some machines
don't use ASCII internally, so you need a lookup table to convert to
and from ASCII.
 
S

santosh

Malcolm McLean said:
There's a recent thread on writing a floating point value portably.
Integers are easy. You just break them down byte by byte and read/
write the bytes. The only tricky bit is sign-extending negatives
where the machine width is greater than the width of the data.
ASCII is in practise easy, in theory there's a catch. Some machines
don't use ASCII internally, so you need a lookup table to convert
to and from ASCII.

The OP was talking about text representation, as defined by standard
C, wasn't he? Why limit yourself to ASCII when C's text
representation is more portable?
 
P

Phred Phungus

santosh said:
The OP was talking about text representation, as defined by standard
C, wasn't he? Why limit yourself to ASCII when C's text
representation is more portable?

I don't understand why the whole thing isn't unsigned chars.
 
Ad

Advertisements

N

Nick Keighley

I know the FAQ recommends using text where possible, but I've always
been somewhat intrigued by the idea of writing/reading binary data in
a portable way.

have you looked at ASN.1 and XDR?

<snip>
 
B

Ben Bacarisse

Eric Sosman said:
On 2/22/2010 7:19 PM, Clint O wrote:

This requires a >32-bit int, or a 32-bit int that plays
nicely when a 1-bit is shifted into the sign position (and even
then it'll likely misbehave). Since int arithmetic is used for
all the crucial bits, one wonders what help the author thought
"ulong" would bring to the party.

I think the author (Rob Pike) knew what he was doing but the OP has
misrepresented what this code is for. The paper is about Plan 9
programming and the Plan 9 C compiler that "accept a dialect of
ANSI C". The code is for moving data about within Plan 9 where "the
operating system is fixed and the compiler, headers and libraries are
constant so most of the stumbling blocks to portability are removed".

The paper has almost nothing to say about portability outside of Plan
9 except a few remarks like: "to port programs beyond Plan 9 ... it is
probably necessary to use pcc and hope that the target machine
supports ANSI C and POSIX". pcc is Plan 9's ANSI+POSIX C compiler.

<snip>
 
C

Clint O

I think the author (Rob Pike) knew what he was doing but the OP has
misrepresented what this code is for.  The paper is about Plan 9
programming and the Plan 9 C compiler that "accept a dialect of
ANSI C".  The code is for moving data about within Plan 9 where "the
operating system is fixed and the compiler, headers and libraries are
constant so most of the stumbling blocks to portability are removed".


I didn't misrepresent anything. I posted the link specifically so
that people could go and read the background context for his code so
they knew why he did what he did. It was I who then posed the
question about the viability of using this mechanism for transporting
data in a binary fashion in a more general sense.

Plan 9 runs/ran on many different machine types with different
endianness and alignment requirements, so his usage is more than
relevant to my question.

-Clint
 
P

Phred Phungus

Clint said:
I think the author (Rob Pike) knew what he was doing but the OP has
misrepresented what this code is for. The paper is about Plan 9
programming and the Plan 9 C compiler that "accept a dialect of
ANSI C". The code is for moving data about within Plan 9 where "the
operating system is fixed and the compiler, headers and libraries are
constant so most of the stumbling blocks to portability are removed".


I didn't misrepresent anything. I posted the link specifically so
that people could go and read the background context for his code so
they knew why he did what he did. It was I who then posed the
question about the viability of using this mechanism for transporting
data in a binary fashion in a more general sense.

Plan 9 runs/ran on many different machine types with different
endianness and alignment requirements, so his usage is more than
relevant to my question.

-Clint


This stuff looks really interesting, Clint. I wish I had more time to
spend with it now.

I think many of us in this forum have completely different notions of
what "portability" entails. I've got a dual boot windows/*nix machine,
and my gcc capability on the windows side is always gummed up. Maybe
pcc could help.
 
C

Clint O

There's a recent thread on writing a floating point value portably.

Ok, I'll see if I can find it.
Integers are easy. You just break them down byte by byte and read/
write the bytes. The only tricky bit is sign-extending negatives where
the machine width is greater than the width of the data.

Ahh yes, I hadn't considered that. In fact I tried to send (-1) over
and I received a properly extended 32-bit value, but I'm not sure how
I would recover that on the receiving side.

% ./putlong | ./getlong
Outputting 18446744073709551615
Received 4294967295

Thanks,

-Clint
 
Ad

Advertisements

B

Ben Bacarisse

Clint O said:
I think the author (Rob Pike) knew what he was doing but the OP has
misrepresented what this code is for.  The paper is about Plan 9
programming and the Plan 9 C compiler that "accept a dialect of
ANSI C".  The code is for moving data about within Plan 9 where "the
operating system is fixed and the compiler, headers and libraries are
constant so most of the stumbling blocks to portability are removed".


I didn't misrepresent anything. I posted the link specifically so
that people could go and read the background context for his code so
they knew why he did what he did. It was I who then posed the
question about the viability of using this mechanism for transporting
data in a binary fashion in a more general sense.


That's true and I am sorry that my remark was so definite. The
appearance of misrepresenting the code come simply from posting it in
comp.lang.c where portability usually means between C
implementations. Plan 9 has an curious definition of "portable C"
that is not the most common one here.

Eric would not have been surprised had you flagged the code as not
portable in the c.l.c sense bit only in the Plan 9 sense. Not
everyone has the time to read all the references!

My apologies for misrepresenting *you*.
 
B

Ben Bacarisse

Clint O said:
Ok, I'll see if I can find it.


Ahh yes, I hadn't considered that. In fact I tried to send (-1) over
and I received a properly extended 32-bit value, but I'm not sure how
I would recover that on the receiving side.

% ./putlong | ./getlong
Outputting 18446744073709551615
Received 4294967295

It might help to take a step back. If the "wire" format you define is
simply 32 bits and 2's complement (with a specified order of
significance for the consecutive octets) then you can't distinguish
between -1 and 4294967295. The receiver just can't know what it
should do when it sees 32 bits all set to 1.

There has to be some more data somewhere. It can be assumed ("all
data is unsigned") or it can be explicit, say by also sending a
signedness and width code. This has the advantage of making the data
stream closer to being self-describing.

Once you pin that down, you will hit the issue the Malcolm is
describing. In general, if either end does not have a type whose
properties match the wire type, your task gets considerably harder.
Can you rule out 1's complement and sign-magnitude machines? Can you
rule out machines that don't have 8-bit bytes and 16, 32 and 64 bit
integer types (with no padding bits)?
 
T

Thad Smith

They are many good ways to do this. To me, it comes down to an adequate
interface definition.
have you looked at ASN.1 and XDR?

Yes, I agree that ASN.1 as a good place to start. I particularly like that
ASN.1 has a way to extend an existing specification in a compatible way.
 
K

Keith Thompson

Ben Bacarisse said:
It might help to take a step back. If the "wire" format you define is
simply 32 bits and 2's complement (with a specified order of
significance for the consecutive octets) then you can't distinguish
between -1 and 4294967295. The receiver just can't know what it
should do when it sees 32 bits all set to 1.
[...]

To take another step back, if the defined wire format is 2's
complement, then it's a signed representation, and all-bits-1
unambiguously means -1, not 4294967295.
 
B

Ben Bacarisse

Keith Thompson said:
Ben Bacarisse said:
It might help to take a step back. If the "wire" format you define is
simply 32 bits and 2's complement (with a specified order of
significance for the consecutive octets) then you can't distinguish
between -1 and 4294967295. The receiver just can't know what it
should do when it sees 32 bits all set to 1.
[...]

To take another step back, if the defined wire format is 2's
complement, then it's a signed representation, and all-bits-1
unambiguously means -1, not 4294967295.

OK, but I meant 2's complement *when* its signed. I describe my
machine as 2's complement even though that has no relevance to many of
the values it stores. Sloppy, I know.
 
Ad

Advertisements

F

Flash Gordon

Malcolm said:
There's a recent thread on writing a floating point value portably.
Integers are easy. You just break them down byte by byte and read/
write the bytes. The only tricky bit is sign-extending negatives where
the machine width is greater than the width of the data.

Negatives are more problematic than that, since you have to allow for 1s
complement and sign-magnitude representation as well. The encoding is
easy though, just assign them to an unsigned integer of the appropriate
type and *then* extract octets (a byte might be more than 8 bits, and
yes, people DO interface DSPs to other processors in ways where this
matters). Decoding is more work, but possible once you know what you are
doing.
ASCII is in practise easy, in theory there's a catch. Some machines
don't use ASCII internally, so you need a lookup table to convert to
and from ASCII.

Far better normally to use native character encoding and let the file
transfer programs deal with converting the text files as you copy them
on/off the machine.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top