Binary-mode i/o, width of char, endianness

T Koster · Mar 1, 2005

Hi group,

I'm having some difficulty figuring out the most portable way to read 24
bits from a file. This is related to a Base-64 encoding.

The file is opened in binary mode, and I'm using fread to read three
bytes from it. The question is though, where should fread put this? I
have considered two alternatives, but neither seem like a good idea:

In most cases, the width of a char is 8 bits, so an array of 3 chars
would suffice, but the width of a char is guaranteed to be only *at
least* 8 bits, so the actual number of chars required would be 24 /
CHAR_BIT, rounded up. Since you can't round in a constant integral
expression, 3 chars is a good safe buffer size because it's guaranteed
to be at least 24 bits. However, since I need to be able to divide
those 24 bits into four 6-bit numbers, indices into the char array
become more complicated as the 6-bit numbers do not fall evenly on the
(presumably) 8-bit boundaries that indexes in the array would give me.
If the width of a char is not 8 bits, then knowing which indices to look
at and shift/mask is even more difficult. As such, I thought of the
second option.

The second option is to allocate the input buffer as simply one int
object that is guaranteed to be at least 24 bits wide: the long int,
which even has 8 bytes to spare. fread can safely write 3 bytes of data
into a long int. I only have worries that because a long int is a
multi-byte integer, accessing various parts of it is dangerous due to
endianness considerations, or is endianness only relevant to the
represented *value* of the multi-byte integer as a whole? fread doesn't
care about that: it writes three bytes into the address of the long int,
starting at the lowest-positioned byte, but would the shifting/masking
be portable? For example a multi-byte integer constant 0x1234 has a
most-significant byte of value 0x12, but on a big-endian machine would
be stored on the *lowest* memory address of the space it takes up. As
such, the mask required to leave only the *lowest* 6 bits of a 32-bit
integer could be either 0x3F000000 or 0x0000003F depending on
endianness, right? Or are hexadecimal integer constants always stored
as-is? That is, the lowest byte is positioned last in an integer
constant instead of the least significant byte positioned last? This
seems counter-intuitive.

If neither of these options is good, is there another way?

Thanks in advance,
Thomas

infobahn · Mar 1, 2005

T said:
Hi group,

I'm having some difficulty figuring out the most portable way to read 24
bits from a file. This is related to a Base-64 encoding.

The file is opened in binary mode, and I'm using fread to read three
bytes from it. The question is though, where should fread put this? I
have considered two alternatives, but neither seem like a good idea:

In most cases, the width of a char is 8 bits, so an array of 3 chars
would suffice, but the width of a char is guaranteed to be only *at
least* 8 bits, so the actual number of chars required would be 24 /
CHAR_BIT, rounded up. Since you can't round in a constant integral
expression, 3 chars is a good safe buffer size because it's guaranteed
to be at least 24 bits.

To store BITS bits, you need at least (BITS + CHAR_BIT - 1) / CHAR_BIT
bytes. If BITS is constant:

#define BITS 24

then:

unsigned char buf[(BITS + CHAR_BIT - 1) / CHAR_BIT] = {0};

is legal.

However, since I need to be able to divide
those 24 bits into four 6-bit numbers, indices into the char array
become more complicated as the 6-bit numbers do not fall evenly on the
(presumably) 8-bit boundaries that indexes in the array would give me.

So you need to mask and shift. If we assume that each octet of data
is stored in a separate byte, then this isn't as hard as it sounds.

/* 1. get bits 7 through 2 of first octet */
num[0] = (buf[0] & 0xFC) >> 2;
/* 2. get bits 1 and 0 of first octet, and bits 7 through 4 of
second octet */
num[1] = ((buf[0] & 0x03) << 6) | ((buf[1] & 0xF0) >> 4);

etc.

If the width of a char is not 8 bits, then knowing which indices to look
at and shift/mask is even more difficult.

See above if they're spread out, with 8 value bits to each byte
(the remaining bits being unused). If they're packed in, you just
have to be a little clever with CHAR_BIT. Once you start to analyse
this problem, you'll see that it isn't as hard as it sounds.

As such, I thought of the
second option.

The second option is to allocate the input buffer as simply one int
object that is guaranteed to be at least 24 bits wide: the long int,
which even has 8 bytes to spare.

Well, at least 8 *bits* to spare.

fread can safely write 3 bytes of data
into a long int.

Not necessarily. On platforms such as the kind you are worrying about
(CHAR_BIT > 8), long int may well be fewer than four bytes wide!

Consider a platform with 11-bit bytes. On such a platform, long ints
may only occupy 3 bytes. On (perhaps more common) platforms with
16-bit or 32-bit bytes, long int may be only 2 bytes, or even 1 byte.

I would stick to unsigned char for this project. Long ints will
multiply your headaches, divide your attention, add to your
worries, and subtract from your understanding (modulo their
day-to-day uses, obviously).

T Koster · Mar 1, 2005

infobahn said:
T said:

I'm having some difficulty figuring out the most portable way to read 24
bits from a file. This is related to a Base-64 encoding.

The file is opened in binary mode, and I'm using fread to read three
bytes from it. The question is though, where should fread put this? I
have considered two alternatives, but neither seem like a good idea:

In most cases, the width of a char is 8 bits, so an array of 3 chars
would suffice, but the width of a char is guaranteed to be only *at
least* 8 bits, so the actual number of chars required would be 24 /
CHAR_BIT, rounded up. Since you can't round in a constant integral
expression, 3 chars is a good safe buffer size because it's guaranteed
to be at least 24 bits.

Click to expand...

To store BITS bits, you need at least (BITS + CHAR_BIT - 1) / CHAR_BIT
bytes. If BITS is constant:

#define BITS 24

then:

unsigned char buf[(BITS + CHAR_BIT - 1) / CHAR_BIT] = {0};

is legal.

Ahh, good idea.

However, since I need to be able to divide
those 24 bits into four 6-bit numbers, indices into the char array
become more complicated as the 6-bit numbers do not fall evenly on the
(presumably) 8-bit boundaries that indexes in the array would give me.

Click to expand...

So you need to mask and shift. If we assume that each octet of data
is stored in a separate byte, then this isn't as hard as it sounds.

/* 1. get bits 7 through 2 of first octet */
num[0] = (buf[0] & 0xFC) >> 2;
/* 2. get bits 1 and 0 of first octet, and bits 7 through 4 of
second octet */
num[1] = ((buf[0] & 0x03) << 6) | ((buf[1] & 0xF0) >> 4);

etc.

If the width of a char is not 8 bits, then knowing which indices to look
at and shift/mask is even more difficult.

Click to expand...

See above if they're spread out, with 8 value bits to each byte
(the remaining bits being unused). If they're packed in, you just
have to be a little clever with CHAR_BIT. Once you start to analyse
this problem, you'll see that it isn't as hard as it sounds.

We seem to be using the term 'byte' with different meanings...see below.

Well, at least 8 *bits* to spare.

Certainly

Not necessarily. On platforms such as the kind you are worrying about
(CHAR_BIT > 8), long int may well be fewer than four bytes wide!
>
Consider a platform with 11-bit bytes. On such a platform, long ints
may only occupy 3 bytes. On (perhaps more common) platforms with
16-bit or 32-bit bytes, long int may be only 2 bytes, or even 1 byte.

Hmmm, this appears to be becoming a question of terminology. I thought
that by definition, one byte is eight bits wide. I'm not using the C
type 'char' interchangably with 'an int that is one _byte_ big'. When I
consider that CHAR_BIT may be greater than 8, I mean exactly that, and
not that a byte of storage on this platform has more than eight bits,
since I thought that was nonsense. That is, a char may occupy more than
one byte of storage, but a byte is still an 8-bit byte. Calling fread
and asking for three bytes implies that 24 bits will be read,
irrespective of platform, correct? As such, a long int, being
guaranteed to have at least 32 bits, is guaranteed to occupy at least
four bytes of storage, which is why I say that fread can safely store
three bytes (24 bits by definition) in a long int. Correct me if I'm
wrong here.

I would stick to unsigned char for this project. Long ints will
multiply your headaches, divide your attention, add to your
worries, and subtract from your understanding (modulo their
day-to-day uses, obviously).

Thanks,
Thomas.

infobahn · Mar 1, 2005

T said:
Hmmm, this appears to be becoming a question of terminology. I thought
that by definition, one byte is eight bits wide.

Not in comp.lang.c, it isn't, because ISO C recognises that it
simply isn't true on some platforms.

I'm not using the C
type 'char' interchangably with 'an int that is one _byte_ big'. When I
consider that CHAR_BIT may be greater than 8, I mean exactly that, and
not that a byte of storage on this platform has more than eight bits,
since I thought that was nonsense.

Many people think that, and many people are wrong. If CHAR_BIT is
greater than 8, it is because bytes are greater than 8 bits wide
for that implementation.

That is, a char may occupy more than
one byte of storage, but a byte is still an 8-bit byte.

In C, by definition, a char is exactly one byte in size.
sizeof(char) always yields 1 as its value. But chars can
be wider than 8 bits. If they are, then so are bytes.

Calling fread
and asking for three bytes implies that 24 bits will be read,
irrespective of platform, correct?

Nope. Consider a typical modern DSP. You might get anything from
48 to 96 bits! (Maybe even more, nowadays.)

As such, a long int, being
guaranteed to have at least 32 bits, is guaranteed to occupy at least
four bytes of storage, which is why I say that fread can safely store
three bytes (24 bits by definition) in a long int. Correct me if I'm
wrong here.

You can certainly guarantee to get 24 bits into a long int, yes.
But you might not need three bytes to do it in, and if you read
three bytes you might end up with more than you can chew.

(Pun definitely intended.)

Lew Pitcher · Mar 1, 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

T said:
infobahn wrote: [snip]

Consider a platform with 11-bit bytes. On such a platform, long ints
may only occupy 3 bytes. On (perhaps more common) platforms with
16-bit or 32-bit bytes, long int may be only 2 bytes, or even 1 byte.

Click to expand...

Hmmm, this appears to be becoming a question of terminology.

Yes. This is one of those difference in terms that catches a lot of people.

I thought
that by definition, one byte is eight bits wide.

Not in the C language, although your observation holds true in many
other areas of computing.

Here, a byte is as wide as a byte is wide, and that is 8 bits /or more/.

I'm not using the C
type 'char' interchangably with 'an int that is one _byte_ big'.

A C char data type is, by definition, one byte big. A byte, however, has
no defined size, other than a minimum size of 8 bits. A byte (and hence
a char) could be 9 bits wide, or 32 bits wide, or even 64 bits wide,
depending on the implementation.

When I
consider that CHAR_BIT may be greater than 8, I mean exactly that,

As do we

and
not that a byte of storage on this platform has more than eight bits,

There may be a difference between the machine's minimum unit of
addressable memory and the minimum unit of addressable memory that the C
'virtual machine' recognizes. Here, in comp.lang.c, the C 'virtual
machine' interpretation wins, and a byte is CHAR_BIT bits long, no
matter how the physical machine implements it's load and store
instructions. Think of it this way, in C, you can't get any smaller a
single unit of memory than a char, so a char /is/ the minimum unit of
addressable memory.

since I thought that was nonsense.

I'll let this pass

That is, a char may occupy more than
one byte of storage, but a byte is still an 8-bit byte.

Difference in terms. A byte is CHAR_BIT bits long, and that can be more
than 8 bits.

Calling fread
and asking for three bytes implies that 24 bits will be read,
irrespective of platform, correct?

Nope. Calling fread(,3,1,) (one element, 3 char in length) will read one
block of 3 C char elements. If each C char element is 24 bits long
(CHAR_BIT == 24), then you get 72 bits worth of data in one fread().

As such, a long int, being
guaranteed to have at least 32 bits, is guaranteed to occupy at least
four bytes of storage,

No, is guaranteed to occupy at least CHAR_BIT * sizeof(long int) bits of
storage, or sizeof(long int) bytes of storage.

which is why I say that fread can safely store
three bytes (24 bits by definition) in a long int.
Correct me if I'm wrong here.

See above.

- --

Lew Pitcher, IT Specialist, Enterprise Data Systems
Enterprise Technology Solutions, TD Bank Financial Group

(Opinions expressed here are my own, not my employer's)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)

iD8DBQFCJHGHagVFX4UWr64RAvZUAJ9kwiCwxKiEY5RJIfHSwpMc75u+zQCfbDbx
Ezd/+ESWGlZovKGmMNo/DzQ=
=2v7/
-----END PGP SIGNATURE-----

Keith Thompson · Mar 1, 2005

infobahn said:
T Koster wrote: [...]

Hmmm, this appears to be becoming a question of terminology. I thought
that by definition, one byte is eight bits wide.

Click to expand...

Not in comp.lang.c, it isn't, because ISO C recognises that it
simply isn't true on some platforms.

I'm not using the C
type 'char' interchangably with 'an int that is one _byte_ big'. When I
consider that CHAR_BIT may be greater than 8, I mean exactly that, and
not that a byte of storage on this platform has more than eight bits,
since I thought that was nonsense.

Click to expand...

Many people think that, and many people are wrong. If CHAR_BIT is
greater than 8, it is because bytes are greater than 8 bits wide
for that implementation.

That's not quite the way I'd put it. The fact that a "byte" is
defined to be CHAR_BIT bits isn't necessarily something imposed by the
underlying platform; it's a choice that can be made by the author of
the C implementation. It's about the definition of the word "byte",
not necessarily about any inherent attribute of the platform.

For example, I've worked on systems where the natural fundamental unit
of storage is 64 bits, but the C implementation defines CHAR_BIT==8.
It has to jump through some hoops to make this work; the advantage is
compatibility with other systems.

The C standard could have defined the term "byte" as a unit of storage
consisting of 8 bits.

The end result is the same, though. Out in the real world a "byte" is
almost always 8 bits ("octet" is an unambiguous term for this). In
comp.lang.c, a "byte" is always exactly CHAR_BIT bits, and maximally
portable code must not assume that CHAR_BIT==8.

infobahn · Mar 1, 2005

Keith said:
That's not quite the way I'd put it. The fact that a "byte" is
defined to be CHAR_BIT bits isn't necessarily something imposed by the
underlying platform; it's a choice that can be made by the author of
the C implementation.

That's why I said "implementation", not "platform".

It's about the definition of the word "byte",
not necessarily about any inherent attribute of the platform.

That's why I said "implementation", not "platform".

For example, I've worked on systems where the natural fundamental unit
of storage is 64 bits, but the C implementation defines CHAR_BIT==8.
It has to jump through some hoops to make this work; the advantage is
compatibility with other systems.

Cray, by any chance?

T Koster · Mar 2, 2005

infobahn said:
Not in comp.lang.c, it isn't, because ISO C recognises that it
simply isn't true on some platforms.

Many people think that, and many people are wrong. If CHAR_BIT is
greater than 8, it is because bytes are greater than 8 bits wide
for that implementation.

In C, by definition, a char is exactly one byte in size.
sizeof(char) always yields 1 as its value. But chars can
be wider than 8 bits. If they are, then so are bytes.

Nope. Consider a typical modern DSP. You might get anything from
48 to 96 bits! (Maybe even more, nowadays.)

You can certainly guarantee to get 24 bits into a long int, yes.
But you might not need three bytes to do it in, and if you read
three bytes you might end up with more than you can chew.

(Pun definitely intended.)

Okay okay, so it is impossible to portably tell fread to read exactly 24
bits of data, since you can only ask it to give you an integer number of
bytes, however big they may be on the particular implementation.

As such, should I allocate a larger buffer, of a fixed number of bytes
(regardless of how wide a byte is), and get fread to read in that fixed
number of bytes, and then do the shifting/masking as appropriate
depending on CHAR_BIT, reading more data as needed?

Also, will I need to generate the masks as expressions in terms of
CHAR_BIT, rather than a hexadecimal integer constant? The mask 0xFC
(for the first six bits) defines only 8 bits worth of a char, which we
know can be wider than 8 bits.
Would something like
x[0] = (buf[0] >> (CHAR_BIT - 6)) & 63;
be more appropriate? What about the next six bits? With an 8-bit char,
four bits will have to come from buf[1], but with a 12-bit-or-more char
buf[0] has all we need. In general, I guess since a char is at least
eight bits wide, we only need to examine at most two chars. I'm just
trying to figure out some sort of expression for x[n] (any nth 6-bit
number contained in the 'stream' of chars). I'll have to think about
this one.

Thanks,
Thomas

Walter Roberson · Mar 2, 2005

:Okay okay, so it is impossible to portably tell fread to read exactly 24
:bits of data, since you can only ask it to give you an integer number of
:bytes, however big they may be on the particular implementation.

:As such, should I allocate a larger buffer, of a fixed number of bytes

regardless of how wide a byte is), and get fread to read in that fixed
:number of bytes, and then do the shifting/masking as appropriate
:depending on CHAR_BIT, reading more data as needed?

You could find the lowest common multiple of 24 and CHAR_BIT
(e.g., 72 for CHAR_BIT==9, 120 for CHAR_BIT==10, 282 for CHAR_BIT==11,
24 for CHAR_BIT==12). Divide that by CHAR_BIT and you get the number
of bytes you need to read at a time in order to be able to unpack
on exact boundaries.

:I'm just
:trying to figure out some sort of expression for x[n] (any nth 6-bit
:number contained in the 'stream' of chars).

if ( (n * 6) % CHAR_BIT < 6 ) then you need to span bytes.

Keith Thompson · Mar 2, 2005

T Koster said:
Okay okay, so it is impossible to portably tell fread to read exactly
24 bits of data, since you can only ask it to give you an integer
number of bytes, however big they may be on the particular
implementation.

As such, should I allocate a larger buffer, of a fixed number of bytes
(regardless of how wide a byte is), and get fread to read in that
fixed number of bytes, and then do the shifting/masking as appropriate
depending on CHAR_BIT, reading more data as needed?

[...]

I wonder how much effort this is worth. I don't know of any hosted C
implementations with CHAR_BIT != 8 (and freestanding implementations
don't necessarily support fread).

If you want absolute portability, you need to write your code so it
works on systems with CHAR_BIT != 8. But then it's not clear what a
data file would look like after being copied to such a system.
Whatever mechanism is used to copy the file will need to deal with the
difference in byte sizes, and there are a number of ways it could do
so. If CHAR_BIT % 8 == 0, it could pack multiple octets into each
byte. If not, it could either treat the file as a stream of bits
(with some octets crossing byte boundaries), or it could just store
each octet in a byte and zero out the high-order bits.

It might be sufficient to put the following code in an application
header file:

#include <limits.h>
#if CHAR_BIT != 8
#error This application is supported only on systems with CHAR_BIT==8
#endif

On the other hand, if this is intended as an exercise in theoretical
manic portability, have fun!

T Koster · Mar 2, 2005

Walter said:
You could find the lowest common multiple of 24 and CHAR_BIT
(e.g., 72 for CHAR_BIT==9, 120 for CHAR_BIT==10, 282 for CHAR_BIT==11,
24 for CHAR_BIT==12). Divide that by CHAR_BIT and you get the number
of bytes you need to read at a time in order to be able to unpack
on exact boundaries.

What do you mean to unpack on exact boundaries? That there will be no
partial 6-bit numbers left on the end of the buffer? How about making
the buffer lcm(6,CHAR_BIT) chars big and read that many instead? There
will always be a whole number of 6-bit numbers in there.

I'm just
trying to figure out some sort of expression for x[n] (any nth 6-bit
number contained in the 'stream' of chars).

Click to expand...

if ( (n * 6) % CHAR_BIT < 6 ) then you need to span bytes.

Here is what I have so far:
The index i of the nth 6-bit number in an array of CHAR_BIT-bit chars is:
i = (n * 6) / CHAR_BIT
(with integral division of course)
For example, with 8-bit chars:
00000011 11112222 22333333 44444455 55556666 ... etc.
the n=0th 6-bit number starts in the 0*6/8 = 0th char
the n=1th 6-bit number starts in the 1*6/8 = 0th char
the n=2th 6-bit number starts in the 2*6/8 = 1th char
n=3 -> i=2
n=4 -> i=3
n=5 -> i=3
With 9-bit chars:
000000111 111222222 333333444 444555555 ... etc.
n=0 -> i=0*6/9 = 0
n=1 -> i=1*6/9 = 0
n=2 -> i=2*6/9 = 1
n=3 -> i=3*6/9 = 2
n=4 -> i=4*6/9 = 2
n=5 -> i=5*6/9 = 3

So now that I know which char to start looking at, given any char width,
I'm yet to find a suitable shift/mask expression that generalises to all
legal values of CHAR_BIT also.

Thomas

T Koster · Mar 2, 2005

Keith said:
T Koster said:

Okay okay, so it is impossible to portably tell fread to read exactly
24 bits of data, since you can only ask it to give you an integer
number of bytes, however big they may be on the particular
implementation.

As such, should I allocate a larger buffer, of a fixed number of bytes
(regardless of how wide a byte is), and get fread to read in that
fixed number of bytes, and then do the shifting/masking as appropriate
depending on CHAR_BIT, reading more data as needed?

Click to expand...

[...]

I wonder how much effort this is worth. I don't know of any hosted C
implementations with CHAR_BIT != 8 (and freestanding implementations
don't necessarily support fread).

Me neither.

If you want absolute portability, you need to write your code so it
works on systems with CHAR_BIT != 8. But then it's not clear what a
data file would look like after being copied to such a system.
Whatever mechanism is used to copy the file will need to deal with the
difference in byte sizes, and there are a number of ways it could do
so. If CHAR_BIT % 8 == 0, it could pack multiple octets into each
byte. If not, it could either treat the file as a stream of bits
(with some octets crossing byte boundaries), or it could just store
each octet in a byte and zero out the high-order bits.

It might be sufficient to put the following code in an application
header file:

#include <limits.h>
#if CHAR_BIT != 8
#error This application is supported only on systems with CHAR_BIT==8
#endif

I have definitely given this option serious consideration

but...

On the other hand, if this is intended as an exercise in theoretical
manic portability, have fun!

I originally set out to make a command-line, ISO C, Base-64
encoder/decoder, since all the open-source ones I found were not to my
liking, problems ranging from using platform-specific libraries to just
total spag bol.

It worked fine, but I got to thinking about maximising its portability
even further by not taking CHAR_BIT==8 for granted. This is turning out
to be a nightmare, but now I *have* to prove to myself that it is
possible in ISO C

.

Thomas

Walter Roberson · Mar 2, 2005

:What do you mean to unpack on exact boundaries? That there will be no

artial 6-bit numbers left on the end of the buffer?

lcm(24,CHAR_BIT) will be convertable into groups of 4 output characters
of 6 bits each, with the only left-overs being the trailing case at eof.

: How about making
:the buffer lcm(6,CHAR_BIT) chars big and read that many instead? There
:will always be a whole number of 6-bit numbers in there.

True... and that is really what you are after since the 4 character
output is really just an artifact of 8 bit bytes.

:So now that I know which char to start looking at, given any char width,
:I'm yet to find a suitable shift/mask expression that generalises to all
:legal values of CHAR_BIT also.

As you noted, i = (int) (n * 6 / CHAR_BIT) gives you the starting character.
j = n * 6 - i * CHAR_BIT [or, more shortly, (n*6) % CHAR_BIT] then gives
you the bit index of the starting bit relative to the left-most (most
significant) bit.

If (k = (CHAR_BIT - j)) >= 6 then you have a full 6 bits in this byte
and the mask is 63 << k . Otherwise you only have k bits of it
and the mask for this byte is m1 = (63 >> (6-k)) and the mask for the
next byte is (63^m1)<<(CHAR_BIT-6)

T Koster · Mar 2, 2005

Walter said:
What do you mean to unpack on exact boundaries? That there will be no
partial 6-bit numbers left on the end of the buffer?

Click to expand...

lcm(24,CHAR_BIT) will be convertable into groups of 4 output characters
of 6 bits each, with the only left-overs being the trailing case at eof.

How about making
the buffer lcm(6,CHAR_BIT) chars big and read that many instead? There
will always be a whole number of 6-bit numbers in there.

Click to expand...

True... and that is really what you are after since the 4 character
output is really just an artifact of 8 bit bytes.

So now that I know which char to start looking at, given any char width,
I'm yet to find a suitable shift/mask expression that generalises to all
legal values of CHAR_BIT also.

Click to expand...

As you noted, i = (int) (n * 6 / CHAR_BIT) gives you the starting character.
j = n * 6 - i * CHAR_BIT [or, more shortly, (n*6) % CHAR_BIT] then gives
you the bit index of the starting bit relative to the left-most (most
significant) bit.

If (k = (CHAR_BIT - j)) >= 6 then you have a full 6 bits in this byte
and the mask is 63 << k . Otherwise you only have k bits of it
and the mask for this byte is m1 = (63 >> (6-k)) and the mask for the
next byte is (63^m1)<<(CHAR_BIT-6)

Thanks. This makes sense. I'll give it a go.

Thomas

Tim Rentsch · Mar 3, 2005

T Koster said:
I originally set out to make a command-line, ISO C, Base-64
encoder/decoder, since all the open-source ones I found were not to my
liking, problems ranging from using platform-specific libraries to just
total spag bol.

It worked fine, but I got to thinking about maximising its portability
even further by not taking CHAR_BIT==8 for granted. This is turning out
to be a nightmare, but now I *have* to prove to myself that it is
possible in ISO C .

Here is code that illustrates an approach. The second program
(b64d.c) depends on characters in the encoding set having values less
than 256 when viewed as unsigned chars (or ints); it's easy enough to
program around that requirement if it isn't met. Also both programs
rely on getchar()/EOF working properly, which conceivably could break
if sizeof(char) == sizeof(int); I don't know what ISO has to say
about such a possibility, but as long as there's a way to input a
single character and tell if end of file was reached instead, that's
also easy to adapt.

Incidentally, the final while loop (on 'n != 0') in b64e.c is there to
put in some extra characters (equal signs) to "round out the buffer".
That code is there because the 'mmencode' program does that, which
makes me think some programs might depend on it. The extra characters
are not, strictly speaking, necessary; in particular b64d.c will work
just fine whether the equal signs are there or not.

/* b64e.c - do a base64 encode on stdin */

#include <stdio.h>
#include <limits.h>

char A[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

int
main(){
int n = 0, k = 0, c;
unsigned char b = 0;

while( c = getchar(), c != EOF ){
unsigned char a = c;
int t = CHAR_BIT;
while( t > CHAR_BIT - n ){
b |= a >> n;
a <<= CHAR_BIT - n;
t -= CHAR_BIT - n;
n += CHAR_BIT - n;
if( ++k > 72 ) k = 1, putchar( '\n' );
putchar( A[ b >> CHAR_BIT-6 ] );
b <<= 6;
n -= 6;
}
b |= a >> n;
n += t;
}

while( n > 0 ){
if( ++k > 72 ) k = 1, putchar( '\n' );
putchar( A[ b >> CHAR_BIT-6 ] );
b <<= 6;
n -= 6;
}

while( n != 0 ){
if( n < 0 ) n += CHAR_BIT;
n -= 6;
if( ++k > 72 ) k = 1, putchar( '\n' );
putchar( '=' );
}

if( k > 0 ) putchar( '\n' );
fflush( stdout );
return 0;
}

/* b64d.c - do a base64 decode on stdin */

#include <stdio.h>
#include <limits.h>

char A[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

int
main(){
int n = 0, k = 256, c;
unsigned char b = 0, v[256], uc;

while( k > 0 ) v[--k] = 64;
while( uc = A[k] ) v[ uc & 255 ] = k++;

while( c = getchar(), c != EOF ){
if( c > 255 || c < 0 || v[c] > 63 ) continue;

n += 6;
if( n > CHAR_BIT ){
b |= v[c] >> n - CHAR_BIT;
putchar( b );
n -= CHAR_BIT;
b = 0;
}
b |= v[c] << CHAR_BIT - n;
}

if( n == CHAR_BIT ) putchar( b );

fflush( stdout );
return 0;
}

Tim Rentsch · Mar 3, 2005

PS - I forgot to convert tabs to spaces before posting -
sorry about that. Changing tab's to eight spaces should
fix any formatting problems.

Endianness macros	48	Aug 23, 2013
Endianness macros	4	Nov 27, 2009
Binary to BCD code understanding	0	Dec 27, 2021
Comparison of Integer and Pointer (that's supposed to be an Integer). Where did I go wrong?	0	Nov 19, 2022
Portable custom integer width definitions	37	Jul 9, 2013
bitwise operator and endianness	5	Nov 5, 2007
why printf("%d", arg) works with arg of type int, short, char	21	Mar 1, 2014
Is char obsolete?	20	Apr 8, 2011

Binary-mode i/o, width of char, endianness

T Koster

infobahn

T Koster

infobahn

Lew Pitcher

Keith Thompson

infobahn

T Koster

Walter Roberson

Keith Thompson

T Koster

T Koster

Walter Roberson

T Koster

Tim Rentsch

Tim Rentsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads