Question about disparate CHAR_BIT systems and file access

charles_gero · Jan 2, 2007

Hi Everyone,

I have a quick question regarding access to a file from disparate
CHAR_BIT systems. Has anyone had experience writing a file on a system
where CHAR_BIT is one value (let's use the value of 10) and then
reading said file from a system where this value is different (let's
say the common value of 8)?

I'm just curious how this would play out with respect to the standards,
etc. So for example, if I have a system where CHAR_BIT is 10 and I
write a single character to a hard disk file (using fputc, POSIX write,
etc...), and then move this hard drive to a system with CHAR_BIT set to
8 and attempt to read, what would occur? Obviously I would need at
least two "char" reads, but what happens to the 2 bits in the second
read? Are the treated most significant, least significant, etc.? What
would a file size even be reported on such a system?

I ask not because I've seen this, as a matter of fact I don't believe
I've ever personally run into a system where CHAR_BIT is anything other
than 8 (although we know they do exist), but rather in an effort to
understand how to write the highest portable code possible. NOTE: I'm
not limiting this to disk file discussion only, just using it as an
example. The file could be generated on machine A and network
transferred to B. I'm just curious how this would work.

All comments are extremely appreciated. Thank you so much.

-Charlie

Ben Pfaff · Jan 2, 2007

I have a quick question regarding access to a file from disparate
CHAR_BIT systems. Has anyone had experience writing a file on a system
where CHAR_BIT is one value (let's use the value of 10) and then
reading said file from a system where this value is different (let's
say the common value of 8)?

When this has been brought up in the past, if I recall correctly
the most common suggestion has been that, if you want to write
portable data files in C, you should only use the least
significant 8 bits of each byte (and zero the rest).

Keith Thompson · Jan 2, 2007

Ben Pfaff said:
When this has been brought up in the past, if I recall correctly
the most common suggestion has been that, if you want to write
portable data files in C, you should only use the least
significant 8 bits of each byte (and zero the rest).

And even then, transferring and possibly translating the data is
likely to be non-trivial; it's certainly not defined by the standard.

If both systems have mechanisms for sending and receiving data as
streams of bits, then those mechanisms can be used to achieve a sort
of commonality; bits are bits. Or, if the CHAR_BIT==10 system
supports some networking standard, it will be probably able to send
and receive streams of octets somehow, since that's how most modern
networking protocols are defined.

It's not likely that a CHAR_BIT==8 system and a CHAR_BIT==10 system
would be able to share a common file system; CHAR_BIT!=8 systems tend
to be embedded, and might not even support a file system. But the
standard certainly doesn't preclude the possibility, and if this is
done, the details are going to be system-specific.

Chris Torek · Jan 2, 2007

I have a quick question regarding access to a file from disparate
CHAR_BIT systems. Has anyone had experience writing a file on a system
where CHAR_BIT is one value (let's use the value of 10) and then
reading said file from a system where this value is different (let's
say the common value of 8)?

For actual historical implementations, just look at the standard
for FTP. (I assume you have access to a "raw-style" ftp command,
rather than the all-automatic, usually-passive implmenentations
built into various browsers under the "ftp://user@host/path"
syntax.) Note that there is usually a "binary" command, which
corresponds to the protocol-level operation, "TYPE L BYTESIZE 8".

I'm just curious how this would play out with respect to the standards,
etc. So for example, if I have a system where CHAR_BIT is 10 and I
write a single character to a hard disk file (using fputc, POSIX write,
etc...), and then move this hard drive to a system with CHAR_BIT set to
8 and attempt to read, what would occur?

Your first problem turns out to be "and then move this hard drive".
The kinds of "hard drive"s that plug into 6, 7, 9, or 10-bit byte
hardware do not plug into 8-bit-byte hardware. (For one thing,
they have the wrong number of pins on the end of the connector,
since they have a different bus width.)

As it turns out, however, there usually are *some* pieces of
hardware you can use to transfer the data. When you do, one of
several things happens:

- "Extra" bits simply vanish. If they were not predictable,
you are in trouble.

- "Extra" bits are re-coded according to some scheme, e.g., a
36-bit word is reported as four octets (8-bit-bytes), plus a
fifth octet in which at most four bits are ones.

- "Missing" bits are reported as constant, usually 0 (i.e., 6-bit
FIELDATA character data comes out as octets in the range 0..63).

- "Missing" bits are filled in with junk, which you must mask
off.

Obviously I would need at least two "char" reads, but what happens
to the 2 bits in the second read? Are the treated most significant,
least significant, etc.?

Yes, or sometimes no.

What would a file size even be reported on such a system?

On most of these systems, the concept of "file size" was pretty
nebulous in the first place. A file had a different number of
bytes (of whatever byte-size) stored in it depending on how you
accessed it. These systems had a plethora of "access methods",
which -- as Ken Thompson put it -- "filled a much-needed gap".

Stephen Sprunk · Jan 2, 2007

I have a quick question regarding access to a file from disparate
CHAR_BIT systems. Has anyone had experience writing a file on a
system
where CHAR_BIT is one value (let's use the value of 10) and then
reading said file from a system where this value is different (let's
say the common value of 8)?

I'm just curious how this would play out with respect to the
standards,
etc. So for example, if I have a system where CHAR_BIT is 10 and I
write a single character to a hard disk file (using fputc, POSIX
write,
etc...), and then move this hard drive to a system with CHAR_BIT set
to
8 and attempt to read, what would occur? Obviously I would need at
least two "char" reads, but what happens to the 2 bits in the second
read? Are the treated most significant, least significant, etc.?
What
would a file size even be reported on such a system?

You shouldn't be able to physically connect the drive to both systems in
that specific case since a system that uses a non-power-of-two char size
will, by necessity, have a different interface than a power-of-two char
one (i.e. the number of data pins will differ, among other likely
problems). In the more common case where one system uses a CHAR_BIT
that is a multiple of 8, then you could likely connect it and get the
data with the logical multiplication of chars, e.g. one 24-bit char
written equals three 8-bit chars read.

The good news is that people who use such systems are used to these
problems and will likely have tools to convert data (to the extent
conversion is possible). As long as your data doesn't stray outside of
the basic execution character set, you can safely ignore the problem in
practice. It's binary data that will bite you, and there's no portable
answer to that problem.

If there's a light at the end of the tunnel, it's that every mainstream
system (and even most embedded and HPC ones these days) has CHAR_BIT==8.
While it makes sense to ensure your code still works elsewhere, you
generally won't have to deal with moving data between worlds -- it'll
stay stuck in the world where it was created and your program can handle
it natively. Dealing with endianness issues is a far, far worse
problem.

I ask not because I've seen this, as a matter of fact I don't believe
I've ever personally run into a system where CHAR_BIT is anything
other
than 8 (although we know they do exist), but rather in an effort to
understand how to write the highest portable code possible. NOTE: I'm
not limiting this to disk file discussion only, just using it as an
example. The file could be generated on machine A and network
transferred to B. I'm just curious how this would work.

Network protocols are defined to have a specific number of bits per
byte, usually 8. The IETF goes so far as to specify its protocols (like
TCP/IP) in terms of "octets" to avoid any possible confusion. If a
system uses some other numbers of bits, it's required to adapt the data
before transmission or after reception to comply with the protocol.

S

Walter Roberson · Jan 2, 2007

Stephen Sprunk said:
You shouldn't be able to physically connect the drive to both systems in
that specific case since a system that uses a non-power-of-two char size
will, by necessity, have a different interface than a power-of-two char
one (i.e. the number of data pins will differ, among other likely
problems).

That depends on how far deep you want to get in your definition of
"physically connect". SATA and similar technologies use serial
interfaces, and so at the connection point are not bound to any
particular byte width.

Network protocols are defined to have a specific number of bits per
byte, usually 8. The IETF goes so far as to specify its protocols (like
TCP/IP) in terms of "octets" to avoid any possible confusion. If a
system uses some other numbers of bits, it's required to adapt the data
before transmission or after reception to comply with the protocol.

The wording you have used might be interpreted by some as indicating
that all networking must be done in 8-bit bytes. That is not the case:
the protocols not defined by IETF (or similar) can use whatever
they want internally, subject to the limitation that they have to
pad terminally to an octet boundary if they want to use ethernet
(and even now, not everything is ethernet.)

Gordon Burditt · Jan 3, 2007

I have a quick question regarding access to a file from disparate

CHAR_BIT systems. Has anyone had experience writing a file on a system
where CHAR_BIT is one value (let's use the value of 10) and then
reading said file from a system where this value is different (let's
say the common value of 8)?

I'm just curious how this would play out with respect to the standards,
etc. So for example, if I have a system where CHAR_BIT is 10 and I
write a single character to a hard disk file (using fputc, POSIX write,
etc...), and then move this hard drive to a system with CHAR_BIT set to
8 and attempt to read, what would occur? Obviously I would need at
least two "char" reads, but what happens to the 2 bits in the second
read? Are the treated most significant, least significant, etc.? What
would a file size even be reported on such a system?

I seem to recall some early DEC hardware (tape, and possibly disks)
that read and wrote data in 16-bit chunks. You could maybe re-connect
this hardware to different systems (e.g. PDP-11 vs. IBM 360), or
move media between them. The interesting thing here:

- 16-bit words (aligned) DID NOT have byte-order problems.
- Strings COULD have byte-order problems (even a zero-length string!).
I believe the UNIX 'dd' command option conv=swab was invented to
deal with this. Related options were various ways of translating
between ASCII and EBCDIC.

Upgrading Company's Internal Record Keeping Systems	0	Sep 24, 2021
Question about my projects	3	Jul 23, 2021
Question about multiple metadata files to one file	0	Feb 14, 2022
About as basic "Newbie-Question" that you can get.	3	Sep 4, 2023
Using free space for more fault tolerant file systems.	8	Sep 27, 2011
Incremental build systems, infamake	20	Jul 25, 2011
Fix and improve a UDF File System Driver	0	Aug 20, 2023
Question about WEKA, Python and Python-WEKA-Wrapper3	0	Mar 31, 2022

Question about disparate CHAR_BIT systems and file access

charles_gero

Ben Pfaff

Keith Thompson

Chris Torek

Stephen Sprunk

Walter Roberson

Gordon Burditt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads