Don Knuth and the C language

James Kuyper · May 1, 2014

The general rule is that a T* points to an object of size
sizeof( T ).

»Static« means that the size of the buffer is known at
compile time. ...

As a general rule in this newsgroup, when using a term for which the C
standard provides it's own definition, it's best to avoid the confusion
that can be caused by using that term with a conflicting meaning. Jokes
to the contrary notwithstanding, the C standard only provides a few
different definitions for 'static', and that isn't one of them.

A C term that comes close to that meaning is defined in 6.2.5p23: "A
type has _known constant size_ if the type is not incomplete and is not
a variable length array type." I've used underscores to indicate that
"known constant size" is italicized, and ISO convention indicating that
this sentence defines the meaning of that phrase.

Let's assume the size is 75. In this case, one
can statically encode the buffer size in the pointer type:

#include <stddef.h>
#include <stdio.h>

void f( char( * const buf )[ 75 ])

For instance, 'buf' is 'static' according to your definition, but is
isn't 'static' according to any of the C standard's several definitions.
However, *buf does have a known constant size.

Stefan Ram · May 1, 2014

James Kuyper said:
As a general rule in this newsgroup, when using a term for which the C
standard provides it's own definition, it's best to avoid the confusion

»All objects with static storage duration shall be initialized
(set to their initial values) before program startup.« (N1570)

This hints at the idea: »static« ~ »before program startup«.
Also, think about why »static_assert« is called »static_assert«.

N1570 always carefully distinguishes between the English word
»static« and the C keyword »static«.

An object whose identifier is declared without the
storage-class specifier _Thread_local and without »static«
and either with external or internal linkage has static
storage duration, but it is not declared with the keyword
»static«.

James Kuyper · May 1, 2014

.....
N1570 always carefully distinguishes between the English word
»static« and the C keyword »static«.

But it also uses the English word (NOT the keyword) with a specialized
meaning in C, in the phrase "static storage duration". One could
reasonably assume, as I did, that you were using the term "static
buffer" to mean a buffer that had "static storage duration". The phrase
"known constant size", being precisely defined by the standard, and
closely connected to the point you were making, would have been a much
better term to use (though it's not a drop-in replacement for "static" -
the sentence would require significant rearrangement).

Keith Thompson · May 1, 2014

Â»All objects with static storage duration shall be initialized
(set to their initial values) before program startup.Â« (N1570)

This hints at the idea: Â»staticÂ« ~ Â»before program startupÂ«.
Also, think about why Â»static_assertÂ« is called Â»static_assertÂ«.

N1570 always carefully distinguishes between the English word
Â»staticÂ« and the C keyword Â»staticÂ«.

An object whose identifier is declared without the
storage-class specifier _Thread_local and without Â»staticÂ«
and either with external or internal linkage has static
storage duration, but it is not declared with the keyword
Â»staticÂ«.

The standard usually uses the English word "static" to refer to the
storage duration. The keyword is (ab)used for a couple of other
meanings.

Your statement upthread was:

In C one actually only gets the start address, but has to
learn the size of the buffer by other means. (The size of
the pointee which is provided by the type system of C can
only be employed for static buffers.)

I can't think of any meaning of "static", consistent with the C
standard's usage of the word or not, for which that statement is true.
Consider:

#include <stdio.h>
#include <stdlib.h>
int main(void) {
int (*p0)[10] = malloc(sizeof *p0);
int random_size = rand() % 10 + 10;
int (*p1)[random_size] = malloc(sizeof *p1);
printf("p0 points to a %zu byte object\n", sizeof *p0);
printf("p1 points to a %zu byte object\n", sizeof *p1);
}

Nothing here has static storage duration. Are both *p0 and *p1
"static buffers"? If so, what exactly do you mean by that?

glen herrmannsfeldt · May 1, 2014

(snip)
(snip, I wrote)

Yes, C allows you to consider any object as an array of unsigned
char, i.e., of bytes; that's how the standard defines "object
representation".

I still don't understand what you mean by "the C assumption".
Certainly a lot of C programmers write code that assumes CHAR_BIT==8
(and depending on the context, that can be a perfectly reasonable
assumption). The language can't *force* programmers to write code
that's portable to systems with CHAR_BIT > 8. It provides asll
the tools to do so, but interoperability between, say, 8-bit and
9-bit systems is trickier.

(snip, I also wrote)

So is that what you mean by "the C assumption", that C *programmers*
make an assumption that isn't imposed by the C standard? If so, that's
a perfectly valid point, but I wouldn't use that phrase to describe it.

It started as I was trying to understand how C pointers compare
to PL/I pointers. PL/I doesn't really have anything like
(unsigned char *), though you can have arrays or strings of CHAR.
(While PL/I wasn't all that popular, it was for some time one
of the more popular languages with pointers.)

The PL/I way is with variables, or strings of type BIT, which have
many of the same properties as CHAR, such as the ability to use
SUBSTR and string concatenation for substrings.

If you have a FLOAT BIN(21) and a FIXED BIN(31,0) (That is, 32 bit
floating and fixed point values, with a little luck) you can

DCL I FIXED BIN(31,0), X FLOAT BIN(53);
X=3.14;
UNSPEC(I)=UNSPEC(X);

Where UNSPEC on the right converts to a bit string, and on the
left converts a bit string back to a non-BIT type.
(And no problem with alignment that could happen in other
ways of doing the assignment.)

On the other hand, one tends to write more efficient bit-moving
code in C, as long as one can work with more than one bit at
a time. For PL/I, you hope that the compiler figures out where
the byte (or word) boundaries are and does efficient moves, but
you can't usually be sure.)

You can shift and AND to extract and insert bits into an 8 bit
char, but the operations are enough different for a 9 bit char
that, pretty much, no-one will write code to do it.

-- glen

Kaz Kylheku · May 1, 2014

void * is not a "fudge". It's an inherent feature of the C language. We can pass
around memory buffers, and either treat them as sequences of bytes, or
pass them to client code which understands them.
It's C's way of providing flexibility, loose coupling, and abstraction.

You can use unsigned char * instead of void *.

The benefits are:
- cast required in both directions, so more safety.
- ready for byte access and arithmetic: cumbersome
conversions that do not add safety are eliminated.

I have experimented with using unsigned char * as a generic pointer
to any object: for allocator returns, polymorphism such as
the context for callbacks and so on. It is perfectly fine.

Malcolm McLean · May 2, 2014

So the compressed data goes into the same buffer as the uncompressed
data? If compression makes it bigger, where does the extra data go?
But that's not directly relevant to the current point, so let's ignore
it for now. (Of course I'd use size_t rather than int, but we can set
that aside as well.)

The function returns a void *. So it's a pretty fair guess that it
mallocs a buffer, returns it, and writes the length to clen. It could
pathologically return a pointer to a static buffer, but few real
programmers would be that stupid.

It's unlikely, and frankly unacceptable, that that would be the only
documentation. The algorithm used needn't be documented, but there had
better be something that tells me how to use it.

People aren't perfect. "How does this function behave in CHAR_BIT is not 8"
is something that is quite likely not t be documented. A good language is
one which is robust to a bit of sloppiness, poor design, people not
documenting things or even misdocumenting things.

If it operates on bitstreams (which needn't be a whole number of octets
or of bytes), how does it tell you how many bits of the final octet or
byte are part of the bitstream?

For the compressor, it would almost certainly pad the input bitstream
to a whole number of bytes. So you get a few trailing clear bits at the
end if you try to compress a bitstream that's not a multiple of bytes.
So the caller has to set up his bitstream so that it can tolerate
trailing clear bytes.
For the compressed stream, it is a bitstream set up so that it tolerates
trailing clear bits. Typically there's a sentinel sequence to indicate
end of data.

A function with a built-in assumption of CHAR_BIT==8 could exhibit a
wide variety of behaviors given bytes greater than 255; I'm not
convinced that examining the output of a single call would be that
useful.

You probably want to pass it a slightly longer sequence to be absolutely
sure. if you can compress and recover 0x100 then it's unlikely that
the system has a assumption that CHAR_BIT is 8, however.

Keith Thompson · May 2, 2014

Malcolm McLean said:
The function returns a void *. So it's a pretty fair guess that it
mallocs a buffer, returns it, and writes the length to clen. It could
pathologically return a pointer to a static buffer, but few real
programmers would be that stupid.

Having to guess is unacceptable. If a function allocates a buffer by
calling malloc() and doesn't document the fact that the caller will have
to free() it, I won't be using that function, thankyouverymuch.

People aren't perfect. "How does this function behave in CHAR_BIT is not 8"
is something that is quite likely not t be documented. A good language is
one which is robust to a bit of sloppiness, poor design, people not
documenting things or even misdocumenting things.

What does the language have to do with whether a function is documented?

For the compressor, it would almost certainly pad the input bitstream
to a whole number of bytes. So you get a few trailing clear bits at the
end if you try to compress a bitstream that's not a multiple of bytes.
So the caller has to set up his bitstream so that it can tolerate
trailing clear bytes.

If it operates on bitstreams, but it doesn't distinguish between a
bitstream consisting of 9 bits and one consisting of 16 bits, with the
last 7 equal to 0, then it's not a valid compression function. Unless
it's meant to be lossy -- something that would need to be mentioned in
the documentation if there were any.

For the compressed stream, it is a bitstream set up so that it tolerates
trailing clear bits. Typically there's a sentinel sequence to indicate
end of data.

And if that sentinel sequence occurs as valid data in the middle of the
bitstream? Or is it not intended to operate on arbitrary data?

You probably want to pass it a slightly longer sequence to be absolutely
sure. if you can compress and recover 0x100 then it's unlikely that
the system has a assumption that CHAR_BIT is 8, however.

I can't be sure what the function does in the normal case, where
CHAR_BIT==8. If I had to compress data and decompress data on a 9-bit
system I'd find something else to use. If I *had* to use this one for
some reason, I'd want to examine the source code and/or perform very
thorough testing; seeing it behave sensibly with a byte value of 0x101
wouldn't be enough to give me confidence that it won't corrupt my data.

In real life, such functions *do* have documentation -- perhaps good,
perhaps bad, perhaps incomplete, but more than just a bare declaration.

For this hypothetical example, and for the sake of discussion, I'd be
willing to accept that documentation does exist, and that it describes
the behavior adequately and correctly. Lacking that, I see little
reason to consider using it.

Malcolm McLean · May 2, 2014

Having to guess is unacceptable. If a function allocates a buffer by
calling malloc() and doesn't document the fact that the caller will have
to free() it, I won't be using that function, thankyouverymuch.

What does the language have to do with whether a function is documented?

If it operates on bitstreams, but it doesn't distinguish between a
bitstream consisting of 9 bits and one consisting of 16 bits, with the
last 7 equal to 0, then it's not a valid compression function. Unless
it's meant to be lossy -- something that would need to be mentioned in
the documentation if there were any.

It is and it isn't. It's not "lossy", that has another meaning. The last
few bits are often a problem for a bitstream, because conventional
backing store interfaces don't normally allow for storage of a specified
number of bits. So the true end of data is going to have to be tagged
specially, somehow.

And if that sentinel sequence occurs as valid data in the middle of the
bitstream? Or is it not intended to operate on arbitrary data?

A bitstream is data, not random bits. So a sentinel is like a zero in
a string. If you need to represent a string with embedded zeroes, you can
have an escape. But it has to be parsed by something which understands
it. As a bitstream gets passed about on systems with varying byte sizes,
it will tend to accumulate trailing bits, inevitably. Until it is
parsed and trimmed back to its genuine size. Unlikely to be much of
a practical problem, and we're only talking about one or two bytes
each time.

I can't be sure what the function does in the normal case, where
CHAR_BIT==8. If I had to compress data and decompress data on a 9-bit
system I'd find something else to use. If I *had* to use this one for
some reason, I'd want to examine the source code and/or perform very
thorough testing; seeing it behave sensibly with a byte value of 0x101
wouldn't be enough to give me confidence that it won't corrupt my data.

Any function can have bugs. The test tells you that CHAR_BIT isn't
hard-coded to 8, it treats larger bytes as larger. There might be
more bugs lurking there, for example if it uses a "rack" of 32 bits, and
bytes are also 32 bits long, the "rack" might be too short. But that's
true of almost any function written in any language.

In real life, such functions *do* have documentation -- perhaps good,
perhaps bad, perhaps incomplete, but more than just a bare declaration.

For this hypothetical example, and for the sake of discussion, I'd b
willing to accept that documentation does exist, and that it describes
the behavior adequately and correctly. Lacking that, I see little
reason to consider using it.

If you can employ perfect programmers who never make any mistakes, then
it really doesn't matter much what language you use. They never make
mistakes, so everything will always go very smoothly.
The question is how the language responds to a programmer being sloppy,
or miscommunication (meticulous documentation, but in Chinese), or
designs not being done, or being compromised by urgent changes to
requirements.
We see that being given a difficult situation - an undocumented compress
function and a system which doesn't use 8 bit bytes, C doesn't respond
too badly. We can work out how the function works relatively easily,
we can isolate any bugs / limitations.

No-ones saying that these are ideal circumstances, or that code shouldn't
be documented.

Kenny McCormack · May 2, 2014

....

If you can employ perfect programmers who never make any mistakes, then
it really doesn't matter much what language you use. They never make
mistakes, so everything will always go very smoothly.
The question is how the language responds to a programmer being sloppy,
or miscommunication (meticulous documentation, but in Chinese), or
designs not being done, or being compromised by urgent changes to
requirements.

You guys really need to get a room!

We see that being given a difficult situation - an undocumented compress
function and a system which doesn't use 8 bit bytes, C doesn't respond
too badly. We can work out how the function works relatively easily,
we can isolate any bugs / limitations.

No-ones saying that these are ideal circumstances, or that code shouldn't
be documented.

Kiki doesn't know how to operate in other than ideal circumstances.

That's why he prefers this newsgroup to anything resembling the real world.

--

No, I haven't, that's why I'm asking questions. If you won't help me,
why don't you just go find your lost manhood elsewhere.

CLC in a nutshell.

Kenny McCormack · May 2, 2014

Robert Wessel said:
The standard way to pad a bit stream is to append a zero bit, plus as
many one bits as needed to round out to an even number of storage
units. Then when reading, you check the end of the input and discard
any trailing ones and the immediately preceding zero. So there's no
need to have to decode a sentential that might appear in the middle of
the stream.

Chapter & verse, please?

c89 or c99 or later?

Please do be specific.

--
About that whole "sent His Son to die for us thing", I've never been able
to understand that one. It's not like Jesus isn't going back to Heaven
after his Earthly self dies, right? So, having him be executed, and
resurrect a few days later strikes me as being more akin to spending the
weekend at the non-custodial parent's house than "dying", doesn't it?

Malcolm McLean · May 2, 2014

The standard way to pad a bit stream is to append a zero bit, plus as
many one bits as needed to round out to an even number of storage
units. Then when reading, you check the end of the input and discard
any trailing ones and the immediately preceding zero. So there's no
need to have to decode a sentential that might appear in the middle of
the stream.

Thanks, that's worth knowing.

Keith Thompson · May 2, 2014

Robert Wessel said:
On Thu, 1 May 2014 23:15:46 -0700 (PDT), Malcolm McLean
The standard way to pad a bit stream is to append a zero bit, plus as
many one bits as needed to round out to an even number of storage
units.

Do you mean an even number or a whole number? If it's really an even
number, why would an odd number of bytes be forbidden?

Then when reading, you check the end of the input and discard
any trailing ones and the immediately preceding zero. So there's no
need to have to decode a sentential that might appear in the middle of
the stream.

Is this actually a standard? Can you cite a reference?

I'll be pleasantly surprised if there's really just one consistently
used standard.

Stephen Sprunk · May 2, 2014

Do you mean an even number or a whole number? If it's really an
even number, why would an odd number of bytes be forbidden?

I'm sure he meant a whole number.

Is this actually a standard? Can you cite a reference?

I'll be pleasantly surprised if there's really just one consistently
used standard.

The only official standard like this that I'm aware of is for hashes and
block encryption, to pad a variable-sized input up to a multiple of the
block size, but IIRC it uses one 1 followed by one or more 0s, not one 0
followed by one or more 1s as given above.

Note that if your input size is an exact multiple of the block size, you
end up with an entire block of padding; this is necessary to distinguish
between a padded input and an unpadded input that happens to end with
the padding sequence.

This scheme has become a common convention for similar needs in other
domains as well, verging on a de facto standard.

S

glen herrmannsfeldt · May 2, 2014

Stephen Sprunk said:
On 02-May-14 10:15, Keith Thompson wrote:
(snip)

The only official standard like this that I'm aware of is for hashes and
block encryption, to pad a variable-sized input up to a multiple of the
block size, but IIRC it uses one 1 followed by one or more 0s, not one 0
followed by one or more 1s as given above.

Note that if your input size is an exact multiple of the block size, you
end up with an entire block of padding; this is necessary to distinguish
between a padded input and an unpadded input that happens to end with
the padding sequence.

There is also the CP/M tradition. CP/M file system only stores the
number of blocks, not the number of bytes. For text files, CP/M marked
the end of the actual text with X'1A' (control-Z).

For some reason that I never knew, this tradition continued with
MS-DOS files, even though the file system does count the bytes.

Even today, it is not unusual to see files with X'1A' at the end,
and for programs reading text files to consider it the end.

-- glen

Stefan Ram · May 2, 2014

glen herrmannsfeldt said:
Even today, it is not unusual to see files with X'1A' at the end,
and for programs reading text files to consider it the end.

In ASCII (1968) we actually have:

0011010 1A 26 ^Z SUB Substitute
0011011 1B 27 ^[ ESC Escape
0011100 1C 28 ^\ FS File Separator

Ben Bacarisse · May 2, 2014

Note that if your input size is an exact multiple of the block size, you
end up with an entire block of padding; this is necessary to distinguish
between a padded input and an unpadded input that happens to end with
the padding sequence.

....unless the sequence happens to end with something that can't be
padding. E.g. using 0+1s as the padding, a sequence that ends xxx0 can
end on a block/byte boundary without needing any "fake" padding.

(This is just a clarification. You don't say that every input must be
padded.)

Ben Bacarisse · May 2, 2014

Robert Wessel said:
Stephen Sprunk said:

On 02-May-14 10:15, Keith Thompson wrote:
]
The standard way to pad a bit stream is to append a zero bit, plus
as many one bits as needed to round out to an even number of
storage units.

Click to expand...

Note that if your input size is an exact multiple of the block size, you
end up with an entire block of padding; this is necessary to distinguish
between a padded input and an unpadded input that happens to end with
the padding sequence.

Click to expand...

...unless the sequence happens to end with something that can't be
padding. E.g. using 0+1s as the padding, a sequence that ends xxx0 can
end on a block/byte boundary without needing any "fake" padding.

(This is just a clarification. You don't say that every input must be
padded.)

Click to expand...

Actually that would result in the last zero bit being discarded.
Unless you have some other way to indicate that there was no padding
(perhaps file meta-data), you always need to add at least one trailing
bit. Otherwise the reader has no way to determine if that last zero
is pad or data.

Ah, I thought the proposal was for 0 and *one* or more 1s. No idea why
I thought that, just an incorrect assumption.

glen herrmannsfeldt · May 2, 2014

(snip on EOF indication, then I wrote)

Actually that's a bit different, most CP/M and MS-DOS programs reading
a text file wil assume an 0x1a is EOF, even if encountered in the
middle of the file.

Yes. Well, for CP/M it would only need to be in the last block, but
I presume that it was actually tested anywhere in the file.

One of the major goals of MS-DOS (and Scott Peterson's 86-DOS from
which it derived), was CP/M compatibility. Hence the ability to
invoke many of original APIs by putting a function number in CL and
calling location 5 (which was an exact translation of the OS call
mechanism in CP/M-80), plus a bunch of other stuff (format of FCBs,
etc.). Add the ability to mechanically translate a good chunk of many
CP/M-80 programs, and the convention got carried forward, even though
it was pointless for MS-DOS itself (although if you were interchanging
files with CP/M systems, you'd see the EOFs).

Nothing against back compatibility, but it is over 30 years now,
and I am pretty sure that by now no-one is developing on CP/M
to port to DOS/Windows.

We still have plenty of code that will strip a (single) 0x1a from the
end of a (text) file. But we won't consider one in the middle of a
file to be an EOF. Within the last couple of years we actually had a
customer for one of our product complain that we had managed to break*
the option to *add* an 0x1a to the end of an output file (apparently
whatever they were feeding that into was looking for it).

The one I ran into for a long time was the MS-DOS, PRINT spools,
at least to 3.x, and probably longer.

About 10 years after MS-DOS, I was writing programs to do bit-mapped
graphics on different printers. Printing stops at X'1A'. For parallel
printers, you could copy to LPT1, but for serial printers, it didn't
do any flow control at all, so about the only way was to use the
print spooler, which did.

*It had actually been broken for several years and releases, but the
customer was upgrading a fairly old installation, so it's not like
this is actually a common issue, but it did happen.

-- glen

Keith Thompson · May 3, 2014

glen herrmannsfeldt said:
(snip on EOF indication, then I wrote)

Actually that's a bit different, most CP/M and MS-DOS programs reading
a text file wil assume an 0x1a is EOF, even if encountered in the
middle of the file.

Click to expand...

Yes. Well, for CP/M it would only need to be in the last block, but
I presume that it was actually tested anywhere in the file. [...]
Nothing against back compatibility, but it is over 30 years now,
and I am pretty sure that by now no-one is developing on CP/M
to port to DOS/Windows.

[...]

Even today, a Windows C program reading input in text mode treats
Control-Z (character 26) as an end-of-file indicator. (I just
tried it on Windows 7 with MSVC 2010 Express.)

C language now truly universal	0	Jan 1, 2011
What is the different between while(0) and while(1) in c language	4	Jul 12, 2016
C as a scripting language	88	Mar 26, 2009
Errata for The C Programming Language, Second Edition, by Brian Kernighanand Dennis Ritchie	4	May 16, 2009
Programming "only" in an environment of C and it's friends - Being a toolsmith	51	May 27, 2012
Maybe C is the perfect language for really good systems programmers, but unfortunately not-so-good s	18	Nov 4, 2009
Comparision of C Sharp and C performance	360	Dec 27, 2009
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012

Don Knuth and the C language

James Kuyper

Stefan Ram

James Kuyper

Keith Thompson

glen herrmannsfeldt

Kaz Kylheku

Malcolm McLean

Keith Thompson

Malcolm McLean

Kenny McCormack

Kenny McCormack

Malcolm McLean

Keith Thompson

Stephen Sprunk

glen herrmannsfeldt

Stefan Ram

Ben Bacarisse

Ben Bacarisse

glen herrmannsfeldt

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads