Don Knuth and the C language

jacob navia · May 4, 2014

Le 05/05/2014 00:14, BartC a Ã©crit :

Apart from MSVC which apparently gives 1,0,0, that is also the result
with gcc,

uses msvc run time

PellesC, DMC, Clang and g++, all running on Windows.

Yes, all are bug compatible with msvc.

(gcc under Linux gave 1,1,1.)

So lcc-win is the odd-one-out, in text mode.

Of course. Reporting a correct result is not bug compatible!

Yes, lcc-win is an odd compiler. It doesn't treat 26 as EOF but as 26.

Odd isn't it?

Keith Thompson · May 4, 2014

BartC said:
jacob navia said:

Le 04/05/2014 03:09, Keith Thompson a Ã©crit :

Click to expand...

lcc64 ctrlz.c // Compile it with a good compiler
lcclnk64 ctrlz.obj // Link it with a good linker
ctrlz.exe // Execute
saw_A = 1
saw_Z = 1
saw_Ctrl_Z = 1

Click to expand...

[...]

Apart from MSVC which apparently gives 1,0,0, that is also the result with
gcc, PellesC, DMC, Clang and g++, all running on Windows.

gcc on Windows is often installed as part of a POSIX-like layer
that imposes its own semantics. There are serveral ports of gcc
to Windows, and I doubt that they all behave the same way. And of
course it's the runtime library, not the compiler, that's relevant;
some implementations (like MinGW, I think) combine gcc with the
MS runtime library, others combine gcc with some other library
(like Cygwin).

And g++ is not a C compiler, so ...

There are two major differences between POSIX-style and Windows-style
text files: the end-of-line representation and the treatment
of Ctrl-Z as an end-of-file marker. It would be interesting to
see, for each of the compilers you mention, how it treats both.
For Ctrl-Z handling, you can use the program I posted earlier.
For end-of-line representation, you can write '\n' to a text file
in text mode, then read it back in binary mode.

I'm not surprised that some implementations might use Windows-style
handling for end-of-line and POSIX-style handling of Ctrl-Z, nor
do I suggest that that approach is better or worse than any other.

(gcc under Linux gave 1,1,1.)

So lcc-win is the odd-one-out, in text mode.

(In binary mode, which I generally use, that gives 1,1,1 always.)

And either behavior is perfectly valid as far as the C standard is
concerned.

N1570 7.21.2p2:

A text stream is an ordered sequence of characters
composed into lines, each line consisting of zero or more
characters plus a terminating new-line character. Whether
the last line requires a terminating new-line character is
implementation-defined. Characters may have to be added,
altered, or deleted on input and output to conform to
differing conventions for representing text in the host
environment. Thus, there need not be a one- to-one correspondence
between the characters in a stream and those in the external
representation. Data read in from a text stream will necessarily
compare equal to the data that were earlier written out to that
stream only if: the data consist only of printing characters
and the control characters horizontal tab and new-line; no
new-line character is immediately preceded by space characters;
and the last character is a new-line character. Whether space
characters that are written out immediately before a new-line
character appear when read in is implementation-defined.

My advice: Use text mode for text, binary mode for non-text. If you
care what happens when you write '\x1a' to a file and read it back
(assuming, as in most commonly used character sets, that '\x1a' is a
control character), then you're not working with text.

Admittedly it's not always that simple, especially if you have a
requirement to deal with "foreign" text files.

glen herrmannsfeldt · May 4, 2014

Lew Pitcher said:
On Sunday 04 May 2014 15:49, in comp.lang.c, "ralph"
(snip)

In ASCII (and derivatives), 0x1a (aka ^Z) has been given the mnemonic "SUB",
with the explanation: "SUB is used in the place of a character that has
been found to be invalid or in error. SUB is intended to be introduced by
automatic means."

"Unknown character" would fit the intent of the SUB (0x1A ^Z) character.

I haven't thought about this for some years, but didn't DEC systems
start the use of ^Z for EOF on terminal input streams?

-- glen

Malcolm McLean · May 4, 2014

Like I said in another thread I was using "shell" only generically - a
term for that run-time thingy that isolates an app from the kernel. I
should have realized that UNIX people tend to apply a more limited
meaning. If it is of any consequence, I've received the same from
Windows people.

I should have stuck to "run-time thingy". <bg>

In the olden days, it made sense to talk about "an EBCDIC machine".
The screen would be memory-mapped to bytes representing EBCDIC
characters. most programs would operate via system-level utilities
which were hardcoded to accept EBCDIC text strings. All the text
files on the system would be EBCDIC.
Nowadays, there's still that concept. The filing system will have
a fixed representation for file names, for example. But it's less
meaningful than it was. Most program use raster displays, and it's
relatively easy to set up a font to display any character set.
There will be a mixture of text files on the system, downloaded
from the internet, and users expect that most software read all of
the common formats.

Jacob's approach of reading all files in binary is probably a good
intermediate step. Long-term, of course, we want all text files
to return utf-8 strings when read, transparently.

Ben Bacarisse · May 4, 2014

jacob navia said:
Le 05/05/2014 00:14, BartC a Ã©crit :

Of course. Reporting a correct result is not bug compatible!

Yes, lcc-win is an odd compiler. It doesn't treat 26 as EOF but as 26.

Odd isn't it?

In case there is a linguistic confusion here, saying that something is
the "odd one out" is not in the least critical and does not mean that
the thing is odd. Calling a thing "odd" is mildly critical, but being
the "odd one out" simply means that it is the exception -- exceptional
if you like.

Ben Bacarisse · May 4, 2014

ralph said:
I think it did.

Yes, I am pretty sure that TOPS-20, for one, did this with Ctrl-Z. It
was, however, a command to the TTY driver and had no meaning in a data
stream. I would put money on that twist being a CP/M invention (or at
least an invention from a vary small system that needed some way to
signal end-of-data but did not want the full tty driver mechanism).

<snip>

Lew Pitcher · May 4, 2014

Agree. ( I should be looking this up, instead of from memory, but ...
)

The Ctrl-Z came from the fact that CP/M file system didn't track or
deliver a 'count' of bytes, but rather always delivered a 128 (or
256?) block

CP/M 2.2 delivered 128-byte blocks to the function 20 Read Sequential and
function 33 Read Random BDOS calls

with no indication that anything inside was "trailing" or
actually part of a file, thus one needed a specific end-of-file
delimiter. It did however let the CRT know whether the block was or
was not the "last block".

The A register was set to a non-zero value "if no data exists at the next
record position (e.g. end of file occurs)."

glen herrmannsfeldt · May 4, 2014

(snip)

In the olden days, it made sense to talk about "an EBCDIC machine".
The screen would be memory-mapped to bytes representing EBCDIC
characters. most programs would operate via system-level utilities
which were hardcoded to accept EBCDIC text strings. All the text
files on the system would be EBCDIC.

The IBM printing terminals used with S/360 and later didn't
use EBCDIC, but there was a conversion along the way.

I believe line printers like the 1403 have a map somewhere
indicating which chararacter (bit pattern) is where on the
print train.

The 3800 (a laser printer that doesn't look anything like
one you would put on a desk) does all the character coding
in software.

The 2250 and 2260 use hardware character generators, but I
don't know which code they use.

And ASCII terminals were commonly used with translation
somewhere along the way.

There are some characters in EBCDIC and not ASCII (not and cent,
for two) and some in ASCII not in EBCDIC (carat and tilde).

The PL/I (F) compiler (the only one I have noted such in) has
comments in some modules noting that they are character code
independent, and others noting that:

* THE OPERATION OF THIS MODULE DEPENDS UPON AN INTERNAL
* REPRESENTATION OF THE EXTERNAL CHARACTER SET WHICH IS
* EQUIVALENT TO THE ONE USED AT ASSEMBLY TIME. THE CODING HAS
* BEEN ARRANGED SO THAT REDEFINITION OF 'CHARACTER' CONSTANTS,
* BY RE-ASSEMBLY, WILL RESULT IN A CORRECT MODULE FOR THE NEW
* DEFINITIONS.

Nowadays, there's still that concept. The filing system will have
a fixed representation for file names, for example. But it's less
meaningful than it was. Most program use raster displays, and it's
relatively easy to set up a font to display any character set.
There will be a mixture of text files on the system, downloaded
from the internet, and users expect that most software read all of
the common formats.

Jacob's approach of reading all files in binary is probably a good
intermediate step. Long-term, of course, we want all text files
to return utf-8 strings when read, transparently.

-- glen

jacob navia · May 5, 2014

Le 05/05/2014 01:20, Ben Bacarisse a Ã©crit :

Is there really any need for that sort of thing?

Yes, this ensures that I remain in his killfile!

With an older (32 bit) version of your lcc, I get the output that Keith
gets using MS's compiler. Is the new behaviour part of the 64 bit
changes?

Yes. I rewrote ALL stdio. I am Microsoft clean now. Before, in the 32
bit version I used MSVCRT.DLL for stdio.

Is there any replacement mechanism? On a newish version of Windows,
using either your lcc64 IO library or even native Windows IO primitives,
is there a way to tell an interactive console program that there is no
more input?

<snip>

d:\lcc-src\libc\test>type tstdin.c
#include <stdio.h>
int main(void)
{
int c;
while ((c=getchar()) != EOF) {
putchar(c);
}
}

d:\lcc-src\libc\test>lc64 tstdin.c

d:\lcc-src\libc\test>tstdin.exe
abc^Z
abc
^Z

d:\lcc-src\libc\test>

In the first line of input I write "abc" then Ctrl-z. Nothing happens,
the Ctrl-Z is ignored.
In the second line of input I type Ctrl-Z at the START of the line. The
ReadFile() function of the OS returns an end of file condition.

Just as in Unix.

Richard Tobin · May 5, 2014

Only if there are no characters waiting to be passed to the process.
Otherwise it is discarded and any waiting characters are sent. Typing
"abc^D" does not result in an EOF condition.

[/QUOTE]

No but, often, it flushes the (pseudo-) tty input so the program gets
it

Yes, that's the first case above, where there are characters ("abc")
waiting. They get passed to the program,

and a second one the causes the input to be closed.

And now there are no characters waiting, so read() returns 0 which
indicates end of file (it doesn't "close" anything).

-- Richard

Richard Tobin · May 5, 2014

What's more, the "EOF condition" only exists at the stdio level.
The underlying read() system call merely returns 0, and further
reads may return more data. And some stdio implementations (in
particular Linux) do not correctly implement the EOF condition.

[/QUOTE]

How is the implementation incorrect?

If you type the sequence "a", linefeed, end-of-file-character,
successive calls to getchar() should return 'a', '\n', EOF, EOF, EOF,
.... because the the EOF condition should be set and remain set until
clearerr() is called.

However, on Linux (I think I should really say "using glibc") only
one EOF is returned and the system waits for another character to be
typed.

There is a discussion about fixing it at
https://sourceware.org/ml/libc-alpha/2012-09/msg00356.html

-- Richard

Noob · May 5, 2014

James said:
"A type has _known constant size_ if the type is not incomplete and is
not a variable length array type." I've used underscores to indicate that
"known constant size" is italicized, and ISO convention indicating that
this sentence defines the meaning of that phrase.

(Waaay off-topic)

Some mail/news clients support so-called "enhanced plain-text features"
to display *bold* /italic/ or _underlined_ text.

http://kb.mozillazine.org/Thunderbird_:_FAQs_:_Viewing_Headers#Enhanced_plain_text_features

Regards.

James Kuyper · May 5, 2014

(Waaay off-topic)

Some mail/news clients support so-called "enhanced plain-text features"
to display *bold* /italic/ or _underlined_ text.

http://kb.mozillazine.org/Thunderbird_:_FAQs_:_Viewing_Headers#Enhanced_plain_text_features

I know - I've got that feature turned off, because I've seen too many
messages become unreadable when viewed using a incompatible client when
it's turned on. Messages using none of the enhanced features are
readable everywhere, though, as shown above, there is some corresponding
inconvenience.

Ken Brody · May 5, 2014

I know - I've got that feature turned off, because I've seen too many
messages become unreadable when viewed using a incompatible client when
it's turned on. Messages using none of the enhanced features are
readable everywhere, though, as shown above, there is some corresponding
inconvenience.

Even more fun (FSVO) are clients that turn plain text into emoticons, making
things such as:

(foo+8)

rather "interesting" to read.

Anand Hariharan · May 7, 2014

BartC said:
BartC said:

jacob navia said:

Le 04/05/2014 03:09, Keith Thompson a écrit :
int main(void) {

Click to expand...

lcc64 ctrlz.c // Compile it with a good compiler
lcclnk64 ctrlz.obj // Link it with a good linker
ctrlz.exe // Execute
saw_A = 1
saw_Z = 1
saw_Ctrl_Z = 1

Click to expand...

[...]

Apart from MSVC which apparently gives 1,0,0, that is also the result with
gcc, PellesC, DMC, Clang and g++, all running on Windows.

Click to expand...

gcc on Windows is often installed as part of a POSIX-like layer
that imposes its own semantics. There are serveral ports of gcc
to Windows, and I doubt that they all behave the same way. And of
course it's the runtime library, not the compiler, that's relevant;
some implementations (like MinGW, I think) combine gcc with the
MS runtime library, others combine gcc with some other library
(like Cygwin).

Using the compiler that came with VS 2010 ("Microsoft (R) 32-bit C/C++
Optimizing Compiler Version 16.00.40219.01 for 80x86"), I got a 1, 0, 0
but I got a 1, 1, 1 using gcc 4.8.2 on i686-pc-cygwin.

- Anand

David Thompson · May 25, 2014

Le 01/05/2014 04:23, glen herrmannsfeldt a écrit :

Also in C++. In both cases only in non-static methods of course.
COBOL also allows it, and I'm nearly certain that's where PL/I got it.
(To a first approximation PL/I is COBOL + FORTRAN + spices, beat well
and bake until crisp.) Similarly the PL/I syntax for declaring
structures is much closer to COBOL than the algol ... C tribe.

<snip rest>

glen herrmannsfeldt · May 26, 2014

(snip, I wrote)

(snip)

COBOL also allows it, and I'm nearly certain that's where PL/I got it.
(To a first approximation PL/I is COBOL + FORTRAN + spices, beat well

and also ALGOL

and bake until crisp.) Similarly the PL/I syntax for declaring
structures is much closer to COBOL than the algol ... C tribe.

Yes. As I understand it, COBOL allows only 1D arrays, but you can
build array structures of arrays of ... to an appropriate depth.

So, partial qualification is a convenient way not to have to write
all the qualifiers that you put in only to allow enough dimensions.
PL/I does allow more than one dimension, though.

I believe PL/I also inherited the ability to move subscripts around
between structure qualifiers, or at least move them right.
(I am not sure if you can move them left or not, I never tried.)

-- glen

C language now truly universal	0	Jan 1, 2011
What is the different between while(0) and while(1) in c language	4	Jul 12, 2016
C as a scripting language	88	Mar 26, 2009
Errata for The C Programming Language, Second Edition, by Brian Kernighanand Dennis Ritchie	4	May 16, 2009
Programming "only" in an environment of C and it's friends - Being a toolsmith	51	May 27, 2012
Maybe C is the perfect language for really good systems programmers, but unfortunately not-so-good s	18	Nov 3, 2009
Comparision of C Sharp and C performance	360	Dec 27, 2009
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012

Don Knuth and the C language

jacob navia

Keith Thompson

glen herrmannsfeldt

Malcolm McLean

Ben Bacarisse

Ben Bacarisse

Lew Pitcher

glen herrmannsfeldt

jacob navia

Richard Tobin

Richard Tobin

Noob

James Kuyper

Ken Brody

Anand Hariharan

David Thompson

glen herrmannsfeldt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads