Malcolm's new book

Malcolm McLean · Aug 21, 2007

Al Balmer said:
So write pseudocode. Then there's no danger that your readers will
assume it's good, properly implemented C code.

This would also satisfy your goal (as stated to Santosh) of making it
easier to port.

There's a case for that. But the reader is more likely to know C than any
other algorithm notation.

Kelsey Bjarnason · Aug 21, 2007

[snips]

Any standard function may *in addition* be implemented as a macro, as
long as the macro is well-behaved. Such a macro must evaluate each of
its arguments exactly once

Exactly once, or _at most_ once?

I ponder the case of, say, strchr, where it might do something akin to:

if ( *haystack == '\0' )
break;

and as a result, never evaluate "needle" at all.

Is it required to evaluate needle anyways? If so, to what end?

Richard Heathfield · Aug 21, 2007

Malcolm McLean said:

No. That sort of code is too hard to read.

It needn't be.

What I am after is the idea of specifying functions.

Bad ones.

That's the purpose of chapter one.
The idea is not to implement a library for production use.

Bad code is a bad teacher.

The function is reasonable drop-in for fgets() in a non-security,
non-safety critical environment.

No, it isn't a drop-in for fgets (it has a different interface), and no
it isn't reasonable (it's too slow when used with arbitrarily long
lines, which are its very raison d'etre).

It means the programmer doesn't have
to worry about buffer size. A malicious user can crash things by
passing a massive line to the function, but we don't all have to
consider that possibility.

It does, however, create a serious housekeeping overhead which fgets has
never had. There are solutions which do not create a significant
overhead.

Keith Thompson · Aug 21, 2007

Malcolm McLean said:
Enlightenment dawns. Sure it is. No results beats wrong results.

Maybe. In some circumstances, partial data may be better than no
data, as long as something indicates that it's partial. See the
behavior of fread(), for example.

I'll grant you that it's not a big deal for a sample program in a book
on algorithms, but it might be worth mentioning the issue.

Richard Heathfield · Aug 21, 2007

Malcolm McLean said:

There's a case for that. But the reader is more likely to know C than
any other algorithm notation.

If the reader knows C, properly written C should not be too hard for
them to read.

Flash Gordon · Aug 21, 2007

Malcolm McLean wrote, On 21/08/07 21:53:

That's a design decision. Insisting on certain data types means that the
algorithms are harder to port to languages other than C, which may lack
unsigned types.

Not really. When checking for overflow you still check for overflow,
just as you should with an unsigned type.

Eric Sosman · Aug 21, 2007

Kelsey Bjarnason wrote On 08/21/07 17:28,:

[snips]

Any standard function may *in addition* be implemented as a macro, as
long as the macro is well-behaved. Such a macro must evaluate each of
its arguments exactly once

Click to expand...

Exactly once, or _at most_ once?

Exactly once (7.1.4p1).

I ponder the case of, say, strchr, where it might do something akin to:

if ( *haystack == '\0' )
break;

and as a result, never evaluate "needle" at all.

Is it required to evaluate needle anyways? If so, to what end?

Yes, it is required, to ensure that

strchr(lineptr[i++], delim[--j])

work the same way with strchr-the-macro as with strchr-
the-function.

Flash Gordon · Aug 21, 2007

CBFalconer wrote, On 21/08/07 22:21:

Because getc can be implemented as a macro, provided its action
doesn't have to evaluate the argument more than once.

You seem to have the above backwards. getc is explicitly allowed to
evaluate its parameter more than once when implemented as a macro,
unlike fgetc.

> This can
avoid a good deal of procedure calling, and also avoid having to
assign an input buffer. There are special provisions for getc and
putc in the standard to allow this.

Indeed getc and putc are a special case and it is to save on overheads.

Peter J. Holzer · Aug 21, 2007

The parameter in question is used to specify the size of a passed-in
buffer. Since a buffer can never have a size less than zero, there
*cannot* ever be a legitimate value of the length parameter less than
zero.

By allowing a size value of less than zero, the implication is obvious: it
is not, in fact, a length value, but a length+magic value;

This is not at all obvious. I would never get that idea.

Exactly. And there is a good reason for using something else here: the
variable specifies the size of a passed-in buffer, which can *never* be
negative, so allowing for negative values implies something "magic" must
happen in such cases.

It doesn't imply that to me. Not all all representable input values are
useful.

The question before him is what magic occurs when the function is
passed a negative value - which the function, as prototyped, is
explicitly designed to accept?

What magic happens when you pass a negative value (except EOF)
or a positive value > UCHAR_MAX to one of the ctype isxxx functions?
What magic happens if you pass a value outside of [-1.0, 1.0] to an
arcsine function? You'll have to check the documentation.

If there is no magic, if negative values are invalid - as one would
*expect* for buffer sizes - then there's no justification for using a
signed type,

I don't need a justification for using a signed type. I need a
justification for using an unsigned type. And "it cannot be negative"
isn't enough justification for me.

particularly when it means it can no longer cope with
perfectly legitimate ranges of valid values which it should be able to
cope with.

Yes, but that would be true just the same if he had used "unsigned int".

Again: There are good reasons to use size_t for buffer sizes (mostly
that it is large enough to hold all possible buffer sizes, but also that
using a standard type serves as documentation), so I would use size_t.
But there is only a weak reason to use unsigned int (it cannot be
negative) which doesn't outweigh the disadvantages of unsigned types,
for example:

* you cannot naturally count down to zero:

for (i = n; i >= 0; i--)

doesn't terminate if i is unsigned.

* comparison between signed and unsigned values isn't value-preserving:

#include <stdio.h>

int main(void) {
int i;
unsigned u;

/*
* some
* stuff
* here
*/

i = -3;
u = 5;
if (i < u) {
printf("ok\n");
} else {
printf("oops\n");
}
return 0;
}

prints "oops".

["C is stupid because you cannot pass chars to the isxxx functions"]

Yes, but that's just the point I made: nothing in those functions says -
as claimed - that you cannot pass in a char;

Actually, the prototype says you can only pass an int. The point about
"you cannot pass a char" is that if you come across code like

char c; .... isspace(c) ...

or

char *s; ... isspace(*s) ...

there is a very high probability that this is a bug. You simply cannot
assume that a char contains only values which are in the allowed range
for the isxxx functions - you have to check this explicitely. So you
either need to pepper your code with conditionals (which probably
severely limit the usability of your program) or with ugly casts to
unsigned char (which still may not be completely portable, as Harald
pointed out).

what is required is that you pass in values in specific ranges. As
long as your chars _are_ in that range, pass 'em on in, everybody's
happy.

Except that on somebody else's systems the chars _are not_ in that range,
and your program will crash or - worse - produce some slightly wrong
output.

hp

Keith Thompson · Aug 21, 2007

Kelsey Bjarnason said:
[snips]

Any standard function may *in addition* be implemented as a macro, as
long as the macro is well-behaved. Such a macro must evaluate each of
its arguments exactly once

Click to expand...

Exactly once, or _at most_ once?

Exactly once. C99 7.1.4:

Any invocation of a library function that is implemented as a
macro shall expand to code that evaluates each of its arguments
exactly once, fully protected by parentheses where necessary, so
it is generally safe to use arbitrary expressions as arguments.

I ponder the case of, say, strchr, where it might do something akin to:

if ( *haystack == '\0' )
break;

and as a result, never evaluate "needle" at all.

Is it required to evaluate needle anyways? If so, to what end?

Yes, it must evaluate needle anyway, so that any side effects occur
exactly as they would in a function call. This function call:

(strchr)(foo++, bar++);

must increment foo and bar exactly once. This possible macro
invocation:

strchr(foo++, bar++);

must do the same thing.

CBFalconer · Aug 21, 2007

Malcolm said:
.... snip ...

The function is reasonable drop-in for fgets() in a non-security,
non-safety critical environment. It means the programmer doesn't
have to worry about buffer size. A malicious user can crash things
by passing a massive line to the function, but we don't all have
to consider that possibility.

That is precisely the purpose of ggets(char **). It is simple and
can't crash. Take a look at:

<http://cbfalconer.home.att.net/download/>

CBFalconer · Aug 22, 2007

Malcolm said:
.... snip ...

You're right. Converting to size_t won't necessarily solve the
problem. You can fix it with fancy coding, inappropriate for what
the purpose is. That's why I specify the functions can fail for
extreme inputs. I specifically reiterate that the fucntion is not
acceptable as a library function.

Very simple. Just pass malloc (and realloc) through extenders:

void *xmalloc(size_t sz) {
if (!sz) sz++;
return malloc(sz);
}

and now the NULL test for failure is accurate.

Peter J. Holzer · Aug 22, 2007

Peter said:
Peter said:

On many systems char has a range of -128 .. 127, and much of that range
is actually used by common text files (e.g. -96 .. 126 on ISO-8859
systems), so no, you cannot portably pass a char to one of the isxxx
functions unless you have checked that the value is non-negative. But
then what do you do with the negative values? The correct way is almost
always to cast the char to unsigned char. (unless you have multibyte
strings, but there are different functions for that)

Click to expand...

Is the correct way to cast the char to unsigned char, or is it to
reinterpret the char as an unsigned char? In other words,

#include <stdio.h>
#include <string.h>
#include <ctype.h>
int main(void) {
char line[100];
if (fgets(line, sizeof line, stdin) && strchr(line, '\n')) {
#ifdef MAYBE
char *p;
for (p = line; *p; p++)
*p = toupper((unsigned char) *p);
#else
unsigned char *p;
for (p = (unsigned char *) line; *p; p++)
*p = toupper(*p);
#endif
fputs(line, stdout);
}
return 0;
}

Should MAYBE be defined or undefined for a correct program?

Good question.

On most systems,
there will be no difference, and I sincerely hope there will be no
difference on any system (in other words, I hope that on any system where
signed char has less representable values than unsigned char, plain char is
unsigned), but I don't believe that's required, so I'm curious.

I hope so, too, but I can't see a requirement, either. I was trying to
construct one from the "interpreted as unsigned char" requirement for
strcmp, but anything I try invokes undefined behaviour.

In any case even if it was allowed, I would consider an implementation
which forced me to cast all strings to a pointer to an incompatible type
to correctly access the individual characters as unacceptably bad.

hp

pete · Aug 22, 2007

Harald said:
Harald said:

pete wrote:
Harald van =?UTF-8?B?RMSzaw==?= wrote:
Is the correct way to cast the char to unsigned char, or is it to
reinterpret the char as an unsigned char? In other words,

#include <stdio.h>
#include <string.h>
#include <ctype.h>
int main(void) {
char line[100];
if (fgets(line, sizeof line, stdin) && strchr(line, '\n')) {
#ifdef MAYBE
char *p;
for (p = line; *p; p++)
*p = toupper((unsigned char) *p);
#else
unsigned char *p;
for (p = (unsigned char *) line; *p; p++)
*p = toupper(*p);
#endif
fputs(line, stdout);
}
return 0;
}

Should MAYBE be defined or undefined
for a correct program? On most systems,
there will be no difference, and I sincerely hope there will be no
difference on any system
(in other words, I hope that on any system where
signed char has less representable values than unsigned char,
plain char is unsigned),
but I don't believe that's required, so I'm curious.

My feeling is that whether an implementation
uses signed magnitude or one's complement,
to represent negative integers,
shouldn't come into play wtih ctype functions.
I prefer a cast.

Both forms use a cast,
so I'm not completely sure which form you believe is correct.

Click to expand...

"cast the char to unsigned char"

Click to expand...

Thanks.
Does this also imply that if you want to have helper functions that
operate on arrays of unsigned char (that contain text),
you should not pass them a converted char *,
but you should use an array of unsigned char right
from the start?

As I said in my other post, it's complicated.

I just remembered what it is that I like about casting the values:
It's because putchar works that way.

If you define these negative int values:
#define NEG_5 ('5' - 1 - (unsigned char)-1)
#define NEG_A ('A' - 1 - (unsigned char)-1)

Then putchar(NEG_5) will equal 5
and putchar(NEG_A) will equal 'A'
(or EOF, but that's not the point)

Also
isupper((unsigned char)NEG_5) is 0
islower((unsigned char)NEG_5) is 0
toupper((unsigned char)NEG_5) is 5
tolower((unsigned char)NEG_5) is 5

isupper((unsigned char)NEG_A) is 1
islower((unsigned char)NEG_A) is 0
toupper((unsigned char)NEG_A) is A
tolower((unsigned char)NEG_A) is a

Kelsey Bjarnason · Aug 22, 2007

[snips]

This is not at all obvious. I would never get that idea.

No? Yet the functions as defined _cannot_ accept a legitimately sized
buffer (ints being too small) but *can* accept negative values.
Obviously, the requirement to take negatives was so compelling that it
overrode the ability to handle perfectly valid sizes; there _must_ be
"magic" involved, else there would be no such compelling reason.

It doesn't imply that to me. Not all all representable input values are
useful.

No, they're not. On a system with 64-bit ints, one could reasonably
assume it is most unlikely they'd have buffers as large as 2^63-1. On a
16-bit system, by contrast, ints are not large enough to handle useful
sizes - yet he uses ints. Thus, again, the "magic" triggered by negative
values must be sufficiently compelling it outweighs the cost of losing
support for perfectly legitimate buffers. Again I ask, what _is_ that
magic?

What magic happens when you pass a negative value (except EOF) or a
positive value > UCHAR_MAX to one of the ctype isxxx functions?

Presumably none at all, but then on all but some relatively odd systems,
the issue never comes up, as these functions are not excluding legitimate
values in the process.

The correct comparison would be to ask why, if the character set contains
values from 0..255, would anyone design the is_xxx functions to accept
values from, say, -17 to 192? Doing so excludes perfectly legitimate
values, so there must be a compelling reason to do so... and it is done to
allow a greater range of negative values, meaning there must be a
compelling reason to allow those, to the exclusion of legitimate values.

The is_xxx functions do not work this way. They allow the full range of
legitimate values (with some hoofooraw when it comes to 32-bit chars,
32-bit ints, and where EOF fits in) but they do not go out of their way to
exclude legitimate options for the sake of undescribed magic.

What
magic happens if you pass a value outside of [-1.0, 1.0] to an arcsine
function? You'll have to check the documentation.

Again, does the arcsin function clip its input range, to reject legitimate
sensible values for no particular reason? Let's see... nope, it's defined
in terms of values _between_ -1 and 1, so it would not be excluding
legitimate values if it allows passing in (with bogus results or
otherwise) of a larger range.

You fail to compare cases. Your exemplars involve functions which accept
the entire legitimate range of possible values, then you ask what happens
beyond those ranges?

That is not the case with MM's code. MM's code *excludes* perfectly
legitimate values. Why? Oh, right, to use ints instead of size_ts. Why
would one want to use ints over size_ts, when dealing with buffers and
sizes? One wouldn't - unless there were a really compelling reason to
accept negative values, to the exclusion of perfectly legitimately sized
buffers outside the range an int can store.

He does not present that compelling reason, though; it is "magic" and one
is, presumably, supposed to simply guess at what the magic is.

I don't need a justification for using a signed type.

You do if, in the process of using it, you limit the range of perfectly
valid inputs for no good reason.

I need a
justification for using an unsigned type. And "it cannot be negative"
isn't enough justification for me.

I see. So you're the sort who would write code as he does, with ints
instead of size_ts as size values.

Very good, and when a legitimate buffer needs to use your functions, but
can't, because you've used an int instead of a size_t and the size can
no longer be passed in?

Oh, right, your code's as broken as his as a result, and should be turfed
along with his.

Yes, but that would be true just the same if he had used "unsigned int".

Who the hell said unsigned int? Granted, it would be an improvement
over his abortion, but it's largely irrelevant - we're talking about
negative values, or, rather, the complete lack of justification for them
in a context where the proper type to use is a size_t.

Again: There are good reasons to use size_t for buffer sizes (mostly
that it is large enough to hold all possible buffer sizes, but also that
using a standard type serves as documentation), so I would use size_t.

But there is only a weak reason to use unsigned int

Then don't use it. Use the *proper* type - size_t.

(it cannot be
negative) which doesn't outweigh the disadvantages of unsigned types,
for example:

* you cannot naturally count down to zero:

for (i = n; i >= 0; i--)

Check MM's code. He's got a loop in there:

for ( i = 0; i < len; i++ )

but len, if passed in a legitimate value for the size of a buffer, just
one too large for an int, may well result in len being less than zero. So
signed doesn't buy anything here, either, and the objections to using the
correct types don't add up.

* comparison between signed and unsigned values isn't value-preserving:

And the effects of signed integer overflow are...? As in the code he
presents, using the wrong type?

Actually, the prototype says you can only pass an int. The point about
"you cannot pass a char" is that if you come across code like

char c; .... isspace(c) ...

or

char *s; ... isspace(*s) ...

there is a very high probability that this is a bug.

Quite possibly, unless you happen to be *damned* sure of the source of the
data - which is, in fact, not likely to be all that often.

Still, if you _are_ sure the data is in the legitimate range, you _can_
"pass a char to the isxxx functions", in direct contradiction of what he
said. What you cannot do is blithely assume that _all_ characters passed
will have values in the range the isxxx functions will accept, unless you
take steps to ensure they are, in fact, representable as an unsigned char
or EOF. But that's not what he said; he said you can't pass chars to the
functions; this is simply wrong. You can.

Except that on somebody else's systems the chars _are not_ in that
range, and your program will crash or - worse - produce some slightly
wrong output.

As I said, as long as the chars *are* in that range, pass 'em in,
everyone's happy.

What is it today? You're at least the third person who seems to prefer to
read something which has absolutely nothing to do with what was written,
and you've done so several times in one post.

You *can* pass chars to isxxx, as long as those chars are in range.

The question of types was not about *extending* the range, but of
*limiting* it.

The question of types was not about "unsigned int", but the inclusion of
negatives for no reason, to the exclusion of legitimate ranges... none of
your exemplars remotely relate.

Kelsey Bjarnason · Aug 22, 2007

[snips]

Exactly once (7.1.4p1).

Yes, it is required, to ensure that

strchr(lineptr[i++], delim[--j])

work the same way with strchr-the-macro as with strchr-
the-function.

Er... umm... <smacks forehead> of course. 'Scuse me while I brain fart.

Kelsey Bjarnason · Aug 22, 2007

[snips]

Yes, it must evaluate needle anyway, so that any side effects occur
exactly as they would in a function call. This function call:

(strchr)(foo++, bar++);

Zackly.

It's one of those cases of "Yeah, I know it works just like the function...
and that the code will work either way... but I missed the implication of
it."

pete · Aug 22, 2007

Philip said:
fgets() is required to be implemented as a function AFAIK.

Both of them are required to be implemented as functions.

getc is typically also implemented as a macro.
fgetc is typically not also implemented as a macro.

Ah that's where my confusion comes from. I had assumed that getc would
be implemented in terms of fgetc rather than the other way round.
But of
course your version makes much more sense.

I suspect that fgetc and fputc exist, at in part,
to simplify the standard's description of input and output.

N869
7.19.3 Files
[#11]
The byte input functions
read characters from the stream as if by successive calls to
the fgetc function.

However, programmers are likely to think of getc,
as the real building block, as the description of getchar suggests.

Description
[#2] The getchar function is equivalent to getc with the
argument stdin.

CBFalconer · Aug 22, 2007

Flash said:
CBFalconer wrote, On 21/08/07 22:21:

You seem to have the above backwards. getc is explicitly allowed to
evaluate its parameter more than once when implemented as a macro,
unlike fgetc.

Yup. Wrote too fast. Thanks for the correction.

Keith Thompson · Aug 22, 2007

CBFalconer said:
That is precisely the purpose of ggets(char **). It is simple and
can't crash. Take a look at:

<http://cbfalconer.home.att.net/download/>

It can't crash if malloc and realloc behave properly. It initially
mallocs 112 bytes, then reallocs more space 128 bytes at a time for
long lines.

But, as we've discussed here before, malloc doesn't behave properly on
all systems. On some systems, malloc can return a non-null result
even if the memory isn't actually available. The memory isn't
actually allocated until you try to write to it. Of course, by then
it's too late to indicate the failure via the result of malloc, so the
system kills your process -- or, perhaps worse, some other process.

I'm sure you dislike the idea of catering to such systems as much as I
do, but you might consider implementing a way to (optionally) limit
the maximum line length, to avoid attempting to allocate a gigabyte of
memory if somebody feeds your program a file with a gigabyte-long line
of text.

NUMERICAL RECIPES	66	Jan 2, 2011
The International Conference on Computing, Networking and DigitalTechnologies (ICCNDT2012)	0	Sep 7, 2012
CFP: GAMEON 2007, November 20-22, 2007, University of Bologna, Bologna,Italy	0	Jun 15, 2007
Senior Statistician Modeller/ SAS Modeler required in bellevue, WA	0	Jul 31, 2009
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007

Malcolm's new book

Malcolm McLean

Kelsey Bjarnason

Richard Heathfield

Keith Thompson

Richard Heathfield

Flash Gordon

Eric Sosman

Flash Gordon

Peter J. Holzer

Keith Thompson

CBFalconer

CBFalconer

Peter J. Holzer

pete

Kelsey Bjarnason

Kelsey Bjarnason

Kelsey Bjarnason

pete

CBFalconer

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads