Bounds checking and safety in C

F

Flash Gordon

Richard Heathfield wrote, On 30/07/07 14:32:
Flash Gordon said:



Well, presumably that feature would only be used during debugging.

In that case I would like it to cause the ICE to break the program so it
can be debugged :)
You won't have to - the M25 doesn't /have/ a fast lane.

You forget that a program invoking undefined behaviour can cause
anything* to happen! Anyway, I've been known to be driving on the M25
after midnight when it can actually be fast.
 
K

Kelsey Bjarnason

[snips]

So theory then and almost impossible to achieve in real life since it's
impossible to, IMO, to predetermine that your program will react
properly under all variations of potential user input.

Actually, no, it's not all that difficult to achieve. It does require
some skill and some patience, but it's not all that difficult, actually.

Perhaps the most obvious example of this is gets vs fgets. If you use
gets, forget it, your code is virtually guaranteed, at some point, to
invoke UB. Using fgets, on the other hand, allows you to prevent buffer
overflows on input and thus avoid UB as a result of the overflow.
 
J

JT

The most obvious example is the development of
length delimited strings [strlen] [strcat]
I have been promoting this change [..] for several years.

First of all, it was thought of long ago [by OTHER people]
Second of all, its overall efficiency is still debated [*]
Third of all, companies such as Microsoft already REQUIRE it
internally
Thanks for the advice but what makes you think I haven't done it?

Because you appear painfully naive in thinking
that you have the answer, and others are just too
stubborn to see it.

* I'll leave the literature search and other
examples for others to cite. One quick example is
the tokenization and manipulation of an input text block.
With NUL-terminated string, you can tokenize
the input buffer in-place (by replacing whitespace
with \0), but with length-denoted strings you need
to always malloc a new space since there's no room
for the length field.

** For a few years now, Microsoft internally requires
EVERY METHOD in their internal code (eg. Office, Windows...)
must always pass a length argument for every variable
length argument. So an internal method that accepts
a string arg must also have an arg that contains its size.

You did NOT come up with the idea.

And researchers at Microsoft have already tried
and evaluated the practice of always storing/passing
the string length AT ALL TIME.

I leave it up to you to look at their whitepaper
(you know, real research) and see what their evaluation
of the performance penalty, programmer productivity penalty,
and percent of bug reduction is.
 
¬

¬a\\/b

Yes, but unlike you the authors of the paper actually did some research.

i say that a programme that use bound check whould be near 0% slower
than one that not use it
 
K

Kelsey Bjarnason

[snips]

The second is the "spirit of C". C is for macho programmers
that do not need bounds checking because they never make mistakes.

Er... no. C programmers can make mistakes, but that's not relevant here;
all that's relevant here is whether an accomplished C programmer is liable
to make significant (UB-inducing) errors as pertain to buffer/array
management - the stuff which boundary checking helps with.

Check the code produced by most of the seasoned regulars around here; see
how much of it is prone to such things. Then stop and consider, of that,
how much makes it past review and into "gold" status.
 
J

jacob navia

JT said:
The most obvious example is the development of
length delimited strings [strlen] [strcat]
I have been promoting this change [..] for several years.

First of all, it was thought of long ago [by OTHER people]

Obvious, did you see any copyright in my message?
Second of all, its overall efficiency is still debated [*]

I presented you with obvious examples.
Third of all, companies such as Microsoft already REQUIRE it
internally

Yes, this is maybe a hint that is not so BAD as it seems.
Because you appear painfully naive in thinking
that you have the answer, and others are just too
stubborn to see it.

I am proposing a change in the handling of strings, using
Strcmp, Strlen, and other functions.
I have developed an implementation that I distribute (with the
source code) with my compiler system.

Maybe I am "naive" as you say, but I believe that discussing
this proposal and related items is worthwhile.

* I'll leave the literature search and other
examples for others to cite. One quick example is
the tokenization and manipulation of an input text block.
With NUL-terminated string, you can tokenize
the input buffer in-place (by replacing whitespace
with \0), but with length-denoted strings you need
to always malloc a new space since there's no room
for the length field.

Yes, in that example zero terminated strings take less space
and do not require an allocation.
** For a few years now, Microsoft internally requires
EVERY METHOD in their internal code (eg. Office, Windows...)
must always pass a length argument for every variable
length argument. So an internal method that accepts
a string arg must also have an arg that contains its size.

That is the problem. The programmer should NOT pass
the length. Strings should carry the length field
so the programmers should just pass a string and
do not have to figure out what the length is!

You did NOT come up with the idea.

I feel that the only objective of your posts is to make me feel
bad, as if I was caught in the act of stealing Strings at the
grocery store.
And researchers at Microsoft have already tried
and evaluated the practice of always storing/passing
the string length AT ALL TIME.

Ditto. It is a BAD idea to pass that length.
I leave it up to you to look at their whitepaper
(you know, real research) and see what their evaluation
of the performance penalty, programmer productivity penalty,
and percent of bug reduction is.

Maybe I have time to look at their research. Since it will be
long to search, I would suggest you stop this games and
give your reference.

I gave the reference of the article I cited.
 
K

Kelsey Bjarnason

So "if conforming program violates" then bounds checking is legal.

Correct, as if it violates object bounds, it invokes undefined behaviour
and thus *any* result is allowable.
Since "its not possible for conforming program to violate", checking is
legal.

Correct, because a strictly conforming program *cannot* tell the
difference between a bounds-checking implementation and one that does not
do bounds-checking.
I do have my thick head on. I don't understand anything of the past few
points about "conforming" programs. It sounds like a load of mumbo
jumbo.

It is not terribly difficult, once you understand the import of undefined
behaviour.
 
F

Flash Gordon

jacob navia wrote, On 30/07/07 15:42:
Maybe, maybe not, it depends. In any case nobody is advocating making
it mandatory.

The real problem behind this is the difficulty of the standard library
since if you store the meta data with the object, the object layout
changes slightly and all library routines not compiled with bounds
checking will not work.

So do what MS do and build different versions of the libraries for
different purposes.
That is why a standard procedure would much better.

The "standard" procedure would be the same as the procedure for
selecting any other compiler mode and version of the C library.
It would be possible to build compatible libraries.

So why complain when you already know the solution? Since the C standard
does not specify how to do linking why should it specify how to enable
bounds checking?

<OT>
MS already have a system for specifying calling conventions, they also
have (or had last I looked) different versions of the C library for use
with different build options.
</OT>
 
K

Kelsey Bjarnason

[snips]

Yes, there are a lot of tools, and their existence PROVES the gaping
hole in the language.

Thus Intercal's "come from" proves the need of such a construct? I think
not.

What it proves, if anything, is that diagnostic tools are useful
particularly in the early development phase.
 
S

santosh

JT said:
The most obvious example is the development of
length delimited strings [strlen] [strcat]
I have been promoting this change [..] for several years.

First of all, it was thought of long ago [by OTHER people]
Second of all, its overall efficiency is still debated [*]
Third of all, companies such as Microsoft already REQUIRE it
internally

[ ... ]
** For a few years now, Microsoft internally requires
EVERY METHOD in their internal code (eg. Office, Windows...)
must always pass a length argument for every variable
length argument. So an internal method that accepts
a string arg must also have an arg that contains its size.

You did NOT come up with the idea.

Neither did Microsoft. The concept of length delimited strings is many
decades old.
 
S

santosh

Kelsey said:
[snips]

So theory then and almost impossible to achieve in real life since it's
impossible to, IMO, to predetermine that your program will react
properly under all variations of potential user input.

Actually, no, it's not all that difficult to achieve. It does require
some skill and some patience, but it's not all that difficult, actually.

Perhaps the most obvious example of this is gets vs fgets. If you use
gets, forget it, your code is virtually guaranteed, at some point, to
invoke UB. Using fgets, on the other hand, allows you to prevent buffer
overflows on input and thus avoid UB as a result of the overflow.

I think he was talking about whether a given program, that took user input,
could be determined as strictly conforming or not. Not just the I/O
functions, but the entire program.
 
K

Kelsey Bjarnason

[snips]

It's no where near that bad. Yes there is a performance penalty, but
this can be mitigated by only applying the full set of checks to
selected parts of the application.

Ah, so the parts where you used strcpy safely you can skip, but the parts
where you didn't use it safely, you should bounds-check?

Seems silly to me; if you suspect a particular piece of code of needing
such hand-holding, fix the code. If you simply don't know, then why do
you assume some parts are safe and others not?

As to speed...

void mycpy( char *d, char *s )
{
while( *s )
*d++ = *s++

....
}

How's your bounds-checking implementation going to handle that? Is it
going to try to figure out where d was defined (or allocated), track the
space involved... okay, assume it does, then what? Calculate length of s,
add to start of d, compare to size of buffer and go "Okay, carry on"?
Or is it going to trap the pointer increments and compare them to the
buffer?

One involves a single set of calculations and almost zero impact; the
other involves potentially tens of thousands of operations and may well
have serious impact. If I - as the coder - am particularly worried about
this, I can pretty trivially code in the passing of the buffer sizes and a
singular up-front check which tells me whether I have enough room or not.

So how does your implementation work to avoid the test-ever-pointer-op
mode and the subsequent performance hit?
 
J

jacob navia

Kelsey said:
[snips]

It's no where near that bad. Yes there is a performance penalty, but
this can be mitigated by only applying the full set of checks to
selected parts of the application.

Ah, so the parts where you used strcpy safely you can skip, but the parts
where you didn't use it safely, you should bounds-check?

Seems silly to me; if you suspect a particular piece of code of needing
such hand-holding, fix the code. If you simply don't know, then why do
you assume some parts are safe and others not?

As to speed...

void mycpy( char *d, char *s )
{
while( *s )
*d++ = *s++

....
}

How's your bounds-checking implementation going to handle that? Is it
going to try to figure out where d was defined (or allocated), track the
space involved... okay, assume it does, then what? Calculate length of s,
add to start of d, compare to size of buffer and go "Okay, carry on"?
Or is it going to trap the pointer increments and compare them to the
buffer?

One involves a single set of calculations and almost zero impact; the
other involves potentially tens of thousands of operations and may well
have serious impact. If I - as the coder - am particularly worried about
this, I can pretty trivially code in the passing of the buffer sizes and a
singular up-front check which tells me whether I have enough room or not.

So how does your implementation work to avoid the test-ever-pointer-op
mode and the subsequent performance hit?

Please read the article I mentioned in my first message.
There it is explained how to do it precisely.
 
W

William Hughes

That is the justification I hear most often.

So you say. I do not know what you hear most often.
However, I have seen enough of

Whoever: Changing the language to allow easier bounds
checking is not practical as there is too
much legacy code, so you either break legacy
code or end up with duplication

JN: What you really mean is that bounds checking
is too expensive in terms of run time
and "real" C programmers do not make need bounds
checking because they never make mistakes

that I don't give much weight to your claims
of what you hear most often as I have no
confidence that what you hear is what other people say.

One of your main points, that we should
add counted strings to C, seems to sum this up
perfectly. Complete counted string implementations
exist, but changing C to mandate a specific
implementation does not seem useful. Even if
you continued to support the present string
library and provided helper functions for conversion,
the old and new strings would not coexist well.
And if you did not support the existing string
library, existing code could not be maintained.

But strings are not really the problem.
I do not use C for string manipulation,
but I certainly have bounds problems.
I use a common approach, use bounds
checking tools for development and remove
them for production. For this a two times
speed penalty is not a problem (and even
an order of magnitude can be accomodated).
I have zero interest in making my life harder
so that the tools could be faster
or the people who write the tools would
have an easier time.

- William Hughes
 
J

jacob navia

William Hughes wrote:
[snip]
I use a common approach, use bounds
checking tools for development and remove
them for production.

What a BRIGHT idea.

So, in production, when your code needs the most
security you have none.

You say:

A factor of ten slowdown is no problem, since I use it only
when debugging.

The objective is to use it at runtime since the speed penalty is
not great.

Another point is that you do not explain why the counted strings
would not work well.

In my implementation:

char *str = (char *)String;

and that is it.

It is designed to have almost 100% compatibility with the old zero
terminated strings.
 
K

Keith Thompson

No, it isn't. If it _were_ possible for a strictly conforming program to
violate object bounds, a bounds checking implementation would be legal.

You mean "would be illegal", of course.
Since it is _not_ possible, a bounds checking implementation is legal
and, get this, on occasion very _practical_ to discover where your
program is not strictly conforming in a bounds-violating way.

As I've mentioned, an implementation that correctly handles all
strictly conforming programs is not necessarily a conforming
implementation. The standard requires more of implementations than
that. An implementation that rejects or mishandles this:

#include <limits.h>
#include <stdio.h>
int main(void)
{
printf("INT_MAX = %d\n", INT_MAX);
return 0;
}

is not conforming.
 
W

William Hughes

Another point is that you do not explain why the counted strings
would not work well.

In my implementation:

char *str = (char *)String;

and that is it.

It is designed to have almost 100% compatibility with the old zero
terminated strings.

No you have a simple way of converting from the new
strings to the old strings. In particular, you can use
the new string as an argument to a function call
that expects and old string.
This is nice but it it a long way from 100% compatibility.

For example, strings and Strings do not have the same
size. Consider a set of strings
in an array. With the old zero terminated strings
I can find the next string by next_string = string_var +
strlen(string_var) + 1
This will not work with the new strings. Yes the fix is
simple but a fix is needed.

Another example. Suppose I have a String_var of length 100.
I pass it to my error output routine, during
the function call String_var is converted to string_var.
The first thing that happens is I check the length
of the string and if necessary I truncate to length
30 by string_var[30]=\0. String_var now contains an embeded null
so strlen((char*)String_var) and new_Strlen(String_var) are different.

Will these be a problem in real programs? I don't know.
(though I have seen both lists of strings and trunctation)
Are there other similar problems? I don't know.


In maintaining a program
you would have to stick with the old strings unless you were
*SURE* that converting will not cause a problem or that
you had made all the fixes necessary.
In practice you will end up with two parallel string
implemenations.

- William Hughes
 
K

Keith Thompson

jacob navia said:
I never said that I wanted to make zero terminated strings illegal.
Good.

I just propose that OTHER types of strings could be as well supported by
the language, nothing else.

Other types of strings can already be well supported by the language.
The only current limitation is that some operations require a bit more
syntax than you might like. For example, if you have a type String
that's really a structure, you can't use a cast to 'char*' to convert
a String to a classic C string -- but if one of the members is a char*
pointing to a C string, you can just use something like "obj.str".
You can't use string literals for String values, but you can use a
function call.

For example:

...
String s = Str("hello");
s = append(s, Str(", world"));
printf("%s\n", s.str);
...

If you want the convenience of using string literals and so forth, I
think that only a few minor changes to the language would be required.
I haven't thought this through, but I suspect that most or all of
these changes could be implemented as conforming extensions (i.e.,
extensions that don't alter the behavior of any strictly conforming
code; see C99 4p6). Any programs that depend on such extensions would
of course be restricted to implementations that support them, but it
could be the first step in establishing existing practice and possibly
getting the extensions adopted in a future C standard.

Incidentally, depending on how this hypothetical String type is
implemented, aliasing could be an issue. For example;

String s1, s2;
s1 = "hello";
s2 = s1;
s2.str[0] = 'j';

s2 is now equal to str("jello"). Is s1 equal to str("jello"), or to
str("hello")? In other words, does assignment of Strings copy the
entire string value, or does it just create a new reference to the
same string value?

Certainly classic C strings have the same issue, but there's bound to
be considerable work to be done in deciding how a new (standard?)
String type will deal with it.
 
R

Richard Tobin

i say that a programme that use bound check whould be near 0% slower
than one that not use it

Yes, we know you say it, but we don't take any notice of you because you
just make it up. If you can do bounds checking in C at zero cost, then
do it and we'll all be impressed. But we won't be holding our breath.

I believe the Americans have a phrase "put up or shut up" which seems
to fit.

-- Richard
 
K

Keith Thompson

William Hughes said:
Indeed, but this just means that an executable must
be produced.

And often that it must execute correctly. For example:

#include <limits.h>
#include <stdio.h>
int main(void)
{
printf("INT_MAX = %d\n", INT_MAX);
return 0;
}

This example isn't relevant to bounds checking, but it is an example
of a non-strictly-conforming program that must compile and execute
correctly under any conforming implementation.

See C99 4p3:

A program that is correct in all other aspects, operating on
correct data, containing unspecified behavior shall be a correct
program and act in accordance with 5.1.2.3.
Even if there is a conforming program that can be proven duing
compilation
have bounds violations, all this means is that the compiler
may have to produce an executable (A reasonable behaviour
would be to output a warning that there is a bounds violation
and let the dynamic bounds checking stuff deal with it).


I am not convinced that an example of a conforming program that
could be shown at compile time to violate bounds exists.

That wasn't my claim. A program that can be shown at compile time to
violate bounds invokes undefined behavior, so it's neither strictly
conforming (C99 4p5) nor "correct (C99 4p3). An implementation can do
anything it likes with such a program.

My argument is that an bounds-checking implementation that doesn't
affect any strictly conforming program (except perhaps for
performance), but that does break some "correct" programs (i.e.,
programs that do not invoke UB but that depend on unspecified
behavior) is not a conforming implementation. In other words, it's
not the effect on strictly conforming programs we have to worry about;
it's the effect on the much larger set of "correct" programs.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,135
Latest member
VeronaShap
Top