Changing the meaning of the array indexing brackets "[ ]" to do bounds checking and then ...

  • Thread starter Casey Hawthorne
  • Start date
C

Casey Hawthorne

Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.

By default, previous programs would have bounds checking turned on,
unsless explicilty turned off.
 
R

robertwessel2

Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.

By default, previous programs would have bounds checking turned on,
unsless explicilty turned off.


There's nothing to prevent someone from doing that now, nor is it a
new idea. It might break some programs which were relying on
undefined behavior, but I don't think too many people would find that
too objectionable (and assuming there was a way to turn off the
function, presumably with a compiler switch).

There are two problems. The minor one is overhead. Even if you know
the object being accessed, bounds checking is not free, and that will
cause issues in many places where C is a good choice of languages. To
some extent clever compilers can minimize the overhead, but they
cannot eliminate it altogether.

The major problem is that at the point where the array access happens,
whether that be the brackets or pointer arithmetic with deferencing,
the meta information about the "array," or whatever other object is
being accessed, is not really available to the C compiler. So you
don't really know the bounds of the array (or whatever) you're
accessing at that point. The only real solution is a fat pointer of
some sort that contains information about the object it's pointing to,
but that has very considerable overhead in both size and space. And
again, while compilers can mitigate that somewhat, there are limits,
especially considering the space overhead.
 
J

jacob navia

Casey Hawthorne a écrit :
Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.

By default, previous programs would have bounds checking turned on,
unsless explicilty turned off.

The lcc-win C compiler implements this. It is called "operator overloading".

You can define your own [ ] operators, and the compiler will call your function
that can do bounds checking in the array.

You define a structure with size information, and a pointer to the data stored
(and maybe other data). For this structure you redefine the operator [ ] and
you check the array bounds.

When you do not want to check anymore you do NOT define any operator overloading
and you replace the array structure by a normal array. The code that uses the
array doesn't need to be modified.

I have been arguing for this extension to C since 2002.

jacob
 
K

Keith Thompson

jacob navia said:
Casey Hawthorne a écrit :
Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.

By default, previous programs would have bounds checking turned on,
unsless explicilty turned off.

The lcc-win C compiler implements this. It is called "operator
overloading".

You can define your own [ ] operators, and the compiler will call
your function that can do bounds checking in the array.
[...]

Does lcc-win allow a [] operator to be defined for existing types that
already have a built-in [] operator? For example, can I do something
like this (I'm not sure of the syntax)?

int operator[](int *arr, size_t index)
{
return /* something */;
}

I suspect the answer is no.

I think the OP is suggesting changing the semantics of the existing
built-in [] operator, so that existing code gains bounds checking
with no modifications to the source.

Of course implementations are already allowed to do this, even
without operator overloading. All cases in which a bounds check
would fail invoke undefined behavior anyway, so the implementation
is free to generate code that detects the error.

The overhead would be substantial, since you'd need fat pointers to
retain bounds information for function parameters. And it would
break some existing code whose behavior is, strictly speaking,
undefined, but that works in practice. The most obvious examples
are the "struct hack" (which could be modified to use the C99 feature
that replaces it, but the point is to avoid changing the source), and
code that treats a multidimensional array as a one-dimensional array.

Code that avoids these odd corners of the language could certainly
benefit from a bounds checking implementation. And in fact I believe
such implementations exist, though I don't know if they go so far as
to implement fat pointers to enable checking on pointer parameters.
(At the very least, *testing* code under such an implementation could
be instructive.)
 
B

bartc

Eric Sosman said:
Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.
Structs and unions are splendid sources of complication here.
For example,

union u { int i; char c[10]; };
union u *p = malloc(sizeof *p); /* assume success */

Imagine a not uncommon platform where sizeof(int) == 4 and
int requires 4-byte alignment; we'll probably wind up with a 12-
byte union, so the final line would be equivalent to

union u *p = malloc(12); ....
But now it gets even more entertaining:

int *ip = &p->i;
char *cp = p->c;

What bounds are carried with the fat pointers ip and cp? For
....
This must be the case even if the conversion is indirect, via
umpty-seven layers of void* and separately compiled modules
where union u has never been heard of:

strcpy (p->c, "x");
char *q = strchr(p->c, 'x');
union u *p2 = (union u*)q;

The pointer p2 must cover all twelve bytes, even though it is
derived from a pointer that we'd prefer to have cover only ten.

Seems to me that doing a thorough job might require not just
a pointer-plus-bounds, but a pointer-with-history -- not just a
fat pointer, but an obese pointer.

Your points all seem valid, but they remind me of when my boss wanted some
special feature or wanted something to work as he thought it should. I would
then proceed to explain a dozen ways why it didn't make sense and was
completely impossible.

But a day or so later I got the thing (hardware, software or both) working
just like he wanted! It wasn't quite so impossible after all.

The OP is only talking about arrays, and not pointers. But these are
inextricably linked so I guess we need 'proper' arrays, ie. distinct from
the sort of arrays/pointers currently in C. Then there wouldn't be all this
baggage to make the implementation of array bounds 'impossible'. That's
still a major language extension though.
 
S

Stephen Sprunk

Keith said:
The overhead would be substantial, since you'd need fat pointers to
retain bounds information for function parameters. And it would
break some existing code whose behavior is, strictly speaking,
undefined, but that works in practice. The most obvious examples
are the "struct hack" (which could be modified to use the C99 feature
that replaces it, but the point is to avoid changing the source), and
code that treats a multidimensional array as a one-dimensional array.

Why would it break the struct hack? The fat pointer would have the
bounds of the actual allocation, not the struct type, right?

Or are you proposing that converting a fat pointer from one type to
another changes the bounds? I can see some bugs that would catch, but
it would also break a lot of perfectly valid code, which IMHO is
unacceptable.
Code that avoids these odd corners of the language could certainly
benefit from a bounds checking implementation. And in fact I believe
such implementations exist, though I don't know if they go so far as
to implement fat pointers to enable checking on pointer parameters.

There was a GCC branch that did fat pointers and bounds checking, but it
didn't get very far since it changed the ABI, requiring a recompile of
every library on the system. Few people are willing to go through that.

Presumably they could have added another feature to disable bounds
checking on certain functions or arguments, but they apparently didn't
or that caused other problems...
(At the very least, *testing* code under such an implementation could
be instructive.)

It would be a good start, but presumably if you knew all the cases to
test, you could eliminate the bugs themselves and not need the bounds
checking in the first place. If you don't test every case, the bounds
violation will escape into the wild where there is no checking.

S
 
C

Casey Hawthorne

jacob navia said:
Casey Hawthorne a ?crit :
Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.

By default, previous programs would have bounds checking turned on,
unsless explicilty turned off.

The lcc-win C compiler implements this. It is called "operator
overloading".

You can define your own [ ] operators, and the compiler will call
your function that can do bounds checking in the array.
[...]

Does lcc-win allow a [] operator to be defined for existing types that
already have a built-in [] operator? For example, can I do something
like this (I'm not sure of the syntax)?

int operator[](int *arr, size_t index)
{
return /* something */;
}

I suspect the answer is no.

I think the OP is suggesting changing the semantics of the existing
built-in [] operator, so that existing code gains bounds checking
with no modifications to the source.

Of course implementations are already allowed to do this, even
without operator overloading. All cases in which a bounds check
would fail invoke undefined behavior anyway, so the implementation
is free to generate code that detects the error.

The overhead would be substantial, since you'd need fat pointers to
retain bounds information for function parameters. And it would
break some existing code whose behavior is, strictly speaking,
undefined, but that works in practice. The most obvious examples
are the "struct hack" (which could be modified to use the C99 feature
that replaces it, but the point is to avoid changing the source), and
code that treats a multidimensional array as a one-dimensional array.

Code that avoids these odd corners of the language could certainly
benefit from a bounds checking implementation. And in fact I believe
such implementations exist, though I don't know if they go so far as
to implement fat pointers to enable checking on pointer parameters.
(At the very least, *testing* code under such an implementation could
be instructive.)


I thought of this question, of buffer overruns, after one of the
people interviewed for the book "Coders at Work" said that C was great
for systems programming by well trained programmers, but that C had
leaked out into the applications area.
For systems programming you do need the access to the machine that C
provides, but for applications programming, you don't need/shouldn't
have such access.
 
N

Nick Keighley

On Nov 2, 12:21 pm, Casey Hawthorne <[email protected]>
Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.

note: C++'s vector class did this the other way round
Won't work in C, as C does not have operator overloading.

what? He's (She's?) talking about making a change to the language.
 
N

Nick Keighley

Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.
By default, previous programs would have bounds checking turned on,
unsless explicilty turned off.

There's nothing to prevent someone from doing that now, nor is it a
new idea.  It might break some programs which were relying on
undefined behavior, but I don't think too many people would find that
too objectionable

the struct hack

typedef struct
{
Byte type;
Byte data[1];
} Message;

void process_msg (byte raw [64], size_t size)
{
Message *msg;
msg = (Message*)raw;

if (msg->data[1] == LOCAL)
....
}

the struct hack may be deprecated but it does occur quite often. (I
prefer it to C99;s variable array)

(and assuming there was a way to turn off the
function, presumably with a compiler switch).

There are two problems.  The minor one is overhead.  Even if you know
the object being accessed, bounds checking is not free, and that will
cause issues in many places where C is a good choice of languages.  To
some extent clever compilers can minimize the overhead, but they
cannot eliminate it altogether.

The major problem is that at the point where the array access happens,
whether that be the brackets or pointer arithmetic with deferencing,
the meta information about the "array," or whatever other object is
being accessed, is not really available to the C compiler.  So you
don't really know the bounds of the array (or whatever) you're
accessing at that point.  The only real solution is a fat pointer of
some sort that contains information about the object it's pointing to,
but that has very considerable overhead in both size and space.  And
again, while compilers can mitigate that somewhat, there are limits,
especially considering the space overhead.

Yeah, exactly, much of the time the compiler cannot easily determine
what the bounds are.

void refrobnicate (char*);

void snafu (size_t n)
{
char *fonz = malloc (n);

if (fonz == 0) handle_memory_error();

if (fonz[42] == 0) /* is there a bounds error here? */
{
char *yolonde = fonz + 42;
refrobnicate (yalonde);
}
}

void refrobnicate (char *yalonde)
{
if (yalonde [23] == 0) /* or here? */
more_stuff();
}

so you have to fat pointerise the code (pass some hidden info around
with the memory block).
 
N

Nick Keighley

Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.
    Structs and unions are splendid sources of complication here.
For example,
union u { int i; char c[10]; };
union u *p = malloc(sizeof *p);  /* assume success */
    Imagine a not uncommon platform where sizeof(int) == 4 and
int requires 4-byte alignment; we'll probably wind up with a 12-
byte union, so the final line would be equivalent to
union u *p = malloc(12); ...
    But now it gets even more entertaining:
int *ip = &p->i;
char *cp = p->c;
What bounds are carried with the fat pointers ip and cp?  For

...
 This must be the case even if the conversion is indirect, via
umpty-seven layers of void* and separately compiled modules
where union u has never been heard of:
strcpy (p->c, "x");
char *q = strchr(p->c, 'x');
union u *p2 = (union u*)q;
The pointer p2 must cover all twelve bytes, even though it is
derived from a pointer that we'd prefer to have cover only ten.
    Seems to me that doing a thorough job might require not just
a pointer-plus-bounds, but a pointer-with-history -- not just a
fat pointer, but an obese pointer.

Your points all seem valid, but they remind me of when my boss wanted some
special feature or wanted something to work as he thought it should. I would
then proceed to explain a dozen ways why it didn't make sense and was
completely impossible.

But a day or so later I got the thing (hardware, software or both) working
just like he wanted! It wasn't quite so impossible after all.

The OP is only talking about arrays, and not pointers.

can you really do that?

void process_ptr (char*);
void process_arr (char[]);

Do they take the same type? They do in C but do they in Hawthorne-C?


But these are
inextricably linked so I guess we need 'proper' arrays, ie. distinct from
the sort of arrays/pointers currently in C.

I don't think you have C anymore...

Then there wouldn't be all this
baggage to make the implementation of array bounds 'impossible'. That's
still a major language extension though.

looks a language redesign to me. Why not just abolish pointers and
call it Java?
 
N

Nick Keighley

jacob navia said:
Casey Hawthorne a ?crit :
Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.
By default, previous programs would have bounds checking turned on,
unsless explicilty turned off.
The lcc-win C compiler implements this. It is called "operator
overloading".
You can define your own [ ] operators, and the compiler will call
your function that can do bounds checking in the array. [...]

Does lcc-win allow a [] operator to be defined for existing types that
already have a built-in [] operator?  For example, can I do something
like this (I'm not sure of the syntax)?
   int operator[](int *arr, size_t index)
   {
       return /* something */;
   }
I suspect the answer is no.
I think the OP is suggesting changing the semantics of the existing
built-in [] operator, so that existing code gains bounds checking
with no modifications to the source.
Of course implementations are already allowed to do this, even
without operator overloading.  All cases in which a bounds check
would fail invoke undefined behavior anyway, so the implementation
is free to generate code that detects the error.
The overhead would be substantial, since you'd need fat pointers to
retain bounds information for function parameters.  And it would
break some existing code whose behavior is, strictly speaking,
undefined, but that works in practice.  The most obvious examples
are the "struct hack" (which could be modified to use the C99 feature
that replaces it, but the point is to avoid changing the source), and
code that treats a multidimensional array as a one-dimensional array.
Code that avoids these odd corners of the language could certainly
benefit from a bounds checking implementation.  And in fact I believe
such implementations exist, though I don't know if they go so far as
to implement fat pointers to enable checking on pointer parameters.
(At the very least, *testing* code under such an implementation could
be instructive.)

I thought of this question, of buffer overruns, after one of the
people interviewed for the book "Coders at Work" said that C was great
for systems programming by well trained programmers, but that C had
leaked out into the applications area.
For systems programming you do need the access to the machine that C
provides, but for applications programming, you don't need/shouldn't
have such access.

I'm nor sure the distinction is as clean as you'd like. Are
communication protocols system or application? What about embedded
programming? I suppose C-Hash's Safe and Unsafe modes (or whatever
they're called) might be a way to go.
 
B

bartc

Nick said:
The OP is only talking about arrays, and not pointers.

can you really do that?

void process_ptr (char*);
void process_arr (char[]);

Do they take the same type? They do in C but do they in Hawthorne-C?
But these are
inextricably linked so I guess we need 'proper' arrays, ie. distinct
from the sort of arrays/pointers currently in C.

I don't think you have C anymore...

I think it can be done by adding a type, 'checked array' or some such name.
But you can't mess around with these as freely as you can with regular
arrays.

A pointer to an element is possible, but then you aren't protected if you
decide to derefence that pointer with a too-large offset. But you can
probably create a slice instead, which is another checked array.

The implementation? Probably just another fat pointer (pointer+length), but
ring-fenced for extra protection.
looks a language redesign to me. Why not just abolish pointers and
call it Java?

Switching language is one solution.
 
N

Noob

Casey said:
Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.

By default, previous programs would have bounds checking turned on,
unless explicitly turned off.

You might find the following articles interesting.

http://gcc.gnu.org/wiki/Mudflap_Pointer_Debugging
http://gcc.fyxm.net/summit/2003/mudflap.pdf

On an unrelated note, it seems like the mudflap library itself recently
introduced a vulnerability.

http://xorl.wordpress.com/2009/11/01/gcc-mudflap-arbitrary-code-execution/
 
E

Eric Sosman

bartc said:
[... array bounds checking ...]

I think it can be done by adding a type, 'checked array' or some such
name. But you can't mess around with these as freely as you can with
regular arrays.

A pointer to an element is possible, but then you aren't protected if
you decide to derefence that pointer with a too-large offset. [...]

So, no protection for

checked_array char hello[5];
strcpy (hello, "Hello there, world, old buddy, old pal!");

? Seems to me this is exactly the use case the O.P. tries to
address! You might want to re-count the babies after throwing
out the bathwater ...

Or you could forbid using a checked_array with library
functions that take pointers, which would make checked_array
pretty useless. Or you could implement an entire parallel
suite of library functions to handle checked_arrays instead
of pointers. There'd be a certain amount of bloat: strcpy()
alone would need four versions, and I don't want to *think*
about scanf() ...
 
B

bartc

Eric Sosman said:
bartc said:
[... array bounds checking ...]

I think it can be done by adding a type, 'checked array' or some such
name. But you can't mess around with these as freely as you can with
regular arrays.

A pointer to an element is possible, but then you aren't protected if you
decide to derefence that pointer with a too-large offset. [...]

So, no protection for

checked_array char hello[5];
strcpy (hello, "Hello there, world, old buddy, old pal!");

The checked_array is just converted to a normal pointer in this case.
? Seems to me this is exactly the use case the O.P. tries to
address! You might want to re-count the babies after throwing
out the bathwater ...

The OP was talking about indexing arrays. strcpy() and friends are mainly
about pointers.
Or you could forbid using a checked_array with library
functions that take pointers, which would make checked_array
pretty useless.

A warning would suffice.

Or you could implement an entire parallel
suite of library functions to handle checked_arrays instead
of pointers. There'd be a certain amount of bloat: strcpy()
alone would need four versions, and I don't want to *think*
about scanf() ...

It's possible an implementation might have parallel versions of some
functions (ones that write to an array), and make it so that their use is
transparent.

Strcpy wouldn't need 4 versions, just two. And the one taking the
checked_array param can just be implemented in terms of a regular function.
Or maybe there can just be one version, the one with checked_array
(requiring widening of an ordinary pointer to a fat pointer with no upper
limit).
 
N

Noob

Casey said:
Changing the meaning of the array indexing brackets "[ ]" to do bounds
checking and then using another notation for array indexing without
bounds checking, say "[| |]", might help minimize buffer overruns.

By default, previous programs would have bounds checking turned on,
unsless explicilty turned off.

What is your opinion of Cyclone?

http://cyclone.thelanguage.org/wiki/Cyclone for C Programmers
http://en.wikipedia.org/wiki/Cyclone_(programming_language)

"""
Cyclone attempts to avoid some of the common pitfalls of the C
programming language, while still maintaining the look and performance
of C. To this end, Cyclone places the following restrictions upon programs:

* NULL checks are inserted to prevent segmentation faults
* Pointer arithmetic is restricted
* Pointers must be initialized before use (this is enforced by
definite assignment analysis)
* Dangling pointers are prevented through region analysis and
limitations on free()
* Only "safe" casts and unions are allowed
* goto into scopes is disallowed
* switch labels in different scopes are disallowed
* Pointer-returning functions must execute return
* setjmp and longjmp are not supported
"""

Regards.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top