Null terminated strings: bad or good?

C

CBFalconer

Malcolm said:
.... snip ...

Whether the decision was the best one or not is open to question.
For short strings having a sentinel value makes programming
easier, for long strings it can lead to performance problems.
Almost every man and his dog tries to write a better C strign
library at some point or other.

And most strings used are short. However, my dog never attempted
to improve the C string library throughout his life. He died of
old age about 5 years ago.
 
J

JC

If you wish you are perfectly free to write another string
library.  Face the fact that it will not be compatible with normal
C programming, and that it will not be a component of the standard
library.

If it was written with a C interface it would be compatible. It would
not need to be a standard library component to be useful.
Then also consider that, for most string purposes, the existing
library is quite adequate, and efficient.  In fact, it is even
debugged.

Correct on all accounts, but this does not affect the usefulness of a
counted string library, which might be useful for at least some of the
purposes not included in "most string purposes".

Jason
 
J

JC

JC wrote:
We consider only the abstract machine here.
Existing implementations and what occurs "in practice" aren't relevant.

In this case, "in practice" refers to the interaction of the
standard's definition of "string" with other things defined in the
standard frequently used with strings (such as malloc() and strlen()),
independent of the underlying implementation. The implementation issue
raised about a maximum file size for fopen() and friends was to point
out an implementation-independent workaround for an implementation-
dependent issue (specifically, to work around tying the maximum string
length to the maximum file size). I addressed both theory and
practice, in this discussion you may safely ignore one or the other
without losing any meaning.

There are, in fact, even practical situations where an infinite length
string can be encountered (which I will give an example of
elsethread).
That
the entire enclosed character sequence of a string literal needn't be part
of a string can be seen in the string literal "straw\0man". The closest thing
the standard offers us for string literal limits, in 5.2.4.1, is its
notorious "one program" requirement, a true mark of genius from whatever
pencil-neck devised it.

This is correct, but we seem to have miscommunicated. Sorry if I was
unclear. What I meant was: An implementation-defined maximum string
literal length does not imply that the maximum length of a string is
that same length. This is, of course, irrelevant if you weren't
confusing the maximum string literal length with the maximum string
length to begin with.

Jason
 
J

JC

In fact, the maximum length of a string is not even limited by
available memory. [7.1.1.1] (in C99, TC2) defines a "string" and does
not define a limit on its length. There is no number that exists such
that if the length of a string exceeded that number, it would not be
considered a "string" as defined by the standard. The maximum length
of a string is actually infinite.

     In a freestanding implementation, perhaps, but in a hosted
implementation the strlen() function must be able to return "the
number of characters that precede the terminating null character"
(7.21.6.3p3).  Since it returns this count as a size_t value, it
follows that the count cannot exceed SIZE_MAX, a finite number.


Another example (that just occurred to me) that breaks the SIZE_MAX
limit is if you are reading a string from the standard input stream,
one character at a time (e.g. with fgetc). This still fits under the
definition of a string -- a consecutive sequence of bytes terminated
by a null character -- but the length of the string could easily
exceed SIZE_MAX. This is independent of the defined behavior of strlen
(). There is nothing in the standard that specifies that a string may
not be read from an input stream that is not in the domain of strlen
().

Note that there is also nothing in the standard that includes storage
in memory, ability for random access, or requirements of the existence
of a "pointer to a string" as part of the definition of a string,
therefore a character sequence read one character at a time from a
stream is still a C string.

Then, it follows that the maximum length of a string (at least, one
obtained in the above manner) can still be infinite, even in a hosted
implementation. This may be a more practical example than the calloc
(SIZE_MAX-1,SIZE_MAX-1) case. It may also be a slightly more complete
example in that it shows a string with an infinite length rather than
one with a length that is simply greater than SIZE_MAX.

Jason
 
J

JC

No, you've bungled the indexes again. This time you've gone one too
many.

Indeed; I'm not doing too good with this one (although I'm laughing).
It looks like I had it right on the try before that. Change to calloc
(SIZE_MAX,SIZE_MAX) and it should be OK, if it's not OK, then I give
up -- maybe try calloc(rand(),rand()).

My sincerest apologies to anybody who has become dumber after reading
these examples. Might be time to drop programming and consider a
career in circus performance, or at least make more liberal use of xna
in the future.

Jason
 
K

Keith Thompson

Han from China - Master Troll said:
The biggest meal of all was when Keith Thompson sent out emails to
some of the "regulars", requesting they not reply to my posts or
discuss me.
[...]

That is a pathetic lie.
 
J

JC



1. Because I can not allocate a block of memory > SIZE_MAX bytes on
this machine, and

2. Because tested or not, it is meant to illustrate a point. If I
replaced the broken loop with
"fill_entire_block_with_a_character_and_add_a_0_to_the_end()" then it
would have been just as acceptable. Instead I quickly hacked together
code to make what was going on explicit. In retrospect, I shouldn't
have, and should have just described what was going on in English
instead of in C.


Jason
 
K

Keith Thompson

christian.bau said:
There was a lengthy discussion about this a while ago on comp.std.c, I
think. The main question then was what would be the behaviour of
calloc (nmemb, size) if the mathematical product of nmemb and size
exceeds the range of size_t. It seemed conceivable that for example on
a system with 64 bit pointers and 32 bit size_t a call calloc
(0x10001, 0x10000) would return a pointer to 4 GiB + 64 KiB, which
could then be filled with a string whose length exceeds SIZE_MAX. That
would be completely legal; a call to strlen () in this case would
invoke undefined behaviour.

As I recall, there was a counter-argument that no object can be
larger than SIZE_MAX bytes, and so any call calloc(x, y) where
x * y > SIZE_MAX must fail (return a null pointer value).
I don't think this is supported by the wording of the standard,
but I can easily believe that it was the intent. The language is
more internally consistent if objects larger than SIZE_MAX bytes
cannot exist.

Note that the expression x * y > SIZE_MAX is intended to be denote
the mathematical multiplication of x by y as unbounded integers,
not using operations on size_t that can wrap around. Some calloc()
implementations fail to check for wraparound; if x * y yields a very
large value that wraps around to a smaller value, the allocation
can succeed. This is a bug in those implementations.
 
J

JC

This is, of course, what I was trying to show in my botched example.
Although, I didn't know there was already a discussion about it. I
found it here, for reference:

http://groups.google.com/group/comp.std.c/browse_frm/thread/35ec063d81174821
As I recall, there was a counter-argument that no object can be
larger than SIZE_MAX bytes, and so any call calloc(x, y) where
x * y > SIZE_MAX must fail (return a null pointer value).
I don't think this is supported by the wording of the standard,
but I can easily believe that it was the intent.  The language is
more internally consistent if objects larger than SIZE_MAX bytes
cannot exist.

[7.20.3.1/2] of C99 states: The calloc function allocates space for an
array of nmemb objects, each of whose size is size. The space is
initialized to all bits zero.

The way I read that, objects larger than SIZE_MAX still can not exist.
Even though calloc() can return a block of memory larger than
SIZE_MAX, what it's really doing is allocating space for some number
of objects whose individual sizes do not exceed SIZE_MAX.

If you read it that way, the issue actually becomes whether or not
it's "undefined" to treat consecutive "objects" returned by calloc()
as a single object (e.g. a "string" that spans multiple "object"
boundaries), rather than what happens when strings longer than
SIZE_MAX are passed to strlen().

It does seem to be a bit vague, doesn't it? The standard places no
limitation on a "string", but it also provides calloc, but calloc
doesn't necessarily mean that an "object" can exceed SIZE_MAX bytes,
but a "string" is not necessarily defined as being bound by an
"object", but strlen is only defined when a "string" fits in an
"object".

In the end, of course, sensible applications will rarely, if ever, run
into problems with any of this. On a system with a 32-bit size_t, if
your application is dealing with 4 gigabyte strings, you probably have
a few other issues worth looking at, and you're probably not going to
be using strlen() to determine the length (at least on today's
hardware).

Jason
 
K

Keith Thompson

JC said:
As I recall, there was a counter-argument that no object can be
larger than SIZE_MAX bytes, and so any call calloc(x, y) where
x * y > SIZE_MAX must fail (return a null pointer value).
I don't think this is supported by the wording of the standard,
but I can easily believe that it was the intent.  The language is
more internally consistent if objects larger than SIZE_MAX bytes
cannot exist.

[7.20.3.1/2] of C99 states: The calloc function allocates space for an
array of nmemb objects, each of whose size is size. The space is
initialized to all bits zero.

The way I read that, objects larger than SIZE_MAX still can not exist.
Even though calloc() can return a block of memory larger than
SIZE_MAX, what it's really doing is allocating space for some number
of objects whose individual sizes do not exceed SIZE_MAX.

The array is itself an object.

[...]
 
J

James Kuyper

Keith said:
As I recall, there was a counter-argument that no object can be
larger than SIZE_MAX bytes, and so any call calloc(x, y) where
x * y > SIZE_MAX must fail (return a null pointer value).
I don't think this is supported by the wording of the standard,

The strongest argument I've seen for that point of view was based upon
the fact that sizeof(type) is supposed to return the size of an object
of the specified type. Since it's not possible for
sizeof(char[SIZE_MAX][SIZE_MAX]) to return the correct size of the
specified type, it must not be possible to use calloc() to allocate such
an array.

Personally, I don't see the connection. Even if calloc() is not able to
allocate such an array, that still doesn't make it possible for
sizeof(char[SIZE_MAX][SIZE_MAX]) to return the correct value.
but I can easily believe that it was the intent. The language is
more internally consistent if objects larger than SIZE_MAX bytes
cannot exist.

Agreed.
 
J

JC

As I recall, there was a counter-argument that no object can be
larger than SIZE_MAX bytes, and so any call calloc(x, y) where
x * y > SIZE_MAX must fail (return a null pointer value).
I don't think this is supported by the wording of the standard,

The strongest argument I've seen for that point of view was based upon
the fact that sizeof(type) is supposed to return the size of an object
of the specified type. Since it's not possible for
sizeof(char[SIZE_MAX][SIZE_MAX]) to return the correct size of the
specified type, it must not be possible to use calloc() to allocate such
an array.

Personally, I don't see the connection. Even if calloc() is not able to
allocate such an array, that still doesn't make it possible for
sizeof(char[SIZE_MAX][SIZE_MAX]) to return the correct value.

On top of that, I don't even see the connection between sizeof(char
[SIZE_MAX][SIZE_MAX]) being undefined and char[SIZE_MAX][SIZE_MAX]
being disallowed. The array is larger than SIZE_MAX, and can exist --
nothing seems to say that just because you can't determine the size of
something with sizeof(), that something can't exist.

Jason
 
J

JC

[7.20.3.1/2] of C99 states: The calloc function allocates space for an
array of nmemb objects, each of whose size is size. The space is
initialized to all bits zero.
The way I read that, objects larger than SIZE_MAX still can not exist.
Even though calloc() can return a block of memory larger than
SIZE_MAX, what it's really doing is allocating space for some number
of objects whose individual sizes do not exceed SIZE_MAX.

The array is itself an object.

Fair enough. Then, what is the rationale for the premise that "no
object can be larger than SIZE_MAX bytes"? Is it because sizeof() is
undefined for such an object? Just because sizeof() is undefined for
objects larger than SIZE_MAX bytes doesn't seem to imply that objects
larger than SIZE_MAX bytes can't exist -- just that you can't
determine the size of them with sizeof(). That's my same rationale for
why undefined behavior of strlen() doesn't affect the maximum length
of a string.

The relation between all of this, as I see it (retracting my previous
incorrect statement that calloc returns an "array of objects" that is
not in itself an "object"), is:

- The upper limit of sizeof() does not imply the upper limit of the
size of an object.
- Nothing else in the standard seems to specify the maximum size of
an object either.
- Therefore an object has infinite maximum size, in theory.
- Therefore, the use of calloc() to allocate more than SIZE_MAX
bytes does not violate any upper limits on the size of a theoretical
object (which is infinite).
- Therefore, calloc() can be used to allocate an "extra large"
object larger than SIZE_MAX bytes. The maximum number of bytes that
calloc() can allocate is SIZE_MAX*SIZE_MAX, and *this* is therefore an
upper limit on the size of an object, in practice.
- So, that "extra large" object can exist regardless of sizeof()'s
limits, and can be obtained via calloc(). It is a real object.
- The standard does not explicitly place an upper limit on the
length of a string. A string has infinite maximum length, in theory.
- If the extra large object is filled with a consecutive sequence of
characters terminated by a 0 character, it fits the definition of a
string, and does not exceed any upper limits on the length of a
theoretical string (which is infinite).
- The upper limit of strlen() does not imply the upper limit of the
length of a string.
- The standard defines the "length of a string" as the "number of
bytes preceding the null character". Therefore the length of a string
placed in the maximum amount of memory allocatable by calloc() is
"SIZE_MAX * SIZE_MAX - 1". This, then, is the upper limit on the
length of a string in practice.

In other words, this shows that:

1) An object of maximum size SIZE_MAX * SIZE_MAX can be obtained
without violating the standard.
2) A string of maximum length SIZE_MAX * SIZE_MAX - 1 can be
constructed in that object without violating the standard.
3) The maximum theoretical size of an object (and length of a
string) is infinite, according to the standard (but correct me if I'm
wrong about the object max size, that blows a hole in the whole
thing).
4) The maximum practical size of an object is SIZE_MAX * SIZE_MAX,
because there is no way to obtain a larger object.
5) The maximum practical length of a string residing in memory is
SIZE_MAX * SIZE_MAX - 1, because there is no way to obtain a larger
object that the string can be constructed in.

So I guess that means the maximum length of a string *in memory* is
SIZE_MAX * SIZE_MAX - 1. However, note that that only applies to
strings that exist in memory; also note that the standard does not
require strings to reside in memory (and does not require that a
"pointer to a string", as defined in 7.1.1/1, must exist for a string
to exist).

Going back to the example of a string read from the standard input
stream with fgetc(), then, the actual maximum length of a string is
still infinite.

Jason
 
J

James Kuyper

JC said:
The strongest argument I've seen for that point of view was based upon
the fact that sizeof(type) is supposed to return the size of an object
of the specified type. Since it's not possible for
sizeof(char[SIZE_MAX][SIZE_MAX]) to return the correct size of the
specified type, it must not be possible to use calloc() to allocate such
an array.

Personally, I don't see the connection. Even if calloc() is not able to
allocate such an array, that still doesn't make it possible for
sizeof(char[SIZE_MAX][SIZE_MAX]) to return the correct value.

On top of that, I don't even see the connection between sizeof(char
[SIZE_MAX][SIZE_MAX]) being undefined and char[SIZE_MAX][SIZE_MAX]
being disallowed. The array is larger than SIZE_MAX, and can exist --
nothing seems to say that just because you can't determine the size of
something with sizeof(), that something can't exist.

For any object that you can actually define in your program, with either
static or automatic storage duration, "sizeof object" is supposed to
give the size of that object; the standard provide no exceptions to this
requirement. An implementation can meet this requirement by either
limiting the maximum size of objects defined in the program to a size
smaller than SIZE_MAX, or by choosing a type for size_t sufficiently big
that SIZE_MAX is bigger than the largest object that can be defined.
Because there is always a way for an implementation to meet that
requirement, I believe that the meaning of that requirement is precisely
that an implementation must use whichever of those methods it wishes to
use, in order to meet that requirement.

The requirement that sizeof(type) always give the correct size has quite
different implications. Since it will always be possible to declare
types that have a size greater than SIZE_MAX, no matter what value
SIZE_MAX has, there's no course of action that an implementation has
available to it to ensure meeting that requirement. I therefore consider
that requirement to be a defect in the standard. I imagine that the
actual intent was that sizeof(type) is only required to give the correct
size when the correct size is smaller than SIZE_MAX. However, the
standard as currently written contains no wording that actually allows
sizeof to ever fail to give the correct size.

I've been told that the standard requires sizeof(type) to return the
size, after conversion to size_t, even if that conversion result in a
number which is not the actual size; that interpretation would avoid
this problem. However, I see no such wording in the standard, not even
when I looked at the precise citations that were made with the intent of
supporting that claim.

Use of calloc() presents a different issue. I can declare

char (*array)[SIZE_MAX][SIZE_MAX] = calloc(SIZE_MAX, SIZE_MAX);

if(array)
{
size_t size = sizeof *array;

Now, if we reach this point in the code, then *array is an actual
object, and it would appear that sizeof *array is required to return the
correct size of that object, but it equally obviously cannot. Since an
implementation always has the option of having calloc() fail for such
allocations, it might seem that it could be argued that objects with
dynamic storage duration also must be limited to SIZE_MAX bytes, just
the same as objects with static or allocated storage duration.

However, there's a key difference here. "sizeof expression" does not
evaluate the expression. Therefore, "sizeof *array" must produce the
same result no matter what the value of 'array' is. 'array' doesn't even
have to be initialized. Therefore, the problem is not the calloc() call,
but the declaration of 'array'. It's not at all clear that an
implementation has any legitimate reason it could give for rejecting
such a declaration, but doing so is the only way available to it for
avoiding the possibility of having to meet an impossible requirement.

If, instead, you wrote

char (*array)[SIZE_MAX] = calloc(SIZE_MAX, SIZE_MAX);

then there is no way to use an argument based upon 'sizeof' to conclude
that this call must fail.
 
J

James Kuyper

JC wrote:
....
Fair enough. Then, what is the rationale for the premise that "no
object can be larger than SIZE_MAX bytes"? Is it because sizeof() is
undefined for such an object?

No, it's because sizeof such an object is DEFINED by the standard to
have a value that it cannot actually have; an implementation is
therefore required to make sure that such a situation does not actually
come up. This argument has flaws, which I've discussed in the message I
just posted a few minutes ago.
- The upper limit of sizeof() does not imply the upper limit of the
size of an object.

Since "sizeof object" is defined as having a value (of type size_t)
which is the size of the object, it does not have the behavior mandated
by the standard if there's any object it can be applied to which has a
size greater than that limit. If an implementation allows the creation
of such objects in code which applies the sizeof operator to those
objects, it has no choice but to fail to conform to the standard, one
way or another.
 
H

Harald van Dijk

[...]
char (*array)[SIZE_MAX][SIZE_MAX] = calloc(SIZE_MAX, SIZE_MAX);

if(array)
{
size_t size = sizeof *array;
[...]
However, there's a key difference here. "sizeof expression" does not
evaluate the expression. Therefore, "sizeof *array" must produce the
same result no matter what the value of 'array' is. 'array' doesn't even
have to be initialized. Therefore, the problem is not the calloc() call,
but the declaration of 'array'. It's not at all clear that an
implementation has any legitimate reason it could give for rejecting
such a declaration, but doing so is the only way available to it for
avoiding the possibility of having to meet an impossible requirement.

Can an implementation state that this code exceeds a translation limit?
5.2.4.1 only states that an implementation "shall be able to translate and
execute at least one program that contains at least one instance of every
one of the following limits", which seems to allow for the possibility of
completely different types of translation limits, so long as they do not
disallow at least the one program.
 
A

Antoninus Twink

Trolls lie, not because it is in their interest, but because it is
in their nature. Not every comp.lang.c liar is a troll, but I think
I'm correct in claiming that every comp.lang.c troll is a liar.

We've heard it all now - Heathfield, the very prince of lies, accuses
others of lying.
Of course, he's just as much a troll as the others, and therefore a
liar. (I can prove, if necessary, that he's a liar.)

Why would it be necessary? And why would we believe the testimony of a
liar like you?
 
J

James Kuyper

Harald said:
[...]
char (*array)[SIZE_MAX][SIZE_MAX] = calloc(SIZE_MAX, SIZE_MAX);

if(array)
{
size_t size = sizeof *array;
[...]
However, there's a key difference here. "sizeof expression" does not
evaluate the expression. Therefore, "sizeof *array" must produce the
same result no matter what the value of 'array' is. 'array' doesn't even
have to be initialized. Therefore, the problem is not the calloc() call,
but the declaration of 'array'. It's not at all clear that an
implementation has any legitimate reason it could give for rejecting
such a declaration, but doing so is the only way available to it for
avoiding the possibility of having to meet an impossible requirement.

Can an implementation state that this code exceeds a translation limit?
5.2.4.1 only states that an implementation "shall be able to translate and
execute at least one program that contains at least one instance of every
one of the following limits", which seems to allow for the possibility of
completely different types of translation limits, so long as they do not
disallow at least the one program.

It's hard for me to imagine an reasonable limit that could be used to
justify this. However, I can't rule out the possibility that someone
might be able to come up with one, which is why I said "It's not at all
clear".
My key point was that it's the declaration of 'array', if anything,
which must be constrained by this requirement, not the call to calloc().
 
H

Harald van Dijk

Harald said:
[...]
char (*array)[SIZE_MAX][SIZE_MAX] = calloc(SIZE_MAX, SIZE_MAX);

if(array)
{
size_t size = sizeof *array;
[...]
It's not at all clear
that an implementation has any legitimate reason it could give for
rejecting such a declaration, [...]

Can an implementation state that this code exceeds a translation limit?
[...]

It's hard for me to imagine an reasonable limit that could be used to
justify this. [...]

Is it unreasonable to make "SIZE_MAX bytes in an object type" a
translation limit? That's exactly the limit you want to set, right?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,052
Members
48,769
Latest member
Clifft

Latest Threads

Top