Null terminated strings: bad or good?

K

Keith Thompson

CBFalconer said:
I claim that calloc didn't really succeed. It just converted the
total size requested using the usual unsigned conversions. It
returned a pointer to a physical object, which was NOT (SIZE_MAX *
2) big. It can't be, since no object can exceed SIZE_MAX. THIS IS
NOT A TYPE. THIS calloc IS FAULTY.

You misunderstood my example. In my hypothetical implementation,
calloc() *did* succeed. It returned a pointer to an object whose size
exceeds SIZE_MAX bytes.

You assert that "no object can exceed SIZE_MAX". I see no *direct*
statement of this in the standard. If there is one, surely you can
provide a citation. (You don't get to just make up rules like this.)

However, Harald van D?k (sorry, I can't type his name properly in my
current environment) presented an interesting indirect argument
involving the strlen() function. It's in this thread; since you and I
use the same news server I'm sure you can access it.

[...]
I believe (but may be wrong) that somewhere the standard specifies
that the action of calloc is to call malloc, and then initialize
the result.

Yes, you're wrong. Since we're discussion subtleties in the wording
of the standard, I'm surprised you didn't just look it up.

C99 7.30.3.1:

The *calloc* function allocates space for an array of *nmemb* objects,
each of whose size is *size*. The space is initialized to all bits
zero.

...

The *calloc* function returns either a null pointer or a pointer
to the allocated space.

(I've added '*'s to denote boldface.)
 
Q

qarnos

You are a knowledgable C programmer. And yet you have
proved that you can forget adding the dammed terminating
zero.

Whereas forgetting to update the character count would be a far
superior bug.
 
J

JC

I posted it first, at least in this thread. That was the example I
botched else-thread in a reply to Eric Sosman.

Very interesting.

For reference, here's what C99 says about strlen:

7.21.6.3 The strlen function

Synopsis
1 #include <string.h>
size_t strlen(const char *s);

Description
2 The strlen function computes the length of the string pointed to
by s.

Returns
3 The strlen function returns the number of characters that precede
the terminating null character.

I'm still not comfortable with the indirectness of the reasoning
(that's a criticism of the standard, not of your analysis of it). But
I think you're right.

This is exactly what I was trying to say as well, I think. There seem
to basic arguments here: either strlen()'s requirements are taken to
be invariant, or strlen() is considered undefined for strings longer
than SIZE_MAX.

A counterargument could be made that if the length of the string can't
be represented as a size_t then the behavior is undefined by omission,
just as if s doesn't point to a string.

There's nothing in the definition of a string that presents acceptance
by strlen() as one of the requirements. This counterargument says the
same thing: strlen() is undefined for strings longer than SIZE_MAX.

What I think I'd *like* the standard to say is that no object can
exceed SIZE_MAX bytes, that any attempt to declare such an object
invokes undefined behavior, and that calloc(X, Y), where X * Y exceeds
SIZE_MAX, must return NULL. Your argument suggests that it already
implies all of this; I wish it did so more explicitly.

For completeness, it would also have to state that the maximum size of
a string is the maximum size of an object, although I believe that to
be too arbitrary. A sequence of null-terminated characters returned by
successive calls to fgetc() is still a string by the standard's
definition of a string, and yet its length is clearly arbitrary --
even infinite is within reason in that case.


Jason
 
K

Keith Thompson

JC said:
For completeness, it would also have to state that the maximum size of
a string is the maximum size of an object, although I believe that to
be too arbitrary. A sequence of null-terminated characters returned by
successive calls to fgetc() is still a string by the standard's
definition of a string, and yet its length is clearly arbitrary --
even infinite is within reason in that case.

I think that a string is necessarily an object. The standard's
definition is:

A _string_ is a contiguous sequence of characters terminated by
and including the first null character.

I assume that "contiguous" means "contiguous in memory", so a sequence
of characters returned by calls to fgetc() isn't a string unless you
store them contiguously in some object.
 
J

JC

I think that a string is necessarily an object.  The standard's
definition is:

    A _string_ is a contiguous sequence of characters terminated by
    and including the first null character.

I assume that "contiguous" means "contiguous in memory", so a sequence
of characters returned by calls to fgetc() isn't a string unless you
store them contiguously in some object.


Yes, I made a different assumption. The entire argument of whether or
not character data returned by calls to fgetc() is a "string" depends
on whether or not one assumes an implied "in memory" there. I'll
concede that there's definitely strong evidence that suggests it's
implied (esp. if it's not implied, a *lot* of other things in the
standard become vague or possibly undefined, e.g. the "pointer to
string" defined in that same paragraph).

In any case, I can't really come up with any new arguments for
*either* side (a string is or is not an object), and it seems like a
few extra words in the standard to clarify it all wouldn't hurt.


Jason
 
T

Tony

Dik T. Winter said:
...

And the advantage of that?

1. Ease of IO (writing to/fro disk files, for example).
2. No need to calculate length (a real bummer with null terminated strings).
3. ?

Actually, the above isn't enough. One would want to maintain the size of the
buffer that data points to.

Note that I use C++ and have a string class based upon an array class and
that the above was just a quickie thought as to how a counted string could
look in C (for starters anyway). I think null terminated strings are a nail
in C's coffin now.

See:
struct string a = {10, "abcde"};

Every construct requires documentation/learning-how-to-use.
a.length = 100;

One is not supposed to manipulate the struct directly. There would be all
the standard functions that operate on the struct. (Again though, I'm not
proposing this for C (though it would be simple enough to make a library to
do so)).

Tony
 
T

Tony

Thank Han. I thought I was going to have to read many more posts in this
thread but now you've marked the prune point and I need not read this
branch
beyond your post at all!


"It's safe (in a manner of speaking). The subtopic is the maximum
length of a string."

I don't see that as an issue. Everything doesn't have to be scalable to the
largest integer size on a machine. Strings to me are of "reasonable" length.
For instance. If there is a period marking the end of a sentence, then that
is probably one string. A whole file of sentences and paragraphs, is not a
string. 32 bits for a length field is just because it's easy to use on a
32-bit platform. If someone needs a billion byte string, well I'm not even
going to try to conceive of that because it sounds silly.

Tony
 
T

Tony

James Kuyper said:
You criticized the definition of the term, rather than the design of the
library that it was used to describe. I gave you an answer that matched
your criticism. Next time, try to be clearer about what it is that you're
really criticizing.


You've got cause and effect reversed. The fact that the standard chose to
define strings as being null-terminated follows from the fact that the C
standard library was designed to use null-terminated strings, not
vice-versa.

C'mon now. You can't build the library (functions) without first deciding
what a "string" is (what the functions operate on).

Tony
 
J

JC

Every construct requires documentation/learning-how-to-use.

I'm slightly confused now; are you referring to an actual struct with
a count and data member? Or are you referring to changing the
underlying implementation of strings to counted strings? Or something
else? I assumed the second, where the user could not manipulate the
count and data independently of eachother, and all manipulation must
be done via library functions, and initialization is simply a matter
of:

countedstring a = "thestring";

In other words, which of the following are you talking about:

1. Standard structures with a publicly accessible count and data
field.
2. Standard structures only accessible with library functions.
3. Language change with change to underlying implementation (e.g.
string literals represent a count followed by the data rather than the
data followed by a 0).

That question applies to both C and C++ contexts.


Jason
 
K

Keith Thompson

Tony said:
I don't see that as an issue. Everything doesn't have to be scalable
to the largest integer size on a machine. Strings to me are of
"reasonable" length. For instance. If there is a period marking the
end of a sentence, then that is probably one string. A whole file of
sentences and paragraphs, is not a string. 32 bits for a length
field is just because it's easy to use on a 32-bit platform. If
someone needs a billion byte string, well I'm not even going to try
to conceive of that because it sounds silly.

Limiting a feature to what you consider to be "reasonable" is likely
to prevent perfectly valid uses. I've written programs (not in C)
that slurp the entire contents of a file into memory as a single
string. Why should that be disallowed because you think it's "silly"?
 
G

Guest

C'mon now. You can't build the [string] library (functions) without first deciding
what a "string" is (what the functions operate on).

well you could build a lot of the library with some sort
of abstract definition of "string" and only concretise
it later; even at run time if you wanted really late binding.

make-string, string-ref (read a char), string-set! (write a char),
string-length would probably do it.

Of course this may not be a good idea if your language is supposed to
be hyper-efficeint bare metal programming.
 
J

James Kuyper

Tony said:
C'mon now. You can't build the library (functions) without first deciding
what a "string" is (what the functions operate on).

Why not? All you need is the concept; naming the concept can come later.
And the concept need to write those routines is "null-terminated
string". The decision about whether or not the term "string" should be
used exclusively for "null-terminated string" can be deferred
practically indefinitely without affecting the feasibility of writing
programs for manipulating null-terminated strings.
 
B

Bartc

Tony said:
My thinking was/is also that null terminated strings suck. I was just
being a good citizen and asking the more general question.

Just curious: what do you do with your length+char-array string when you
need to pass it to an OS function that needs a zero-terminated one (or
Asciiz as they used to be called)?
 
J

jacob navia

Bartc said:
Just curious: what do you do with your length+char-array string when you
need to pass it to an OS function that needs a zero-terminated one (or
Asciiz as they used to be called)?

I pass what it expects obviously. In my implementation,
I always append a zero to the string stored so the conversion
is very fast.
 
B

Bartc

jacob navia said:
I pass what it expects obviously. In my implementation,
I always append a zero to the string stored so the conversion
is very fast.

OK so you use a length *and* a zero-terminated string?

Makes sense (that's what I do). But then you lose one advantage of
length+string which is dealing with arbitrary binary data, which can include
zeros.

(I suppose also on zero length strings you store a single 0 char?)
 
R

Rui Maciel

jacob said:
I pass what it expects obviously. In my implementation,
I always append a zero to the string stored so the conversion
is very fast.

So in the end it's a standard C null terminated string with a wrapper that holds a line length variable?


Rui Maciel
 
J

jacob navia

Rui said:
So in the end it's a standard C null terminated string with a wrapper that holds a line length variable?


Rui Maciel

The string has folllowing fields:
1) length: The number of data characters in the string.
2) capacity: The length of the allocated string buffer for this string
3) Flags (used to implement read only strings and other goodies)
4) A pointer to the data characters.

Since I always allocated at least 1 char more, I use the zero as a
sanity check since at all times

String.buffer[String.length] == 0

Embedded zeroes are supported of course
 
J

jacob navia

Bartc said:
OK so you use a length *and* a zero-terminated string?

No, the zero is just there to double check the length.

Suppose some rogue program changes the length in a wrong
way. The string will not be accepted any more since
String.buffer[String.length] will not be zero.
Makes sense (that's what I do). But then you lose one advantage of
length+string which is dealing with arbitrary binary data, which can
include zeros.

Is only the zero at String.length that counts. Embedded zeroes inside
the string are NOT significant.

(I suppose also on zero length strings you store a single 0 char?)

Yes
 
A

Antoninus Twink

Suppose some rogue program changes the length in a wrong
way. The string will not be accepted any more since
String.buffer[String.length] will not be zero.

What do you mean by "not accepted"? Where and how often does the check
on String.buffer[String.length] take place?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,264
Latest member
FletcherDa

Latest Threads

Top