Null terminated strings: bad or good?

T

Tony

James Kuyper said:
Within the context of the C language, there's lots of routines in the C
standard library that work with null-terminated strings, and only a few
that work with counted strings.

Well "duh". Once you define "string" to mean "null terminated sequence of
chars", then obviously you write the functions to fit that PARADIGM.
Every routine that treats a string as a series of characters rather than
as raw memory recognizes null termination, even those functions that also
accept a count.

Yeah, so? That is obvious and follows directly from the defintion of a C
string as being a null-terminated thing.
Therefore, if the standard used 'string' as a short for for 'counted
string' rather than 'null terminated string', it would be marginally
longer and harder to read.

Did you just give the following argument: "god created all things, hence
there must be a god"?

Tony
 
G

Guest

My C++ string class is a length and data ptr (pretty much, since the
underpinnings are an array class).




And didn't the null terminated string give us the buffer overrun fiasco?

no. Could you give an example of the buffer overrun fiasco.

String representations, like many things in life, are a compromise.
Zero terminated strings have some pros and cons. So do counted
strings (read the rest of the thread for some).

If you *really* can't live with zero terminated strings then write (or
find)
a library that doesn't use them! Jocab Navia or Paul Hsiu (I think)
have
publically available versions (though Jocob's uses an extension to the
C
language).
 
J

JC

Perhaps strings should be akin to width-specified integers:

string16 (a string with up to 65536 chars)
string32 ... etc.

This is almost not a bad idea, but the major problem with it is
string16 and string32 are now different types, when they arguably
should not be. Are there special rules for converting from string16 ->
string32? What if you want to pass a string8 to a function that
expects a string32? What if the function modifies the string through
it's parameter? Then the string8 -> string32 conversion must make
changes to the string32 visible in the string8. What if you want to
convert a string32 to a string16? Should it do a run-time check and
fail if the string32's length is larger than 65535?

"These are normally known as "C-style" strings. The main advantage is
the length of the string is limited only by available memory, and the
length field is not stored with the string, thus conserving storage
space."

The "main advantage" above, is actually a disadvantage. It causes
programmers to write code that is succeptible to buffer overrun attacks.

Storing the length with the string does not protect against buffer
overrun attacks.

Storage space conservation? Only in the exceptional case nowadays.

Not on embedded systems. The ATMega168 I have sitting on my desk right
now has 512 bytes of read-only storage for data, and 16kB more
available for program + data. Every byte counts. These are not
exceptional cases; it's fairly common hardware, albeit somewhat
specialized.

""removing"
the prefix of a string can be done simply by referring to a location
past the beginning,"

That operation is the same in the "Pascal-type string also, but then the
length has to be updated. No big deal.

No. This is not correct. Consider this function, which takes as input
a full path name (e.g. "c:\\windows\\notepad.exe") and prints the last
path component ("notepad.exe"), without modifying the input.


void print_file_name (const char *fullpath) {
const char *filename = strrchr(fullpath, PATH_SEPARATOR);
printf("file name only: %s\n", filename ? filename : fullpath);
}


I challenge you to implement that with counted strings without leaking
memory or allocating new memory. As a handicap, you may assume that a
function "last_index_of(counted_string, char)" exists that returns the
index of the last occurrence of the character in the specified
counted_string (say it returns -1 if not found).

"dividing strings into substrings can be done by
placing 0's where appropriate. As an exercise, try implementing strtok
() with Pascal-style strings. You may be surprised at the difficulty."

Well one function is an exceptional case. The rule is to program for the
common case and make special things as required  rather than complicate the
general case.


It's one function because I didn't list all the others. How about
strrchr, strchr, strstr? Not to mention that strtok is not an uncommon
function.

Additionally, if the rule is to program for the common case, it seems
that null-terminated strings satisfy that rule quite well, as you can
already see. Counted strings would complicate the general case. Not
that counted strings aren't useful in certain cases, but I would
assume they are useful in the minority based on the success I've had
with null-terminated strings so far.

"The main disadvantage of C-style strings is computing the length is O
(n)"

I'd say there are a FEW issues and that is just one of them.


Please elaborate.

And this, of course mitigates that issue:
", but applications that need to reduce this to constant time can
easily do so by storing the length elsewhere, if they need it."



Jason
 
R

Richard

jacob navia said:
There are many advantages to zero terminated strings:

o Performance. You must scan the whole string millions of times
each time you want to know how long it is. This needs always
a faster processor so you can count on C to get that new game
machine you were dreaming of.

Huh? Oh! You are taking the piss!
 
A

Amandil

Well said (meaning, I read that, understand it). I still don't see the
benefit of null terminated strings over a struct-like thing:
struct string
{
   uint32 length;
   char* data;
};

Here are some potential advantages:

(1) The struct takes more space - 4 bytes for the length vs 1 for
the nul.

(2) The code for a nul terminated string typically is more
compact.  Compare

    for (i = 0; i<s.length; i++) {... s.data ...}

vs

    for (ptr = s; ptr; ptr++) {... *ptr ...}

or even (if we don't need to preserve the pointer to the string)

    for (;s;s++) {... *s ...}


An additional advantage may be that one might not know how long a
string is, for example if it was read with user-input, like gets().
Yes, gets() could store the length of the string at the beginning, but
that's not it's job.
By the way, if you're not satisfied with NUL terminated strings, I
believe DOS (this is going into x86 assembly for DOS) recognized the
'$' character as a string terminator.
NUL terminated strings are recognized in many more places than just
the C programming language, and it has its advantages. It may also
have its disadvantages, mostly in the form of attracting bugs, but I
(personally) believe the advantages win out. Of course, anyone who
disagrees with me is perfectly free to create a string type based on
the C++ std::string class, and a complete library to go along with it
- I believe J. Navia claims to have done something similar, and that's
quite understandable - but "It's all over but the shouting" and
standard C is planning to stick with NUL terminated character strings.
BTW, please don't confuse NUL (ASCII character code 0x00) with NULL
(the non-pointer).

Cheers

-- Marty Amandil (Yes, it's a pseudonym, but I hope that <i>he</he>
doesn't consider me a liar or a troll.)
 
A

Antoninus Twink

However, my dog never attempted to improve the C string library
throughout his life. He died of old age about 5 years ago.

Damn, he must have been really ancient if he died of old age before you
did.

Linking a C program? "Off topic". Death dates of family pets? "On
topic".
 
P

Phil Carmody

Well said (meaning, I read that, understand it). I still don't see the
benefit of null terminated strings over a struct-like thing:

struct string
{
uint32 length;
char* data;
};

Here are some potential advantages:

(1) The struct takes more space - 4 bytes for the length vs 1 for
the nul.

(2) The code for a nul terminated string typically is more
compact. Compare

for (i = 0; i<s.length; i++) {... s.data ...}

vs

for (ptr = s; ptr; ptr++) {... *ptr ...}

or even (if we don't need to preserve the pointer to the string)

for (;s;s++) {... *s ...}


It may be more compact, but is it easier to make a silly mistake
that compiles without a warning?

Phil
 
J

jacob navia

JC said:
This is almost not a bad idea, but the major problem with it is
string16 and string32 are now different types, when they arguably
should not be. Are there special rules for converting from string16 ->
string32? What if you want to pass a string8 to a function that
expects a string32? What if the function modifies the string through
it's parameter? Then the string8 -> string32 conversion must make
changes to the string32 visible in the string8. What if you want to
convert a string32 to a string16? Should it do a run-time check and
fail if the string32's length is larger than 65535?

The library of lcc-win uses 32 bits for the string length.
This gives up to 4GB strings. I think that for most applications
this will be enough.

Other schemes are possible. For instance you store the length in
the first byte if the length is smaller than 255. If it is bigger than
254, you store 255 in the first byte and the length in the following two
bytes. If the length is bigger than 65534, you store 65535 in the
two bytes and store the length of the strings in the next 4 bytes.
Etc.

Since most strings are smaller than 255 chars it will save a lot of
space fopr the normal cases.

Storing the length with the string does not protect against buffer
overrun attacks.

Well, it makes strcpy much faster and 100% sure. Surely a way of making
a buffer overflow atack is to supply a string too long for strcpy so
your statement makes no sense at all...

Not on embedded systems. The ATMega168 I have sitting on my desk right
now has 512 bytes of read-only storage for data, and 16kB more
available for program + data. Every byte counts. These are not
exceptional cases; it's fairly common hardware, albeit somewhat
specialized.

Do those processors need a lot of string handling?

And, by the way, why should the language be tailored to shitty
hardware? A schema as I described above would in most cases
just waste ZERO bytes: Since most strings in the ATM168 will be
smaller tghan 255 bytes, and we do not need the terminating
zezro the space wasted is NIL!!!
No. This is not correct. Consider this function, which takes as input
a full path name (e.g. "c:\\windows\\notepad.exe") and prints the last
path component ("notepad.exe"), without modifying the input.

Well, you make a temporary copy or you make a fat pointer to it.
Both can be done with lcc-win library
void print_file_name (const char *fullpath) {
const char *filename = strrchr(fullpath, PATH_SEPARATOR);
printf("file name only: %s\n", filename ? filename : fullpath);
}


I challenge you to implement that with counted strings without leaking
memory or allocating new memory. As a handicap, you may assume that a
function "last_index_of(counted_string, char)" exists that returns the
index of the last occurrence of the character in the specified
counted_string (say it returns -1 if not found).

This is very easy
Just copy the part you want.

Note that your solution makes an alias to the string, an alias that can
mean a crash if its stored away somewhere, and the original string is
discarded with free().


It is absolutely not a difficulty. But (as you may know) modifying the
input string is not possible in many cases.
 
J

JC

Here are some potential advantages:
(1) The struct takes more space - 4 bytes for the length vs 1 for
the nul.
(2) The code for a nul terminated string typically is more
compact.  Compare
    for (i = 0; i<s.length; i++) {... s.data ...}

    for (ptr = s; ptr; ptr++) {... *ptr ...}

or even (if we don't need to preserve the pointer to the string)
    for (;s;s++) {... *s ...}

It may be more compact, but is it easier to make a silly mistake
that compiles without a warning?



Well, it's always easy to make a silly mistake that compiles without a
warning, e.g.:

for (i = 0; i>s.length; i++) {... s.data ...}

Anyways, as much as I like null-terminated strings, I don't think
"compact" code is a compelling reason to use them (where "compact" = 5
or 6 characters shorter). There are far more compelling reasons to
stick to null-terminated strings, most notably ease of manipulation.
There are compelling reasons to stick to counted strings, too, e.g.
constant-time length computation and the ability to have strings with
0's as part of the data.

Jason
 
J

JC

This is almost not a bad idea, but the major problem with it is
string16 and string32 are now different types, when they arguably
should not be.
[snip]

Other schemes are possible. For instance you store the length in
the first byte if the length is smaller than 255. If it is bigger than
254, you store 255 in the first byte and the length in the following two
bytes. If the length is bigger than 65534, you store 65535 in the
two bytes and store the length of the strings in the next 4 bytes.
Etc.


Heavens no. Having to parse your strings every time you want to access
them isn't something I'm comfortable with. And what happens if you
have a 254-character string that you want to append a single character
to? You have to copy the whole thing?
Well, it makes strcpy much faster and 100% sure. Surely a way of making
a buffer overflow atack is to supply a string too long for strcpy so
your statement makes no sense at all...


No. The count is the length of the string, not its capacity. It does
not protect against buffer overruns and has no bearing on strcpy's
security (although it certainly can improve performance, that much you
are right about). You still have to use the counted string equivalent
of strncpy and pass the buffer size as a separate parameter.

Do those processors need a lot of string handling?

Maybe, maybe not. Depends on your application.

And, by the way, why should the language be tailored to shitty
hardware?
Erm...


A schema as I described above would in most cases
just waste ZERO bytes: Since most strings in the ATM168 will be
smaller tghan 255 bytes, and we do not need the terminating
zezro the space wasted is NIL!!!

Or you could just use null-terminated strings, which always require 1
byte more than the length of the string, and don't have any of the
string parsing issues mentioned above. It is never possible to have a
counted string that takes up less space than a null-terminated string,
although it is possible to have them take the same amount of space.
Also note that the size of the code required to parse the length out
of a string becomes significant on those platforms.

Well, you make a temporary copy or you make a fat pointer to it.
Both can be done with lcc-win library

There are certainly compelling reasons to use counted strings, but
that is not one of them. Avoiding temporary copies and fat pointers is
a compelling reason to *not* use counted strings.

This is very easy
Just copy the part you want.

Note that I said "without ... allocating new memory". Copying the part
you want does not fit the requirements.

Note that your solution  makes an alias to the string, an alias that can
  mean a crash if its stored away somewhere, and the original string is
discarded with free().

The 'filename' never escapes from that function and this is not a
possibility. If 'filename' did escape, it would be with the
understanding that it becomes invalid when the original string is
freed. This is the same caveat that applies to, say, strchr(). It is
generally not problematic.

If an application requires that the original string be freed without
invalidating the part, then a copy can be made. That is orthogonal to
whether or not you're sing null-terminated or counted strings,
although it may appear related because using counted strings always
forces you to make a copy whether you need to or not.

It is absolutely not a difficulty.

I'm curious to see your implementation, please elaborate.

But (as you may know) modifying the
input string is not possible in many cases.

This is not a valid premise as it is not only false, but does not
relate to any of the arguments being made anyways.


Jason
 
J

JC

This is not a valid premise as it is not only false, but does not
relate to any of the arguments being made anyways.

I should say 'groundless', not 'false'. This is something that is
heavily dependent on the nature of the code you're writing and you
really can't make any assumptions about the ratio of non-const to
const strings in an application, let alone use those assumptions as a
basis for any type of argument.

Jason
 
J

James Kuyper

Tony said:
Well "duh". Once you define "string" to mean "null terminated sequence of
chars", then obviously you write the functions to fit that PARADIGM.

You criticized the definition of the term, rather than the design of the
library that it was used to describe. I gave you an answer that matched
your criticism. Next time, try to be clearer about what it is that
you're really criticizing.
Yeah, so? That is obvious and follows directly from the defintion of a C
string as being a null-terminated thing.

You've got cause and effect reversed. The fact that the standard chose
to define strings as being null-terminated follows from the fact that
the C standard library was designed to use null-terminated strings, not
vice-versa.
Did you just give the following argument: "god created all things, hence
there must be a god"?

No. It only seems that way because of your misconception that the
definition of "string" came before the design of the relevant standard
library functions. Formal definitions came in much later, initially
there was just code (presumably with some design documentation, though I
have no idea how much documentation existed prior to the time that the
first string-handling function was written for what eventually became
that library).

What I actually just gave you was the argument that if you kept the
design of the standard library exactly the way it is, and changed the
definition of 'string' in the C standard so that it now referred to
counted strings rather than null terminated ones, then re-writing the
standard to correctly described the unchanged design of the C standard
library would require a major rewrite that would make the standard both
marginally longer and harder to understand. That is because it would not
contain a single use of the word "string" that was not modified with the
phrase "null terminated".
 
D

Dik T. Winter

> Well said (meaning, I read that, understand it). I still don't see the
> benefit of null terminated strings over a struct-like thing:
>
> struct string
> {
> uint32 length;
> char* data;
> };

And the advantage of that? See:
struct string a = {10, "abcde"};
a.length = 100;
...
 
C

CBFalconer

Keith said:
.... snip ...
If calloc accepts calls for items with a net size greater than
SIZE_MAX, the code is in error. calloc should simply return NULL
for such a call.

An implementation can certainly avoid the whole issue by having
calloc() return NULL whenever the requested size exceeds SIZE_MAX,
and I suspect most implementations do that.

But suppose an implementation returns a non-null result for
calloc(SIZE_MAX, 2). You can't directly apply sizeof to the
resulting object, since it's anonymous. You can try to compute,
for example, "sizeof char[SIZE_MAX][2]", but you could try that
even if calloc() didn't exist. One compiler I tried issued a
compile-time diagnostic:

error: size of array 'type name' is too large

Here I maintain that calloc is faulty. The standard specifies no
failure mechanism, simply a failure to supply the memory, signalled
by a NULL. If it occurs only on compilation, it might be
acceptable. But that can't handle calls with variable parameters.
calloc itself can.
 
K

Keith Thompson

CBFalconer said:
Keith said:
... snip ...
If calloc accepts calls for items with a net size greater than
SIZE_MAX, the code is in error. calloc should simply return NULL
for such a call.

An implementation can certainly avoid the whole issue by having
calloc() return NULL whenever the requested size exceeds SIZE_MAX,
and I suspect most implementations do that.

But suppose an implementation returns a non-null result for
calloc(SIZE_MAX, 2). You can't directly apply sizeof to the
resulting object, since it's anonymous. You can try to compute,
for example, "sizeof char[SIZE_MAX][2]", but you could try that
even if calloc() didn't exist. One compiler I tried issued a
compile-time diagnostic:

error: size of array 'type name' is too large

Here I maintain that calloc is faulty. The standard specifies no
failure mechanism, simply a failure to supply the memory, signalled
by a NULL. If it occurs only on compilation, it might be
acceptable. But that can't handle calls with variable parameters.
calloc itself can.

You snipped the actual question, which was to cite a specific clause
in the standard that such an implementation would violate.

Once again, assume the following (I'll change the example a bit):

sizeof (short) == 2

The following:
short *ptr = calloc(SIZE_MAX, sizeof *ptr);
succeeds, causing ptr to point to the first element of an array
object whose size is SIZE_MAX*2 bytes. You can successfully read
and write all SIZE_MAX short objects in the allocated object, from
ptr[0] to ptr[SIZE_MAX-1].

sizeof (short[SIZE_MAX]) is rejected with a diagnostic at compile
time. (I would expect this to happen for any compiler where
sizeof (short) > 1.)

You claim that this hypothetical implementation is non-conforming.
What specific clause of the standard does it violate?
 
H

Harald van Dijk

You snipped the actual question, which was to cite a specific clause in
the standard that such an implementation would violate.

Once again, assume the following (I'll change the example a bit):

sizeof (short) == 2

The following:
short *ptr = calloc(SIZE_MAX, sizeof *ptr);
succeeds, causing ptr to point to the first element of an array
object whose size is SIZE_MAX*2 bytes. You can successfully read
and write all SIZE_MAX short objects in the allocated object, from
ptr[0] to ptr[SIZE_MAX-1].

sizeof (short[SIZE_MAX]) is rejected with a diagnostic at compile
time. (I would expect this to happen for any compiler where sizeof
(short) > 1.)

You claim that this hypothetical implementation is non-conforming. What
specific clause of the standard does it violate?

I don't know who posted this first, but fill the memory, call strlen, and
any result it can give renders the implementation nonconforming as
strlen's description has no exception for strings longer than SIZE_MAX
characters.
 
K

Keith Thompson

Harald van Dijk said:
You snipped the actual question, which was to cite a specific clause in
the standard that such an implementation would violate.

Once again, assume the following (I'll change the example a bit):

sizeof (short) == 2

The following:
short *ptr = calloc(SIZE_MAX, sizeof *ptr);
succeeds, causing ptr to point to the first element of an array
object whose size is SIZE_MAX*2 bytes. You can successfully read
and write all SIZE_MAX short objects in the allocated object, from
ptr[0] to ptr[SIZE_MAX-1].

sizeof (short[SIZE_MAX]) is rejected with a diagnostic at compile
time. (I would expect this to happen for any compiler where sizeof
(short) > 1.)

You claim that this hypothetical implementation is non-conforming. What
specific clause of the standard does it violate?

I don't know who posted this first, but fill the memory, call strlen, and
any result it can give renders the implementation nonconforming as
strlen's description has no exception for strings longer than SIZE_MAX
characters.

Very interesting.

For reference, here's what C99 says about strlen:

7.21.6.3 The strlen function

Synopsis
1 #include <string.h>
size_t strlen(const char *s);

Description
2 The strlen function computes the length of the string pointed to
by s.

Returns
3 The strlen function returns the number of characters that precede
the terminating null character.

I'm still not comfortable with the indirectness of the reasoning
(that's a criticism of the standard, not of your analysis of it). But
I think you're right.

A counterargument could be made that if the length of the string can't
be represented as a size_t then the behavior is undefined by omission,
just as if s doesn't point to a string.

What I think I'd *like* the standard to say is that no object can
exceed SIZE_MAX bytes, that any attempt to declare such an object
invokes undefined behavior, and that calloc(X, Y), where X * Y exceeds
SIZE_MAX, must return NULL. Your argument suggests that it already
implies all of this; I wish it did so more explicitly.
 
C

CBFalconer

Keith said:
CBFalconer said:
Keith said:
... snip ...

If calloc accepts calls for items with a net size greater than
SIZE_MAX, the code is in error. calloc should simply return NULL
for such a call.

An implementation can certainly avoid the whole issue by having
calloc() return NULL whenever the requested size exceeds SIZE_MAX,
and I suspect most implementations do that.

But suppose an implementation returns a non-null result for
calloc(SIZE_MAX, 2). You can't directly apply sizeof to the
resulting object, since it's anonymous. You can try to compute,
for example, "sizeof char[SIZE_MAX][2]", but you could try that
even if calloc() didn't exist. One compiler I tried issued a
compile-time diagnostic:

error: size of array 'type name' is too large

Here I maintain that calloc is faulty. The standard specifies no
failure mechanism, simply a failure to supply the memory, signalled
by a NULL. If it occurs only on compilation, it might be
acceptable. But that can't handle calls with variable parameters.
calloc itself can.

You snipped the actual question, which was to cite a specific clause
in the standard that such an implementation would violate.

Once again, assume the following (I'll change the example a bit):

sizeof (short) == 2

The following:
short *ptr = calloc(SIZE_MAX, sizeof *ptr);
succeeds, causing ptr to point to the first element of an array
object whose size is SIZE_MAX*2 bytes. You can successfully read
and write all SIZE_MAX short objects in the allocated object, from
ptr[0] to ptr[SIZE_MAX-1].

sizeof (short[SIZE_MAX]) is rejected with a diagnostic at compile
time. (I would expect this to happen for any compiler where
sizeof (short) > 1.)

You claim that this hypothetical implementation is non-conforming.
What specific clause of the standard does it violate?

I claim that calloc didn't really succeed. It just converted the
total size requested using the usual unsigned conversions. It
returned a pointer to a physical object, which was NOT (SIZE_MAX *
2) big. It can't be, since no object can exceed SIZE_MAX. THIS IS
NOT A TYPE. THIS calloc IS FAULTY.

I had exactly this problem on my nmalloc package, and it forced me
to include calloc in the package to ensure that all assignments
were accurate.

I believe (but may be wrong) that somewhere the standard specifies
that the action of calloc is to call malloc, and then initialize
the result.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,262
Messages
2,571,056
Members
48,769
Latest member
Clifft

Latest Threads

Top