Null terminated strings: bad or good?

T

Tony

I can think of a few good reasons to have "string" mean a contiguous series
of bytes and a length. I have a hard time finding any value in having
"string" mean a contiguous series of bytes terminated by a null. Help me
with this please.

Tony
 
J

JC

I can think of a few good reasons to have "string" mean a contiguous series
of bytes and a length.

These are sometimes known as "Pascal-style" strings. The main issue is
the length of the string is limited by the maximum value that can be
stored in the length field; in Pascal, it was a single byte, limiting
strings to 255 characters. There are other variants that are in use
and you may run into in C, for example, the Windows API defines a
"BSTR" type, which consists of a 4-byte length field followed by
string data, the pointers you deal with point to the start of the data
(4 bytes after the start of the allocated block).
I have a hard time finding any value in having
"string" mean a contiguous series of bytes terminated by a null.

These are normally known as "C-style" strings. The main advantage is
the length of the string is limited only by available memory, and the
length field is not stored with the string, thus conserving storage
space.

Another major advantage to storing null-terminated strings is the
strings can be modified in place with minimal effort; truncating a
string is a matter of simply setting the new end byte to 0, "removing"
the prefix of a string can be done simply by referring to a location
past the beginning, dividing strings into substrings can be done by
placing 0's where appropriate. As an exercise, try implementing strtok
() with Pascal-style strings. You may be surprised at the difficulty.

The main disadvantage of C-style strings is computing the length is O
(n), but applications that need to reduce this to constant time can
easily do so by storing the length elsewhere, if they need it.

Jason
 
T

Tomás Ó hÉilidhe

I can think of a few good reasons to have "string" mean a contiguous series
of bytes and a length. I have a hard time finding any value in having
"string" mean a contiguous series of bytes terminated by a null. Help me
with this please.

Tony


For one thing it's handy. You have code such as:

char *p;

for (p = str; *p; ++p)
{
*p = tolower((char unsigned)*p);
}

I use null-terminated arrays in my programs, not just for strings. For
instance in my current project I have a null-terminated array of IP
addresses, and also a null-terminated array of MAC addresses. It
allows for simpler code such as:

for (p = ip_addresses; *p; ++p)
{
if (*p == ip_default_gateway)
DoSomething();
}

If the amount of IP addresses changes at runtime, I don't have to
change some global integer variable that indicates the size. If I did
it would be something like:

char *p;
char const *const pend = ip_addresses +
global_variable_amount_ip_addresses;

for (p = ip_addresses; p != pend; ++p)
{
if (*p == ip_default_gateway)
DoSomething();
}

Also I'd say the null-terminated way is faster.
 
B

Bartc

Tony said:
I can think of a few good reasons to have "string" mean a contiguous series
of bytes and a length. I have a hard time finding any value in having
"string" mean a contiguous series of bytes terminated by a null. Help me
with this please.

I have used both at the same time: null-terminated strings together with a
length.

The null-terminatedness made it easy to interface to C-runtime and OS
functions.

The length made a few functions trivial, such as a version of strlen().

However you lose the advantage of having a zero-character as part of the
string data.
 
R

Richard Tobin

Tony said:
I can think of a few good reasons to have "string" mean a contiguous series
of bytes and a length. I have a hard time finding any value in having
"string" mean a contiguous series of bytes terminated by a null. Help me
with this please.

Your question is strangely phrased. Do you really want to ask about
whether "string" should mean that, or whether a programming language
should use null-terminated strings?

-- Richard
 
J

James Kuyper

Tony said:
I can think of a few good reasons to have "string" mean a contiguous series
of bytes and a length. I have a hard time finding any value in having
"string" mean a contiguous series of bytes terminated by a null. Help me
with this please.

Within the context of the C language, there's lots of routines in the C
standard library that work with null-terminated strings, and only a few
that work with counted strings. Every routine that treats a string as a
series of characters rather than as raw memory recognizes null
termination, even those functions that also accept a count.

Therefore, if the standard used 'string' as a short for for 'counted
string' rather than 'null terminated string', it would be marginally
longer and harder to read.
 
T

Thad Smith

Richard said:
Your question is strangely phrased.

Strictly speaking, there was no question. I sense an implied question
of "what are the advantages of null-terminated strings?" for the purpose
of increasing general programming knowledge. I agree with the earlier
posts that it is an easier implementation and eliminates the need for
selecting a size for a length variable.
 
J

jacob navia

Thad said:
Strictly speaking, there was no question. I sense an implied question
of "what are the advantages of null-terminated strings?" for the purpose
of increasing general programming knowledge. I agree with the earlier
posts that it is an easier implementation and eliminates the need for
selecting a size for a length variable.

There are many advantages to zero terminated strings:

o Performance. You must scan the whole string millions of times
each time you want to know how long it is. This needs always
a faster processor so you can count on C to get that new game
machine you were dreaming of.
o Security. Since there is no way to know the length without scanning
the supposed string, if there is no terminating zero your program will
crash, or even corrupt other variables if you are writing to the
string. This will increase the security provided by the already lax
standards of C and will SURELY give the C++ people one more reason
to say: "You see? C sucks".
o Easy of implementation. Instead of just keeping the length in a
correct data structure and avoiding all above problems, you have to
program around those, increasing the coide length and bug
surface area. For instance, to catenate two strings (strcat operation)
instead of just adding the two lengths, allocating space, then
copying, you have to scan both strings and test their length,
allocate one byte more to store the zero (how many millions
of newbee bugs could have been avoided) etc.
Zero terminated strings make it easy to inject bugs in your code,
hence they are easy to implement.
o Virus writers would have had a lot of work to do if those strings
weren't there to easy them the job.

The lcc-win compiler features a string library without any zero
termination.

I have discussed this subject many time in this group, full of
retrograde people that LOVE zero terminated strings. I have had
my dose of them and will not answer any of their "arguments".

jacob
 
J

JC

You'll notice I did heed your desire
for me to reply in-thread

Thanks! :)
Please cite the C standard for your claim that the length of a C string
is limited *only* by available memory.

In fact, the maximum length of a string is not even limited by
available memory. [7.1.1.1] (in C99, TC2) defines a "string" and does
not define a limit on its length. There is no number that exists such
that if the length of a string exceeded that number, it would not be
considered a "string" as defined by the standard. The maximum length
of a string is actually infinite.

In practice, the maximum length of a string is a function of the
amount of storage space available where you are storing the string
(the standard does not restrict it's definition of a string to include
only data that resides in memory as opposed to, say, on disk), and
also of the methods you are using to access that data.

For example, strings in memory are, in practice, limited by the
maximum value of size_t (including the null), since that's the largest
block of memory you would be able to allocate using standard functions
like malloc(). The size_t maximum in memory, however, is only a
consequence of how you would obtain the memory to store the string in,
and is independent of the definition of a string. The behavior of
strlen() is also not defined if the string length exceeds size_t,
although that does not in itself place any limits on the maximum
length.

As another example, a string residing in a file on a disk or other
medium is limited only by available space on that medium. While you
may not necessarily be able to directly access it with C's string
functions (instead having to read it sequentially with functions like
fgetc()), it still does not violate the definition of a string: "a
contiguous sequence of characters terminated by and including the
first null
character." [7.1.1.1]

In that example, any limits on the maximum size of a file that an
implementation of fopen() can deal with are only a matter of
inconvenience. You could concoct a scheme of storing a string in
multiple files, opened with fopen(), accessing it only sequentially
(note that the C standard does not include a random-access requirement
in the definition of a string), and it still is, in fact, a C string.

You may be thinking of the maximum length of a string literal. AFAIK
the standard does not define a limit on that, either, but I have
encountered compilers with limits (I've had issues with GCC in the
past -- I don't recall the version or the length, 4096 rings a bell).
The maximum length of a string literal does not affect the maximum
length of a string.

Jason
 
J

JC

You'll notice I did heed your desire
for me to reply in-thread

Thanks! :)
Please cite the C standard for your claim that the length of a C string
is limited *only* by available memory.

In fact, the maximum length of a string is not even limited by
available memory. [7.1.1.1] (in C99, TC2) defines a "string" and does
not define a limit on its length. There is no number that exists such
that if the length of a string exceeded that number, it would not be
considered a "string" as defined by the standard. The maximum length
of a string is actually infinite.


For reference, the entire paragraph is:

[7.1.1.1] A string is a contiguous sequence of characters terminated
by and including the first null character. The term multibyte string
is sometimes used instead to emphasize special processing given to
multibyte characters contained in the string or to avoid confusion
with a wide string. A pointer to a string is a pointer to its initial
(lowest addressed) character. The length of a string is the number of
bytes preceding the null character and the value of a string is the
sequence of the values of the contained characters, in order.

In practice, the maximum length of a string is a function of the
amount of storage space available where you are storing the string
(the standard does not restrict it's definition of a string to include
only data that resides in memory as opposed to, say, on disk), and
also of the methods you are using to access that data.

For example, strings in memory are, in practice, limited by the
maximum value of size_t (including the null), since that's the largest
block of memory you would be able to allocate using standard functions
like malloc(). The size_t maximum in memory, however, is only a
consequence of how you would obtain the memory to store the string in,
and is independent of the definition of a string. The behavior of
strlen() is also not defined if the string length exceeds size_t,
although that does not in itself place any limits on the maximum
length.

As another example, a string residing in a file on a disk or other
medium is limited only by available space on that medium. While you
may not necessarily be able to directly access it with C's string
functions (instead having to read it sequentially with functions like
fgetc()), it still does not violate the definition of a string: "a
contiguous sequence of characters terminated by and including the
first null
character." [7.1.1.1]

In that example, any limits on the maximum size of a file that an
implementation of fopen() can deal with are only a matter of
inconvenience. You could concoct a scheme of storing a string in
multiple files, opened with fopen(), accessing it only sequentially
(note that the C standard does not include a random-access requirement
in the definition of a string), and it still is, in fact, a C string.

You may be thinking of the maximum length of a string literal. AFAIK
the standard does not define a limit on that, either, but I have
encountered compilers with limits (I've had issues with GCC in the
past -- I don't recall the version or the length, 4096 rings a bell).
The maximum length of a string literal does not affect the maximum
length of a string.


Jason
 
E

Eric Sosman

JC said:
You'll notice I did heed your desire
for me to reply in-thread

Thanks! :)
Please cite the C standard for your claim that the length of a C string
is limited *only* by available memory.

In fact, the maximum length of a string is not even limited by
available memory. [7.1.1.1] (in C99, TC2) defines a "string" and does
not define a limit on its length. There is no number that exists such
that if the length of a string exceeded that number, it would not be
considered a "string" as defined by the standard. The maximum length
of a string is actually infinite.

In a freestanding implementation, perhaps, but in a hosted
implementation the strlen() function must be able to return "the
number of characters that precede the terminating null character"
(7.21.6.3p3). Since it returns this count as a size_t value, it
follows that the count cannot exceed SIZE_MAX, a finite number.
 
J

JC

In fact, the maximum length of a string is not even limited by
available memory. [7.1.1.1] (in C99, TC2) defines a "string" and does
not define a limit on its length. There is no number that exists such
that if the length of a string exceeded that number, it would not be
considered a "string" as defined by the standard. The maximum length
of a string is actually infinite.

     In a freestanding implementation, perhaps, but in a hosted
implementation the strlen() function must be able to return "the
number of characters that precede the terminating null character"
(7.21.6.3p3).  Since it returns this count as a size_t value, it
follows that the count cannot exceed SIZE_MAX, a finite number.

Right, but that only defines the behavior of strlen(), and does not
place a restriction on the theoretical maximum length of a string. It
does place a practical restriction on how long you would want a string
to be, but it leaves the behavior of strlen() for strings larger than
SIZE_MAX undefined. A more important practical limitation is how you
would obtain the char* that you would pass to strlen() in the first
place.

In fact, AFAIK, calloc() will allow you to allocate a block larger of
memory larger than SIZE_MAX bytes. In that case, something like this
seems like valid C that uses a valid string, but invokes UB in strlen
() (assuming that you are on a platform where calloc() succeeds):


// Note: C99
#include <assert.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

int main (void) {

char *str = calloc(SIZE_MAX, SIZE_MAX), *ptr;
size_t a, b, len;
assert(str);

for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
for (b = 0; b != SIZE_MAX - 1; ++ b, ++ ptr)
*ptr = 'a';

len = strlen(str); // undefined

return 0;

}


In other words, strlen()'s range is independent of the maximum length
of a string, even though it does place a limitation on the maximum
length of a useful string.

Jason
 
J

JC

JC said:
On Jan 3, 1:34 pm, (e-mail address removed), (e-mail address removed),
(e-mail address removed), Master Troll <[email protected]>
wrote:
JC wrote:
These are normally known as "C-style" strings. The main advantage is
the length of the string is limited only by available memory, and the
length field is not stored with the string, thus conserving storage
space.
You'll notice I did heed your desire
for me to reply in-thread
Thanks! :)
Please cite the C standard for your claim that the length of a C string
is limited *only* by available memory.
In fact, the maximum length of a string is not even limited by
available memory. [7.1.1.1] (in C99, TC2) defines a "string" and does
not define a limit on its length. There is no number that exists such
that if the length of a string exceeded that number, it would not be
considered a "string" as defined by the standard. The maximum length
of a string is actually infinite.
     In a freestanding implementation, perhaps, but in a hosted
implementation the strlen() function must be able to return "the
number of characters that precede the terminating null character"
(7.21.6.3p3).  Since it returns this count as a size_t value, it
follows that the count cannot exceed SIZE_MAX, a finite number.

Right, but that only defines the behavior of strlen(), and does not
place a restriction on the theoretical maximum length of a string. It
does place a practical restriction on how long you would want a string
to be, but it leaves the behavior of strlen() for strings larger than
SIZE_MAX undefined. A more important practical limitation is how you
would obtain the char* that you would pass to strlen() in the first
place.

In fact, AFAIK, calloc() will allow you to allocate a block larger of
memory larger than SIZE_MAX bytes. In that case, something like this
seems like valid C that uses a valid string, but invokes UB in strlen
() (assuming that you are on a platform where calloc() succeeds):

// Note: C99
#include <assert.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

int main (void) {

  char *str = calloc(SIZE_MAX, SIZE_MAX), *ptr;
  size_t a, b, len;
  assert(str);

  for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
    for (b = 0; b != SIZE_MAX - 1; ++ b, ++ ptr)
      *ptr = 'a';

Oops, I'm sorry, this has a bug in it and does not fill the entire
area, replace that loop with:

for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
for (b = 0; b != SIZE_MAX; ++ b, ++ ptr)
*ptr = 'a';
ptr[-1] = 0;


Jason
 
J

JC

  for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
    for (b = 0; b != SIZE_MAX - 1; ++ b, ++ ptr)
      *ptr = 'a';

Oops, I'm sorry, this has a bug in it and does not fill the entire
area, replace that loop with:

   for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
     for (b = 0; b != SIZE_MAX; ++ b, ++ ptr)
       *ptr = 'a';
   ptr[-1] = 0;


It may have some other bugs in it. How embarrassing. Anyways, just
take it to mean "filling the entire allocated area with the character
'a', except for a 0 at the very end". Apologies for the sloppy
example.

Jason
 
J

jacob navia

JC said:
for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
for (b = 0; b != SIZE_MAX - 1; ++ b, ++ ptr)
*ptr = 'a';
Oops, I'm sorry, this has a bug in it and does not fill the entire
area, replace that loop with:

for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
for (b = 0; b != SIZE_MAX; ++ b, ++ ptr)
*ptr = 'a';
ptr[-1] = 0;


It may have some other bugs in it. How embarrassing.

As I said in my message:

Zero terminated strings are a CATASTROPHE.

You are a knowledgable C programmer. And yet you have
proved that you can forget adding the dammed terminating
zero.

WHY do we go on doing this nonsense?

I have proposed a string library with counted strings. It was
rejected because it is "non conforming"...
 
F

Flash Gordon

jacob said:
JC said:
for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
for (b = 0; b != SIZE_MAX - 1; ++ b, ++ ptr)
*ptr = 'a';
Oops, I'm sorry, this has a bug in it and does not fill the entire
area, replace that loop with:

for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
for (b = 0; b != SIZE_MAX; ++ b, ++ ptr)
*ptr = 'a';
ptr[-1] = 0;


It may have some other bugs in it. How embarrassing.

As I said in my message:

Zero terminated strings are a CATASTROPHE.

Which is an overstatement.
You are a knowledgable C programmer. And yet you have
proved that you can forget adding the dammed terminating
zero.

On quick hack code where any testing would have shown the problem.
WHY do we go on doing this nonsense?

I have proposed a string library with counted strings. It was
rejected because it is "non conforming"...

Implement it in a way that allows it to be used with any compiler and
people *might* be more interested. Stop engaging in hyperbole and people
are more likely to take you seriously.
 
J

JC

JC said:
  for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
    for (b = 0; b != SIZE_MAX - 1; ++ b, ++ ptr)
      *ptr = 'a';
Oops, I'm sorry, this has a bug in it and does not fill the entire
area, replace that loop with:
   for (a = 0, ptr = str; a != SIZE_MAX; ++ a)
     for (b = 0; b != SIZE_MAX; ++ b, ++ ptr)
       *ptr = 'a';
   ptr[-1] = 0;
It may have some other bugs in it. How embarrassing.

You are a knowledgable C programmer. And yet you have
proved that you can forget adding the dammed terminating
zero.

There were two problems in my code, both related to broken loop logic:

1) I left too many terminating 0's (note that calloc zero-
initializes the data, hence why the original example did not
explicitly add a 0).

2) The fill does not cover the entire range of a size_t (the
condition != SIZE_MAX is one short).

You are right in that if strings were preceded by a length field,
problem (1) would not exist, and I would not have had to consider it.
That does simplify certain things. On the other hand, problem (2) is
independent of the string representation, and was just a result of a
shoddily coded example.

To be fair, the example is an edge case, and the issues I ran into
were mostly related to dealing with upper limits of integers, and that
testing is not possible on my machine. My hasty example is not
compelling evidence against 0-terminated strings, although it is
compelling evidence for being more careful about loop indexes.

A more manageable example would probably have been something like
calloc(SIZE_MAX - 1, SIZE_MAX - 1) (although things like calloc(3,
SIZE_MAX / 2) still illustrate the point).

So, here is another attempt to redeem my poor example, this time with
a bit more care; it's untested but, with a bit of luck, *may* be
correct:


// Note: C99
#include <assert.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

int main (void) {

char *str = calloc(SIZE_MAX-1, SIZE_MAX-1), *ptr;
size_t a, b, len;
assert(str);

for (a = 0, ptr = str; a < SIZE_MAX; ++ a)
for (b = 0; b < SIZE_MAX; ++ b, ++ ptr)
*ptr = 'a';
ptr[-1] = 0;

len = strlen(str); // undefined

return 0;

}

I have proposed a string library with counted strings. It was
rejected because it is "non conforming"...

A string library with counted strings could certainly be useful in
some situations, have you implemented it and made it available on
line?

Jason
 
K

Keith Thompson

jacob navia said:
As I said in my message:

Zero terminated strings are a CATASTROPHE.

OH NO! IT'S THE END OF THE WORLD AS WE KNOW IT!! WE'RE ALL DOOMED!!!

Be serious.

Programmers have been using zero terminated strings for decades. They
have advantages and disadvantages.
You are a knowledgable C programmer. And yet you have
proved that you can forget adding the dammed terminating
zero.

Yes, anyone can make a mistake -- especially in a tiny piece of
software posted to Usenet. One mistake proves nothing.
WHY do we go on doing this nonsense?

I have proposed a string library with counted strings. It was
rejected because it is "non conforming"...

As I recall, it can only be used with the lcc-win compiler, because it
relies on language extensions, particularly operator overloading. Can
I use it with, say, gcc on my Linux machine? Didn't think so. On
that basis, I "reject" it only in the sense that it's not of much use
to me; I have no objection to anyone else using it, as long as they're
aware of its limitations.
 
C

CBFalconer

JC said:
(e-mail address removed), (e-mail address removed), (etc) wrote:
.... snip ...


Thanks! :)
.... snip ...

--
+-------------------+ .:\:\:/:/:.
| PLEASE DO NOT F :.:\:\:/:/:.:
| FEED THE TROLLS | :=.' - - '.=:
| | '=(\ 9 9 /)='
| Thank you, | ( (_) )
| Management | /`-vvv-'\
+-------------------+ / \
| | @@@ / /|,,,,,|\ \
| | @@@ /_// /^\ \\_\
@x@@x@ | | |/ WW( ( ) )WW
\||||/ | | \| __\,,\ /,,/__
\||/ | | | jgs (______Y______)
/\/\/\/\/\/\/\/\//\/\\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
==============================================================

fix (vb.): 1. to paper over, obscure, hide from public view; 2.
to work around, in a way that produces unintended consequences
that are worse than the original problem. Usage: "Windows ME
fixes many of the shortcomings of Windows 98 SE". - Hutchinson
 
C

CBFalconer

jacob said:
.... snip ...

Zero terminated strings are a CATASTROPHE.

You are a knowledgable C programmer. And yet you have proved that
you can forget adding the dammed terminating zero.

If you wish you are perfectly free to write another string
library. Face the fact that it will not be compatible with normal
C programming, and that it will not be a component of the standard
library.

Then also consider that, for most string purposes, the existing
library is quite adequate, and efficient. In fact, it is even
debugged.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,021
Latest member
AkilahJaim

Latest Threads

Top