Null terminated strings: bad or good?

Hallvard B Furuseth · Jan 9, 2009

CBFalconer said:
Wojtek said:

CBFalconer said:

long array[SIZE_MAX];

Click to expand...

I maintain that, whenever (sizeof (long) > 1), that is a compile
error.

Click to expand...

We know that you do, but we don't believe that you have
demonstrated that to be true.

Click to expand...

I have quoted the appropriate portion of the standard. Any other
interpretation involves a contradiction.

Maybe nobody has has made this clear yet, but the point is that the
standard has no concept of "compile error". It has circumstances
where the compiler is required to issue a diagonstic: For violations
of constraints, syntax rules, and for #error. See C99 5.1.1.3.

That's why people are pointing out that your quoted standard text is not
a constraint. There is indeed a contradiction, but it is in the
standard.

Faced with a requirement to produce an impossible executable, in real
life it makes sense for a compiler to refuse to compile. But that's
another matter.

Keith Thompson · Jan 9, 2009

Ben Pfaff said:
I would use size_t for the string length. In normal C
implementations, size_t is sufficient to hold the longest
possible string.

We've been debating that point in comp.std.c, but in practice, yes,
it's reasonable to assume that size_t is sufficient.

Using something smaller that size_t to hold the length of a
string-like object is what I'd consider to be a micro-optimization.
Sometimes you need to do that kind of thing (for example, storing 1 or
2 bytes rather than 4 on a severely memory-constrained system might
make sense), but in most cases it's a waste of effort. Or you can
define a separate "short string" type in addition to one that can
potentially store up to SIZE_MAX characters, but then you have to deal
with converting between the two types.

My advice: Keep It Simple -- use size_t to hold the length unless you
have a really good reason not to.

Ian Collins · Jan 9, 2009

Tony said:
That's a relevant example then. But will I use a 32-bit length field in the
string class I am currently reworking based on that? Maybe, but probably
not. Maybe that will be "bigstring" rather then "string". Should a 32-bit
length field be used in a language level, general string class? Maybe more
compelling there, but maybe a *choice* of lower overhead would be lucrative:
"string", "smallstring".

As Ben says, use size_t. If you want to optimise for short strings,
follow the example of the commonly used short string optimisation in C++
standard libraries.

Ian Collins · Jan 9, 2009

Amandil said:
Another point in defense of the NUL-terminated string is the
unwillingness on the part of the creators of C to create a new data
type with handling done transparent to what the programmer can see.
Other examples of "transparent" code would include constuctors,
destructors, and the like, that you seem used to in C++. But in C, the
idea is to do exactly what the programmer expects and writes, with no
behind the scenes. The biggest exception to that is copying structs
(struct file a = b, and perhaps casting (usually explicit, or with a
prototype). Adding a special string type is not in C's design.

Just like the wildly popular _Complex type?

Phil Carmody · Jan 9, 2009

Tony said:
"only"? Did you say ONLY 65535-byte lines?! (Note: ASCII chars assumed).
Please do tell how "much more useful" it is to have lines longer that 64k?
The one example I can think of that may need "a lot" of chars is macro
preprocessor lines but 64k seems like a stretch there too.

Well, I've got files consising of lines which are factors of
trinomials of degree >42 million, some of which are thousands
of terms long (and, at the exponents are up to 8 digits, you
can expect such lines to be well over 64k in length.

But evidently such expressions are inexpressible as strings,
and are probably simply infinite, using Hotentot metrics.

Phil

Phil Carmody · Jan 9, 2009

Tony said:
It's bizarre to handle buffers of data (and remember, the discussion stemmed
from reading-in a whole textfile as a single string) like that in the
everyday case is what I was saying. Your example is a special case that
requires special abstractions/handling.

With all due respect, I think you'll find that embedding null
characters into a C string requires special abstractions and
handling.

Phil

CBFalconer · Jan 9, 2009

Keith said:
.... snip ...

Ok, let's consider another case, setting aside the huge allocation
issue for the moment:

int count = <some number>;
int *p = calloc(count, sizeof(int));

If calloc succeeds, it allocates an object whose size is
count*sizeof(int). What expression refers to that object? Note
that *p refers to an object whose size is sizeof(int); that's not
the object I'm talking about.

To put it another way:

sizeof <BLANK> == count*sizeof(int)

Fill in the blank in a manner that makes this expression true and
is directly relevant to the object allocated by calloc() in the
code above.

In this case things are no different than calling malloc with that
product as argument, except that the memory is not initialized. I
don't think there is any argument that that size is unavailable
with standard C methods. Replace BLANK by "((count) * sizeof
(int))", bearing in mind the unsigned calculation.

I greatly doubt that you will find ANY calloc that does other than
compute the product and call malloc. That means that the system
may report success, but has actually allocated much less than the
desired memory. To me, that is a glaring calloc fault.

To justify that, consider that it is not portably possible to
allocate more than one memory segment with malloc, and guarantee
that the segments are combinable.

CBFalconer · Jan 9, 2009

James said:
.... snip ...

For that matter, you've also failed to identify a problem that
having calloc() return NULL would solve. sizeof(*ptr) has the
exact same value, whether or not calloc() returned NULL.

i.e. ptr = calloc(<SPECS>);
sz = sizeof(*ptr);

I haven't checked, but if ptr is NULL I believe that is an error.

CBFalconer · Jan 9, 2009

James said:
CBFalconer wrote:
.... snip ...

No, the problem is quite independent of calloc(), as has been
repeatedly pointed out. It is equally impossible for sizeof to
return the specified value when applied to types that are too big.

There are no "types that are too big". Their sizes have been
calculated using unsigned arithmetic, which cannot return a value
larger than SIZE_MAX. That result is in the symbol tables for
compilation.

If the compilation is on a cross-compiler, the authors should have
handled the appropriate size limits.

CBFalconer · Jan 9, 2009

James said:
.... snip ...

My interpretation of the relevant words implies that a conforming
implementation of C can

1. reject any declaration that refers to a type bigger than
SIZE_MAX, as exceeding an implementation limit.

2. have calloc(nmemb, size) return a non-null pointer to enough
memory to store an array nmemb objects of the specified size, even
if nmemb*size has a mathematical value that is greater than
SIZE_MAX. It will return sufficient memory for the specified number
of objects of the specified size, even though the amount of memory
required is greater than the value of nmemb*size, interpreted as a
C expression rather than a mathematical one.

Please demonstrate the contradiction that you see in that
interpretation.

size_t is specified to be able to specify the size of ANY object.
The maximum value for a size_t item is SIZE_MAX. All calloc has to
do is check that the product of the specifications does not
overflow a size_t, and pass that to malloc. If it does overflow,
simply return NULL.

Also consider that the system has no known means of allocating that
'over SIZE_MAX' space in a continuous block.

CBFalconer · Jan 9, 2009

jameskuyper said:
I can't think of a good reason why an implementation would want to
do that, but I don't see how accepting such a program violates any
of the standard's requirements.

It's a fine point, but the types are harmless. They don't create a
problem unless used in a declaration or sizeof. At that point, the
compiler should find an error, if it hasn't already flagged the
type.

Ben Pfaff · Jan 9, 2009

CBFalconer said:
i.e. ptr = calloc(<SPECS>);
sz = sizeof(*ptr);

I haven't checked, but if ptr is NULL I believe that is an error.

No, the operand of sizeof is not evaluated, except in one special
case:

The sizeof operator yields the size (in bytes) of its
operand, which may be an expression or the parenthesized
name of a type. The size is determined from the type of the
operand. The result is an integer. If the type of the
operand is a variable length array type, the operand is
evaluated; otherwise, the operand is not evaluated and the
result is an integer constant.

CBFalconer · Jan 9, 2009

Wojtek said:
CBFalconer said:

Wojtek said:

Where does the standard say that char[SIZE_MAX][SIZE_MAX] is
not a declarable type?

Click to expand...

It says sizeof can return the size of a type. But it returns a
size_t, which has a maximum value of SIZE_MAX. This requires
that the declaration be an error, or at least unusable.

Click to expand...

Unusable as an operand of sizeof, maybe. But it doesn't follow
that it must be unusable for other purposes.

No restriction on using it as a number. But size_t is intended to
measure the size of ANY object, which must first be created. That
means the object fits into memory (which may include disk
simulation of memory space). That also means that size_t matches
the addressing capabilities of the computing unit.

Keith Thompson · Jan 9, 2009

CBFalconer said:
In this case things are no different than calling malloc with that
product as argument, except that the memory is not initialized. I
don't think there is any argument that that size is unavailable
with standard C methods. Replace BLANK by "((count) * sizeof
(int))", bearing in mind the unsigned calculation.

So you claim that

sizeof ((count) * sizeof (int)) == count*sizeof(int)

is true? I don't think that's what you meant -- and I don't think
you're taking this seriously.

I greatly doubt that you will find ANY calloc that does other than
compute the product and call malloc. That means that the system
may report success, but has actually allocated much less than the
desired memory. To me, that is a glaring calloc fault.

A calloc implementation that blindly multiplies its two arguments,
ignoring any wraparound, is buggy. I think I've seen such
implementations, but three systems I've just tried don't have this
problem. On all three systems, this program:

#include <stdio.h>
#include <stdlib.h>

#define MY_SIZE_MAX ((size_t)-1)
/* SIZE_MAX isn't always available */

int main(void)
{
void *c = calloc(MY_SIZE_MAX, MY_SIZE_MAX);
void *m = malloc(MY_SIZE_MAX * MY_SIZE_MAX);
printf("calloc %s\n", c == NULL ? "failed" : "succeeded");
printf("malloc %s\n", m == NULL ? "failed" : "succeeded");
return 0;
}

produces this output:

calloc failed
malloc succeeded

To justify that, consider that it is not portably possible to
allocate more than one memory segment with malloc, and guarantee
that the segments are combinable.

I fail to see the relevance. What is a "memory segment", and why
would I want to combine more than one of them?

No implementation is *required* to support allocating objects larger
than SIZE_MAX bytes using calloc(). The argument is whether an
implementation is *allowed* to do so.

CBFalconer · Jan 9, 2009

Keith said:
.... snip ...

Consider an implementation where size_t is 16 bits, with
SIZE_MAX==65535. Then SIZE_MAX * SIZE_MAX, after wraparound,
yields a result of 1. (The same happens with a 32-bit size_t.)
By your argument the expression

calloc(SIZE_MAX, SIZE_MAX)

must attempt to allocate just 1 byte. But an implementation
that did that would violate the standard's requirements for
calloc(), which I'll leave you to look up for yourself.

No. calloc should check that the multiplication does not overflow
before making any attempt to allocate memory. On overflow, return
NULL. Otherwise, call malloc. That's all that is required.

Keith Thompson · Jan 9, 2009

CBFalconer said:
i.e. ptr = calloc(<SPECS>);
sz = sizeof(*ptr);

I haven't checked, but if ptr is NULL I believe that is an error.

Why haven't you checked? And what kind of "error" do you think it is?

In sizeof(*ptr), the expression *ptr is not evaluated, so the result
is not affected by whether calloc() succeeded.

You've quoted 6.5.3.4p2 several times here.

Richard Tobin · Jan 9, 2009

CBFalconer said:
There are no "types that are too big". Their sizes have been
calculated using unsigned arithmetic

Do you have any evidence for that? I don't recall anything that
implies that the size of objects is calculated using C arithmetic
(rather than everyday arithmetic) at all.

-- Richard

Keith Thompson · Jan 9, 2009

CBFalconer said:
size_t is specified to be able to specify the size of ANY object.

For the N'th time, no, it bloody well is not. size_t "is the unsigned
integer type of the result of the sizeof operator". Re-read 6.5.3.4;
the sizeof operator yields the size of a type, not of an object.

The maximum value for a size_t item is SIZE_MAX. All calloc has to
do is check that the product of the specifications does not
overflow a size_t, and pass that to malloc. If it does overflow,
simply return NULL.

Yes, that's all it *has* to do. The debate is whether it's *allowed*
to do more.

Also consider that the system has no known means of allocating that
'over SIZE_MAX' space in a continuous block.

How do you know that? What system are you referring to?

Tony · Jan 9, 2009

Ben Pfaff said:
I would use size_t for the string length. In normal C
implementations, size_t is sufficient to hold the longest
possible string.

I'm thinking that the overhead of such a design is unnecessary. Consider:

struct string_t
{
char* data;
size_t length; // 4 bytes on a 32-bit platform, 8 bytes on a 64-bit
platform assumed
size_t buff_sz;
};

12 bytes on a 32-bit platform. 24 bytes on a 64-bit platform.

vs.

struct string16
{
char* data;
uint16 length;
uint16 buff_sz;
};

8 bytes on a 32-bit platform. 16 bytes (effectively after alignment) on a
64-bit platform.

vs.

struct string_null
{
char* data;
uint16 buff_sz;
};

8 bytes on a 32-bit platform (after alignment). 16 bytes on a 64-bit
platform (after alignment).

Now consider the overhead of the above while handling a 1 MB text file as
one string per line with an average line length of 80 bytes:

1000000 b/(80 b/line) = 12500 lines

On a 32-bit platform:

string16: 12500 lines*(8 bytes overhead/line) = 100 KB overhead/MB = 10%
string_t: 12500 lines*(12 bytes overhead/line) = 150 KB overhead/MB = 15%
string_null: 12500 lines*(8 bytes overhead/line) = 100 KB overhead/MB =
10%

On a 64-bit platform:

string16: 12500 lines*(16 bytes overhead/line) = 200 KB overhead/MB = 20%
string_t: 12500 lines*(24 bytes overhead/line) = 300 KB overhead/MB = 30%
string_null: 12500 lines*(16 bytes overhead/line) = 200 KB overhead/MB =
20%

Therefor,

50% more overhead for the string_t over string16 for the design given.

When the above analysis is applied to structs that don't have the field
buff_sz, the result is equivalent overhead for string16 and string_t on both
32-bit and 64-bit platforms after alignment is considered, so design and
alignment issues are important when determining space overhead.

Tony

CBFalconer · Jan 9, 2009

James said:
He didn't suggest that file == string. However, a file can be used
to store a string, and a single string can be created using the
contents of any sufficiently small file, simply by ignoring any
null characters it contains. I presume that's what he's suggesting
for the "minor modification of ggets.c". Whether there's any point
in doing so depends upon the context; but it's certainly not
universally pointless.

No need to ignore zero bytes. Just install them in the output.
The system is terminated by receiving an EOF in place of a char.
You have to do something else about signalling length.

A string is not a type. A char array is a type.

Working with NON-NULL terminated strings	4	Jul 14, 2007
Reading null terminated strings in Java	9	Feb 4, 2009
pointer to NULL terminated array of pointer	8	Aug 30, 2012
How to put a null check on this code	0	Jan 4, 2022
Using <algorithm> with null-terminated arrays	4	Dec 18, 2010
strncpy() and null terminated strings	4	Apr 8, 2004
Hello all! Noob here with completely unrealistic ambitions. Happy to join the crew and get good enough to help others.	4	Aug 13, 2024
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022

Null terminated strings: bad or good?

Hallvard B Furuseth

Keith Thompson

Ian Collins

Ian Collins

Phil Carmody

Phil Carmody

CBFalconer

CBFalconer

CBFalconer

CBFalconer

CBFalconer

Ben Pfaff

CBFalconer

Keith Thompson

CBFalconer

Keith Thompson

Richard Tobin

Keith Thompson

Tony

CBFalconer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads