Null terminated strings: bad or good?

H

Hallvard B Furuseth

CBFalconer said:
Wojtek said:
CBFalconer said:
long array[SIZE_MAX];
I maintain that, whenever (sizeof (long) > 1), that is a compile
error.

We know that you do, but we don't believe that you have
demonstrated that to be true.

I have quoted the appropriate portion of the standard. Any other
interpretation involves a contradiction.

Maybe nobody has has made this clear yet, but the point is that the
standard has no concept of "compile error". It has circumstances
where the compiler is required to issue a diagonstic: For violations
of constraints, syntax rules, and for #error. See C99 5.1.1.3.

That's why people are pointing out that your quoted standard text is not
a constraint. There is indeed a contradiction, but it is in the
standard.

Faced with a requirement to produce an impossible executable, in real
life it makes sense for a compiler to refuse to compile. But that's
another matter.
 
K

Keith Thompson

Ben Pfaff said:
I would use size_t for the string length. In normal C
implementations, size_t is sufficient to hold the longest
possible string.

We've been debating that point in comp.std.c, but in practice, yes,
it's reasonable to assume that size_t is sufficient.

Using something smaller that size_t to hold the length of a
string-like object is what I'd consider to be a micro-optimization.
Sometimes you need to do that kind of thing (for example, storing 1 or
2 bytes rather than 4 on a severely memory-constrained system might
make sense), but in most cases it's a waste of effort. Or you can
define a separate "short string" type in addition to one that can
potentially store up to SIZE_MAX characters, but then you have to deal
with converting between the two types.

My advice: Keep It Simple -- use size_t to hold the length unless you
have a really good reason not to.
 
I

Ian Collins

Tony said:
That's a relevant example then. But will I use a 32-bit length field in the
string class I am currently reworking based on that? Maybe, but probably
not. Maybe that will be "bigstring" rather then "string". Should a 32-bit
length field be used in a language level, general string class? Maybe more
compelling there, but maybe a *choice* of lower overhead would be lucrative:
"string", "smallstring".
As Ben says, use size_t. If you want to optimise for short strings,
follow the example of the commonly used short string optimisation in C++
standard libraries.
 
I

Ian Collins

Amandil said:
Another point in defense of the NUL-terminated string is the
unwillingness on the part of the creators of C to create a new data
type with handling done transparent to what the programmer can see.
Other examples of "transparent" code would include constuctors,
destructors, and the like, that you seem used to in C++. But in C, the
idea is to do exactly what the programmer expects and writes, with no
behind the scenes. The biggest exception to that is copying structs
(struct file a = b;), and perhaps casting (usually explicit, or with a
prototype). Adding a special string type is not in C's design.
Just like the wildly popular _Complex type?
 
P

Phil Carmody

Tony said:
"only"? Did you say ONLY 65535-byte lines?! (Note: ASCII chars assumed).
Please do tell how "much more useful" it is to have lines longer that 64k?
The one example I can think of that may need "a lot" of chars is macro
preprocessor lines but 64k seems like a stretch there too.

Well, I've got files consising of lines which are factors of
trinomials of degree >42 million, some of which are thousands
of terms long (and, at the exponents are up to 8 digits, you
can expect such lines to be well over 64k in length.

But evidently such expressions are inexpressible as strings,
and are probably simply infinite, using Hotentot metrics.

Phil
 
P

Phil Carmody

Tony said:
It's bizarre to handle buffers of data (and remember, the discussion stemmed
from reading-in a whole textfile as a single string) like that in the
everyday case is what I was saying. Your example is a special case that
requires special abstractions/handling.

With all due respect, I think you'll find that embedding null
characters into a C string requires special abstractions and
handling.

Phil
 
C

CBFalconer

Keith said:
.... snip ...

Ok, let's consider another case, setting aside the huge allocation
issue for the moment:

int count = <some number>;
int *p = calloc(count, sizeof(int));

If calloc succeeds, it allocates an object whose size is
count*sizeof(int). What expression refers to that object? Note
that *p refers to an object whose size is sizeof(int); that's not
the object I'm talking about.

To put it another way:

sizeof <BLANK> == count*sizeof(int)

Fill in the blank in a manner that makes this expression true and
is directly relevant to the object allocated by calloc() in the
code above.

In this case things are no different than calling malloc with that
product as argument, except that the memory is not initialized. I
don't think there is any argument that that size is unavailable
with standard C methods. Replace BLANK by "((count) * sizeof
(int))", bearing in mind the unsigned calculation.

I greatly doubt that you will find ANY calloc that does other than
compute the product and call malloc. That means that the system
may report success, but has actually allocated much less than the
desired memory. To me, that is a glaring calloc fault.

To justify that, consider that it is not portably possible to
allocate more than one memory segment with malloc, and guarantee
that the segments are combinable.
 
C

CBFalconer

James said:
.... snip ...

For that matter, you've also failed to identify a problem that
having calloc() return NULL would solve. sizeof(*ptr) has the
exact same value, whether or not calloc() returned NULL.

i.e. ptr = calloc(<SPECS>);
sz = sizeof(*ptr);

I haven't checked, but if ptr is NULL I believe that is an error.
 
C

CBFalconer

James said:
CBFalconer wrote:
.... snip ...


No, the problem is quite independent of calloc(), as has been
repeatedly pointed out. It is equally impossible for sizeof to
return the specified value when applied to types that are too big.

There are no "types that are too big". Their sizes have been
calculated using unsigned arithmetic, which cannot return a value
larger than SIZE_MAX. That result is in the symbol tables for
compilation.

If the compilation is on a cross-compiler, the authors should have
handled the appropriate size limits.
 
C

CBFalconer

James said:
.... snip ...

My interpretation of the relevant words implies that a conforming
implementation of C can

1. reject any declaration that refers to a type bigger than
SIZE_MAX, as exceeding an implementation limit.

2. have calloc(nmemb, size) return a non-null pointer to enough
memory to store an array nmemb objects of the specified size, even
if nmemb*size has a mathematical value that is greater than
SIZE_MAX. It will return sufficient memory for the specified number
of objects of the specified size, even though the amount of memory
required is greater than the value of nmemb*size, interpreted as a
C expression rather than a mathematical one.

Please demonstrate the contradiction that you see in that
interpretation.

size_t is specified to be able to specify the size of ANY object.
The maximum value for a size_t item is SIZE_MAX. All calloc has to
do is check that the product of the specifications does not
overflow a size_t, and pass that to malloc. If it does overflow,
simply return NULL.

Also consider that the system has no known means of allocating that
'over SIZE_MAX' space in a continuous block.
 
C

CBFalconer

jameskuyper said:
I can't think of a good reason why an implementation would want to
do that, but I don't see how accepting such a program violates any
of the standard's requirements.

It's a fine point, but the types are harmless. They don't create a
problem unless used in a declaration or sizeof. At that point, the
compiler should find an error, if it hasn't already flagged the
type.
 
B

Ben Pfaff

CBFalconer said:
i.e. ptr = calloc(<SPECS>);
sz = sizeof(*ptr);

I haven't checked, but if ptr is NULL I believe that is an error.

No, the operand of sizeof is not evaluated, except in one special
case:

The sizeof operator yields the size (in bytes) of its
operand, which may be an expression or the parenthesized
name of a type. The size is determined from the type of the
operand. The result is an integer. If the type of the
operand is a variable length array type, the operand is
evaluated; otherwise, the operand is not evaluated and the
result is an integer constant.
 
C

CBFalconer

Wojtek said:
CBFalconer said:
Wojtek said:
Where does the standard say that char[SIZE_MAX][SIZE_MAX] is
not a declarable type?

It says sizeof can return the size of a type. But it returns a
size_t, which has a maximum value of SIZE_MAX. This requires
that the declaration be an error, or at least unusable.

Unusable as an operand of sizeof, maybe. But it doesn't follow
that it must be unusable for other purposes.

No restriction on using it as a number. But size_t is intended to
measure the size of ANY object, which must first be created. That
means the object fits into memory (which may include disk
simulation of memory space). That also means that size_t matches
the addressing capabilities of the computing unit.
 
K

Keith Thompson

CBFalconer said:
In this case things are no different than calling malloc with that
product as argument, except that the memory is not initialized. I
don't think there is any argument that that size is unavailable
with standard C methods. Replace BLANK by "((count) * sizeof
(int))", bearing in mind the unsigned calculation.

So you claim that

sizeof ((count) * sizeof (int)) == count*sizeof(int)

is true? I don't think that's what you meant -- and I don't think
you're taking this seriously.

I greatly doubt that you will find ANY calloc that does other than
compute the product and call malloc. That means that the system
may report success, but has actually allocated much less than the
desired memory. To me, that is a glaring calloc fault.

A calloc implementation that blindly multiplies its two arguments,
ignoring any wraparound, is buggy. I think I've seen such
implementations, but three systems I've just tried don't have this
problem. On all three systems, this program:

#include <stdio.h>
#include <stdlib.h>

#define MY_SIZE_MAX ((size_t)-1)
/* SIZE_MAX isn't always available */

int main(void)
{
void *c = calloc(MY_SIZE_MAX, MY_SIZE_MAX);
void *m = malloc(MY_SIZE_MAX * MY_SIZE_MAX);
printf("calloc %s\n", c == NULL ? "failed" : "succeeded");
printf("malloc %s\n", m == NULL ? "failed" : "succeeded");
return 0;
}

produces this output:

calloc failed
malloc succeeded
To justify that, consider that it is not portably possible to
allocate more than one memory segment with malloc, and guarantee
that the segments are combinable.

I fail to see the relevance. What is a "memory segment", and why
would I want to combine more than one of them?

No implementation is *required* to support allocating objects larger
than SIZE_MAX bytes using calloc(). The argument is whether an
implementation is *allowed* to do so.
 
C

CBFalconer

Keith said:
.... snip ...

Consider an implementation where size_t is 16 bits, with
SIZE_MAX==65535. Then SIZE_MAX * SIZE_MAX, after wraparound,
yields a result of 1. (The same happens with a 32-bit size_t.)
By your argument the expression

calloc(SIZE_MAX, SIZE_MAX)

must attempt to allocate just 1 byte. But an implementation
that did that would violate the standard's requirements for
calloc(), which I'll leave you to look up for yourself.

No. calloc should check that the multiplication does not overflow
before making any attempt to allocate memory. On overflow, return
NULL. Otherwise, call malloc. That's all that is required.
 
K

Keith Thompson

CBFalconer said:
i.e. ptr = calloc(<SPECS>);
sz = sizeof(*ptr);

I haven't checked, but if ptr is NULL I believe that is an error.

Why haven't you checked? And what kind of "error" do you think it is?

In sizeof(*ptr), the expression *ptr is not evaluated, so the result
is not affected by whether calloc() succeeded.

You've quoted 6.5.3.4p2 several times here.
 
R

Richard Tobin

CBFalconer said:
There are no "types that are too big". Their sizes have been
calculated using unsigned arithmetic

Do you have any evidence for that? I don't recall anything that
implies that the size of objects is calculated using C arithmetic
(rather than everyday arithmetic) at all.

-- Richard
 
K

Keith Thompson

CBFalconer said:
size_t is specified to be able to specify the size of ANY object.

For the N'th time, no, it bloody well is not. size_t "is the unsigned
integer type of the result of the sizeof operator". Re-read 6.5.3.4;
the sizeof operator yields the size of a type, not of an object.
The maximum value for a size_t item is SIZE_MAX. All calloc has to
do is check that the product of the specifications does not
overflow a size_t, and pass that to malloc. If it does overflow,
simply return NULL.

Yes, that's all it *has* to do. The debate is whether it's *allowed*
to do more.
Also consider that the system has no known means of allocating that
'over SIZE_MAX' space in a continuous block.

How do you know that? What system are you referring to?
 
T

Tony

Ben Pfaff said:
I would use size_t for the string length. In normal C
implementations, size_t is sufficient to hold the longest
possible string.

I'm thinking that the overhead of such a design is unnecessary. Consider:

struct string_t
{
char* data;
size_t length; // 4 bytes on a 32-bit platform, 8 bytes on a 64-bit
platform assumed
size_t buff_sz;
};

12 bytes on a 32-bit platform. 24 bytes on a 64-bit platform.

vs.

struct string16
{
char* data;
uint16 length;
uint16 buff_sz;
};


8 bytes on a 32-bit platform. 16 bytes (effectively after alignment) on a
64-bit platform.

vs.

struct string_null
{
char* data;
uint16 buff_sz;
};


8 bytes on a 32-bit platform (after alignment). 16 bytes on a 64-bit
platform (after alignment).

Now consider the overhead of the above while handling a 1 MB text file as
one string per line with an average line length of 80 bytes:

1000000 b/(80 b/line) = 12500 lines

On a 32-bit platform:

string16: 12500 lines*(8 bytes overhead/line) = 100 KB overhead/MB = 10%
string_t: 12500 lines*(12 bytes overhead/line) = 150 KB overhead/MB = 15%
string_null: 12500 lines*(8 bytes overhead/line) = 100 KB overhead/MB =
10%

On a 64-bit platform:

string16: 12500 lines*(16 bytes overhead/line) = 200 KB overhead/MB = 20%
string_t: 12500 lines*(24 bytes overhead/line) = 300 KB overhead/MB = 30%
string_null: 12500 lines*(16 bytes overhead/line) = 200 KB overhead/MB =
20%

Therefor,

50% more overhead for the string_t over string16 for the design given.

When the above analysis is applied to structs that don't have the field
buff_sz, the result is equivalent overhead for string16 and string_t on both
32-bit and 64-bit platforms after alignment is considered, so design and
alignment issues are important when determining space overhead.

Tony
 
C

CBFalconer

James said:
He didn't suggest that file == string. However, a file can be used
to store a string, and a single string can be created using the
contents of any sufficiently small file, simply by ignoring any
null characters it contains. I presume that's what he's suggesting
for the "minor modification of ggets.c". Whether there's any point
in doing so depends upon the context; but it's certainly not
universally pointless.

No need to ignore zero bytes. Just install them in the output.
The system is terminated by receiving an EOF in place of a char.
You have to do something else about signalling length.

A string is not a type. A char array is a type.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top