Null terminated strings: bad or good?

W

Wojtek Lerch

This argument only arises because of the prototype of calloc.

No, I was talking about specifying types larger than SIZE_MAX bytes.
The
argument about calloc is a separate issue.
Nowhere else is there any possibility of creating oversized objects
(considering the definition of size_t and SIZE_MAX).

It's the other way around: the standard never forbids declaring
objects or
types larger than SIZE_MAX. The definition of size_t and SIZE_MAX
implies a
limit on the types for which sizeof can possibly return its specified
value,
but it says nothing about types that aren't operands of the sizeof
operator.
calloc is
quite capable of protecting itself by checking that the size
requested does not exceed SIZE_MAX. It has the ability to signal
this (or other failures) by returning NULL.

Sure, but the question is whether it's *required* to do so. And
that's not
the question that I was originally arguing about.
 
B

Ben Pfaff

Tony said:
That's bizarre. file != string, IMO. YMMV.

I like my software to be able to handle extreme cases as best it
can. If general-purpose software for operating on text files is
limited to lines that are only, say, 65535 bytes long, then it is
much less useful than it might otherwise be.
 
P

Phil Carmody

Tony said:
That's just the C paradigm. It's unnatural. I can see the reason for
character buffers, but null terminating them is bizarre.

You've obviously never worked in an environment where the
streaming of data of an arbitrary, and unknown in advance,
length needs to be done. The concept of in-band control is
absolutely essential in such cases, and that's precisely
what NUL is. It's far from bizarre.

Phil
 
T

Tony

Ben Pfaff said:
I like my software to be able to handle extreme cases as best it
can. If general-purpose software for operating on text files is
limited to lines that are only, say, 65535 bytes long, then it is
much less useful than it might otherwise be.

"only"? Did you say ONLY 65535-byte lines?! (Note: ASCII chars assumed).
Please do tell how "much more useful" it is to have lines longer that 64k?
The one example I can think of that may need "a lot" of chars is macro
preprocessor lines but 64k seems like a stretch there too.

Tony
 
B

Bartc

Tony said:
"only"? Did you say ONLY 65535-byte lines?! (Note: ASCII chars assumed).
Please do tell how "much more useful" it is to have lines longer that 64k?
The one example I can think of that may need "a lot" of chars is macro
preprocessor lines but 64k seems like a stretch there too.

If a file has the wrong sort of line separator (or doesn't have any), then
quite likely the first line you read may be the entire file. And these days
a 64KB file limit won't really do.
 
T

Tony

Phil Carmody said:
You've obviously never worked in an environment where the
streaming of data of an arbitrary, and unknown in advance,
length needs to be done. The concept of in-band control is
absolutely essential in such cases, and that's precisely
what NUL is. It's far from bizarre.

It's bizarre to handle buffers of data (and remember, the discussion stemmed
from reading-in a whole textfile as a single string) like that in the
everyday case is what I was saying. Your example is a special case that
requires special abstractions/handling. Trying to cover too many bases with
a single abstraction is what makes the standard library so unwieldly.
Null-terminated strings seem to be one of those "over-thought" ideas. Though
I don't mind anyone trying to convince me that they are the way to go (then
again, I don't really use C. I use C++, though that shouldn't really
matter). The string-literal case is probably the point to argue in defense
of C-strings. Perhaps that's the ONLY place that null-terminated makes
sense. Literals are most always of short length also.

Tony
 
J

James Kuyper

Tony said:
That's bizarre. file != string, IMO. YMMV.

He didn't suggest that file == string. However, a file can be used to
store a string, and a single string can be created using the contents of
any sufficiently small file, simply by ignoring any null characters it
contains. I presume that's what he's suggesting for the "minor
modification of ggets.c". Whether there's any point in doing so depends
upon the context; but it's certainly not universally pointless.
 
R

Richard Tobin

Tony said:
"only"? Did you say ONLY 65535-byte lines?! (Note: ASCII chars assumed).
Please do tell how "much more useful" it is to have lines longer that 64k?

I frequently deal with text files (XML) that have no line breaks at
all, or only incidental ones where they are part of the data. It is
not unusual to have "lines" that are many megabytes long.

-- Richard
 
W

Wojtek Lerch

James Kuyper said:
My interpretation of the relevant words implies that a conforming
implementation of C can

1. reject any declaration that refers to a type bigger than SIZE_MAX, as
exceeding an implementation limit.

Out of curiosity, do you disagree that a conforming implementation can also
accept programs that refer to types bigger than SIZE_MAX but never apply the
sizeof operator to such a type?
 
A

Amandil

It's bizarre to handle buffers of data (and remember, the discussion stemmed
from reading-in a whole textfile as a single string) like that in the
everyday case is what I was saying. Your example is a special case that
requires special abstractions/handling. Trying to cover too many bases with
a single abstraction is what makes the standard library so unwieldly.
Null-terminated strings seem to be one of those "over-thought" ideas. Though
I don't mind anyone trying to convince me that they are the way to go (then
again, I don't really use C. I use C++, though that shouldn't really
matter). The string-literal case is probably the point to argue in defense
of C-strings. Perhaps that's the ONLY place that null-terminated makes
sense. Literals are most always of short length also.

Tony

Another point in defense of the NUL-terminated string is the
unwillingness on the part of the creators of C to create a new data
type with handling done transparent to what the programmer can see.
Other examples of "transparent" code would include constuctors,
destructors, and the like, that you seem used to in C++. But in C, the
idea is to do exactly what the programmer expects and writes, with no
behind the scenes. The biggest exception to that is copying structs
(struct file a = b;), and perhaps casting (usually explicit, or with a
prototype). Adding a special string type is not in C's design.

Cheers,

-- Marty Amandil
 
J

jameskuyper

Wojtek said:
Out of curiosity, do you disagree that a conforming implementation can also
accept programs that refer to types bigger than SIZE_MAX but never apply the
sizeof operator to such a type?

I can't think of a good reason why an implementation would want to do
that, but I don't see how accepting such a program violates any of the
standard's requirements.
 
A

Amandil

CBFalconer said:
Wojtek said:
    long array[SIZE_MAX];
I maintain that, whenever (sizeof (long) > 1), that is a compile
error.
We know that you do, but we don't believe that you have
demonstrated that to be true.
I have quoted the appropriate portion of the standard.  Any other
interpretation involves a contradiction.

My interpretation of the relevant words implies that a conforming
implementation of C can

1. reject any declaration that refers to a type bigger than
SIZE_MAX, as
exceeding an implementation limit.

That's definitely true. It is also allowed to set both SIZE_MAX and
the maximum size of an object to 65536. However, the standard allows
these two limitations separately: SIZE_MAX in 7.18.3.2, regarding
<stdint.h>, and object size in 5.2.4.1, Translation limits. I has not
been proven, by Chuck or anyone else, that these two values are
related or must be the same. From the lack of any relationship between
those two values - other than that must both be at least 65536 -
mentioned in the standard, the conclusion can be drawn that there
indeed is not any such relationship. As a matter of fact, the maximum
object size may be less than 65536 in a freestanding implementation,
but no such exception exists for SIZE_MAX.

SIZE_MAX is simply the largest possible value to be contained in an
object of type size_t. I suggest that objects such as

int ia[65536];
char ca[250000];

are valid even when size_t is a short (SIZE_MAX = 65536). The result
of sizeof ia or sizeof ca, in such a case is ID, same as in

int i = 100000;
short s = (short) i;

according to 6.3.1.3.3, or whatever the standard says in 6.3.1.3.2
(Can someone please that to me?)
2. have calloc(nmemb, size) return a non-null pointer to enough
memory
to store an array nmemb objects of the specified size, even if
nmemb*size has a mathematical value that is greater than
SIZE_MAX. It
will return sufficient memory for the specified number of objects of the
specified size, even though the amount of memory required is
greater
than the value of nmemb*size, interpreted as a C expression
rather than
a mathematical one.

And memory returned is addressable by array subscrilpt, which is not
necessarily of type size_t, for two resons. One I mentioned above,
the other is that size_t is (or may be) an unsigned integral type,
while an array subscript is required to be of any integer type,
including a negative long long (6.5.2.1.1, by exclusion of any other
constraints).

-- Marty Amandil


I hope I didn't make
 
L

lawrence.jones

Tony said:
Getting the length of a string is such a common operation that it's
implementation should have be considered when designing a string library and
any associated structures.

Whether it's a common operation or not depends a great deal on what
paradigm you're using. In many languages, you need to get the length
before you can do anything to the contents of the string. In C, on the
other hand, you can usually just start processing the contents and stop
when you find the null byte without ever explicitly getting the length.
 
D

Drew Lawson

That's bizarre. file != string, IMO. YMMV.

I agree with both parts.

However, I have also written code that stores (in excess of) entire
files in single strings. That code was doing a gateway fetch/forward
of PDFs from another system -- SOAP request sent, entire response
(containing a PDF in binary) stored in a single string, then chop
out the body and return it.

This was a C++ wrapper class, but built on C-model string utilities.
That's how I know the PDFs were binary. I had to fix the piece of
code that used strlen() when it was supposed to support arbitrary
data.

The string was not a file. But the string and the file contained
the same bytes.
 
T

Tony

Bartc said:
If a file has the wrong sort of line separator (or doesn't have any), then
quite likely the first line you read may be the entire file.

One should handle that error at the limit rather than just continue reading!
Indeed, a program may have some apriori knowledge about the file structure
and error-out way before 64k bytes have been read.

Tony
 
T

Tony

Richard Tobin said:
I frequently deal with text files (XML) that have no line breaks at
all, or only incidental ones where they are part of the data. It is
not unusual to have "lines" that are many megabytes long.

That's a relevant example then. But will I use a 32-bit length field in the
string class I am currently reworking based on that? Maybe, but probably
not. Maybe that will be "bigstring" rather then "string". Should a 32-bit
length field be used in a language level, general string class? Maybe more
compelling there, but maybe a *choice* of lower overhead would be lucrative:
"string", "smallstring".

Tony
 
T

Tony

James Kuyper said:
He didn't suggest that file == string. However, a file can be used to
store a string, and a single string can be created using the contents of
any sufficiently small file, simply by ignoring any null characters it
contains. I presume that's what he's suggesting for the "minor
modification of ggets.c". Whether there's any point in doing so depends
upon the context; but it's certainly not universally pointless.

I wouldn't include those requirements in a string library specification and
thereby simplify the implementation or at least make it more space
efficient. I'd actually include that as a NON-specification, preferring to
specify "grep file" and requirements like that elsewhere especially since
files are likely to be large in bytes (the max unsigned integer size).

Tony
 
B

Ben Pfaff

Tony said:
That's a relevant example then. But will I use a 32-bit length field in the
string class I am currently reworking based on that? Maybe, but probably
not. Maybe that will be "bigstring" rather then "string". Should a 32-bit
length field be used in a language level, general string class? Maybe more
compelling there, but maybe a *choice* of lower overhead would be lucrative:
"string", "smallstring".

I would use size_t for the string length. In normal C
implementations, size_t is sufficient to hold the longest
possible string.
 
F

Flash Gordon

CBFalconer said:
See my reply to Keith Thompson. You can't legally get calloc to
create anything larger than SIZE_MAX, because you can't specify
such a value.

Yes you can as has already been shown in this thread. For one argument
to calloc you pass SIZE_MAX and for the other argument any value greater
than one. Other argument combinations will also achieve the same result.
The size_t operands are unsigned integers and follow
the rules for unsigned overflows etc.

Irrelevant unless you can show somewhere in the specification for size_t
that the standard requires that calloc perform a multiplication of its
two parameters using the size_t type. Hint, it doesn't.

If you want to say something cannot be done legally you need to provide
a quote from the standard that proves it. The definition of sizeof has
already been demonstrated to NOT be such relevant.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,218
Latest member
JolieDenha

Latest Threads

Top