Null terminated strings: bad or good?

C

CBFalconer

JC said:
Fair enough. Then, what is the rationale for the premise that "no
object can be larger than SIZE_MAX bytes"? Is it because sizeof()
is undefined for such an object? Just because sizeof() is
undefined for objects larger than SIZE_MAX bytes doesn't seem to
imply that objects larger than SIZE_MAX bytes can't exist -- just
that you can't determine the size of them with sizeof(). That's
my same rationale for why undefined behavior of strlen() doesn't
affect the maximum length of a string.

SIZE_MAX is defined as the maximum value of a size_t integer. At
the same time, size_t type is the type of the value returned by
sizeof, and sizeof is designed to return the size of any object.
Repeat, ANY OBJECT. Thus no object can be larger than SIZE_MAX
bytes.

At the same time calloc returns memory space for the requested
count of the requested size objects. This can be used for an
array. An array is a single object. Repeat, SINGLE object. Thus
the size cannot exceed SIZE_MAX,

If calloc accepts calls for items with a net size greater than
SIZE_MAX, the code is in error. calloc should simply return NULL
for such a call.
 
C

CBFalconer

pete said:
.... snip ...

calloc allocates an object that doesn't have to have a declared
type or an array name. If calloc returns the address of an object
which has more than ((size_t)-1) bytes, that doesn't break sizeof.

However, it indicates that calloc is faulty.
 
C

CBFalconer

James said:
.... snip ...

The requirement that sizeof(type) always give the correct size
has quite different implications. Since it will always be
possible to declare types that have a size greater than SIZE_MAX,
no matter what value SIZE_MAX has, there's no course of action
that an implementation has available to it to ensure meeting that
requirement. I therefore consider that requirement to be a defect
in the standard. I imagine that the actual intent was that
sizeof(type) is only required to give the correct size when the
correct size is smaller than SIZE_MAX. However, the standard as
currently written contains no wording that actually allows sizeof
to ever fail to give the correct size.

No. While it may be possible to declare something that is too big,
it is not possible (with a correct compiler) to successfully
compile that code. For direct declarations (automatic or static)
the error should appear at compile time. For use with malloc,
calloc, realloc the routine should return a NULL.

Note that means that calloc must perform a multiplication, and
check that that result does not exceed SIZE_MAX. This is not a
requirement that calloc, malloc, realloc can assign blocks of size
SIZE_MAX - they are allowed to reject the call because insufficient
memory is available.
 
J

jameskuyper

CBFalconer said:
No. While it may be possible to declare something that is too big,
it is not possible (with a correct compiler) to successfully
compile that code.

I'm not talking about declaring an object that big. I'm talking about
referring to a type that big. It is possible to evaluate sizeof(type)
even if no object of the specified type is created anywhere in the
program. It's pretty clear that an implementation can have a limit on
the size of an object; it's not clear (at least to me) that an
implementation can arbitrarily limit the size of a type that is never
used to actually define an object.
... For direct declarations (automatic or static)
the error should appear at compile time. For use with malloc,
calloc, realloc the routine should return a NULL.

When, and on what basis, should the code fail (presumably at compile
time?) when no attempt is made to create any object of the specified
type, whether with automatic, static, or allocated storage duration?
 
J

JC

SIZE_MAX is defined as the maximum value of a size_t integer.  At
the same time, size_t type is the type of the value returned by
sizeof, and sizeof is designed to return the size of any object.
Repeat, ANY OBJECT.  Thus no object can be larger than SIZE_MAX
bytes.

Additionally:

Since "sizeof object" is defined as having a value (of type size_t)
which is the size of the object, it does not have the behavior mandated
by the standard if there's any object it can be applied to which has a
size greater than that limit. If an implementation allows the creation
of such objects in code which applies the sizeof operator to those
objects, it has no choice but to fail to conform to the standard, one
way or another.


It looks like there can be more than one way to work through this
logic. I believe this is the root of any disagreements.

(a) On one hand, you can say if sizeof *must* return the size of an
object, and sizeof returns a size_t, then in order to satisfy the
constraints of sizeof, an object's size *must* fit in a size_t. That
is the argument you, James Kuyper, and others are putting forward.

(b) On the other hand, it seems that you can say if sizeof returns the
size of an object, and sizeof returns a size_t, then sizeof is
undefined if an object's size can not fit in a size_t, and that sizeof
is not the limiting factor on an object's size. I do not see anything
in the standard that disallows this line of reasoning either. This is
the argument I am putting forward.

It appears that both arguments are equally valid. With (b), to deduce
what the maximum size of an object can be, you must look elsewhere in
the standard, as sizeof's definition is insufficient to determine it.
AFAIK there are no other statements of the object's maximum size. It
would follow, then, with (b), that while the maximum theoretical size
of an object is infinite (i.e. there's no number such that if the size
of an object exceeded that number, it would no longer be considered an
"object"), the actual maximum size of an object is limited by the
largest amount of memory you could obtain. The largest amount of
memory you can obtain, AFAIK, is by calloc(SIZE_MAX,SIZE_MAX). This
would have to be on a system where the number of bits in a pointer was
at least double the number of bits in a size_t (e.g. 64-bit pointers,
32-bit size_t).

Jason
 
R

Richard

CBFalconer said:
No. While it may be possible to declare something that is too big,
it is not possible (with a correct compiler) to successfully
compile that code. For direct declarations (automatic or static)
the error should appear at compile time. For use with malloc,
calloc, realloc the routine should return a NULL.

No it should not.

And I will leave it as an exercise for you to figure out why.
 
R

Richard

Han from China - Master Troll said:
Idiot. The functions malloc() and realloc() won't receive a
size larger than SIZE_MAX to begin with.


Yours,
Han from China

Damn. You beat me to it. Is this Falconer guy always such an egotistical
big I am?

The day a function like calloc can determine that the size_t passed in
is too big for a size_t then we may as well go home.
 
K

Keith Thompson

CBFalconer said:
SIZE_MAX is defined as the maximum value of a size_t integer. At
the same time, size_t type is the type of the value returned by
sizeof, and sizeof is designed to return the size of any object.
Repeat, ANY OBJECT. Thus no object can be larger than SIZE_MAX
bytes.

Not quite. The sizeof operator can be applied either to a
parenthesized type name or to an *expression* (specifically a
"unary-expression"). That expression can be an object name, but in
the case of an object created by calloc(), there's no name for the
object. The expression needn't even be an lvalue; "sizeof (2+2)" is
equivalent to "sizeof (int)".
At the same time calloc returns memory space for the requested
count of the requested size objects. This can be used for an
array. An array is a single object. Repeat, SINGLE object. Thus
the size cannot exceed SIZE_MAX,

If calloc accepts calls for items with a net size greater than
SIZE_MAX, the code is in error. calloc should simply return NULL
for such a call.

An implementation can certainly avoid the whole issue by having
calloc() return NULL whenever the requested size exceeds SIZE_MAX, and
I suspect most implementations do that.

But suppose an implementation returns a non-null result for
calloc(SIZE_MAX, 2). You can't directly apply sizeof to the resulting
object, since it's anonymous. You can try to compute, for example,
"sizeof char[SIZE_MAX][2]", but you could try that even if calloc()
didn't exist. One compiler I tried issued a compile-time diagnostic:

error: size of array 'type name' is too large

but that in that implementation calloc() returns a null pointer for
huge allocations.

So here's the question. Suppose an implementation (a) returns a valid
non-null pointer for calloc(SIZE_MAX, 2), but (b) rejects (i.e.,
issues a compile-time diagnostic and fails to process the translation
unit) the expression ``sizeof char[SIZE_MAX][2]''. What clause of the
standard would this violate? If you claim that "sizeof is designed to
return the size of any object", please cite specific text from the
standard that supports this claim.
 
S

saul.plonkerton

I think Chuck earned his bragging rights when he wrote a gets() replacement
that causes a bottleneck tighter than a nun's asshole.

that sounds like a good bottleneck to me. but, maybe you just don't
know the right nuns.
 
T

Tony

blargg said:
It's about the tradeoff between having meta-information in a separate
channel and embedding it in the same channel via reserved symbols. Null
termination unifies pointer to string and pointer into string, and
eliminates the need for a specialized string type. This is often used for
arrays of other types too, with a terminator element (often -1 for
integral types).

Well said (meaning, I read that, understand it). I still don't see the
benefit of null terminated strings over a struct-like thing:

struct string
{
uint32 length;
char* data;
};

Tony
 
T

Tony

I can think of a few good reasons to have "string" mean a contiguous
series
of bytes and a length.

"These are sometimes known as "Pascal-style" strings. The main issue is
the length of the string is limited by the maximum value that can be
stored in the length field; in Pascal, it was a single byte, limiting
strings to 255 characters. There are other variants that are in use
and you may run into in C, for example, the Windows API defines a
"BSTR" type, which consists of a 4-byte length field followed by
string data, the pointers you deal with point to the start of the data
(4 bytes after the start of the allocated block)."

Perhaps strings should be akin to width-specified integers:

string16 (a string with up to 65536 chars)
string32 ... etc.
I have a hard time finding any value in having
"string" mean a contiguous series of bytes terminated by a null.

"These are normally known as "C-style" strings. The main advantage is
the length of the string is limited only by available memory, and the
length field is not stored with the string, thus conserving storage
space."

The "main advantage" above, is actually a disadvantage. It causes
programmers to write code that is succeptible to buffer overrun attacks.
Storage space conservation? Only in the exceptional case nowadays.

"Another major advantage to storing null-terminated strings is the
strings can be modified in place with minimal effort; truncating a
string is a matter of simply setting the new end byte to 0,"

As if changing the length field was harder to do?

""removing"
the prefix of a string can be done simply by referring to a location
past the beginning,"

That operation is the same in the "Pascal-type string also, but then the
length has to be updated. No big deal.

"dividing strings into substrings can be done by
placing 0's where appropriate. As an exercise, try implementing strtok
() with Pascal-style strings. You may be surprised at the difficulty."

Well one function is an exceptional case. The rule is to program for the
common case and make special things as required rather than complicate the
general case.

"The main disadvantage of C-style strings is computing the length is O
(n)"

I'd say there are a FEW issues and that is just one of them.

", but applications that need to reduce this to constant time can
easily do so by storing the length elsewhere, if they need it."

Tony
 
T

Tony

<[email protected]>; <[email protected]>;
Well, judging by your groundless accusations about me in recent days,
I gather you wanted my attention. You'll notice I did heed your desire
for me to reply in-thread, but after I'm done with you, I think you'll
probably wish I was tucked hidden away out-thread after all.

Welcome to the fire. Round one.

Please cite the C standard for your claim that the length of a C string
is limited *only* by available memory.

C'mon, bro, show me what you've got. I've got a bunch of quotes from
our venerable "regulars" at my disposal, so either tread very carefully
here
or concede by silence that you've pissed your pants and bailed.

Yours,
Han from China

Thank Han. I thought I was going to have to read many more posts in this
thread but now you've marked the prune point and I need not read this branch
beyond your post at all! Thx!

Tony
 
T

Tony

I can think of a few good reasons to have "string" mean a contiguous
series
of bytes and a length. I have a hard time finding any value in having
"string" mean a contiguous series of bytes terminated by a null. Help me
with this please.

Tony


"For one thing it's handy. You have code such as:

char *p;

for (p = str; *p; ++p)
{
*p = tolower((char unsigned)*p);
}"

You can still do that with:

struct string
{
uint32 len;
char* data;
};


"I use null-terminated arrays in my programs, not just for strings. For
instance in my current project I have a null-terminated array of IP
addresses, and also a null-terminated array of MAC addresses. It
allows for simpler code such as:

for (p = ip_addresses; *p; ++p)
{
if (*p == ip_default_gateway)
DoSomething();
}"

Though semantically easier in C++, example like the above are not compelling
reason IMO.


"If the amount of IP addresses changes at runtime, I don't have to
change some global integer variable that indicates the size. If I did
it would be something like:

char *p;
char const *const pend = ip_addresses +
global_variable_amount_ip_addresses;

for (p = ip_addresses; p != pend; ++p)
{
if (*p == ip_default_gateway)
DoSomething();
}

Also I'd say the null-terminated way is faster."

I didn't take time to grok your whole example, but I don't think I see it as
justification for null terminated strings (especially since I use C++).

Tony
 
T

Tony

Richard Heathfield said:
Tony said:


So can I.


Ease of implementation.

Try it. Write a set of functions for manipulating strings (e.g.
implement the string functions from the standard library). Then
write a similar set of functions for manipulating length+data
"strings".

My C++ string class is a length and data ptr (pretty much, since the
underpinnings are an array class).
The latter is, in my view, more worthwhile, especially if you
incorporate "stretchiness" into the strings. Nevertheless, it is a
little harder to do and takes a little more time.

And didn't the null terminated string give us the buffer overrun fiasco?

Tony
 
T

Tony

Richard Tobin said:
Your question is strangely phrased. Do you really want to ask about
whether "string" should mean that, or whether a programming language
should use null-terminated strings?

I indeed meant in a broad perspective. I was questioning the sanity of the C
implementation of strings.

Tony
 
T

Tony

Thad Smith said:
Strictly speaking, there was no question. I sense an implied question of
"what are the advantages of null-terminated strings?" for the purpose of
increasing general programming knowledge. I agree with the earlier posts
that it is an easier implementation and eliminates the need for selecting
a size for a length variable.
I use C++ and very often "ponder" the defficiencies of the language that are
there for reasons of backward compatibility with C. My C++ string class IS a
length and a ptr to data. So I've already decided long ago. I was just
wondering if someone knew something I didn't about it because it seems so
obvious.

Tony
 
T

Tony

jacob navia said:
There are many advantages to zero terminated strings:

o Performance. You must scan the whole string millions of times
each time you want to know how long it is. This needs always
a faster processor so you can count on C to get that new game
machine you were dreaming of.
o Security. Since there is no way to know the length without scanning
the supposed string, if there is no terminating zero your program will
crash, or even corrupt other variables if you are writing to the
string. This will increase the security provided by the already lax
standards of C and will SURELY give the C++ people one more reason
to say: "You see? C sucks".
o Easy of implementation. Instead of just keeping the length in a
correct data structure and avoiding all above problems, you have to
program around those, increasing the coide length and bug
surface area. For instance, to catenate two strings (strcat operation)
instead of just adding the two lengths, allocating space, then
copying, you have to scan both strings and test their length,
allocate one byte more to store the zero (how many millions
of newbee bugs could have been avoided) etc.
Zero terminated strings make it easy to inject bugs in your code,
hence they are easy to implement.
o Virus writers would have had a lot of work to do if those strings
weren't there to easy them the job.

The lcc-win compiler features a string library without any zero
termination.

I have discussed this subject many time in this group, full of
retrograde people that LOVE zero terminated strings. I have had
my dose of them and will not answer any of their "arguments".

My thinking was/is also that null terminated strings suck. I was just being
a good citizen and asking the more general question.

Tony
 
T

Tony

Malcolm McLean said:
"string" in programming terms means "human-readable sequence of character
data". The obvious way of representing characters is to have a code, and
to store the character codes contiguously in memory.

In C the decision was made to terminate strings literal (quoted sequences
embedded in C source) with a nul. The obvious alternative would be to have
a structure with a hidden length element. Whilst you can use alternative
string representations in C, you can't embed them. So most C string
functions operate on nul terminated strings, simply because of the ease of
writing foo("Fred");

Thank you! I wasn't considering that. But to me, it seems like that is a
special case deserving special machinery that doesn't penalize the common
case. They are different kinds of strings! (Just like there is not just one
integer width).
Whether the decision was the best one or not is open to question.

Not to me anymore. I think you wrapped it up for me. The literal
implementation should be different maybe. What do I care? Why should I as a
programmer be penalized by compiler implementors?!
For short strings having a sentinel value makes programming easier, for
long strings it can lead to performance problems. Almost every man and his
dog tries to write a better C strign library at some point or other.

I have one of those. Except I architected it in C++.

Tony
 
T

Tony

CBFalconer said:
And most strings used are short. However, my dog never attempted
to improve the C string library throughout his life. He died of
old age about 5 years ago.

Duh. Was he Pekinese? Well of course THEY wouldn't do that! (Too
temporamental). (Sorry for to hear about your loss).

Tony
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,114
Latest member
GlucoPremiumReview
Top