Null terminated strings: bad or good?

B

Bartc

Tony said:
I'm thinking that the overhead of such a design is unnecessary. Consider: .....

Now consider the overhead of the above while handling a 1 MB text file as
one string per line with an average line length of 80 bytes:

1000000 b/(80 b/line) = 12500 lines

On a 32-bit platform:

string16: 12500 lines*(8 bytes overhead/line) = 100 KB overhead/MB = 10%
string_t: 12500 lines*(12 bytes overhead/line) = 150 KB overhead/MB =
15%
string_null: 12500 lines*(8 bytes overhead/line) = 100 KB overhead/MB =

If the file is split into one line per string, then that's already 4 bytes
per line or 50KB/5% overhead.

Adding a length makes that 8 bytes per line or 100KB/10% overhead. But it is
only 5% /extra/. And a total size of 1.1MB instead of 1MB.

The buff_sz as you say can be optional, especially as it could double the
overhead (from 8 to 16 bytes). Where the 1MB text has been loaded into a
single memory block, then each line is not individually allocated anyway.
(And even if it was, there may be other ways of deriving the capacity of a
string of length N without storing that value with every string of length
N.)

(My efforts at this use a 16-byte descriptor, but this is part of a dynamic
type system (for a separate language). The spare capacity of a string
allocation is not stored but handled by the memory allocator, but then my
strings tend to be immutable when it comes to extending them.)
On a 64-bit platform:
string16: 12500 lines*(16 bytes overhead/line) = 200 KB overhead/MB =
20%
string_t: 12500 lines*(24 bytes overhead/line) = 300 KB overhead/MB =
30%
string_null: 12500 lines*(16 bytes overhead/line) = 200 KB overhead/MB =
20%

64-bit is always going to have this problem, effectively doubling the memory
and bandwidth requirements for the (likely) majority of programs that fit
happily into 32-bits.
 
I

Ian Collins

Richard said:
Do you have any evidence for that? I don't recall anything that
implies that the size of objects is calculated using C arithmetic
(rather than everyday arithmetic) at all.
The size of objects is calculated by the compiler (except for VLAs) and
is an unsigned integer type (6.5.3.4/4).
 
W

Wojtek Lerch

CBFalconer said:
Wojtek said:
CBFalconer said:
Wojtek Lerch wrote:

Where does the standard say that char[SIZE_MAX][SIZE_MAX] is
not a declarable type?

It says sizeof can return the size of a type. But it returns a
size_t, which has a maximum value of SIZE_MAX. This requires
that the declaration be an error, or at least unusable.

Unusable as an operand of sizeof, maybe. But it doesn't follow
that it must be unusable for other purposes.

No restriction on using it as a number.

Using a type as a number? The standard doesn't allow that, does it? I'd
call that a restriction.
But size_t is intended to
measure the size of ANY object, which must first be created.

No, for the Nth time, size_t measures the size of a type, regardless of
whether that type is ever involved in the creation of an object. When the
operand is an expression, only its type matters, and whether the expression
dereferences an object (or is an lvalue in the first place) does not affect
the correctness of the code.
That
means the object fits into memory (which may include disk
simulation of memory space).

It might mean that, but since it's untrue anyway, it doesn't matter. The C
standard that the rest of us have read doesn't forbid SIZE_MAX or the size
of some types to be greater than the size of memory. A lot of
implementations use a 32-bit size_t on hardware incapable of addressing 4GB
of memory.

Think about it: you seem to be claiming that sizeof(char[N]) must be a
"compile error" if the machine that the program will run on has less than N
bytes of memory. Do you seriously believe that the standard requires
compilers to have that kind of knowledge of the machines that will run the
compiled program?
That also means that size_t matches
the addressing capabilities of the computing unit.

No it doesn't. The standard doesn't say anything about the addressing
capabilities of the computing unit.
 
R

Richard Tobin

Do you have any evidence for that? I don't recall anything that
implies that the size of objects is calculated using C arithmetic
(rather than everyday arithmetic) at all.
[/QUOTE]
The size of objects is calculated by the compiler (except for VLAs) and
is an unsigned integer type (6.5.3.4/4).

That's about the type of sizeof()'s return value. I'm referring to
CBFalconer's contention that calloc() (and presumably the compiler
when handling defintions) uses C unsigned arithmetic (with its
wraparound) to determine how much space to allocate.

-- Richard
 
B

Ben Bacarisse

Han from China - Master Troll said:
One such case may be string literals. Since the object created by a string
literal need not be modifiable, a compiler could store its length somewhere
at compile-time.

My point is that it's wrong to assume strlen() has O(n) performance in
all cases.

Storing the lengths of literals can't affect the "Big-O" for strlen.
What you probably meant was that it is wrong to assume that strlen
needs to count the characters in all cases.
 
T

Tony

Keith Thompson said:
We've been debating that point in comp.std.c, but in practice, yes,
it's reasonable to assume that size_t is sufficient.

Using something smaller that size_t to hold the length of a
string-like object is what I'd consider to be a micro-optimization.
Sometimes you need to do that kind of thing (for example, storing 1 or
2 bytes rather than 4 on a severely memory-constrained system might
make sense), but in most cases it's a waste of effort. Or you can
define a separate "short string" type in addition to one that can
potentially store up to SIZE_MAX characters, but then you have to deal
with converting between the two types.

My advice: Keep It Simple -- use size_t to hold the length unless you
have a really good reason not to.

It's not any less simple to use a 32-bit or a 16-bit integer. I think you're
chasing that elusive "portability" goal needlessly. Who needs 64-bit
strings? (I'd get by just fine with a 65535 byte limit). A string, at least
in the vast majority of uses ("99.9%"?) need not gain capacity on a 64-bit
platform, for example. The case for the other direction may have a tad of
meat, but only if one was implementing one of those "standard library"
things (everything including the kitchen sink and ATTEMPTS to be all things
in cases). My greatest simplification: don't use the standard libraries!

Tony
 
T

Tony

Where the 1MB text has been loaded into a
single memory block, then each line is not individually allocated anyway.
(And even if it was, there may be other ways of deriving the capacity of a
string of length N without storing that value with every string of length
N.)

I knew someone was going to say that. But think about something like a text
editor (that's the scenario I had in mind) and having to edit the lines: the
easiest implementation is one string per line. Let's assume that is the
example scenario since that is what I had in mind.

The other thing, getting the length from somewhere else could be done, but
it sounds like a function call (or more than one) evertime you want the
length, so not very nice for that reason and it couples the string with
something else (the memory manager).

Tony
 
T

Tony

Ian Collins said:
As Ben says, use size_t.

I don't think it is appropriate and I prefer to consider the intended
practical usage bound things to practical limits. I don't forsee myself
developing on other than 32-bit and 64-bit platforms and if I do, it will be
with a different codebase because those platforms are likely to be WAAAAY
different in many/every way. The standard library builders has to deal with
the complexity, but by "rolling my own" libraries, I don't have to, and
thereby have a much simpler implementation.
If you want to optimise for short strings,
follow the example of the commonly used short string optimisation in C++
standard libraries.

Or viewed from my perspective: don't penalize the common case by using
size_t. Doing nothing is simpler than doing something. To contrive a greater
requirement than exists, and then deal with the need to fixup the common
case, leads to a lot more work than is necessary.

Tony
 
T

Tony

Phil Carmody said:
With all due respect, I think you'll find that embedding null
characters into a C string requires special abstractions and
handling.

If you read the thread from the beginning, you'll find that I'm not a
proponent of null-terminated strings, but am more than willing to listen to
what others feel is good (or bad) about them.

Tony
 
T

Tony

Han from China - Master Troll said:
I meant both. Big-O can be applied to restricted domains, such as the
domain of string literals.

But looking at my statement and yours, I'd like to double down and
say that it's wrong to assume that strlen() needs to count the characters
in *any* cases.

Do most implementations do that though? Mine does, but I prefer
obvious/simple implementations over optimized ones that are more obscure,
unless of course, optimization is required and that's the only option.

Tony
 
P

Phil Carmody

Tony said:
If you read the thread from the beginning, you'll find that I'm not a
proponent of null-terminated strings, but am more than willing to listen to
what others feel is good (or bad) about them.

Was that supposed to be an agreement or disagreement with what
I said? Your ability to follow and continue a logical argument
seems to be sub-par.

Phil
 
P

Phil Carmody

CBFalconer said:
i.e. ptr = calloc(<SPECS>);
sz = sizeof(*ptr);

I haven't checked, but if ptr is NULL I believe that is an error.

Stop. Think. Have you never encountered a construct such as this:

nodger_t *ptr=malloc(COUNT * sizeof(*ptr))

?

What do you think about the sizeof(*ptr) there, before ptr even
has a value?

And I'm obliged to add a question enquiring why you didn't check?
This is comp.lang.c, not comp.lang.cbf.guesses.incorrectly.about.c .

Phil
 
P

Phil Carmody

CBFalconer said:
Wojtek said:
CBFalconer said:
Wojtek Lerch wrote:

Where does the standard say that char[SIZE_MAX][SIZE_MAX] is
not a declarable type?

It says sizeof can return the size of a type. But it returns a
size_t, which has a maximum value of SIZE_MAX. This requires
that the declaration be an error, or at least unusable.

Unusable as an operand of sizeof, maybe. But it doesn't follow
that it must be unusable for other purposes.

No restriction on using it as a number. But size_t is intended to
measure the size of ANY object, which must first be created.

Absolute nonsense. Sizeof measures types. If you give it an object,
it will give you the size of the type of that object. No creation
of any objects is necessary. Are any int[n] arrays created in the
following:

#include <stdio.h>
#include <stdlib.h>

int* foo(int n)
{
static int p1[10],p2[20];
if(sizeof(int[n])<=sizeof(p1)) { return p1; }
if(sizeof(int[n])<=sizeof(p2)) { return p2; }
return NULL;
}

int main(void)
{
for(int i=5; i<=25; i+=5)
{
printf("%d -> %p\n", i, (void*)foo(i));
}
return 0;
}

Phil
 
J

just.a.garbageman

I meant both. Big-O can be applied to restricted domains, such as the
domain of string literals.  

But looking at my statement and yours, I'd like to double down and
say that it's wrong to assume that strlen() needs to count the characters
in *any* cases. What do you think of that?

If you provide a counterexample, I think I could conceive of a machine
that would still allow for O(1) over the full domain.

The standard allows implementations to store the length of the string
somewhere (like, say at string_ptr - sizeof (size_t)). To be more
precise, it does not disallow it. (nor it mentions it)

An implementation is free to provide an O(1) or O(n) strlen. I'm glad
the standard avoids to put restrictions in efficiency, that allows for
implementations with least efficiency to be written, for comparison
reasons.
 
F

Flash Gordon

CBFalconer said:
Keith Thompson wrote:
... snip ...

No. calloc should check that the multiplication does not overflow
before making any attempt to allocate memory. On overflow, return
NULL. Otherwise, call malloc. That's all that is required.

C&V for where the standard requires this as opposed to, for example,
doing the multiplication using a type with twice the number of value
bits as size_t and successfully allocating the memory.

A sensible implementation on a segmented architecture where pointers are
n-bit segment and 16-bit offset could have
size_t 16 bits
ptrdiff_t 16 bits
unsigned long 32 bits
Maximum declarable object <= 65535 bytes
calloc able to create objects larger than 65535 bytes
 
B

Ben Bacarisse

The standard allows implementations to store the length of the string
somewhere (like, say at string_ptr - sizeof (size_t)). To be more
precise, it does not disallow it. (nor it mentions it)

An implementation is free to provide an O(1) or O(n) strlen. I'm glad
the standard avoids to put restrictions in efficiency, that allows for
implementations with least efficiency to be written, for comparison
reasons.

It is hard (for me at least) to conceive of an O(1) algorithm for
strlen that does not require an absurd amount of extra data (and other
overheads). Do you have one in mind?
 
H

Harald van Dijk

The standard allows implementations to store the length of the string
somewhere (like, say at string_ptr - sizeof (size_t)). To be more
precise, it does not disallow it. (nor it mentions it)

The standard allows this for string literals, but it would have little
benefit as functions cannot make use of it reliably. The standard
guarantees that strlen("Hello") is 5, but equally that
strlen("Hello" + 1) is 4. 4 cannot be stored in the bytes before "ello".
 
B

Bartc

Harald van Dijk said:
The standard allows this for string literals, but it would have little
benefit as functions cannot make use of it reliably. The standard
guarantees that strlen("Hello") is 5, but equally that
strlen("Hello" + 1) is 4. 4 cannot be stored in the bytes before "ello".

That's another advantage of a (char*,length) structure which does not
require anything (length and/or zero byte) at either end of what it points
to. So it can point into the middle of another string.

You can't directly tell from such a structure however whether it 'owns' (by
allocation) what it points to, or not. But that is no different from a
regular char*.

There are so many ways of playing this, all with pros and cons, that a
null-terminated sequence was a good compromise for building in to C.
 
J

James Kuyper

CBFalconer said:
James Kuyper wrote:
... snip ...

i.e. ptr = calloc(<SPECS>);
sz = sizeof(*ptr);

I haven't checked, but if ptr is NULL I believe that is an error.

ptr is the name of a variable. It's value can be null; and if so it must
compare equal to NULL, but ptr can't actually be NULL.

You actually have to check? 6.5.3.4p2: "If the type of the operand is a
variable length array type, the operand is evaluated; otherwise, the
operand is not evaluated ...".

If you can't even remember that the operand of sizeof isn't (usually)
evaluated, I'd recommend not arguing further about the sizeof until
you've re-read the description thoroughly.
 
J

James Kuyper

CBFalconer said:
James Kuyper wrote: ....

There are no "types that are too big". Their sizes have been
calculated using unsigned arithmetic,

I'm sure you believe that, but there's no support for that concept in
the standard.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top