Increasing efficiency in C

J

jacob navia

As everybody knows, C uses a zero delimited unbounded
pointer for its representation of strings.

This is extremely inefficient because at each query of the
length of the string, the computer starts an unbounded
memory scan searching for a zero that ends the string.

A more efficient representation is:

struct string {
size_t length;
char data[];
};

The length operation becomes just a memory read.
This would considerably speed the programs. The basic
idea is to use a string type that is length prefixed and
allows run-time checking against UB: undefined
behavior.

Comparing strings is speeded up also because when
testing for equality, the first length comparison tells
maybe the whole story with just a couple of
memory reads.

A string like the one described above is not able to
resize itself. Any pointers to it would cease to be valid
when it is resized if the memory allocator is forced to
move memory around. The block where that string was
allocated is bounded by another blocks in memory, and
it is not possible to resize it.

A pointer ( an indirect representation) costs a sizeof(void *)
but allows to resize strings without invalidating the pointers
to them.

struct string {
size_t length;
char *data;
};

There is no compelling reason to choose one or the other.
It depends on the application. In any case, the standard
library could be complemented by
Strcmp
Strcpy
etc., all using length prefixed strings.

Syntactic sugar.

I have added some sugar to this coffee. I always liked coffee
with a bit of sugar. I feel that is too acid without it.

Current strings are used using the [ ] notation. This strings
could have the same privilege isn't it?

The language extension I propose is that the user has the right to
define the operation [ ] for any data type he/she wishes.

Not a big deal for today's compilers.

Length checked strings can then use:

String s;
....
s[2] = 'a';

I think I am proposing the obvious.

Do you agree?

jacob
 
M

Mike Wahler

The language extension I propose is that the user has the right to
define the operation [ ] for any data type he/she wishes.

Not a big deal for today's compilers.

Length checked strings can then use:

String s;
...
s[2] = 'a';

I think I am proposing the obvious.

I think you're proposing C++. Rather than try to 'reinvent' it,
I just use it.

-Mike
 
J

jacob navia

Mike Wahler said:
I think you're proposing C++. Rather than try to 'reinvent' it,
I just use it.

Well I can't use it Mike.

Just too complex.

Default instantiated template traits?

No thanks. Just characters. What I like of C is that it is not
"object oriented".

It is not oriented at all. It is the programmer that puts
the orientation of the program.

C++ has good ideas, but the complexity of the whole is
so staggering, that actually it is a reminder where that leads,
not knowing when to stop.

The crux of the matter is knowing when to stop. When a feature
becomes a nuisance, and doesn't simplify the task it is better
to drop it.

Syntactic sugar can lead to caries in the teeths. I said I like
a *bit* of sugar. Not five spoonfuls you see?

I want a bit of sugar in my coffee and not some coffee in my
sugar.

jacob
 
C

Charles Harrison Caudill

jacob navia said:
As everybody knows, C uses a zero delimited unbounded
pointer for its representation of strings.
This is extremely inefficient because at each query of the
length of the string, the computer starts an unbounded
memory scan searching for a zero that ends the string.

Correction, strcmp is inefficient, I'd like to see someone produce a more
efficient model for strings using even assembly
A more efficient representation is:
struct string {
size_t length;
char data[];
};

I'm not really sure what the question is, you seem to have asked and answered
all in one...
The language extension I propose is that the user has the right to
define the operation [ ] for any data type he/she wishes.

foo[x] is the value held in the array foo at location x, not string foo at
location x. strings don't even exist in c, just memory.
Length checked strings can then use:
String s;
...
s[2] = 'a';
I think I am proposing the obvious.

See C++
 
M

Mike Wahler

jacob navia said:
Well I can't use it Mike.
Whatever.

Just too complex.

You needn't use all of it. You seem to be wanting
a 'real' string type, which C++ has, and it's not
at all difficult to use. Actually, if such a type
is needed, imo that's a good enough reason to use C++
(even if everything else is only the common subset
of the two languages).
Default instantiated template traits?

So don't use 'em.
No thanks. Just characters. What I like of C is that it is not
"object oriented".

C++ does not require OO design. This is a very common misconception.
It is not oriented at all. It is the programmer that puts
the orientation of the program.

Right. Which is why I find C++ useful for very many things.
(and C as well, and other languages too).
C++ has good ideas, but the complexity of the whole is
so staggering, that actually it is a reminder where that leads,
not knowing when to stop.

I use the parts I find useful, discard the rest.
The crux of the matter is knowing when to stop. When a feature
becomes a nuisance, and doesn't simplify the task it is better
to drop it.

Or ignore it. Simple, huh?
Syntactic sugar can lead to caries in the teeths. I said I like
a *bit* of sugar. Not five spoonfuls you see?

Ever hear of self-discipline? Anything can be abused.
I want a bit of sugar in my coffee and not some coffee in my
sugar.

Who's got control of the spoon, you or someone else? :)

I'll stop now, I don't want to be accused of language
advocacy in a group about a different language. (It's
probably too late, though :))

-Mike
 
L

Leor Zolman

As everybody knows, C uses a zero delimited unbounded
pointer for its representation of strings.

zero terminated, anyway.
This is extremely inefficient because at each query of the
length of the string, the computer starts an unbounded
memory scan searching for a zero that ends the string.

It is "extremely inefficient" only if you're continuously recalculating the
length. For applications where you're not, it is extremely efficient.
A more efficient representation is:

struct string {
size_t length;
char data[];
};

What is that [] about? That's not a legal definition. Are you implying that
a fixed-length array implementation (with an actual size in there) is an
improvement in any significant way over a simple char *? I don't think so.
The length operation becomes just a memory read.
This would considerably speed the programs. The basic
idea is to use a string type that is length prefixed and
allows run-time checking against UB: undefined
behavior.

Now it is starting to sound like Java.
Comparing strings is speeded up also because when
testing for equality, the first length comparison tells
maybe the whole story with just a couple of
memory reads.

Perhaps a bit; but on average, inequality is determined pretty quick the
conventional way, and equality would actually take /more/ time to
determine. But yes, you might net a teeny bit of an improvement.
A string like the one described above is not able to
resize itself.

Are we talking about the one with the fixed-length array, or the version
with the mysterious empty brackets? Either way, /nothing/ in C can "resize
itself"...
Any pointers to it would cease to be valid
when it is resized if the memory allocator is forced to
move memory around. The block where that string was
allocated is bounded by another blocks in memory, and
it is not possible to resize it.

A pointer ( an indirect representation) costs a sizeof(void *)
but allows to resize strings without invalidating the pointers
to them.

struct string {
size_t length;
char *data;
};

This is the classic first C++ class implementation exercise. Thinking about
it yields some good fundamental principles about class design. But, to
achieve a true performance benefit in a string service, it ultimately
requires tailoring the string implementation to the specific circumstances
in which it will be used. There's no magic bullet; the "irrational
exuberance" surrounding the rise-and-fall of reference counted Standard C++
string implementations is a case in point.
There is no compelling reason to choose one or the other.
It depends on the application. In any case, the standard
library could be complemented by
Strcmp
Strcpy
etc., all using length prefixed strings.

Syntactic sugar.

I have added some sugar to this coffee. I always liked coffee
with a bit of sugar. I feel that is too acid without it.

Current strings are used using the [ ] notation. This strings
could have the same privilege isn't it?

The language extension I propose is that the user has the right to
define the operation [ ] for any data type he/she wishes.

Not a big deal for today's compilers.

Might be a bigger deal for the C Standards committee ;-)
Length checked strings can then use:

String s;
...
s[2] = 'a';

I think I am proposing the obvious.

Do you agree?

Mike's right: use C++.
-leor

Leor Zolman
BD Software
(e-mail address removed)
www.bdsoft.com -- On-Site Training in C/C++, Java, Perl & Unix
C++ users: Download BD Software's free STL Error Message
Decryptor at www.bdsoft.com/tools/stlfilt.html
 
L

Leor Zolman

You needn't use all of it. You seem to be wanting
a 'real' string type, which C++ has, and it's not
at all difficult to use. Actually, if such a type
is needed, imo that's a good enough reason to use C++
(even if everything else is only the common subset
of the two languages).

Jaocb: There's even a common term used to describe C++ when used only for
those features that are a direct "clean-up" of messiness left over from C's
need to be backward-compatible with earlier incarnations of itself: "A
Better C". I don't think the C++ string class is typically lumped in with
those features (which I'm reluctant to go into here since this isn't a C++
group), but I'd think some more about it before discarding the idea of
using C++ as "C plus strings." I even wrote a CUJ article to help folks
with this very issue:
http://www.bdsoft.com/resources/thinking.html
(The title has "STL" in it, but the article is really about migration of
char *-based code to using strings)
-leor

Leor Zolman
BD Software
(e-mail address removed)
www.bdsoft.com -- On-Site Training in C/C++, Java, Perl & Unix
C++ users: Download BD Software's free STL Error Message
Decryptor at www.bdsoft.com/tools/stlfilt.html
 
J

jacob navia

Leor Zolman said:
zero terminated, anyway.

Yes Sir!
Zero terminated and surely NOT zero delimited. What a deep
difference :)
It is "extremely inefficient" only if you're continuously recalculating the
length.

Obviously. And this is a very common use, haven't you
notice it?

For applications where you're not, it is extremely efficient.

Sorry but this string was once constructed, and the
length was known. Why not keeping this information?

What about the security?

What about the failure modes of unbounded pointers,
A more efficient representation is:

struct string {
size_t length;
char data[];
};

What is that [] about? That's not a legal definition.

C99 introduces variable length arrays. This is standard
notation.

Are you implying that
a fixed-length array implementation (with an actual size in there) is an
improvement in any significant way over a simple char *?

Yes.

1 Length operation is trivial
2 Comparisons for equality are cheaper when the length
of the strings differ. You never know this in C strings
and you have to start scanning for that zero...
3 Bounds checked strings can be implemented.
I don't think so.

Well. I think so for the reasons above. Can you
maybe go to those reasons in detail?
Now it is starting to sound like Java.

In matters of languages I do not despise any. I am
sorry, I like C but I am not a zealot, and see
C's problems and weakness. A bad string type
is the reason for many bugs we could really get
rid of.
Perhaps a bit; but on average, inequality is determined pretty quick the
conventional way, and equality would actually take /more/ time to
determine. But yes, you might net a teeny bit of an improvement.

And also net a big security improvement...
Are we talking about the one with the fixed-length array, or the version
with the mysterious empty brackets? Either way, /nothing/ in C can "resize
itself"...
Sorry, I thought realloc was part of C...
This is the classic first C++ class implementation exercise. Thinking about
it yields some good fundamental principles about class design.

Maybe but I do not want any class design. There are no classes
in C. I want strings for holding text. As I said, no
default instantiated template traits. Just chars please.
But, to
achieve a true performance benefit in a string service, it ultimately
requires tailoring the string implementation to the specific circumstances
in which it will be used. There's no magic bullet; the "irrational
exuberance" surrounding the rise-and-fall of reference counted Standard C++
string implementations is a case in point.

Yes, each application has its own needs. That's why I would
propose that the user writes many specialized string
structures, that share a common description.

Length delimited strings are infinitely extensible with other
features.
Mike's right: use C++.

I answered that to Mike. See my answer in a parallel thread.
I think C is the last not object oriented language around.
That makes it very interesting.

jacob
http://www.cs.virginia.edu/~lcc-win32
 
L

Leor Zolman

Yes Sir!
Zero terminated and surely NOT zero delimited. What a deep
difference :)

I think of delimiters as a matched set, terminated as asymmetric. Just
seemed off to use it there, but yes, I'm sure everyone knew what you meant.
Obviously. And this is a very common use, haven't you
notice it?

Knowing something about how the strings are going to be used is precisely
what drives the design decision of which flavor to use. When there's going
to be a lot of repeated length testing, that fact may contribute to a
decision against my using plain old char *'s /in that application/.
Sorry but this string was once constructed, and the
length was known. Why not keeping this information?

Keeping and maintaining it has a spacial and temporal cost. Is it always
justified? Sometimes, probably. Usually? Always?
What about the security?

What about the failure modes of unbounded pointers,

C doesn't provide any automatic protection for these things. The spirit of
C is to let the programmer program them if they're needed. Period.
A more efficient representation is:

struct string {
size_t length;
char data[];
};

What is that [] about? That's not a legal definition.

C99 introduces variable length arrays. This is standard
notation.

Darn, I'm really going to actually have to write a piece of code using VLAs
some day, so I can at least recognize them when they get used (blush). But
the problem is, I don't like them ;-)
Yes.

1 Length operation is trivial
2 Comparisons for equality are cheaper when the length
of the strings differ. You never know this in C strings
and you have to start scanning for that zero...

....or the first mismatch. If you happen to know that enough of your strings
will be identical for their first several characters /and/ be of different
lengths for this to make a significant difference, you'd have good reason
to use your implementation /in that application/.
3 Bounds checked strings can be implemented.

They can, but lots of things /can/ be implemented, it is just that C has no
pretense of supporting such things at the core language level. Neither does
C++, for that matter.
Well. I think so for the reasons above. Can you
maybe go to those reasons in detail?

I'm not compelled to, no.
In matters of languages I do not despise any. I am
sorry, I like C but I am not a zealot, and see
C's problems and weakness. A bad string type
is the reason for many bugs we could really get
rid of.

Nor do I despise Java (I've even written an article, still available on
line somewhere, outlining why I believe Java makes a great "first"
programming language.) But hand-holding features are just /not/ in C's job
description, I'm sorry.
And also net a big security improvement...

Which you may or may not want to pay for.
Sorry, I thought realloc was part of C...

What I'm saying is that nothing "resizes itself", there has to be user code
to recognize the need, dispatch to the appropriate functions, etc. At any
given point in a design, a C programmer can choose whether or not to do
that stuff. She may choose not to, for reasons that make all the sense in
world for that application. She may not want that overhead forced upon her.
Maybe but I do not want any class design. There are no classes
in C. I want strings for holding text. As I said, no
default instantiated template traits. Just chars please.

I'm not trying to force class design down your throat, I'm just saying
that "black-box" string management is always going to be either prejudicial
to some quality of the data being operated upon, or middle-of-the road and
thus probably not optimal for /your/ situation, whatever that may be; it
can't be sufficiently general-purpose and really efficient for some special
case...
Yes, each application has its own needs. That's why I would
propose that the user writes many specialized string
structures, that share a common description.

Length delimited strings are infinitely extensible with other
features.


I answered that to Mike. See my answer in a parallel thread.
I think C is the last not object oriented language around.
That makes it very interesting.

jacob
http://www.cs.virginia.edu/~lcc-win32

Okay. Good luck in your quest,
-leor



Leor Zolman
BD Software
(e-mail address removed)
www.bdsoft.com -- On-Site Training in C/C++, Java, Perl & Unix
C++ users: Download BD Software's free STL Error Message
Decryptor at www.bdsoft.com/tools/stlfilt.html
 
M

Mike Wahler

jacob navia said:
Sorry, I thought realloc was part of C...

Ever notice that the return value from 'realloc()' is
often not the same as returned by the original 'malloc()' or
'calloc()'? 'realloc()' often (in my experience almost
always, except in the cases of 'trivial' size increases)
does a new allocation followed by a copy and a deallocation.
Not really a 'resizing'.


-Mike
 
A

Arthur J. O'Dwyer

As everybody knows, C uses a [null-terminated string model]
for its representation of strings.
A more efficient representation is:
struct string {
size_t length;
char *data;
};

There is no compelling reason to choose one or the other.

Except the aforementioned efficiency reasons, of course. ;-)
So far, so good; but here you start to go downhill.
It depends on the application. In any case, the standard
library could be complemented by
Strcmp
Strcpy
etc., all using length prefixed strings.

Except for the fact that nobody in the world would accept that
kind of library bloat in C0x. I don't even like the dozens of
transcendental and date-manipulation functions in C90. :-D Besides,
if there's one thing a "simple" language *doesn't* need, it's two
different and incompatible implementations of one fundamental
concept (i.e., "string").
Syntactic sugar.

Not if it's accomplished by making the programmer do the bookkeeping
on those new library functions, it's not. I don't want to have to
remember what kind of string I'm using! Let the computer do it!
(This is one of the reasons I hate Java's library model: I don't care
whether I'm using a TreeFoo or a ListFoo or a HashFoo, I just want
some kind of Foo. They're all equally capable; why should I be
forced to keep books on which one I'm using at the moment?)
The language extension I propose is that the user has the right to
define the operation [ ] for any data type he/she wishes.

I.e., you're proposing to take a C++ compiler and strip it of
most of the goodies. Okay, but that won't be the C language any
more. Consider simplicity and orthogonality: you really think you
should be able to re-define the semantics of [] but not * or ->?
That's foolishness waiting to happen.
Do you agree?

Of course not!

But remember where I said "so far, so good"? What you *should*
have concluded, there, was that since the "Pascal-style" string
model is so superior to the "C-style" string model, that wouldn't
it be neat if somebody implemented a C compiler that used the
Pascal model internally?!
That is, the *programmer* would still see a completely conforming
standard C implementation; but when he writes

a = strlen(a);

where a typical C compiler would assemble the equivalent of
[completely untested and not-real code, but you get the point:]

; strlen(a)
MOV AX, 0
L1: MOV BX, a[AX]
INC AX
JNZ L1
SUB AX, a
DEC AX
; a
MOV BX, a
ADD BX, i
; =
MOV [BX], AX

this hypothetical "Pascal-style" implementation could do the
much faster

; strlen(a)
MOV AX, [internal_a]
; a
MOV BX, [internal_a+4]
ADD BX, i
; =
MOV [BX], AX

Of course, this would require a *lot* of thinking-out ahead of
time, and a *lot* of compiler support (including clever workarounds
for users' trying to memcpy() over strings, or storing strings in
unions, or simply using 'malloc' in creative ways)... but it would
be really neat IMHO if you could get it to work.
It would certainly be a bigger and more widely interesting challenge
than simply re-implementing half of C++.

-Arthur
 
C

Chris Torek

In matters of languages I do not despise any. I am
sorry, I like C but I am not a zealot, and see
C's problems and weakness. A bad string type
is the reason for many bugs we could really get
rid of.

I do not think C's "string" data format is necessarily "bad", merely
"limited". The counted-strings other languages have used have their
own advantages and drawbacks.

Perhaps the biggest problem (as it were) with C is that it provides
such a limited built-in syntax for *generating* these anonymous
arrays. Anything inside double quotes is, ignoring the exception
for initializers, one of these special anonymous arrays. In C99
we at least can create anonymous "struct"s:

struct counted_string { size_t len; char *data; };
...
func((struct counted_string){sizeof "foo" - 1, "foo"});

Of course, as shown here, you cannot even use flexible array members
(initializing a variant of counted_string with "char data[]" instead
of "char *data" appears to be invalid). It might be nice if one
could get the above without resorting to preprocessor tricks.

(On the other hand, if you want Lisp, you know where to find it. :) )
 
T

those who know me have no need of my name

in comp.lang.c i read:
since the "Pascal-style" string
model is so superior to the "C-style" string model, that wouldn't
it be neat if somebody implemented a C compiler that used the
Pascal model internally?!
That is, the *programmer* would still see a completely conforming
standard C implementation;
Of course, this would require a *lot* of thinking-out ahead of
time, and a *lot* of compiler support (including clever workarounds
for users' trying to memcpy() over strings, or storing strings in
unions, or simply using 'malloc' in creative ways)...

or even the bog simple: a[strlen(a)-2] = 0;
 
O

Old Wolf

As everybody knows, C uses a zero delimited unbounded
pointer for its representation of strings.

Pointers cannot be bounded or unbounded, as such.
String constants are represented by an array. C arrays
are bounded (but it is not required for these bounds
to ever be checked).
C library functions expect a pointer into an array
or some other piece of allocated memory, which contains
a zero-terminator. Calling a C library function with
anything else is a programming error.
This is extremely inefficient because at each query of the
length of the string, the computer starts an unbounded
memory scan searching for a zero that ends the string.

It is inefficient to repeatedly calculate the length of the
string, as you say. However most of us are clever enough to
not do that.
A more efficient representation is:

struct string {
size_t length;
char data[];
};

Efficient? Apart from requiring C99 (harder to get a compiler
for than C++) , it uses a lot more memory and is slower at
runtime because you have to do extra checks and updates every
time you modify the string. Also it cannot be declared
statically or allocated using malloc (if my understanding of
C99 VLA is correct). This is not my idea of efficient.

If you meant "char data[N]" for some N, then it wastes
even more memory and limits your string size.
The length operation becomes just a memory read.
This would considerably speed the programs.

Programs which check length a lot and do not do much else
with the string, possibly. This is a small subset of programs.
The basic idea is to use a string type that is length
prefixed and allows run-time checking against UB: undefined
behavior.

How slow and inefficient. I prefer to prevent UB by coding
correctly. Also, how do you propose to check against UB
this way. Anyone can introduce UB by modifying the string
directly, unless you also enforce all modifications to it
to be via a library. (how slow).
Comparing strings is speeded up also because when
testing for equality, the first length comparison tells
maybe the whole story with just a couple of
memory reads.

And if they were equal length you have wasted a comparison.
If your application is the sort where you test string equality
so much that it is important, you would probably introduce
some other method of equality checking (eg. a hash).

Most instances of "comparing strings" are actually interested
in which one comes first alphabetically, for which your
counted string is slower.
A string like the one described above is not able to
resize itself. Any pointers to it would cease to be valid
when it is resized if the memory allocator is forced to
move memory around. The block where that string was
allocated is bounded by another blocks in memory, and
it is not possible to resize it.

This seems to be a problem with any string representation
(including C's builtin one)
A pointer ( an indirect representation) costs a sizeof(void *)
but allows to resize strings without invalidating the pointers
to them.

struct string {
size_t length;
char *data;
};

How does this allow resizing without invalidating pointers?
I presume you have something in mind like this:

char *ptr = str.data + 6;
resize_string(&str);
/* keep using ptr... */

To make a string bigger you have to get new memory (There is no
portable way of getting more memory at the same location because,
as you pointed out before, there might be something else already
using the desired memory).
There is no compelling reason to choose one or the other.
It depends on the application.

Congratulations, some sense. Wisely, the standard library
does not make either choice, leaving it up to the programmer
to do what is most efficient for his/her program.
In any case, the standard
library could be complemented by
Strcmp
Strcpy
etc., all using length prefixed strings.

What bloat. The C library is big enough as it is.
Syntactic sugar.

I have added some sugar to this coffee. I always liked coffee
with a bit of sugar. I feel that is too acid without it.

Current strings are used using the [ ] notation. This strings
could have the same privilege isn't it?

The language extension I propose is that the user has the right to
define the operation [ ] for any data type he/she wishes.

Not a big deal for today's compilers.

Length checked strings can then use:

String s;
...
s[2] = 'a';

I think I am proposing the obvious.

I think you are proposing bounds-checked arrays. The standard
allows you to implement this already. Why not go and do it
and then advertise your compiler as supporting bounds checking.
In fact, why not write a library as you have proposed, and package
it with your compiler. Then people can choose if it suits them or not.
Do you agree?

Not really, no. You seem to be making big assumptions about
the rest of the world's programming requirements.
 
R

Richard Bos

jacob navia said:
As everybody knows, C uses a zero delimited unbounded
pointer for its representation of strings.

This is extremely inefficient because at each query of the
length of the string, the computer starts an unbounded
memory scan searching for a zero that ends the string.

A more efficient representation is:

struct string {
size_t length;
char data[];
};

Possibly efficient in time, for certain programs, but almost certainly
inefficient in storage. The problem is two-pronged: either your strings
are large relative to your size_t, or they aren't. In the first case,
you will, sooner or later, run into a string that simply won't fit
inside a size_t.
Ok, so you need a size_t that is large enough to contain every possible
string length on the system, large enough to address all of your memory.
But in that case, you'll hit the second case: now, you're using four
bytes, maybe eight, to address a single string, AOT C's one-byte null
terminator. Not a problem, perhaps, if all your strings are dozens of
kilobytes, but most applications seem to use lots and lots of small
strings, every single one potentially costing you seven bytes extra, and
only a few large ones.
You might get away with this on systems where sizeof(size_t) ==
sizeof('\0'). IOW, an embedded device, most likely. But how many strings
does a microwave oven need, anyway? The only application I can think of
that makes this worthwhile is a mobile phone, where you may need quite a
few strings, most of them static.
The length operation becomes just a memory read.
This would considerably speed the programs. The basic
idea is to use a string type that is length prefixed and
allows run-time checking against UB: undefined
behavior.

How? What if I want a 30-char array, of which the first 20 to 29 chars
contain a string, and the last char contains a checksum? Modifying the
checksum wouldn't be undefined behaviour at all, but it would write
beyond data[length-1].
Comparing strings is speeded up also because when
testing for equality, the first length comparison tells
maybe the whole story with just a couple of
memory reads.

Erm... no. No, it doesn't at all. Because, you see, "aaa" < "zzzzz", but
"zzz" < "aaaaa". Length has _nothing_ to do with the lexicographical
ordering of strings, except when you've already compared the contents of
the strings and found them identical up to the length of the shortest.
In fact, I think you'll be hard pushed indeed to beat the efficiency of

strcmp(const char *s1, const char *s2)
{
while (*s1==*s2 && *s1) { s1++; s2++; }
return *s1-*s2;
}

using your length-indicated string.
Syntactic sugar.

Syntactic sugar causes cancer of the semi-colon.
I have added some sugar to this coffee. I always liked coffee
with a bit of sugar. I feel that is too acid without it.

YM bitter, I suspect. And that shows that truly civilised people drink
tea, without sugar, and program in C, without ++ :)
Current strings are used using the [ ] notation. This strings
could have the same privilege isn't it?

No, they couldn't. Puzzle for you: how would you extend the
interoperation between strings and char pointers, using your
length-strings? You can't point a char * inside one, because if you do,
you've lost track of its length and there isn't a null terminator to
help you find it...
And of course, without pointer arithmetic, array indexing is impossible
in the current Standard. You'd need to define an explicit exception for
your length-strings.

Richard
 
R

Richard Bos

Leor Zolman said:
Jaocb: There's even a common term used to describe C++ when used only for
those features that are a direct "clean-up" of messiness left over from C's
need to be backward-compatible with earlier incarnations of itself: "A
Better C".

Common, and used, by whom? C++ programmers, I suspect. _My_ term for
such a hybrid monstrum would be "A Bloody Mess".

If you want C++, be a man and program in C++. Don't go pretend that
you're almost using C.

Richard
 
D

Dik T. Winter

> But remember where I said "so far, so good"? What you *should*
> have concluded, there, was that since the "Pascal-style" string
> model is so superior to the "C-style" string model, that wouldn't
> it be neat if somebody implemented a C compiler that used the
> Pascal model internally?!

Eh? The "Pascal-style"? In the only Pascal I ever used (on the CDC
Cyber, the original Pascal from ETH Zuerich), a string was implemented
as a sequence of characters, and that was it. Nearly the same as in C,
except that there was no terminator.
 
J

jacob navia

Richard Bos said:
jacob navia said:
As everybody knows, C uses a zero delimited unbounded
pointer for its representation of strings.

This is extremely inefficient because at each query of the
length of the string, the computer starts an unbounded
memory scan searching for a zero that ends the string.

A more efficient representation is:

struct string {
size_t length;
char data[];
};

Possibly efficient in time, for certain programs, but almost certainly
inefficient in storage. The problem is two-pronged: either your strings
are large relative to your size_t, or they aren't. In the first case,
you will, sooner or later, run into a string that simply won't fit
inside a size_t.

In lcc-win32 a size_t can contain up to 4GB strings.
I think that searching for the terminating zero in a 1GB string would
not be very efficient anyway... :)

Ok, so you need a size_t that is large enough to contain every possible
string length on the system, large enough to address all of your memory.

Yes, normally size_t would do it.
But in that case, you'll hit the second case: now, you're using four
bytes, maybe eight, to address a single string, AOT C's one-byte null
terminator. Not a problem, perhaps, if all your strings are dozens of
kilobytes, but most applications seem to use lots and lots of small
strings, every single one potentially costing you seven bytes extra, and
only a few large ones.

There is no free lunch.
Yes, it will cost at least size_t bytes more than a zero terminated string.
So what?
Many applications can afford this.
The length operation becomes just a memory read.
This would considerably speed the programs. The basic
idea is to use a string type that is length prefixed and
allows run-time checking against UB: undefined
behavior.

How? What if I want a 30-char array, of which the first 20 to 29 chars
contain a string, and the last char contains a checksum? Modifying the
checksum wouldn't be undefined behaviour at all, but it would write
beyond data[length-1].

You can implement check sums in my schema. Anyway, I am proposing
string handling, where normal zero terminated strings are assumed.

Erm... no. No, it doesn't at all. Because, you see, "aaa" < "zzzzz", but
"zzz" < "aaaaa". Length has _nothing_ to do with the lexicographical
ordering of strings, except when you've already compared the contents of
the strings and found them identical up to the length of the shortest.

Please read carefully. I said
"When comparing strings for equality"

and not strcmp !!!

strcmp gives as a result a lexicographical ordering. This is NOT
NEEDED when I just want to know if a == b.
Syntactic sugar causes cancer of the semi-colon.
In great doses YES. (See my answer as to why I do not use C++ in a
parallel thread)

I small doses sugar is useful.

It comes to the dosage you see?

It is a question of knowing when to stop.
Current strings are used using the [ ] notation. This strings
could have the same privilege isn't it?

No, they couldn't. Puzzle for you: how would you extend the
interoperation between strings and char pointers, using your
length-strings?

I have started a library in lcc-win32 that makes exactly that.
It can be done.

You can't point a char * inside one, because if you do,
you've lost track of its length and there isn't a null terminator to
help you find it...

All the strings are null terminated to keep interoperability
with existing code.
And of course, without pointer arithmetic, array indexing is impossible
in the current Standard. You'd need to define an explicit exception for
your length-strings.

Yes. I have done that.

jacob
 
?

=?ISO-8859-1?Q?Tor_Husab=F8?=

Dik said:
Eh? The "Pascal-style"? In the only Pascal I ever used (on the CDC
Cyber, the original Pascal from ETH Zuerich), a string was implemented
as a sequence of characters, and that was it. Nearly the same as in C,
except that there was no terminator.

Meaning you had to keep track of the length in a separate variable?
Sounds unlikely...
 
J

jacob navia

Old Wolf said:
Pointers cannot be bounded or unbounded, as such.
String constants are represented by an array. C arrays
are bounded (but it is not required for these bounds
to ever be checked).

A bounded pointer has size/limits information associated with it. For
example:

fread(buffer,1,100,file);

the buffer pointer is implicitely bounded by the 1,100 arguments.

Or

int process(int datalength, char *data);

C library functions expect a pointer into an array
or some other piece of allocated memory, which contains
a zero-terminator. Calling a C library function with
anything else is a programming error.

Yes. I do not discuss that this is an error. My proposition goes
into avoiding it.
It is inefficient to repeatedly calculate the length of the
string, as you say. However most of us are clever enough to
not do that.

Then you keep the length and the string in separate variables.
You have to name each string length that you use more than
once, and never mix them up. This is very error prone and
doesn't fit into structured programming.

Why do we use:
struct s{
int age;
bool sex;
char *Name;
} Employee;

instead of
int ageEmployee1, int sexEmployee1, char *NameEmployee1???
Having a structure easies the way you program and avoids errors!

A more efficient representation is:

struct string {
size_t length;
char data[];
};

Efficient? Apart from requiring C99 (harder to get a compiler
for than C++) , it uses a lot more memory and is slower at
runtime because you have to do extra checks and updates every
time you modify the string.

This is important for security reasons. A length check is just 4 assembly
instructions at most!

Programs which check length a lot and do not do much else
with the string, possibly. This is a small subset of programs.

Sorry but the length operations is used VERY often.
How slow and inefficient. I prefer to prevent UB by coding
correctly.

And you never make mistakes of course. You are Superman.
Also, how do you propose to check against UB
this way. Anyone can introduce UB by modifying the string
directly, unless you also enforce all modifications to it
to be via a library. (how slow).

Strings should not be modified directly. Very simple.
Maybe 0.0000000001 seconds slower, but much faster to
develop.
And if they were equal length you have wasted a comparison.

This is around 2-3 assembly instructions...
At 2GHZ it is 0.000000000something seconds
This seems to be a problem with any string representation
(including C's builtin one)


How does this allow resizing without invalidating pointers?

This is a misunderstanding. I suppose you have pointers to the string
not to the middle of teh data.
What bloat. The C library is big enough as it is.

I am sorry but there are functions for calculating the
complex square root but not for using a reasonable
string type.

C string type is completely outdated. Yes, you can use
it in small machines but it is too ERROR PRONE!
I think you are proposing bounds-checked arrays. The standard
allows you to implement this already. Why not go and do it
and then advertise your compiler as supporting bounds checking.

Well I am doing that, but the intent here is to make people
aware that this problems should be solved in a general fashion.

I do not want to just add a "special" solution but to see if
we can solve a general problem in a collective way i.e.
in the way of standards.
In fact, why not write a library as you have proposed, and package
it with your compiler. Then people can choose if it suits them or not.

Yes, I am doing that as a "proof of concept"
Not really, no. You seem to be making big assumptions about
the rest of the world's programming requirements.

Yes.
I assume that security is important, that extreme efficiency is not
that important, and that you can spare size_t bytes for the length

jacob
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top