Strings in C are less optimal than in (say) Pascal - correct?

P

pemo

I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.



In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.



Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.
Which, seems to go against a bit of C philosophy - speed is everything!



Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.



However, this would impose a limit on a string's length - but, surely, a
practical one!?



So, to the questions!



Are my assertions correct?

Would such an implementation break the C standard?

Does anyone know of an implementation that perhaps does something like this?
 
A

Antonio Contreras

pemo said:
I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.

Yes, that's true. However, this imposes certain overhead in other
operations. In Pascal every access to an element of an array is checked
to be inside the array range. IIRC Pascal arrays can have any type as
index (integral types, enumerated types...) and they need not be 0
based, so some sort of operation must be performed with the index value
to get the actual address. C implementation is much more
straightforward.
In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.

True, but I don't think this imposes a great overhead except in extreme
circumstances.
Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.
Which, seems to go against a bit of C philosophy - speed is everything!

I don't think C philosophy is "speed is everything". I think is more
something like "Keep the language simple and assume the programmer
knows what (s)he is doing".
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.



However, this would impose a limit on a string's length - but, surely, a
practical one!?

I can think of many situations where you would need strings longer than
255 characters (assuming CHAR_BIT == 8).
So, to the questions!



Are my assertions correct?

Would such an implementation break the C standard?
Yes.

Does anyone know of an implementation that perhaps does something like this?

You could build your own abstract data type and a set of functions to
handle it if you feel like it, but I don't think it deserves the effort.
 
T

Targeur fou

pemo said:
I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.

Perhaps, I don't know.
In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.

Yes.


Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.
Which, seems to go against a bit of C philosophy - speed is everything!

Do you think that this only reason is sufficient to assert a such thing
? And do you think that speed was the main goal of the C designers ?
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

Sure, it's an UB.
[snip]

Would such an implementation break the C standard?

If you plan to break the current pointer arithmetic rules and rewrite a
big part of the libc...yes.
Does anyone know of an implementation that perhaps does something like this?

Use your own defined-type, for instance a struct with an int member to
store the current string length and a char[] or char* member. Write all
the necessary around-stuff and you have your implementation.

A+
Regis
 
P

pemo

Do you think that this only reason is sufficient to assert a such thing
? And do you think that speed was the main goal of the C designers ?

I believe that speed *was* a goal - wasn't it used to originally write an
OS? One would assume that speed was certainly a goal in that case -
wouldn't one?

Of course, flexibility is there too. Take the case that C doesn't check
array bounds. One reason *I think* it doesn't check is that such a thing
takes some period of time (not something one perhaps wants to readily
sacrifice if one's sure that some access is in range?), also, it's quite
possible that stepping of the end or the beginning or an array is legitimate
and useful - e.g., perhaps, if you're writing a memory manager, you actually
know what's at these locations, and might want to access them (arena headers
etc)?
 
S

SM Ryan

#
#
# I believe that in languages like Pascal, the length of a string is encoded
# in some sort of 'header' data structure, and that if a programmer wants to
# know the length of such a string, the resultant discovery is therefore very
# fast.

Yes, we know. There are plenty of libraries available that use
counted strings and allow zero bytes in strings. But it's too
late to change the interpretation of "...." and str...() functions.
 
R

Rob Thorpe

pemo said:
I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.
Yes.

In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.
Yes

Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.
Which, seems to go against a bit of C philosophy - speed is everything!

Not genereally. Strlen is slower. Most of the others iterate across
the array of chars looking for the '\0' character at the end, rather
than finding it with strlen first.
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

No it can't do that, \0 must be the end of a string.
However, this would impose a limit on a string's length - but, surely, a
practical one!?

Not on a modern machine. On a modern machine a language that stores
the length of a string will do so using a full integer - either 32 or
64 bits. Giving strings with GB of length.

The only times when the C implementation is better (in terms of memory
consumption) are:
- Very small embedded machines, 16bitters
- Situations with very many very short strings
So, to the questions!



Are my assertions correct?

Would such an implementation break the C standard?

Does anyone know of an implementation that perhaps does something like this?

Paul Hsieh has written an alternative string library that uses stored
lengths. Other libraries do similar things.

Many programs use stored lengths internally rather than using
c-strings.
 
E

Eric Sosman

pemo wrote On 11/03/05 09:39,:
I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.

In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.

Yes, although some machines can use tricky methods to
avoid inspecting each character individually.
Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.

Faulty argument, I believe. Let's say you're looking
for the first '/' in a string. The fact that Pascal happens
to know how long the string is doesn't help it much: like C,
it must hunt through the entire string starting at the front.
(Again, with possibly trickery to reduce the "obvious" number
of operations needed.)

On the other hand, some operations could (potentially) be
faster if the string length were cheap to compute. strrchr()
could search backwards from the end of the string; strstr()
might be able to use fancier algorithms, and so on.

The case of string concatenation is a little different.
The inefficiency of repeated strcat() doesn't really arise
from the way strings are represented, but from the fact that
it and strcpy() return a silly value. If they were to return
pointers to the end of the output string instead of to the
beginning (a datum the caller already knows), it would be
easy to use repeated strcpy() instead of repeated strcat()
and not incur the overhead of hunting for the ever-receding
end of the string.
Which, seems to go against a bit of C philosophy - speed is everything!

If that's your idea of C's philosophy, I hope you choose
another language or maybe even another career.
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

Maybe, maybe not; see above. You're also imposing a
significant penalty on small machines whose unsigned long
must actually be handled as a multi-precision integer: an
eight-bit machine, for example, would need to treat your
string length as at least a four-place number. Don't forget
the cost of keeping track of the length as the string is
altered: each transaction may be small, but there might be
a horrendous number of them.

Also, see below for my answer to your second question.
However, this would impose a limit on a string's length - but, surely, a
practical one!?

Well, one could always use a size_t.
So, to the questions!

Are my assertions correct?

Some are, some probably aren't, some are debatable.
Would such an implementation break the C standard?

Yes, obviously.

char *p = strchr("Hello, world!", ',');
size_t len = strlen(p);

A pointer to "the middle" of a string is a valid string
pointer.
Does anyone know of an implementation that perhaps does something like this?

No C implementation could do what you describe, because
the resulting language would not be C.
 
N

Netocrat

On Thu, 03 Nov 2005 14:39:42 +0000, pemo wrote:

[on speeding up strlen()]
Of course, if a C implementation chose to perhaps keep track of, and
store a C string's length at perhaps a negative offset, e.g. strlen of s
= ((unsigned long int *)s)[-1]); this would result if faster string
operations.
[...]

C has no string type. Strings are a concept (a sequence of characters
ending in '\0') that share a type with character arrays and character
pointers, so as far as I can see it is impossible for an implementation to
support your idea and be compliant with the standard.

The problem is that the starting address of the string is not the same as
the starting address of the underlying object. So neither memcpy(deststr,
srcstr) nor the simple assignment loop while(*deststr++ = *srcstr++) ;
will fully copy the string properly. It's not even clear when they would
be required to copy a prefix, since as I said a char array or pointer
needn't be a string; also a string can be shortened by incrementing a
pointer to it in which case it is a string without the prefix you propose
and attempting to write to that prefix will corrupt an unrelated part of
the string.

The second problem (which doesn't affect standards-compliance but
does affect the likelihood of implementation) is that the prefix length
must be synchronised with the location of the first '\0', and for the
implementation to keep track of this may well not be worth the complexity
- and might even be so runtime-intensive in some situations as to negate
the gain in strlen() speed.

The only ways I can see to support are to introduce either a string type
or a string library with such support (and only manipulate strings using
that library's functions). There are a few such libraries already in
existence - e.g. Paul Hsieh markets his Better String Library as
performing this task and maintaining good compatibility with current C
strings.
 
P

pete

Netocrat said:
On Thu, 03 Nov 2005 14:39:42 +0000, pemo wrote:

[on speeding up strlen()]
Of course, if a C implementation chose to perhaps keep track of, and
store a C string's length at perhaps a negative offset,
e.g. strlen of s
= ((unsigned long int *)s)[-1]); this would result if faster string
operations.
[...]

C has no string type. Strings are a concept (a sequence of characters
ending in '\0') that share a type with character arrays and character
pointers, so as far as I can see it
is impossible for an implementation to
support your idea and be compliant with the standard.

The problem is that the starting address of the string
is not the same as
the starting address of the underlying object.

An object may hold several non-overlapping strings.

Each byte in a string, is also the
first byte of a string.
 
K

Keith Thompson

Antonio Contreras said:
pemo wrote: [...]
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

However, this would impose a limit on a string's length - but, surely, a
practical one!?

I can think of many situations where you would need strings longer than
255 characters (assuming CHAR_BIT == 8).

pemo's proposed solution stores the length in an unsigned long int.
(Logically, it should be a size_t.)

It's not possible to replace C's concept of strings like this without
breaking existing code. For example, given
char *s = "hello, world";
s points to a valid string with the value "hello, world" -- and s+7
points to a valid string with the value "world". Proper alignment
of the length field could also be a problem.
 
M

Mark B

Let me be the first to disagree.

pemo said:
I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore
very fast.
On the other hand, the initial stores take longer as this 'header' structure
needs to be filled in... no?
In C, functions like strlen() have to traverse the string in order to find
the null terminator.
3 functions no? strlen() strcat() and strrchr().
This requires a comparison of each char in the string to '\0' of course.

Therefore, string functions (the majority I can think of all require some
'end searching')
Other than the 3 I've mentioned, what others are required to 'search'
to the end?
are necessarily slower than their Pascal counterparts.
strcat() is not 'necessarily' slower... and may be faster (no encoding
of this header structure necessary)
Which, seems to go against a bit of C philosophy - speed is everything!

Of course, if a C implementation chose to perhaps keep track of, and store
a C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.
You would 'slow down' as many as you sped up...
 
G

Gordon Burditt

I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.

And making tail substrings of a string is made MUCH slower, since
in order to, say, generate a substring with leading spaces stripped
off, you have to COPY the string rather than simply increment a
pointer. This can get very messy in trying to parse strings.
In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.

Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.

There are a lot of things in C that are string functions that you
don't even think of as string functions. These can get much, much
slower.
Which, seems to go against a bit of C philosophy - speed is everything!

The philosophy of C is also not to do expensive operations (such
as malloc()) behind the programmer's back.
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

But something has to keep that offset updated.
However, this would impose a limit on a string's length - but, surely, a
practical one!?

Sure, just like the buffer passed to gets() is "long enough". NOT!!!
So, to the questions!

Are my assertions correct?

If you look at speeding up a small portion of the job at a potential
large cost of speed in the REST of the job, you can find something
that looks like a speed improvement which isn't.
Would such an implementation break the C standard?

Yes. Where do you expect the storage for your string length
to magically come from? C programs that statically allocate
string buffers would need to distinguish between char x[100]
and string x[100], which no existing program does.

char p[100];
char *cp;
strcpy(p, "Hello, world");

printf("%s\n", p+7); /* prints "world\n" */

for (cp = p; isalpha((unsigned char) *cp; cp++) {
;
}
*cp = '\0';
printf("%s\n", p); /* prints "hello\n" */

How would your proposed implementation get the output results required?
Does anyone know of an implementation that perhaps does something like this?


Gordon L. Burditt
 
A

Antonio Contreras

Keith said:
Antonio Contreras said:
pemo wrote: [...]
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

However, this would impose a limit on a string's length - but, surely, a
practical one!?

I can think of many situations where you would need strings longer than
255 characters (assuming CHAR_BIT == 8).

pemo's proposed solution stores the length in an unsigned long int.
(Logically, it should be a size_t.)

Yep, I stand corrected, I forgot about the cast and assumed he meant
s[-1]

<snip to EOF>
 
K

Keith Thompson

Sure, just like the buffer passed to gets() is "long enough". NOT!!!

That shouldn't be a problem if the length is stored in a size_t.
There are problems with the proposal, but that's not one of them.
 
N

Netocrat

Netocrat said:
On Thu, 03 Nov 2005 14:39:42 +0000, pemo wrote:

[on speeding up strlen()]
Of course, if a C implementation chose to perhaps keep track of, and
store a C string's length at perhaps a negative offset,
e.g. strlen of s
= ((unsigned long int *)s)[-1]); this would result if faster string
operations.
[...]

C has no string type. Strings are a concept (a sequence of characters
ending in '\0') that share a type with character arrays and character
pointers, so as far as I can see it
is impossible for an implementation to
support your idea and be compliant with the standard.

The problem is that the starting address of the string
is not the same as
the starting address of the underlying object.

An object may hold several non-overlapping strings.

Each byte in a string, is also the
first byte of a string.

I agree.

In this situation a "string" is really an aggregate object equating to the
unpadded:

struct {unsigned long len; char str[];} somestr;

The "starting address of the string" is somestr.str. The
"starting address of the underlying object" is &somestr.

The problem - and the point I was making - is that in this situation your
second statement cannot hold because (void*)somestr.str != (void*)&somestr
 
R

Rich Gibbs

pemo said the following, on 11/03/05 09:39:
I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.

Yes, that's true, although remember that something has to keep that
prefix updated, so what Pascal gives with one hand it may take away with
the other.

In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.

Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.
Which, seems to go against a bit of C philosophy - speed is everything!

I don't know that I think that "speed is everything" is the C
philosophy. I would say being simple and staying "close to the machine"
were also important.

Also, you are (I think) implicitly assuming that looking for the end of
the string requires some operation roughly like a 'for' loop in C, which
is not the case. Some machines have special instructions to facilitate
this sort of search. IBM mainframes, for example, have since System/360
had a "translate and test" instruction, which could be used to find the
first occurrence of a character in a string, at the cost of setting up a
256-byte table (trivial for C strings since you'd only do it once per
execution). Admittedly, I don't think it was originally meant for that
purpose (I think it was there to facilitate the insertion of a floating
'$' in currency amounts), and it was probably implemented in microcode.
But it still was not a software loop, and hence was probably faster.
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

Yes, but as has been pointed out else thread, it breaks other things
(e.g., 's+1' no longer points to a valid string).
However, this would impose a limit on a string's length - but, surely, a
[snip]
 
S

Skarmander

Gordon said:
And making tail substrings of a string is made MUCH slower, since
in order to, say, generate a substring with leading spaces stripped
off, you have to COPY the string rather than simply increment a
pointer. This can get very messy in trying to parse strings.

What you're doing on a conceptual level when you're "simply incrementing
a pointer" (if we accept strings as a distinct data type, which is the
point of the exercise) is generating a new string that is a slice of the
existing string. The slice and original happen to share memory.

In a length-based system, this sort of operation would be accommodated
by different means: the pointer would become an index into a string, or
index-based operations are encapsulated by introducing an explicit type
for string slices. A naive implementation would indeed be much slower in
these cases, however -- of course, so would a naive implementation of
repeated string concatenation in a terminator system be.
There are a lot of things in C that are string functions that you
don't even think of as string functions. These can get much, much
slower.
You're probably thinking of applications like the above: manipulating
virtual strings through pointers. This system can be built on top of
length-based strings -- at a cost, but a constant one.

Sure, just like the buffer passed to gets() is "long enough". NOT!!!

Well, store the length in a size_t and it will certainly be "long
enough". Either that or you'll run out of addressable memory. If a 4G
long string just isn't long enough, how the length of your strings is
indicated will be a minor problem in comparison.

And it's a tad ironic that gets() would be cited as an example. In a
system with length-based strings, gets() might crash your application
with an out-of-memory, but you couldn't use it for a buffer overflow attack.
If you look at speeding up a small portion of the job at a potential
large cost of speed in the REST of the job, you can find something
that looks like a speed improvement which isn't.
And you could also find something that looks like a speed improvement
which is, through the 90/10 rule. Use the right tool for the right job.
Sometimes you *will* get definite improvements by keeping track of
length. At other times, you won't.

The systems are interchangeable in that you can always explicitly keep
track of length yourself in a terminator system (at a cost), and you can
always use indices to simulate pointers in a length-based system (at a
cost). C's choice for zero-terminated strings makes sense in light of
its philosophy; so does Pascal's (and most object-oriented languages')
choice of length-based strings.

Of course, endless flame wars can^H^H^H rage over what system ought to
be picked as the basis... Asking which one is "optimal" is sure to
provoke lots of factless debate, and I'm not sure the unqualified
question has merit to begin with.

S.
 
D

Dik T. Winter

> I believe that in languages like Pascal, the length of a string is encoded
> in some sort of 'header' data structure, and that if a programmer wants to
> know the length of such a string, the resultant discovery is therefore very
> fast.

As already answered by many, yes.
> Of course, if a C implementation chose to perhaps keep track of, and store a
> C string's length at perhaps a negative offset, e.g. strlen of s =
> ((unsigned long int *)s)[-1]); this would result if faster string
> operations. ....
> Would such an implementation break the C standard?

Yes, it is seriously against the standard. How about:
char *s = "0123456789abcdef";
char *t = s + 10;
in C both s and t point to strings.
 
D

Dennis Willson

While I know your talking about 'C', one of the advantages of 'C++' is the ability to create your own String object that has any of
those attributes you want. You would still have to implement your own versions of any of the standard library functions you intended
to use with that object, but they could also be part of the object itself.

However there are only a few 'C' functions that I can think of that would see any speed increase by knowing the String length in
advance.

OK, now the 'C' purest can beat me up...

Dennis
 
M

Martin Ambuhl

Dennis Willson top-posted, for resaons known only to him, writing:
While I know your talking about 'C', one of the advantages of 'C++' is
the ability to create your own String object that has any of those
attributes you want.

This is trivially done in C. If you are going to post advocacy for C++
in newsgroups where C++ is off-topic, you should at least not give the
completely false impression that C is not perfectly capable of handling
self-defined String objects.

[...]
OK, now the 'C' purest can beat me up...

The word is "purist," and any minimally competent C programmer is
capable of creating the type and functions you stupidly think that C++
is needed for.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,167
Latest member
SusanaSwan
Top