Strings in C are less optimal than in (say) Pascal - correct?

R

Richard Tobin

Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.

Many string functions require traversing the string, or the relevant
part of it, anyway (strcmp, strchr, strcat). And in some cases the
length is already known (e.g. operations after mallocing()). So I
doubt that this inefficiency is as great as you think.

There are other disadvantages to counted strings. The overhead of
the count itself, of course, and the fact that strings can't share
or be modified in place (strtok).
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

*Some* faster operations. And it would break almost all existing
programs (any that change the string length by moving the nul manually
of course, but also any that use memcpy to copy them, or allocate space
for them with malloc, etc).

You can of course define your own structures of this kind (without
even the struct hack in C99) and functions to operate on them.

-- Richard
 
D

Dennis Willson

Martin said:
Dennis Willson top-posted, for resaons known only to him, writing:
While I know your talking about 'C', one of the advantages of 'C++' is
the ability to create your own String object that has any of those
attributes you want.


This is trivially done in C. If you are going to post advocacy for C++
in newsgroups where C++ is off-topic, you should at least not give the
completely false impression that C is not perfectly capable of handling
self-defined String objects.

[...]

Except C doesn't allow for function overloading, therefore you cannot "add" to the string function library so they look and feel the
same just use the new Object String instead of char * (for example). Yes you implement string functions in C that do the same thing,
but you would have call them something else.
The word is "purist," and any minimally competent C programmer is
Sorry, but I believe you know what I meant anyway.
capable of creating the type and functions you stupidly think that C++
is needed for.
Overloading is done easier in C++, I don't advocate one language over another because I like one better, I use the correct language
for the job at hand. I program in many different languages for that reason.

With your attitude, I couldn't take anything YOU recommend with without verifying it's just a emotional response instead of factual.
 
K

Keith Thompson

Dennis Willson said:
While I know your talking about 'C', one of the advantages of 'C++'
is the ability to create your own String object that has any of
those attributes you want. You would still have to implement your
own versions of any of the standard library functions you intended
to use with that object, but they could also be part of the object
itself.
[snip]

I think everyone here who cares already knows that.
 
K

Keith Thompson

Many string functions require traversing the string, or the relevant
part of it, anyway (strcmp, strchr, strcat). And in some cases the
length is already known (e.g. operations after mallocing()). So I
doubt that this inefficiency is as great as you think.

I fully agree with the basic point, but strcat() isn't a good example.
strcat() has to scan to find the end of the first argument before it
can start appending characters from the second. With counted strings,
it could jump directly to the end of the first argument.

(Quite often you can do this yourself by keep track of the end of the
string manually.)
 
C

Christian Bau

Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

*Some* faster operations. And it would break almost all existing
programs (any that change the string length by moving the nul manually
of course, but also any that use memcpy to copy them, or allocate space
for them with malloc, etc).

It would be a nice challenge for a compiler to keep track of knowledge
of C strings and their lengths behind the scenes, especially when
standard library functions are used.

Example:

char* concat3 (char* string1, char* string2, char* string3) {
char* result = malloc (strlen (string1)
+ strlen (string2) + strlen (string3) + 1);
strcpy (result, string1);
strcat (result, string2);
strcat (result, string3);
return result;
}

The compiler could keep track of the three string lengths, and change
the calls to strcpy and strcat to calls to memcpy.
 
R

Richard Tobin

Keith Thompson said:
I fully agree with the basic point, but strcat() isn't a good example.

Nor was strcmp(), since often it would be a big win to compare lengths
first.

-- Richard
 
K

Keith Thompson

Nor was strcmp(), since often it would be a big win to compare lengths
first.

Not really; strcmp() still has to return <0, 0, or >0 depending on a
comparison of the first non-matching character.

On the other hand, if the compiler wants to optimize a call like
if (strcmp(a, b) != 0) { ... }
then knowing the lengths differ would be helpful.
 
M

Martin Ambuhl

Dennis said:
Except C doesn't allow for function overloading,
[etc.]

Overloading operators is a sin for whicht BS should never be forgiven.
The overloading of normally arithmetic operators for non-arithmetic
types always looks to have an "obvious" meaning to the original
implementor. He is almost always wrong.

Trolling with your advocacy of C++ by pointing to one of the flaws of
C++ as a virtue marks you as someone who should be writing "that bug is
a feature" documentation for Microsoft.
 
K

Keith Thompson

Martin Ambuhl said:
Dennis said:
Except C doesn't allow for function overloading,
[etc.]

Overloading operators is a sin for whicht BS should never be
forgiven. The overloading of normally arithmetic operators for
non-arithmetic types always looks to have an "obvious" meaning to the
original implementor. He is almost always wrong.

I was going to express an opinion, but I thought better of it.
I'll just mention that either advocating or flaming C++ is off-topic.
 
I

icub3d

Keith Thompson:
I was going to express an opinion, but I thought better of it.
I'll just mention that either advocating or flaming C++ is off-topic.

ahh some sense to this topic. :) It's at times like this I always
wonder why we all just can't get along. You do yours, I'll do mine.
 
W

websnarf

pemo said:
I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.

Yes, but most Pascal implementations I am aware of have a string length
limitation of 255 characters.
In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.

Correct. Any code that contains strlen(), strcat() or other "traverse
to compute length" operations you know is definately suboptimal. I.e.,
if someone claims some code is made for performance and they are
calling one of those function in the critical path, you know they are
wrong.
Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.

Yes, that's certainly true for anything that includes string length
computations.
Which, seems to go against a bit of C philosophy - speed is everything!

Since when has C's philosophy been speed is everything? C was designed
by AT&T hackers specifically for the purpose of making UNIX -- it
didn't really have any other design purpose. UNIX is not an operating
system with any significant focus on strings, and it should not be
surprising that C's poor performance on strings has no effect on UNIX's
performance.
Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.

This is not going to be helpful when used in conjunction with pointer
arithmetic. What do you expect to happen here:

strcat (dest + prefix, src); /* Suppose we know strlen(dest) >
prefix in length. */

if prefix is just some offset that we know is in the middle of dest?
If you write the length into prefix bytes it will overwrite contents
that dest points to.
However, this would impose a limit on a string's length - but, surely, a
practical one!?

Well, even the '\0' delimited strings are limited in size (essentially
by SIZE_MAX, which may actually one day be defined on your platform.)
On real platforms, int, is well large enough to hold any practical
string. You could choose size_t as the length if you really want to
use every ounce of memory assignable to a single object in C.
Would such an implementation break the C standard?

Yes. None of the pointer arithmetic could work the same as it
currently does.
Does anyone know of an implementation that perhaps does something like this?

Try "The Better String Library" (http://bstring.sf.net/). It uses
length prefix semantics, and derives a large average performance
advantage because of it, just as you suspect. But there are other
things in the library that enhance performance. You can see benchmark
data here: http://bstring.sd.net/features.html#benchmarks . Hopefully
these benchmarks should suggest to you that "speed is everything" is
not a philosophy that has been actually carried out in the C language.

A common complaint with other string libraries which are not char *
based (and sometimes erroneously cited as a reason to avoid Bstrlib) is
that they don't support pointer arithmetic, and therefore lose out on
some performance tricks because of it. Bstrlib actually supports a
feature superior to pointer arithmetic, which you could call "segment
arithmetic". Basically, with pointer arithmetic, you can generate a
quick reference to any tail of any string which you can use as
parameter to the C string library functions. With Bstrlib, you can
reference any string segment (head, tail, or any interior segment) and
use it as a source string parameter to any Bstrlib function. I.e.,
hacked up tricks that you could pull with C strings, are now well
defined activities, that are more generalized in Bstrlib. To try to
pull the same trick in C, you would have to (even if temporarily)
modify the source string (inserting '\0's) which means that you could
only have 1 set of non-nested substrings active at once (there is no
such limitation in Bstrlib).

Bstrlib also comes with functions like biseq() which compares a pair of
strings, but does not compute their ASCII ordering. This allows the
massive performance accelerator of first checking for length
difference. (It also checks for content aliasing; something that could
be done in standard C libraries, but generally is not). The
architecture of C's '\0' terminated strings means this performance
enhancement simply is not available to it.

Oh yeah, and it doesn't break the ANSI C standard -- it defines a new
string type, and new complement of functions for it.

There are other features having to do with functionality, ease of use,
iteroperability with char * strings, semantic portability, and safety
that makes Bstrlib a compelling library for string manipulations as
well, which you can read about at the website.
 
W

websnarf

Dennis said:
While I know your talking about 'C', one of the advantages of 'C++' is
the ability to create your own String object that has any of those
attributes you want. You would still have to implement your own
versions of any of the standard library functions you intended
to use with that object, but they could also be part of the object
itself.

C++ comes with its own general usage performance problems (like the
fact that pre-increment faster than post-increment in C++ -- there are
similar problems with constructs like a = b + a being slower than a +=
b) which may be disappointing to the O.P. who seemed interested in
performance.
However there are only a few 'C' functions that I can think of that would see any speed
increase by knowing the String length in advance.

Uh, yeah, but they are amongst the very slowest functions, they are
commonly used, and the speed up is enormous. That's kind of like
saying improving C's malloc function would only improve one function
(even C++ users would care about that!).
 
J

jacob navia

pemo said:
I believe that in languages like Pascal, the length of a string is encoded
in some sort of 'header' data structure, and that if a programmer wants to
know the length of such a string, the resultant discovery is therefore very
fast.



In C, functions like strlen() have to traverse the string in order to find
the null terminator. This requires a comparison of each char in the string
to '\0' of course.



Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.
Which, seems to go against a bit of C philosophy - speed is everything!



Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.



However, this would impose a limit on a string's length - but, surely, a
practical one!?



So, to the questions!



Are my assertions correct?

Would such an implementation break the C standard?

Does anyone know of an implementation that perhaps does something like this?

Yes, if you download lcc-win32 you will find an implementation of
strings using length prefixed strings. They are treated just like
normal strings, and a replacement for the string functions in the C
library is provided. The library source is distributed with lcc-win32.

http://www.cs.virginia.edu/~lcc-win32
 
J

jacob navia

Richard said:
Therefore, string functions (the majority I can think of all require some
'end searching') are necessarily slower than their Pascal counterparts.


Many string functions require traversing the string, or the relevant
part of it, anyway (strcmp, strchr, strcat). And in some cases the
length is already known (e.g. operations after mallocing()). So I
doubt that this inefficiency is as great as you think.

There are other disadvantages to counted strings. The overhead of
the count itself, of course, and the fact that strings can't share
or be modified in place (strtok).

Of course, if a C implementation chose to perhaps keep track of, and store a
C string's length at perhaps a negative offset, e.g. strlen of s =
((unsigned long int *)s)[-1]); this would result if faster string
operations.


*Some* faster operations. And it would break almost all existing
programs (any that change the string length by moving the nul manually
of course, but also any that use memcpy to copy them, or allocate space
for them with malloc, etc).

You can of course define your own structures of this kind (without
even the struct hack in C99) and functions to operate on them.

-- Richard[/QUOTE]

If you mean
strcut mystring {
size_t len;
char data[];
}

This is NOT a "hack", it is STANDARD C.

jacob
 
K

Keith Thompson

jacob navia said:
pemo said:
I believe that in languages like Pascal, the length of a string is
encoded in some sort of 'header' data structure, and that if a
programmer wants to know the length of such a string, the resultant
discovery is therefore very fast. [...]
Does anyone know of an implementation that perhaps does something
like this?

Yes, if you download lcc-win32 you will find an implementation of
strings using length prefixed strings. They are treated just like
normal strings, and a replacement for the string functions in the C
library is provided. The library source is distributed with lcc-win32.

http://www.cs.virginia.edu/~lcc-win32

Can this implementation be used with compilers other than lcc-win32?
 
K

Keith Thompson

jacob navia said:
Richard Tobin wrote: [...]
You can of course define your own structures of this kind (without
even the struct hack in C99) and functions to operate on them.
-- Richard

If you mean
strcut mystring {
size_t len;
char data[];
}

This is NOT a "hack", it is STANDARD C.

Yes, it's standard C99, based on the not-quite-standard "struct hack"
commonly used in C90. (The C99 standard's index even has an entry for
"struct hack", a cross-reference to "flexible array member".)

But I think what Richard meant is that in C99, you don't need to use
the struct hack, which doesn't contradict what you wrote.

(Just because it's standard doesn't necessarily imply it's not a hack,
but that's beside the point.)
 
R

Richard Tobin

If you mean
strcut mystring {
size_t len;
char data[];
}

This is NOT a "hack", it is STANDARD C.

Er yes, that's why I said you don't need the struct hack in C99.

The struct hack would have something like "char data[1]".

-- Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,280
Latest member
BGBBrock56

Latest Threads

Top