managed string library

D

Douglas A. Gwyn

James said:
And, to be fair to Paul, I've seen my share of C code that something
like:
strcpy(x, x + 1);
sprintf(x "%s%d", x, foo);
...under similar ignorance of the C API specifications.

Indeed, there is a much broader issue than the mere specification
of the one function "strcat". As another poster pointed out, in
C such "array" parameters are actually pointer parameters, so the
array objects are essentially passed "by reference", not by the
object values; *in any PL* this would raise the question of what
happens when the references are aliases. Many PLs simply outlaw
programs that contain such aliasing. C allows the aliasing,
except when the programmer has used "restrict" qualification or
when the pointed-to types are incompatible, and it is then up to
the API designer to decide what the semantics are going to be.
In the case of strcat, the official specification clearly
indicates that aliased arguments are not allowed, via "restrict"
qualifiers on the parameters in the function declaration.
 
S

SuperKoko

David said:
You're missing his point. Once you've been steeped in C lore, of course
you know what strcat() does. Any good C programmer does, once they've
learned the language. But learning the language, in this case, consists
of unlearning your intuition about what it means to concatenate strings.
The natural intuition about what strcat() should do would be that it
concatenates strings ("str" "cat", get it?). And there is a natural
definition of what it means to concatenate strings. Unfortunately,
C's strcat() does not use that natural definition; it does not match
the natural intuition you might have before you were exposed to C.
That's a problem, because it means that learning C requires unlearning
your intuition. Unlearning an old intuition and learning a new one
is twice as hard as learning something that you never had any prior
intuition about. Strcat() is just one example of this kind of phenomenom;
it occurs in many funny places in the C language. The folks here are
so steeped in C that they've probably forgotten what it was like to
originally learn C and be surprised by some of its oddball semantics.
Those oddball semantics were justified in the days of PDP-11 where CPU
performance was more important than making the programmer's life easier.
Today, those design choices are debateable.

What do you think that strcat should do?
Intuitively, for someone who has never used C but only other languages,
the unintuitive thing about strcat is not that strcat(p,p) doesn't
work, but that strcat("hello ", "world!") doesn't work!

Intuitively for a newbie, strcat(x,y) doesn't modify y nor x and
returns a new string...

Obviously, this semantic shall not change in next C standards (though,
higher level string functions might be added).

Now, more seriously.
Consider a C programmer who has just learnt that strcat doesn't do what
he expected (i.e. he now understands that there the dest string must be
large enough to contain new characters, etc.).
In order to learn that he read the manual... And I do think that any
decent manual should warn the reader than dest and src must not be
aliased.
I just read the man page (in French) and that's mentioned. Good.
This programmer should have no real problem.
If you read a C99 manual, that would even be more obvious (thanks to
the restrict keyword).
Moreover, C99 programmers are very acustomed to the "restrict" keyword,
and know that aliasing might fail anywhere if they don't check the
manual.


But I fear that there are many Bad Manuals that don't specify it... (I
would like a confirmation, since I'm not sure).

Now, a second question:
For an intermediate C programmer who knows C enough to understand the
design principles of the C library... Does the expression strcat(p,p)
seems dubious enough to him that he checks the manual as soon as he
wants to use it?
The answer is not obvious... I think that many programmer would.... But
perhaps some programmers would not... In the latter case, it's really
harmful.
If that latter case is frequent enough, it might be worth revisiting
the specification of strcat

Now, I think that there is a far more unintuitive semantic in the C
standard library, that may produce bugs in the code of advanced
programmers... I think it's unintuitive even for an experienced
programmer who think to know well the language.

That's the semantic of memcpy and memove if their third argument is 0.
It requires that pointer point to objects, implying that:
memcpy(dest, NULL, 0);
Has undefined behavior (though, it doesn't crash on implementations I
know).

1) I've read the man page of my Linux distrib, and that thing is NOT
specified in it! :(
I fear that most docs don't talk about that issue at all.
2) If the standard specified that this is valid, and do nothing,
implementations would have only either a zero-overhead on
non-obfuscated platforms, or a minor overhead on those obfuscated
platforms, namely, a conditional test:

if (n>0) {
/* do the stuff */
}
Instead of:
/* do the stuff */

This overhead is almost negligible on platforms where memcpy and memove
are not inlined (and, memcpy and memove are not often inlined, at least
with implementations I know).
This overhead might be non-negligible on platforms where memcpy and
memove are inlined, though it's not huge at all, and programmers are
acustomed to the fact that calling memcpy or memove on very small
memory blocks has an inertial non-negligible overhead and program in a
way that doesn't decrease greatly performances if memcpy or memove have
this inertial overhead.

My point is that, specifying that memcpy(dest, NULL, 0) is ok, would be
almost negligible on all existing platforms, and would have a
zero-overhead on many common platforms.

Now, allowing aliasing in strcat is probably far less benefitial, since
it's far less counter-intuitive for intermediate & advanced
programmers, very uncommon (code such as memcpy(dest,
ptr_that_might_be_null, size_that_might_be_zero) is not uncommon).

One might think that it's hard to write an efficient implementation of
strcat accepting pointer aliasing.
But there seems to be a portable, efficient implementation:

char * safestrcat (char * dst, const char * src) {
if (*src) {
char * dend = dst + strlen (dst);
strcpy (dend + 1, src + 1);
*dend = *src;
}
return dst;
}


Now tell me this cannot be translated to an optimzed solution on any platform.
It seems good to me. There seems to be no aliasing problem, and there
seems to be no overhead.

In that case, it might be worth considering either adding a safestrcat
function to the standard in the spirit of memcpy vs memove, or update
the specification of strcat (probably the simpliest solution) and
optionally provide a no_alias_strcat.
 
K

kuyper

SuperKoko wrote:
....
One might think that it's hard to write an efficient implementation of
strcat accepting pointer aliasing.
But there seems to be a portable, efficient implementation:


It seems good to me. There seems to be no aliasing problem, ...

Not quite. I strongly suspect that anyone who's surprised by the fact
that strcat(p,p) has undefined behavior is likely to be equally
surprised by the fact that the following code also has undefined
behavior:

char name[] = "Paul Hsieh";
char* space = strchr(name, ' ');
if(space)
{
*space = '\0'; // Split into two parts at space character.
safestrcat(name, space+1); // Merge first and second part of name.
}
 
D

David Wagner

SuperKoko said:
And I do think that any decent manual should warn the reader than
dest and src must not be aliased.

Of course, a good manual should mention all required preconditions.

But my suggestion is that it would have been even better still for
the semantics of strcat() to have been designed so that no such restriction
is needed. Whenever you introduce restrictions like that, they have costs.
They add some mental burden to the programmer, who has to remember whether
or not strcat() can be safely used in the presence of aliasing or not.
And they introduce an opportunity for error.

As Mr. Hsieh has pointed out, it is possible to have a cleaner semantics
for strcat(), to eliminate this opportunity for error, and all while
maintaining performance that is as good or better than current implementations
of strcat(). I consider that a pretty serious criticism.

This is not just about Good Manuals vs Bad Manuals. This is about
Good Design of library interfaces.
For an intermediate C programmer who knows C enough to understand the
design principles of the C library... Does the expression strcat(p,p)
seems dubious enough to him that he checks the manual as soon as he
wants to use it?
The answer is not obvious... I think that many programmer would.... But
perhaps some programmers would not... In the latter case, it's really
harmful.
If that latter case is frequent enough, it might be worth revisiting
the specification of strcat

Yes, of course some programmers may fail to double-check the manual,
perhaps because they don't realize the need or because they misremembered
what the manual said. As long as the number of programmers who fail
to check the manual is non-zero, there is some risk here. Why would we
want to take on unnecessary risks without any compensating benefits?
That's the semantic of memcpy and memove if their third argument is 0.
It requires that pointer point to objects, implying that:
memcpy(dest, NULL, 0);
Has undefined behavior (though, it doesn't crash on implementations I
know).

Interesting, I didn't know that. Thanks for pointing that out.
Now, allowing aliasing in strcat is probably far less benefitial, [...]

The claim is not that the design of strcat() is somehow catastrophic.
On its own, it's a minor irritant -- very minor. But this is just one
example. If this is one example of a larger phenomenom, then the sum
of many minor irritants can become a major problem.
One might think that it's hard to write an efficient implementation of
strcat accepting pointer aliasing.
But there seems to be a portable, efficient implementation: [...]

Yes, as Mr. Hsieh already stated.
 
A

av

Re: meaning of strcat(p,p)

char* strcat(char* a, char* b)
{char *p, *h;
if(a==0 || b==0) return 0;
for(p=a; *p; ++p) ;
h=p
if(a==b) {for(; b<h; ) *p++=*b++;
*p=0;
return a;
}
else while(b!=a && *p++=*b++);
return b==a ? 0: a;
}

not tested...
0 for error
and "strcat(p,p);" is ok (if p has right mem space)
don't know for strcat(p, p+4); or strcat(p+4, p);
 
D

Douglas A. Gwyn

SuperKoko said:
Intuitively, for someone who has never used C but only other languages,
the unintuitive thing about strcat is not that strcat(p,p) doesn't
work, but that strcat("hello ", "world!") doesn't work!
Intuitively for a newbie, strcat(x,y) doesn't modify y nor x and
returns a new string...

Good points.
... Does the expression strcat(p,p) seems dubious enough to him
that he checks the manual as soon as he wants to use it?

It would to a moderately experienced C programmer. Actually this
is something a C99 compiler could flag, since the prototype has
restrict qualifiers on its pointer parameters.
My point is that, specifying that memcpy(dest, NULL, 0) is ok, would be
almost negligible on all existing platforms, and would have a
zero-overhead on many common platforms.

The C standard does explicitly allow null pointer arguments for
a few functions, where a good case was made for programming
convenience. However, for a lot of cases such as memcpy, it is
not likely that a correct program would even try to invoke the
function with a null pointer, so even if one wanted to extend
the existing semantics to give meaning to that, it isn't clear
that one is doing the programmer any favor. Far better if the
function *traps* (assert fails, etc.) for a detectable invalid
argument, that that it silently provides some artificial
semantics that is probably not what the program was meant to do.

On the rare occasions when it makes sense for an algorithm to
have somehow created a null pointer value at that point in the
code, it is easy enough for the caller of memcpy to do his own
explicit test, rather than depending on memcpy to do it. That
is also more general, in that the programmer gets to pick the
appropriate semantics.

There are implementations of C library functions that do detect
many instances of improper pointer values, etc. The standard
does not require that, because even a single extra instruction
per function invocation can be considered unacceptable when
multiplied by all calls of every function for every program for
every platform, especially when there is *no value at all* for
correctly-written programs.
 
S

SuperKoko

SuperKoko wrote:
...
One might think that it's hard to write an efficient implementation of
strcat accepting pointer aliasing.
But there seems to be a portable, efficient implementation:


It seems good to me. There seems to be no aliasing problem, ...

Not quite. I strongly suspect that anyone who's surprised by the fact
that strcat(p,p) has undefined behavior is likely to be equally
surprised by the fact that the following code also has undefined
behavior:

char name[] = "Paul Hsieh";
char* space = strchr(name, ' ');
if(space)
{
*space = '\0'; // Split into two parts at space character.
safestrcat(name, space+1); // Merge first and second part of name.
}
Well, but this issue (I think), only concerns very beginner programmers
(it might be interesting to add a new safe & easy to use string API,
but it would be a distinct API, of course).
Moreover, such a programmer shall see that his programmer doesn't work
well, and shall have to check the spec.

Anyway, the main aim, here, is to prevent intermediate programmers from
accidentally aliasing pointers when they don't know well the strcat
specification.
Of course, a good manual should mention all required preconditions.

But my suggestion is that it would have been even better still for
the semantics of strcat() to have been designed so that no such restriction
is needed. Whenever you introduce restrictions like that, they have costs.
They add some mental burden to the programmer, who has to remember whether
or not strcat() can be safely used in the presence of aliasing or not.
And they introduce an opportunity for error.
I agree.


As Mr. Hsieh has pointed out, it is possible to have a cleaner semantics
for strcat(), to eliminate this opportunity for error, and all while
maintaining performance that is as good or better than current implementations
of strcat(). I consider that a pretty serious criticism.

Well, I'll look at the benefits/tradeoffs of specifying that:

Benefits:
1) Easier to remember the spec, especially for intermediate
programmers.
2) Less many bugs in intermediate programmer's code, because of their
ignorance.

Tradeoffs:
1) None I see.

Tradeoffs that are NOT present:
1) Compatibility problem:
It's perfectly compatible with the current standard
2) Performance problems:
No apparent performance tradeoff, even on obfuscated machines (the
solution given by Heisch is portable).

How the transition would be done:
Actually compilers effectively produce bad behaviors with aliasing, so,
in a first time, compilers would integrate the new behavior in their
libraries
In a second time, programmers would be able to effectively, reliably,
use strcat(p,p) on all platforms.
In an intermediate time, programmers would have to remember the old
specification of strcat, since new compliant implementations would not
yet be available on all platforms.
That's not a big deal especially because such advanced programmers
would avoid aliasing everywhere at least for a long time.
The transition will probably not be hard.

So, I guess you've entirely convinced me.
That's not a major modification, but a minor modification with non
negligible (though not huge) benefits and no noticeable tradeoff.

Yes, of course some programmers may fail to double-check the manual,
perhaps because they don't realize the need or because they misremembered
what the manual said. As long as the number of programmers who fail
to check the manual is non-zero, there is some risk here. Why would we
want to take on unnecessary risks without any compensating benefits?

Yes. My point was that, if tradeoffs where really non negligible (but I
think they're negligible or equal to zero), it would worth (but now,
it's no more necessary) to count effectively if this number of
programmers is equal to 1/1000000 or 1/10000 or 1/1000 or 1/100 or
1/10.. It's probably somewhere between a few percents and 1/1000. I
don't know the exact number.
 
W

websnarf

SuperKoko said:
SuperKoko wrote:
...
One might think that it's hard to write an efficient implementation of
strcat accepting pointer aliasing.
But there seems to be a portable, efficient implementation:

(e-mail address removed) wrote:
char * safestrcat (char * dst, const char * src) {
if (*src) {
char * dend = dst + strlen (dst);
strcpy (dend + 1, src + 1);
*dend = *src;
}
return dst;
}


Now tell me this cannot be translated to an optimzed solution on any platform.

It seems good to me. There seems to be no aliasing problem, ...

Not quite. I strongly suspect that anyone who's surprised by the fact
that strcat(p,p) has undefined behavior is likely to be equally
surprised by the fact that the following code also has undefined
behavior:

char name[] = "Paul Hsieh";
char* space = strchr(name, ' ');
if(space)
{
*space = '\0'; // Split into two parts at space character.
safestrcat(name, space+1); // Merge first and second part of name.
}
Well, but this issue (I think), only concerns very beginner programmers

Its a concern for programmers who do it, see that it works fine, and
are none the wiser.

Obviously I introduced strcpy into the equation, forgetting that that
guy has *technically* weak aliasing semantics too. Clearly I am
intending the "obvious implementation" of strcpy (noticing the irony
here) which would never have an aliasing problem in that situation. So
let me accept a correction and also require that it be written on top
of semisafestrcpy() which, however its implemented (but 99% of existing
implementations already satisfy this criteria), does not fail so long
as the destination pointer doesn't alias to the same object *after* the
source parameter (something that is basically true today, even if the
spec doesn't require it). In such circumstances, the safestrcat()
function would work as advertised.

Keep in mind, that even using block techniques, hardware string
instructions, etc, there is no issue. Its only really twisted
implementations with non-linear memory where there can be a real
screw-up here.
(it might be interesting to add a new safe & easy to use string API,
but it would be a distinct API, of course).

This was the point of TR 24731 and "Managed Strings". (I've responded
claiming that they both failed to meet the standard of either safe or
easy, by missing aliasing and other issues.) Bstrlib exists and
although I have not submitted it as a proposal to the ANSI folks (there
are technical and philosophical reasons for not doing so) it should at
the very least serve as a model by which any other similar extensions
are judged. By his own admission, Richard Seacord says he didn't even
look at Bstrlib. I'm sorry, but at this point its not just a matter of
my personal ego. Bstrlib, by now, has enough propogation that it would
*have* to be an obvious thing to at least check for ideas before going
off to write your own string library.
Moreover, such a programmer shall see that his programmer doesn't work
well, and shall have to check the spec.

Actually, it will work just fine on nearly any existing system. Its
not technically well defined, but the fact that it works anyways should
tell us something.
Anyway, the main aim, here, is to prevent intermediate programmers from
accidentally aliasing pointers when they don't know well the strcat
specification.

Well there are other aims -- it reduces what you need to understand to
be able to use the library properly, and expands the functionality to
include anything that is reasonably possible.
I agree.


Well, I'll look at the benefits/tradeoffs of specifying that:

Benefits:
1) Easier to remember the spec, especially for intermediate
programmers.
2) Less many bugs in intermediate programmer's code, because of their
ignorance.

You forgot:

3) Increases the actual functionality of the function (how else am I
supposed to concatenate a string or some tail of it to itself?)

The current standard simply takes functionality *away* from us.
Tradeoffs:
1) None I see.

Actually the ANSI C committee would have to swallow some pride or
something. That cost is apparently extremely high.
Tradeoffs that are NOT present:
1) Compatibility problem:
It's perfectly compatible with the current standard

Correct. Implementations today can *already* solve the problem if they
feel like it. But there is the ego/pride cost for the ANSI C
committee. So my main point is that the problem can and should be
addressed in TR 24731 and the "Managed String" proposals, where the
analysis is the same. (Notice that MS went on various excursions to
solve all sorts of multi-threading issues in TR 24731 which are
technically irrelevant to the scope of the ANSI standard -- and yet
they ignore aliasing (beyond sticking "restrict" everywhere) which is
an existing problem for all platforms.)
2) Performance problems:
No apparent performance tradeoff, even on obfuscated machines (the
solution given by Heisch is portable).

How the transition would be done:
Actually compilers effectively produce bad behaviors with aliasing, so,
in a first time, compilers would integrate the new behavior in their
libraries
In a second time, programmers would be able to effectively, reliably,
use strcat(p,p) on all platforms.
In an intermediate time, programmers would have to remember the old
specification of strcat, since new compliant implementations would not
yet be available on all platforms.
That's not a big deal especially because such advanced programmers
would avoid aliasing everywhere at least for a long time.
The transition will probably not be hard.

Oh no -- the damage in the real strcat() has been done. We can't
actually save that, since many compilers just will not be updated
(Microsoft's comments about C99 are very telling.) The new APIs (TR
24731 or Managed Strings) are really the only place where there is any
hope. Users of Bstrlib don't need to worry about any of this of
course.
So, I guess you've entirely convinced me.
That's not a major modification, but a minor modification with non
negligible (though not huge) benefits and no noticeable tradeoff.

The C library is filled with those, BTW.

Sure. Now think about more experienced programmers who simply would
like to concatenate the tail of a string to the end of it. Should we
just throw our hands up? Give up any hope of hardware specific
acceleration and do things by hand (if I care about portability, for
example) and write a char-by-char loop? I guess we should compute the
length, and do things with memmove -- funny that, perhaps Bstrlib might
be worth taking a look at afterall.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,226
Latest member
KristanTal

Latest Threads

Top