managed string library

O

Old Wolf

Just *SAYING* this is the ultimate indictment of the C language. If
the language doesn't match your intuition, then its just takes that
much more effort to program in it.

You're confusing 'intuition' with 'expectation'. Windows users who
try to use an X-windows desktop, find it 'unintuitive' that the active
window is the one with the mouse over it (rather than the one they
last clicked on).

But this is not a matter of intuition, it is just a matter of
them expecting what they are used to. A person who has
not used either system before, would be fine either way.

Regarding the string issue, you would only expect strcat(p, p)
to double p if you had some sort of mental image of p as a
string object. But it is no such thing. It's an array of char. If
you dont understand this then you are not going to get far
with C programming.
 
W

websnarf

Andrew said:
If you start from src+1, you still have to store the original value of
src, or use a counter variable. My points on code-intensivity and ease
of reading still stand.

You typically read the source for your compiler's standard runtime
library? You seriously intuitively expect the source for your C lib
functions to be readable? They may be, but if your vendor is serious
about performance (fast strlens, strcpys, memcpys, and even strstr
implementations are quite convoluted), chances are they aren't really.
It's clever, I admit. However, jumping through hoops to manage odd
inputs which could be fixed with an
assert (dest > src + strlen(src) + 1);
doesn't bode well to me.

How does this *fix* anything? If src happens to be on an earlier
address (and not overlap at all) then what exactly is that assert
doing?
There are many reasons why I don't qualify as a "normal person" or
"average programmer"; my intuition agrees with that of strcat().

But you don't have a good justification for this. This intuition can't
have existed before you learned the C language, as there is no other
language or string model with that kind of problem. Meaning, that is
not really intuition.
I suspect that memmove() memcpy()'s the data to a safe place and
memcpy()'s it back to the dest. This adds an intermediate step
which prevents problems with src and dest overlapping.

(I would throw away any compiler that did that. I'm pretty sure there
are basically no C compilers that do anything like that. Interesting
that your "intuition" seems to have failed you here. You can always
pick between between forward and backward copy depending on the order
of the pointers and memmove will work fine -- you can even then
continue to implement block tricks as is commonly done with memcpy().)

Your *intuition* starts looking a little more like my first "alternate"
suggestion for how strcat might be implemented. So it looks like
you've rushed right over the fence to my side because ...
That doesn't sound particularly efficient to me, though; I assume
that compiler/library writers have found much better ways to code
it.

.... that's right! Its *NOT* efficient, because your idea was just
driven by your *intuition*. So if I simply tell your that I've
implemented strcat_allow_alias() doesn't your intuition end up matching
my first alternative strcat implementation idea?

Yet in *BOTH* cases, it turns out that serious implementations of an
aliasing safe strcat, and memmove do not cost significantly more than
the aliasing unsafe strcat or memcpy functions. So the C standard
decides to pay the penalty of mismatching against people's intuition in
order to save some trivial almost unmeasurable efficiency savings (and
TR24731 is of no help since it takes the exact same position as the
original Clib functions).
 
W

websnarf

David said:
...because it's specified that way. And memcpy() is not specified
that way. strcpy() is not specified that way, either.

Are you arguing for ignoring the function specifications?

You, of course, see the word "intuition" constantly being cited in all
the sentences written above right?
 
D

Default User

Keith said:
(e-mail address removed) writes:

First you wrote "Yeah, sieg heil!" in a recent thread in comp.lang.c,
and now you bring terrorists into a discussion of C strings.

Let's just ignore the troll.



Brian
 
A

Andrew Poelstra

That assert doesn't bode well. It invokes undefined behavior if dest
and src don't point into the same object (or just past the end of it).

Let's see... What I meant was:
assert ((dest - src) > 0 && (dest - src) < strlen (src));

Is that right?
 
A

Andrew Poelstra

How does this *fix* anything? If src happens to be on an earlier
address (and not overlap at all) then what exactly is that assert
doing?

There are a lot of things wrong with that code. I stand corrected.
But you don't have a good justification for this. This intuition can't
have existed before you learned the C language, as there is no other
language or string model with that kind of problem. Meaning, that is
not really intuition.

Actually, in my assembler days I used C-style strings, and I was a /lot/
more concerned with how they were implemented, given that my assembler
wasn't going to help me at all with overruns, etc.

C++ allows C-style strings. 0-termination is actually a very simple and
easy-to-understand way to represent strings. Knowing that C represents
strings that way, I can think of intuitive behavior for strcat().

Given another language with a different string representation, I'd
assume other behaviors, yes. Given another language with the /same/
string representation, my point still stands. This wasn't caused by
learning C; it was caused by learning C-style strings.
Your *intuition* starts looking a little more like my first "alternate"
suggestion for how strcat might be implemented. So it looks like
you've rushed right over the fence to my side because ...

memmove() is specified to handle overlapping memory boundaries; memcpy()
and strcat() are not. That much is in the Standard, and memmove()'s
specification overrides my intuitive assumption that one shouldn't pass
overlapping memory to the function.
 
W

websnarf

Keith said:
Keith said:
(e-mail address removed) writes:
Jonathan Leffler wrote:
(e-mail address removed) wrote:
[...] (so strcat(p,p) leads
to UB even though it has a compelling intuitive meaning).
What's the compelling intuitive meaning? To me, it means copy
characters from the start of p over the null that used to mark the end
of p and keep going until you crash.

That's not an intuitive meaning. Its just an understanding of an
implementation anomoly. Perhaps for you, implementation details
changes your intuition.
[...]

In this case, intuition is not necessary.

Just *SAYING* this is the ultimate indictment of the C language. If
the language doesn't match your intuition, then its just takes that
much more effort to program in it.

News flash: C is not the most intuitive and beginner-friendly language
ever invented. Does this come as a surprise to you?

Huh? No, but apparently its a surprise to Jonanthan Leffler and Andrew
Poelstra. They seem to be arguing the case that it *does* match their
intuition.

My *original* contention, is that any proposal such as Richard
Seacord's managed string library should go ahead and pay the basically
0 penalty of actually *matching* intuition (as compared to the enormous
penalty he seems willing to pay for automatic character set filtering).
Anything else so far, are just people's false projections about what I
said *beyond* this, and my responses to them.
In a language where strings are first-class objects, and you can pass
them around as values, use them as operands in expressions, and so
forth, I'd expect something called "strcat" to behave in some
reasonable intuitive manner.

Yeah, that's nice. C is basically the only example of a language of
its kind, yet you feel not the slightly problem with making sweeping
generalizations about it based on its properties. That's kind of like
saying that Joseph Lieberman can't be elected president of the US
because he's jewish. I mean, that's nonsense (he's a right wing
democrat, which means he has no serious base of support outside his
state) in same way your idea here is nonsense.

first-class or not, strcat *CAN* be implemented as aliasing safe, at
very little cost. The fact that it doesn't is a choice that was made;
its nothing more than that. Its certainly not a *property* of
low-level languages (in assembly language, for example, there is no
assertion or expection of being unable to deal with aliased "objects"),
or a property of the fact that its not a first class value (bstrlib is
the obvious counter-example of this.)
[...] I'd still need to see the declaration to
know how to use it, but it would probably be safe to assume that
something like

s1 = strcat(s2, s3)

would do the obvious thing.

You mean if it were a first class value? But you are making a false
association here. There is no reason you cannot perform in-place
mutation of first class values. So the API could still have the same
basic functionality as the current strcat (i.e., two operands, and
modifying the destination.)
C is not like that. Strings are not a data type, they're a data
format, "a contiguous sequence of characters terminated by and
including the first null character", subject to all of C's
complications regarding arrays and pointers. If you think you can
guess, with 99% certainty, how strcat() is going to behave based on
that, you're likely to be disappointed.

These *complications*, as you suggest, have nothing to do with it. Its
all down to pure choice at the specification level. The guesses are
only wrong because the standard chooses that they should be wrong.
[...] If you read the standard's
description of strcat(), you'll see:

... If copying takes place between objects that overlap, the
behavior is undefined.

Any decent description of strcat() (in a man page

The latest cygwin man page makes no mention of this and WATCOM C/C++'s
documentation omits this.

The Cygwin man page doesn't mention this, but it's not intended to be
complete:

strcat is part of the libc library. The full documentation for
libc is maintained as a Texinfo manual. If info and libc are
properly installed at your site, the command

info libc

will give you access to the complete manual.

I'm not convinced that's a good idea, but it's explicitly acknowledged
with a reference to the complete documentation.

"info libc" doesn't work for me under Cygwin (I don't know why, but
the reason is clearly irrelevant),

It works on my system. info libc does nothing more than document the
standard include contents. info strcat just re-echos the man page.
[...] but on another system the section
on strcat clearly says:

This function has undefined results if the strings overlap.

I don't know about Watcom.

Well I just told you about Watcom, so now you do (it reads
substantially similar to the man pages). Its all downloadable from the
open watcom site if you care.
[...] or text book, for
example) should have similar wording; if it doesn't, that's the fault
of the author of the documentation.

Here's the first hit on google:

http://www.cplusplus.com/ref/cstring/strcat.html

and the second:

http://www.mkssoftware.com/docs/man3/strcat.3.asp

Here's the wikipedia entry as of 07/28/2006:

http://en.wikipedia.org/wiki/Strcat

and here's the Open BSD documentation that it links to:

http://www.openbsd.org/cgi-bin/man.cgi?query=strcat

So I guess none of that counts as "decent documentation".

I agree. I don't know what cplusplus.com is, and I'm not too
surprised by an error like this in Wikipedia (possibly someone here
will correct it soon). I am surprised that the OpenBSD documentation
doesn't mention this.

So we there have it. The standard for "decent" documentation as you
suggest appears to be quite high, and is certainly different from what
is commonly available.
[...] That's a problem -- but not a problem with C itself.

If you could just *stop* with the false projection for one second. You
know there is a reason why I quote other text when I post responses.
Then by all means feel free to go and use those languages. Nobody
here will stop you.

It always this false choice with you. I have to completely throw out
my investment in learning this language because it makes a number of
idiotic decisions through nothing other than poor choices.

We're not even talking about what language I *USE* for whatever I am
doing. Remember, this thread started as a discussion about improving
to the standard, and as I understand it, has reached the level of
serious official proposal. Citing other languages ought to be a
standard part of such a discussion without you pulling out this tired
old canard all the time.
 
W

websnarf

SuperKoko said:
Jonathan said:
(e-mail address removed) wrote:
[...] (so strcat(p,p) leads
to UB even though it has a compelling intuitive meaning).

What's the compelling intuitive meaning? To me, it means copy
characters from the start of p over the null that used to mark the end
of p and keep going until you crash.

That's not an intuitive meaning. Its just an understanding of an
implementation anomoly. Perhaps for you, implementation details
changes your intuition.

Intuitive or not, that was not obvious for a beginner in C89, but that
should be obvious in C99, even for a beginner:
char* strcat (char * restrict, const char * restrict);

Thanks to "restrict", the function has a better documentation.

Yes, I am aware of this. So the documentation has moved into the
platform's header files. Anyways, are you aware of any university
program teaching C programming based on the C99 standard? Somehow I
doubt any significant percentage of novices are learning C via the C99
standard.
Andrew Poelstra:

But there are other ways to implement it....
Borland C++ 5.0 and Digital Mars Compiler use alternative
implementations (and they behave weird too, but in another way).

Well same with the solaris compiler (without the weird behavior). But
who's counting actual modern stuff?
 
S

SuperKoko

first-class or not, strcat *CAN* be implemented as aliasing safe, at
very little cost. The fact that it doesn't is a choice that was made;
its nothing more than that. Its certainly not a *property* of
low-level languages (in assembly language, for example, there is no
assertion or expection of being unable to deal with aliased "objects"),
or a property of the fact that its not a first class value (bstrlib is
the obvious counter-example of this.)
I agree.
In fact, I deem that functions should tend to abstract details.
The implementation details of strcat should not change artificially the
interface.

Making strcat work with aliasing would not cost much, and would
increase safety of C.

IMHO, it would improve the C standard with a quasi-zero cost:
1) Breaks no code
2) That's a library issue : Easy to implemented by any actual C
implementation.

The only tradeoff would be efficiency... But I think that it can be
implemented efficiently.
 
K

Keith Thompson

Andrew Poelstra said:
Let's see... What I meant was:
assert ((dest - src) > 0 && (dest - src) < strlen (src));

Is that right?

No. If dest and src don't point into the same object, both "dst > src"
and "dest - src" invoke undefined behavior.

Of course an implementation of a library function is free to use these
constructs if it knows how the compiler is going to treat them.
 
R

Richard Tobin

Andrew Poelstra said:
Let's see... What I meant was:
assert ((dest - src) > 0 && (dest - src) < strlen (src));

Is that right?

If pointers are not known to point to within the same object, you can
only compare them for equality. So the only way to achieve what you
want (within the letter of the law of the C standard) is to loop
through the addresses in the two objects, testing whether they are
equal. You could optimise that a bit and avoid testing all pairs.

-- Richard
 
J

Joe Wright

Andrew said:
Jonathan said:
(e-mail address removed) wrote:
[...] (so strcat(p,p) leads
to UB even though it has a compelling intuitive meaning).
What's the compelling intuitive meaning? To me, it means copy
characters from the start of p over the null that used to mark the end
of p and keep going until you crash.
That's not an intuitive meaning. Its just an understanding of an
implementation anomoly. Perhaps for you, implementation details
changes your intuition.

No, knowledge that C strings are null-terminated (which any C programmer
needs to know) suggests that intuitively. Either you calculate strlen(),
add a counter variable, and `for' your way through the string, or you
eliminate the counter and superfluous call to strlen(), and code it
efficiently.

It's more intuitive to use the more efficient, less code-intensive, and
easier-to-read version.
Most people would intuitively think of this as simply replacing the
string with a doubled version of itself -- i.e., its analogous to the
C++ expression p += p for std::string's (and to be honest, I don't know
if that's legal or not), or just p = p + p in most other programming
languages.

"Most people" are not C programmers; if you know enough to use strcat(),
you should have an understanding of how C strings work. (And indeed, I've
never seen a C textbook that introduced strcat() prior to introcuding C-
style strings.) (Although I've heard of some pretty terrible textbooks on
this group that I was fortunate enough to avoid!)
You only *know* that this is not the case, because you know that strcat
is implemented as some variation of { d += strlen(d); while (*d++ =
*s++); } instead of { size_t ld = strlen(d), ls = strlen(s); memmove
(d+ld, s, ls); d[ld+ls] = '\0'; }. You know this because the first
variation is going to be faster. This is not intuition -- its just a
technical calculation.

IMHO, _intuitively_, there is no other way to implement strcat().
#include <string.h>

char *catstr(char *dst, char *src) {
if (dst == src) {
size_t s, siz = strlen(dst);
s = siz;
src += siz;
while (s <= siz) {
src = dst;
--s;
}
} else {
strcat(dst, src);
}
return dst;
}
 
M

mithra

SuperKoko said:
Jonathan Leffler wrote:
(e-mail address removed) wrote:
[...] (so strcat(p,p) leads
to UB even though it has a compelling intuitive meaning).

What's the compelling intuitive meaning? To me, it means copy
characters from the start of p over the null that used to mark the end
of p and keep going until you crash.
...

This is computer *science*, not "getting in Touch with Your Feelings
101". intuition only serves to guess at translating psuedo code to a
first pass at real code, looking up the function defs. in the process.

I use Python and Perl a lot, and I *always* keep a copy of their
function library definitions up when I code them. I don't rely on
intuition, and I have been programming since the 'dark ages'.
Imagination, inspiration, experience, research, experimentation, and
perhaps intuition as filler in the pseudo code until I can look it up.

Back in 1980 I wrote my own wrap-around for strcat() that checks sizes
and realloc() as needed. Obviously, not intuitively, I used it with
pointers as destinations, and for the few microseconds in computing
time that it cost me 20+ years ago I traded the security that
something untoward wouldn't have a customer calling me in the wee hours
of the morning. It has been in my personal library, along with a lot of
other string manipulation code, since then.

C acts as a portable assembly language extension. Keep that in mind and
save intuition for VB.


BTW, FreeBSD man page for strcat() offers:
...
The strcat() and strncat() functions append a copy of the
null-terminated
string append to the end of the null-terminated string s, then add
a ter-
minating `\0'. The string s must have sufficient space to hold
the
result.

The strncat() function appends not more than count characters from
append, and then adds a terminating `\0'.
....
The strcat() function is easily misused in a manner which enables
mali-
cious users to arbitrarily change a running program's
functionality
through a buffer overflow attack. (See the FSA.)

Avoid using strcat(). Instead, use strncat() or strlcat() and
ensure
that no more characters are copied to the destination buffer than
it can
hold.

Curtis W. Rendon
 
C

Chris Torek

[than the usual version that self-destructs when you do
char buf[100] = "apple";

strcat(buf, buf); /* desired: appleapple\0 */
or
strcat(buf, buf + 2); /* desired: appleple\0 */
]

char *catstr(char *dst, char *src) {
if (dst == src) {
[rest snipped]

This handles the:

catstr(buf, buf);

case, but not the:

catstr(buf, buf + 2);

case. To handle both, perhaps something like:

/* returns strlen(result) */
size_t catstr(char *dst, const char *src) {
size_t dstlen = strlen(dst), srclen = strlen(src);

memmove(dst + dstlen, src, srclen + 1);
return dstlen + srclen;
}

would be better. (Untested.)
 
J

Joe Wright

Chris said:
[than the usual version that self-destructs when you do
char buf[100] = "apple";

strcat(buf, buf); /* desired: appleapple\0 */
or
strcat(buf, buf + 2); /* desired: appleple\0 */
]

char *catstr(char *dst, char *src) {
if (dst == src) {
[rest snipped]

This handles the:

catstr(buf, buf);

case, but not the:

catstr(buf, buf + 2);

case. To handle both, perhaps something like:

/* returns strlen(result) */
size_t catstr(char *dst, const char *src) {
size_t dstlen = strlen(dst), srclen = strlen(src);

memmove(dst + dstlen, src, srclen + 1);
return dstlen + srclen;
}

would be better. (Untested.)

I wasn't aware that 'catstr(buf, buf + 2)' was at issue.
 
D

David R Tribble

Paul said:
You, of course, see the word "intuition" constantly being cited in all
the sentences written above right?

Yeah. Okay, my intuition tells me that memcpy() is going to be
implemented in the most efficient way to copy a block of bytes
to another block and ignore any issues of overlapping blocks,
because that's the way the function is specified.

My intuition tells me that memmove() is going to be implemented
so that overlapping blocks are copied correctly with a possible
loss of efficiency, because that's the way the function is specified.

My intuition tells me that strcat() is going to be implemented
so that one string is appended to another, using the '\0' characters
to signal the end of the strings, and ignoring the possibility of
overlapping strings (e.g., strcat(p,p)), because that's the way
the function is specified:

7.21.3.1
[...]
The strcat function appends a copy of the string pointed to
by s2 (including the terminating null character) to the end of
the string pointed to by s1. The initial character of s2 overwrites
the null character at the end of s1. If copying takes place
between objects that overlap, the behavior is undefined.

The phrases "overwriting the terminating null character of s2"
and "if the objects overlap the behavior is undefined" are a
pretty clear indication that 'strcat(p,p)' will invoke u.b. and
probably corrupt the resulting string value of p. At least that's
what my intuition tells me.

-drt
 
D

Douglas A. Gwyn

David said:
Yeah. Okay, my intuition tells me that memcpy() is going to be
implemented in the most efficient way to copy a block of bytes
to another block and ignore any issues of overlapping blocks,
because that's the way the function is specified.
My intuition tells me that memmove() is going to be implemented
so that overlapping blocks are copied correctly with a possible
loss of efficiency, because that's the way the function is specified.
My intuition tells me that strcat() is going to be implemented
so that one string is appended to another, using the '\0' characters
to signal the end of the strings, and ignoring the possibility of
overlapping strings (e.g., strcat(p,p)), because that's the way
the function is specified:
...
The phrases "overwriting the terminating null character of s2"
and "if the objects overlap the behavior is undefined" are a
pretty clear indication that 'strcat(p,p)' will invoke u.b. and
probably corrupt the resulting string value of p. At least that's
what my intuition tells me.

It comes down to expectations. If the programmer expects "char"
to mean "character" and "strcat" to mean "concatenate character
strings", then he is simply wrong. There is a relationship
between the concepts and the corresponding implementations, but
not an exact mapping between them, and it is the differences that
can cause trouble when the programmer's mental model is wrong.

C, C++, and many other languages do provide the *means* for a
programmer to create his own object types and support functions
that more closely fit whatever model he has. If the standard
library facility (which met legacy requirements sufficiently
well) doesn't meet your current requirements, don't (mis)use it;
provide your own implementation. (Be sure to choose new names.)
 
J

Jun Woong

Jonathan said:
[...] (so strcat(p,p) leads
to UB even though it has a compelling intuitive meaning).

What's the compelling intuitive meaning? To me, it means copy
characters from the start of p over the null that used to mark the end
of p and keep going until you crash.

I am also not sure what the intuitive meaning was intended, but
thinking about memcpy vs. memmove, if string copying functions had
defined to copy the source string into an intermediate buffer before
putting into the destination, there would be no crash. I suspect
that's the compelling intuitive meaning he intended.
 
J

James Dennett

Jonathan said:
[...] (so strcat(p,p) leads
to UB even though it has a compelling intuitive meaning).

What's the compelling intuitive meaning? To me, it means copy
characters from the start of p over the null that used to mark the end
of p and keep going until you crash.

The simpler expectation from the interface is "append
a copy of the string *currently* pointed to by p to p",
i.e., append it to itself.

Other languages that support this via notation such
as s+=s; or s = s+s implement it this way.

If you think of strcat in terms of its implementation
then your expectation is natural.

-- James
 
K

kuyper

James said:
Jonathan said:
[...] (so strcat(p,p) leads
to UB even though it has a compelling intuitive meaning).

What's the compelling intuitive meaning? To me, it means copy
characters from the start of p over the null that used to mark the end
of p and keep going until you crash.

The simpler expectation from the interface is "append
a copy of the string *currently* pointed to by p to p",
i.e., append it to itself.

Other languages that support this via notation such
as s+=s; or s = s+s implement it this way.

If you think of strcat in terms of its implementation

or, in terms of it's specification by the standard,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,230
Latest member
LifeBoostCBD

Latest Threads

Top