Degenerate strcmp

F

fishpond

One way I've seen strcmp(char *s1, char *s2) implemented is: return
immediately if s1==s2 (equality of pointers); otherwise do the usual
thing of searching through the memory at s1 and s2.

Of course the reason for doing this is to save time in case equal
pointers are passed to strcmp. But it seems to me that this could create
an inconsistency in the degenerate case when s1 points to memory that is
not null-terminated, i.e. by some freak chance, all of the memory from
s1 till the computer reaches the end of all its memory pages (however
that works) don't contain a single null byte. In this case, strcmp
should not say that s1 and s2 are "equal strings" since neither is
actually a string (because not null terminated).

Is my thinking correct?
 
P

pete

One way I've seen strcmp(char *s1, char *s2) implemented is: return
immediately if s1==s2 (equality of pointers); otherwise do the usual
thing of searching through the memory at s1 and s2.

Of course the reason for doing this is to save time in case equal
pointers are passed to strcmp.
But it seems to me that this could create
an inconsistency in the degenerate case when s1 points
to memory that is
not null-terminated, i.e. by some freak chance, all of the memory from
s1 till the computer reaches the end of all its memory pages (however
that works) don't contain a single null byte. In this case, strcmp
should not say that s1 and s2 are "equal strings" since neither is
actually a string (because not null terminated).

Is my thinking correct?

No.
The behavior of strcmp is only defined for
cases when s1 and s2 both point to strings.
If it's not null terminated, it's not a string.

In cases where the behavior of the code is not defined,
the running program can do whatever it wants.
That's the rules of the C programming language.
 
F

fishpond

No.
The behavior of strcmp is only defined for
cases when s1 and s2 both point to strings.
If it's not null terminated, it's not a string.

Your right that a string has to be null terminated, but for random
memory maybe by chance there just isn't any null byte.

For example, for the program

main() { printf("%d\n",strlen(malloc(0))); }

this will print out a random number, but in principal the strlen call
might never terminate, if the memory at the pointer returned by malloc
doesn't have any null bytes. (OK, it's very unlikely, but it could
happen in theory...)
 
U

user923005

One way I've seen strcmp(char *s1, char *s2) implemented is: return
immediately if s1==s2 (equality of pointers); otherwise do the usual
thing of searching through the memory at s1 and s2.

Adding an if() test for that is not (in general) a good idea.
A missed branch prediction is expensive.
How often are you really going to do this:
if (strcmp(p,p)==0) call_captain_obvious();
A library function with a quirk like that would make me worry about
the quality of the implementation.
Of course the reason for doing this is to save time in case equal
pointers are passed to strcmp. But it seems to me that this could create
an inconsistency in the degenerate case when s1 points to memory that is
not null-terminated, i.e. by some freak chance, all of the memory from
s1 till the computer reaches the end of all its memory pages (however
that works) don't contain a single null byte. In this case, strcmp
should not say that s1 and s2 are "equal strings" since neither is
actually a string (because not null terminated).

Is my thinking correct?

It is undefined behavior in any case to call strcmp() with addresses
that are not null terminated strings.

The actual joke goes:
Q: What does an agnostic, insomniac, dyslexic person do?
A: He lays awake at night, wondering if there is a dog.
 
P

pete

user923005 said:
Adding an if() test for that is not (in general) a good idea.
A missed branch prediction is expensive.
How often are you really going to do this:
if (strcmp(p,p)==0) call_captain_obvious();
A library function with a quirk like that would make me worry about
the quality of the implementation.

Anybody who writes code to compare string p with string p,
isn't in a rush.
And that's one of the reasons why I think that it's usually bad
to optimize the degenerate special case
at any cost at all to the general case.
 
E

Eric Sosman

One way I've seen strcmp(char *s1, char *s2) implemented is: return
immediately if s1==s2 (equality of pointers); otherwise do the usual
thing of searching through the memory at s1 and s2.

Of course the reason for doing this is to save time in case equal
pointers are passed to strcmp. But it seems to me that this could create
an inconsistency in the degenerate case when s1 points to memory that is
not null-terminated, i.e. by some freak chance, all of the memory from
s1 till the computer reaches the end of all its memory pages (however
that works) don't contain a single null byte. In this case, strcmp
should not say that s1 and s2 are "equal strings" since neither is
actually a string (because not null terminated).

Is my thinking correct?

What you seem to have missed is that there is no "correct"
behavior in the case you describe: The behavior is undefined
because the arguments are not strings. Returning zero is one
possible behavior, a SIGSEGV is another, a graphic of a nasal
demon whistling "Dixie" while riding backwards on a bicycle
is yet another.
 
E

Eric Sosman

user923005 said:
Adding an if() test for that is not (in general) a good idea.
A missed branch prediction is expensive.
How often are you really going to do this:
if (strcmp(p,p)==0) call_captain_obvious();
A library function with a quirk like that would make me worry about
the quality of the implementation.

I tried to use strcmp(p, p) as a deliberate time-waster
once. I wanted to study the behavior of a sorting function
with "fast" and "slow" user-supplied comparators: the "slow"
one was strcmp(p, p) followed by a call to the "fast" one.
My program adjusted the length of the string at p until I got
a ten-to-one speed ratio.

This worked fine on several systems, but alas! I ran
across one where my setup code kept making p longer and longer
without slowing anything down -- and it turned out that strcmp
was returning zero immediately, as described.

BUT: Was this a stupid test? I don't think so. The
strcmp implementation made a bunch of (highly non-portable)
tests to decide whether it could replace a byte-by-byte loop
with a loop that took bigger, er, bites: two, four, or even
eight at a time. The decision was based on the alignments
of the two operands -- and the "both operands equal" case just
fell out of the alignment tests.

Eventually, I arranged for p to consist entirely of 'X'
and called strcmp(p, p+1), proceeding to call the "fast" method
if ("if") strcmp didn't return zero. Works like a charm.
 
C

CBFalconer

user923005 said:
.... snip ...

It is undefined behavior in any case to call strcmp() with addresses
that are not null terminated strings.

That is a bar to relying on any particular behaviour in such a
circumstance, not a bar to defining the result. Thus the following
should be a satisfactory system implementation of strcmp():

int strcmp(const char *_s1, const char *_s2) {
if (_s1 == _s2) return 0;
else
while (*_s1 && (*_s1 == *_s2)) {
_s1++; _s2++;
}
return (*_s1 > *_s2) - (*_s1 < *_s2);
} /* strcmp */
 
R

Richard Heathfield

user923005 said:
The actual joke goes:
Q: What does an agnostic, insomniac, dyslexic person do?
A: He lays awake at night, wondering if there is a dog.

Can we lay off the dyslexia jokes, please? They're not clever, and
they're not furry.
 
U

user923005

user923005 said:
On Aug 17, 3:47 pm, (e-mail address removed) wrote:
The actual joke goes:
Q: What does an agnostic, insomniac, dyslexic person do?
A: He lays awake at night, wondering if there is a dog.

Can we lay off the dyslexia jokes, please? They're not clever, and
they're not furry.

Being dyslexic myself, I can tell you that permutations are likely
errors, but substitution of r's for n's is improbable.
There are some substitutions that might occur, but they are likely to
be b, p, d, q which are (after all) the same letter.
In my case it has the strange effect that mirrored writing is very
easy for me to read -- in fact too easy.

I have had to train myself about glass doors to stores. If it says
'PULL' on it but it is written on the other side of the door, anyone
else would instantly see that it is reversed and would push instead of
pull. If I am not thinking hard about it, I am quite likely to read
"PULL" and give it a good yank.
 
U

user923005

I tried to use strcmp(p, p) as a deliberate time-waster
once. I wanted to study the behavior of a sorting function
with "fast" and "slow" user-supplied comparators: the "slow"
one was strcmp(p, p) followed by a call to the "fast" one.
My program adjusted the length of the string at p until I got
a ten-to-one speed ratio.

This worked fine on several systems, but alas! I ran
across one where my setup code kept making p longer and longer
without slowing anything down -- and it turned out that strcmp
was returning zero immediately, as described.

BUT: Was this a stupid test? I don't think so. The
strcmp implementation made a bunch of (highly non-portable)
tests to decide whether it could replace a byte-by-byte loop
with a loop that took bigger, er, bites: two, four, or even
eight at a time. The decision was based on the alignments
of the two operands -- and the "both operands equal" case just
fell out of the alignment tests.

Eventually, I arranged for p to consist entirely of 'X'
and called strcmp(p, p+1), proceeding to call the "fast" method
if ("if") strcmp didn't return zero. Works like a charm.

Turns out that the possible missed branch prediction is
inconsequential (after all, it happens only once):

int testing_strcmp(const char *_s1, const char *_s2)
{
if (_s1 == _s2)
return 0;
else
while (*_s1 && (*_s1 == *_s2)) {
_s1++;
_s2++;
}
return (*_s1 > *_s2) - (*_s1 < *_s2);
} /* strcmp */

int non_testing_strcmp(const char *_s1, const char *_s2)
{
while (*_s1 && (*_s1 == *_s2)) {
_s1++;
_s2++;
}
return (*_s1 > *_s2) - (*_s1 < *_s2);
} /* strcmp */

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

char s1[32767];
char s2[32767];
void cmptest1(FILE * f, int (*cmp) (const char *, const
char *))
{
while (fgets(s1, sizeof s1, f)) {
if (fgets(s2, sizeof s2, f)) {
(void) cmp(s1, s2);
} else
break;
}
}

int main(int argc, char **argv)
{
clock_t end, start;
if (argc <= 1) {
puts("USAGE: strcmptest <filename>");
exit(EXIT_FAILURE);
}
{
FILE *f = fopen(argv[1], "rt");
if (f == NULL) {
printf("ERROR: Failed to open file %s\n", argv[1]);
exit(EXIT_FAILURE);
}
start = clock();
cmptest1(f, testing_strcmp);
end = clock();
printf("With testing strcmp time is %f seconds\n", (end -
start)*1.0/CLOCKS_PER_SEC);
rewind(f);
start = clock();
cmptest1(f, non_testing_strcmp);
end = clock();
printf("With non-testing strcmp time is %f seconds\n", (end -
start)*1.0/CLOCKS_PER_SEC);
}
return 0;
}

/*
C:\tmp>dir b.txt
Volume in drive C has no label.
Volume Serial Number is 0890-87CA

Directory of C:\tmp

08/18/2007 12:34 AM 65,969,707 b.txt
1 File(s) 65,969,707 bytes
0 Dir(s) 1,602,658,304 bytes free

C:\tmp>strcmptest b.txt
With testing strcmp time is 0.859000 seconds
With non-testing strcmp time is 0.859000 seconds

C:\tmp>strcmptest b.txt
With testing strcmp time is 0.859000 seconds
With non-testing strcmp time is 0.875000 seconds

C:\tmp>strcmptest b.txt
With testing strcmp time is 0.859000 seconds
With non-testing strcmp time is 0.859000 seconds

C:\tmp>strcmptest b.txt
With testing strcmp time is 0.859000 seconds
With non-testing strcmp time is 0.859000 seconds

C:\tmp>strcmptest b.txt
With testing strcmp time is 0.875000 seconds
With non-testing strcmp time is 0.843000 seconds
*/
 
A

Alexei A. Frounze

One way I've seen strcmp(char *s1, char *s2) implemented is: return
immediately if s1==s2 (equality of pointers); otherwise do the usual
thing of searching through the memory at s1 and s2.

Of course the reason for doing this is to save time in case equal
pointers are passed to strcmp. But it seems to me that this could create
an inconsistency in the degenerate case when s1 points to memory that is
not null-terminated, i.e. by some freak chance, all of the memory from
s1 till the computer reaches the end of all its memory pages (however
that works) don't contain a single null byte. In this case, strcmp
should not say that s1 and s2 are "equal strings" since neither is
actually a string (because not null terminated).

Is my thinking correct?

Indefinitely searching for '\0' will do no good. In the simplest case
it may either cause some sort of memory protection exception or hang
the system if the address wrap around is permitted. If the access goes
where memory mapped devices are, it may be worse and may even damage
the hardware. In these cases strcmp() may not return at all and so I'm
not sure if talking about consistency is any meaningful.

Alex
 
A

Antoninus Twink

What you seem to have missed is that there is no "correct"
behavior in the case you describe: The behavior is undefined
because the arguments are not strings. Returning zero is one
possible behavior, a SIGSEGV is another, a graphic of a nasal
demon whistling "Dixie" while riding backwards on a bicycle
is yet another.

I think the subtle point is the following: a char * isn't actually the
same thing as a string. A char * is a pointer to some bytes of memory,
but is s is a char * then for s to be a string, we need the sequence
*s, *(s+1), *(s+2), ..., *(s+i), ... to actually contain a 0 byte for
some i. In practice memory will have 0 bytes all over the place, but
there's still a theoretical possibility that there won't be zero byte
for any i until the memory space is completely exhausted.

Maybe the program I put in the other thread

main() { printf("%d\n",strlen(malloc(0))); }

illustrates this more simply than strcmp: malloc(0) returns a pointer to
some random place in memory, and there's no absolute guarantee that a
0-byte will occur later in memory, so what gets printed could be a
random number or in theory the program could just never terminate.

Part of the confusion seems to be the names: for example, strlen takes a
char * and returns an int. If the parameter is a string, then the
integer is the length of the string and that makes perfect sense. But
what strlen actually takes is a general char *, not necessarily a
string, and if you pass strlen a char * that isn't a string then you
need to think more carefully about how to interpret the return value of
strlen (or strlen might not terminate at all).
 
J

J. J. Farrell

I think the subtle point is the following: a char * isn't actually the
same thing as a string. A char * is a pointer to some bytes of memory,
but is s is a char * then for s to be a string, we need the sequence
*s, *(s+1), *(s+2), ..., *(s+i), ... to actually contain a 0 byte for
some i. In practice memory will have 0 bytes all over the place, but
there's still a theoretical possibility that there won't be zero byte
for any i until the memory space is completely exhausted.

What is "subtle" about this? It's just the definition of a string, and
it's very simple.
Maybe the program I put in the other thread

main() { printf("%d\n",strlen(malloc(0))); }

illustrates this more simply than strcmp: malloc(0) returns a pointer to
some random place in memory, and there's no absolute guarantee that a
0-byte will occur later in memory, so what gets printed could be a
random number or in theory the program could just never terminate.

I don't understand your point. You seem to be working hard to tell us
that a string is an array of chars up to and including the first zero-
valued character. Of course it is, since that's what it's defined to
be. If there is no zero-valued character in the array, then the array
doesn't contain a string.
Part of the confusion seems to be the names: for example, strlen takes a
char * and returns an int. If the parameter is a string, then the
integer is the length of the string and that makes perfect sense. But
what strlen actually takes is a general char *, not necessarily a
string, and if you pass strlen a char * that isn't a string then you
need to think more carefully about how to interpret the return value of
strlen (or strlen might not terminate at all).

You don't need to be careful about anything if you do this, since you
cannot predict how the system will behave. What you need to be careful
about is not doing this in the first place. strlen() takes a string;
it is your responsibility to ensure that you only ever give it a
string; if you give it anything else, there's no saying what might
happen.
 
S

stdazi

One way I've seen strcmp(char *s1, char *s2) implemented is: return
immediately if s1==s2 (equality of pointers); otherwise do the usual
thing of searching through the memory at s1 and s2.

Of course the reason for doing this is to save time in case equal
pointers are passed to strcmp.

It would be interesting to see how many times (passing two equal
pointers) to strcmp() happens.
 
F

Flash Gordon

user923005 wrote, On 18/08/07 08:02:
Except when maid by a Dyslexic. ;-)
Being dyslexic myself, I can tell you that permutations are likely
errors, but substitution of r's for n's is improbable.
There are some substitutions that might occur, but they are likely to
be b, p, d, q which are (after all) the same letter.

Agreed. Getting mixed up between 5 and 2 was causing me a lot of grief
yesterday, made worse by the fact I was dealing with IP addresses
starting both 85. and 82.

I could never get the hang of the fact the writing was meant to go from
left to right until I started to learn to read music. It was also a
problem for others when it was my turn to lay the table, whether the
knife was on the left or the right was entirely random, and I was use
them whichever way they were without even noticing.
In my case it has the strange effect that mirrored writing is very
easy for me to read -- in fact too easy.

I have had to train myself about glass doors to stores. If it says
'PULL' on it but it is written on the other side of the door, anyone
else would instantly see that it is reversed and would push instead of
pull. If I am not thinking hard about it, I am quite likely to read
"PULL" and give it a good yank.

I've done that on several occasions as well. With me I am most likely to
be caught by short words what I recognise as a single chunk.
 
E

Eric Sosman

CBFalconer said:
user923005 wrote:
... snip ...

That is a bar to relying on any particular behaviour in such a
circumstance, not a bar to defining the result. Thus the following
should be a satisfactory system implementation of strcmp():

int strcmp(const char *_s1, const char *_s2) {
if (_s1 == _s2) return 0;
else
while (*_s1 && (*_s1 == *_s2)) {
_s1++; _s2++;
}
return (*_s1 > *_s2) - (*_s1 < *_s2);
} /* strcmp */

A digression from the main point (with which I agree):
There's a potential bug in the `return' statement on systems
where `char' is signed. 7.2.1.4p1:

"The sign of a nonzero value returned by the
comparison functions [...] is determined by
the sign of the difference between the values of
the first pair of characters (both interpreted
as unsigned char) that differ [...]"

That is, "abc\200" must compare greater than "abc\177",
even when the final `char' value in the first string is
negative.

Suggested fix:

return ((unsigned char) *_s1 > (unsigned char) *_s2)
- ((unsigned char) *_s1 < (unsigned char) *_s2);
 
P

pete

Eric said:
user923005 wrote:
... snip ...

That is a bar to relying on any particular behaviour in such a
circumstance, not a bar to defining the result. Thus the following
should be a satisfactory system implementation of strcmp():

int strcmp(const char *_s1, const char *_s2) {
if (_s1 == _s2) return 0;
else
while (*_s1 && (*_s1 == *_s2)) {
_s1++; _s2++;
}
return (*_s1 > *_s2) - (*_s1 < *_s2);
} /* strcmp */

A digression from the main point (with which I agree):
There's a potential bug in the `return' statement on systems
where `char' is signed. 7.2.1.4p1:

"The sign of a nonzero value returned by the
comparison functions [...] is determined by
the sign of the difference between the values of
the first pair of characters (both interpreted
as unsigned char) that differ [...]"

That is, "abc\200" must compare greater than "abc\177",
even when the final `char' value in the first string is
negative.

Suggested fix:

return ((unsigned char) *_s1 > (unsigned char) *_s2)
- ((unsigned char) *_s1 < (unsigned char) *_s2);

My Turn!
/*
** Could not care less about the s1 == s2 case
*/
int str_cmp(const char *s1, const char *s2)
{
const unsigned char *p1 = (const unsigned char *)s1;
const unsigned char *p2 = (const unsigned char *)s2;

for (;;) {
if (*p1 != *p2) {
return *p2 > *p1 ? -1 : 1;
}
if (*p1 == '\0') {
return 0;
}
++p1;
++p2;
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top