comparing two strcasecmp (stricmp) implementations

T

Tim Rentsch

William Krick said:
Tim said:
William Krick said:
Tim Rentsch wrote:
One final revision. I've modified the return statement so that it
returns -1 / 0 / 1 to bring it in line with the behaviour of other
similar functions...

int str_ccmp( const char *s1, const char *s2 )
{
int c1, c2;
for(;;)
{
c1 = tolower( (unsigned char) *s1++ );
c2 = tolower( (unsigned char) *s2++ );
if (c1 == 0 || c1 != c2)
return c1 == c2 ? 0 : c1 > c2 ? 1 : -1;
}
}

If the tests are written differently, the return values
might be somewhat clearer:

int
str_ccmp( const char *s1, const char *s2 ){
int c1, c2;

for( ; ; s1++, s2++ ){
c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 > c2 ) return 1;
if( c1 < c2 ) return -1;
if( c1 == 0 ) return 0;
}
}


Actually, when you cleaned up the return conditions, you left out some
of the conditions and broke the code. [snip]

If you look again I think you'll see that the posted function
does indeed work properly. Here is the same function with some
assertions added -- see if you agree.


int
str_ccmp( const char *s1, const char *s2 ){
int c1, c2;

for( ; ; s1++, s2++ ){
c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 > c2 ) return 1;
if( c1 < c2 ) return -1;

assert( c1 == c2 );

if( c1 == 0 ) return 0;

assert( c1 == c2 && c1 != 0 );
}
}

Note that each of the 'return' statements is executed only if the
condition 'c1 != c2 || c1 == 0' is true, because of the 'if'
tests; it's not necessary to test for it separately.


I'll be damned. You're right. My bad.

Even though that is clearer, it would probably be a little slower since
there's 3 comparisons being done on each loop vs two in this version
[...]

Just a few quick reactions:

1. It's almost always better to put clarity concerns ahead of
concerns about micro-optimization.

2. We don't know that the generated code will have three
comparisons in the main path of the loop; it could be
optimized to only two (and not unreasonable to guess that
it might be).

3. Even if there are three conditional branches, they won't
necessarily run slower than two - depends on how the
branches are predicted, etc.

4. Performance decisions should be based on measurement.

5. Any differerence is highly unlikely to be on the critical
path. ("Premature optimization is the root of all evil.")
 
T

Tim Rentsch

pete said:
William said:
Tim said:
of the conditions and broke the code. [snip]

If you look again I think you'll see that the posted function
does indeed work properly. Here is the same function with some
assertions added -- see if you agree.


int
str_ccmp( const char *s1, const char *s2 ){
int c1, c2;

for( ; ; s1++, s2++ ){
c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 > c2 ) return 1;
if( c1 < c2 ) return -1;

assert( c1 == c2 );

if( c1 == 0 ) return 0;

assert( c1 == c2 && c1 != 0 );
}
}

Note that each of the 'return' statements is executed only if the
condition 'c1 != c2 || c1 == 0' is true, because of the 'if'
tests; it's not necessary to test for it separately.

I'll be damned. You're right. My bad.

Even though that is clearer,

Is it really clearer?
That's not the first thing that would pop into my head
after getting an explanation of how I read the code wrong.

There is a subtlety around the test for c1 == 0. See if you like
this writing better:

int
str_ccmp( const char *s1, const char *s2 ){
int c1, c2;

for( ; ; s1++, s2++ ){
c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 > c2 ) return 1;
if( c1 < c2 ) return -1;
if( c1 == 0 && c2 == 0 ) return 0;
}
}

This might be read as:

Get a character from each string

If the first character is bigger, the first string is bigger
If the first character is smaller, the first string is smaller
If we've reached the end of both strings, they are equal

Otherwise, we don't know the result yet; go around the
loop and try the next characters...


The earlier version omitted the '&& c2 == 0', which works because
at that point in the code it will always be true if 'c1 == 0' is
true.
 
P

pete

Tim said:
pete said:
William said:
Tim Rentsch wrote:
"William Krick" <[email protected]> writes:
of the conditions and broke the code. [snip]

If you look again I think you'll see that the posted function
does indeed work properly. Here is the same function with some
assertions added -- see if you agree.


int
str_ccmp( const char *s1, const char *s2 ){
int c1, c2;

for( ; ; s1++, s2++ ){
c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 > c2 ) return 1;
if( c1 < c2 ) return -1;

assert( c1 == c2 );

if( c1 == 0 ) return 0;

assert( c1 == c2 && c1 != 0 );
}
}

Note that each of the 'return' statements is executed only if the
condition 'c1 != c2 || c1 == 0' is true, because of the 'if'
tests; it's not necessary to test for it separately.

I'll be damned. You're right. My bad.

Even though that is clearer,

Is it really clearer?
That's not the first thing that would pop into my head
after getting an explanation of how I read the code wrong.

There is a subtlety around the test for c1 == 0. See if you like
this writing better:

int
str_ccmp( const char *s1, const char *s2 ){
int c1, c2;

for( ; ; s1++, s2++ ){
c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 > c2 ) return 1;
if( c1 < c2 ) return -1;
if( c1 == 0 && c2 == 0 ) return 0;
}
}

This might be read as:

Get a character from each string

If the first character is bigger, the first string is bigger
If the first character is smaller, the first string is smaller
If we've reached the end of both strings, they are equal

Otherwise, we don't know the result yet; go around the
loop and try the next characters...

The earlier version omitted the '&& c2 == 0', which works because
at that point in the code it will always be true if 'c1 == 0' is
true.

I like this loop best:

do {
c1 = tolower((unsigned char)*s1++);
c2 = tolower((unsigned char)*s2++);
} while (c1 == c2 && c1 != '\0');

.... just keep looping while they're equal and not at the end.

follwed by

return c2 > c1 ? -1 : c2 != c1;

.... which is my prefered idiom for comparing int types
in a qsort compar function.
 
T

Tim Rentsch

pete said:
Tim said:
pete said:
William Krick wrote:

Tim Rentsch wrote:

of the conditions and broke the code. [snip]

If you look again I think you'll see that the posted function
does indeed work properly. Here is the same function with some
assertions added -- see if you agree.


int
str_ccmp( const char *s1, const char *s2 ){
int c1, c2;

for( ; ; s1++, s2++ ){
c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 > c2 ) return 1;
if( c1 < c2 ) return -1;

assert( c1 == c2 );

if( c1 == 0 ) return 0;

assert( c1 == c2 && c1 != 0 );
}
}

Note that each of the 'return' statements is executed only if the
condition 'c1 != c2 || c1 == 0' is true, because of the 'if'
tests; it's not necessary to test for it separately.

I'll be damned. You're right. My bad.

Even though that is clearer,

Is it really clearer?
That's not the first thing that would pop into my head
after getting an explanation of how I read the code wrong.

There is a subtlety around the test for c1 == 0. See if you like
this writing better:

int
str_ccmp( const char *s1, const char *s2 ){
int c1, c2;

for( ; ; s1++, s2++ ){
c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 > c2 ) return 1;
if( c1 < c2 ) return -1;
if( c1 == 0 && c2 == 0 ) return 0;
}
}

This might be read as:

Get a character from each string

If the first character is bigger, the first string is bigger
If the first character is smaller, the first string is smaller
If we've reached the end of both strings, they are equal

Otherwise, we don't know the result yet; go around the
loop and try the next characters...

The earlier version omitted the '&& c2 == 0', which works because
at that point in the code it will always be true if 'c1 == 0' is
true.

I like this loop best:

do {
c1 = tolower((unsigned char)*s1++);
c2 = tolower((unsigned char)*s2++);
} while (c1 == c2 && c1 != '\0');

... just keep looping while they're equal and not at the end.

follwed by

return c2 > c1 ? -1 : c2 != c1;

... which is my prefered idiom for comparing int types
in a qsort compar function.

The two loops correspond to two basically different approaches
to comparing the strings, namely

1. A finite loop that finds the terminal location, then
tests for appropriate result

2. An "infinite" loop thats tests to see if result is known
yet, and continues on if not.

The 'do{...}while' loop takes approach 1, and the 'for(;;s1++,s2++)'
loop takes approach 2. I don't think either approach is inherently
better than the other.

However, it's easier to turn a loop written using approach 2 into a
functional version that uses tail recursion and has no (explicit)
looping at all:

int
str_ccmp( const char *const s1, const char *const s2 ){
const int c1 = tolower( (unsigned char) *s1 );
const int c2 = tolower( (unsigned char) *s2 );

if( c1 > c2 ) return 1;
if( c1 < c2 ) return -1;
if( c1 == 0 ) return 0;

return str_ccmp( s1+1, s2+1 );
}

Most likely this version will run slower than the do...while approach,
because there are three compares done for each character. But it's
easy to rewrite it to use the same condition optimization that the
do...while loop does:

int
str_ccmp( const char *const s1, const char *const s2 ){
const int c1 = tolower( (unsigned char) *s1 );
const int c2 = tolower( (unsigned char) *s2 );

if( c1 != c2 ) return c1 < c2 ? -1 : 1;
if( c1 == 0 ) return 0;

return str_ccmp( s1+1, s2+1 );
}

Depending on the particulars of the compiler etc, a function written
like this one can run faster than an imperative do...while version,
especially in the case of comparing short strings that are equal.

Again, I don't claim either approach 1 or approach 2 is necessarily
better than the other. Each has its plusses and minusses. There
is something nice though about the functional version.
 
P

pete

Tim said:
The two loops correspond to two basically different approaches
to comparing the strings, namely

1. A finite loop that finds the terminal location, then
tests for appropriate result

2. An "infinite" loop thats tests to see if result is known
yet, and continues on if not.

The 'do{...}while' loop takes approach 1, and the 'for(;;s1++,s2++)'
loop takes approach 2. I don't think either approach is inherently
better than the other.

However, it's easier to turn a loop written using approach 2 into a
functional version that uses tail recursion and has no (explicit)
looping at all:

How about for turning it into str_cncmp?

int
str_cncmp( const char *const s1, const char *const s2, size_t n );
 
T

Tim Rentsch

pete said:
How about for turning it into str_cncmp?

int
str_cncmp( const char *const s1, const char *const s2, size_t n );

Here's the infinite loop version:

int
str_cncmp( const char *s1, const char *s2, size_t n ){
int c1, c2;

for( ; ; s1++, s2++, n-- ){
if( n == 0 ) return 0;

c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 != c2 ) return c1 < c2 ? -1 : 1;

if( c1 == 0 ) return 0;
}
}


It should be easy enough to turn that into a tail recursive functional
(single assignment) version. Actually I wrote a functional version
straight away. For aesthetic reasons I wanted the two 'return 0;'
statements combined into just one, and wanted to do that with no extra
compares, hence:

int
str_cncmp( const char *const a, const char *const b, const size_t n ){
unsigned char ua, ub;
int ca, cb;

if( n == 0 || ((ua = *a) | (ub = *b)) == 0 ) return 0;

ca = tolower( ua ), cb = tolower( ub );
if( ca != cb ) return ca < cb ? -1 : 1;

return str_cncmp( a+1, b+1, n-1 );
}


This functional version can "back produce" an iterative one:

int
str_cncmp( const char *a, const char *b, size_t n ){
unsigned char ua, ub;
int ca, cb;

while( n-- > 0 && (ua = *a++) | (ub = *b++) ){
ca = tolower( ua ), cb = tolower( ub );
if( ca != cb ) return ca < cb ? -1 : 1;
}

return 0;
}


All the above just typed in; no compiles done, nor any performance
tuning.
 
P

pete

Tim Rentsch wrote:
Here's the infinite loop version:

int
str_cncmp( const char *s1, const char *s2, size_t n ){
int c1, c2;

for( ; ; s1++, s2++, n-- ){
if( n == 0 ) return 0;

c1 = tolower( (unsigned char) *s1 );
c2 = tolower( (unsigned char) *s2 );

if( c1 != c2 ) return c1 < c2 ? -1 : 1;

if( c1 == 0 ) return 0;
}
}

I think the above one is easier to read.
 
T

Tim Rentsch

pete said:
I think the above one is easier to read.

Not too surprising, since both the style and the paradigm are probably
more familiar. However, if you really want to understand the other
paradigm, what you should do is take imperative/iterative code in your
own style and write it functionally in your own style. For a while of
course imperative versions will seem more natural; but after you get
used to writing functional versions I expect you'll find they provide
equal clarity in many cases, and better in some. The point really is
to have both paradigms available; neither one is better in all cases.
 
P

pete

Tim said:
Not too surprising, since both the style and the paradigm are probably
more familiar.

And you'll see me posting similar versions in the future
without attribution. ;)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top