strcmp implementation with unsigned conversion

J

jianhua

1. Why the standard* requires that values of the pair of characters
be
"both interpreted as unsigned char"

2. Can they be both interpreted as other larger types
eg. int, unsigned int, long, unsigned long

3. Does "both interpreted as unsigned char" means

this is wrong:

int strcmp(const char *cs, const char *ct)
{
while (1) {
if (*cs != *ct)
return *cs < *ct ? -1 : 1;
if (!*cs)
break;
cs++, ct++;
}
return 0;
}

but this is right:

/*
Copyright (C) 1991, 1992 Linus Torvalds
*/
int strcmp(const char *cs, const char *ct)
{
unsigned char c1, c2;

while (1) {
c1 = *cs++; /*gcc warning: -Wconversion*/
c2 = *ct++; /*gcc warning: -Wconversion*/
if (c1 != c2)
return c1 < c2 ? -1 : 1;
if (!c1)
break;
}
return 0;
}


[*] 7.23.4 Comparison functions
1 The sign of a nonzero value returned by the comparison functions
memcmp, strcmp,
and strncmp is determined by the sign of the difference between the
values of the first
pair of characters (both interpreted as unsigned char) that differ in
the objects being
compared.
 
K

Keith Thompson

jianhua said:
1. Why the standard* requires that values of the pair of characters
be
"both interpreted as unsigned char"

So that it gives consistent results for characters outside the range
0..SCHAR_MAX (commonly 0..127).

ASCII, for example, is a strictly 7-bit character set, so any values
outside the range 0..127 are not valid characters. But most other
character sets, including EBCDIC and modern ASCII-based sets such as
Latin-1 and the various Unicode representations, do have meaningful
character values above 127.

For example, the copyright sign has the code 194 (0xc2) in Latin-1. If
plain char is signed, storing the value 194 in a char object will
probably cause it to be stored as -62; if plain char is unsigned, it's
just stored as 194. By interpreting the stored value *as if* it were an
unsigned char, strcmp() consistently treats the copyright sign as being
greater than, for example, the letter 'c'. Without this requirement,
collation sequences could differ depending on whether the compiler
chooses to make plain char signed or unsigned.

In principle, I'm not sure that the semantics are entirely well defined.
In practice, it works.
2. Can they be both interpreted as other larger types
eg. int, unsigned int, long, unsigned long

I'm not even sure what that means. The phrase "interpreted as" means, I
think, that the representation of the char object is treated as if it
were an unsigned char object. I don't think it makes sense to treat a
char object as something bigger than one byte.
3. Does "both interpreted as unsigned char" means

this is wrong:

int strcmp(const char *cs, const char *ct)
{
while (1) {
if (*cs != *ct)
return *cs < *ct ? -1 : 1;
if (!*cs)
break;
cs++, ct++;
}
return 0;
}

but this is right:

/*
Copyright (C) 1991, 1992 Linus Torvalds
*/
int strcmp(const char *cs, const char *ct)
{
unsigned char c1, c2;

while (1) {
c1 = *cs++; /*gcc warning: -Wconversion*/
c2 = *ct++; /*gcc warning: -Wconversion*/
if (c1 != c2)
return c1 < c2 ? -1 : 1;
if (!c1)
break;
}
return 0;
}

Yes. More precisely, the first is not portable; it works just fine if
plain char is unsigned, but it gives incorrect results if plain char is
signed and some of the values being compared exceed SCHAR_MAX.
[*] 7.23.4 Comparison functions
1 The sign of a nonzero value returned by the comparison functions
memcmp, strcmp, and strncmp is determined by the sign of the
difference between the values of the first pair of characters (both
interpreted as unsigned char) that differ in the objects being
compared.
 
E

Eric Sosman

jianhua said:
[...]
but this is right:

/*
Copyright (C) 1991, 1992 Linus Torvalds
*/
int strcmp(const char *cs, const char *ct)
{
unsigned char c1, c2;

while (1) {
c1 = *cs++; /*gcc warning: -Wconversion*/
c2 = *ct++; /*gcc warning: -Wconversion*/
if (c1 != c2)
return c1< c2 ? -1 : 1;
if (!c1)
break;
}
return 0;
}

Yes. More precisely, the first is not portable; it works just fine if
plain char is unsigned, but it gives incorrect results if plain char is
signed and some of the values being compared exceed SCHAR_MAX.

I don't think the latter is perfectly portable, either (though
it's portable to all the machines Torvalds was concerned with). On
systems with signed char using ones' complement or signed magnitude
representation, both plain zero and minus zero would convert to zero
as unsigned char (if the latter conversion didn't trap), and would
then be indistinguishable. It's my belief that strcmp() et al.
should treat minus zero as greater than plain zero, because the
former has a 1-bit while the latter does not.

In short, I don't think the Standard's "interpreted as" can be
taken to have the same meaning as "converted to."
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,061
Latest member
KetonaraKeto

Latest Threads

Top