Is this a correct implementation of strstr ?

P

Phil Carmody

Ian Collins said:
So it doesn't have a conforming C implementation...

Care you C&V that claim. In case you do, here's my counter in
advance:

The value of the hex escape \xA0 is 160 as an unsigned char.
The int value unsigned char 160 maps onto is 160, so '\xA0' is an
int with value 160.
char has the same range as signed char on the DS-2010, and so the
maximum value of a char is 127. The conversion of out-of-range
values onto signed char is that of clipping the value at whichever
bound is exceeded. So int 160 is converted to 127.

Which bit(s) of that violated which paragraph(s) of the standard?

Phil
 
P

Phil Carmody

Kenneth Brody said:
Dann Corbit said:
I request your opinion about the following attempt to implement the
standard function strstr. Here is the code : [...]
while ((unsigned char)*s != (unsigned char)c) { [...]
The casts add nothing.

I don't think that's true, though the second cast is probably
unnecessary.

Actually, I think it is necessary in the above statement. What
happens if, rather than calling mystrchr(foo,160) you use
mystrchr(foo,'\xa0')?

'\xa0' has value 160.

Phil
 
I

Ian Collins

Care you C&V that claim. In case you do, here's my counter in
advance:

The value of the hex escape \xA0 is 160 as an unsigned char.
The int value unsigned char 160 maps onto is 160, so '\xA0' is an
int with value 160.
char has the same range as signed char on the DS-2010, and so the
maximum value of a char is 127. The conversion of out-of-range
values onto signed char is that of clipping the value at whichever
bound is exceeded. So int 160 is converted to 127.

Which bit(s) of that violated which paragraph(s) of the standard?

Irrelevant, if CHAR_MAX=127, you can't have a character value of 160. So
"hex escape \xA0' can never appear in a string.
 
I

Ian Collins

Some platforms use signed chars, and (char)160 != (int)160.

This prints "no" on my system:

#include <stdio.h>

main()
{
char c = 160;
int i = 160;

if ( c == i )
printf("yes\n");
else
printf("no\n");
}

It upsets the compiler on mine...

Anyway, that isn't the issue. If the value 160 (-96 in signed char)
appeared in a character string, it would compare equal to (char)160. To
reuse your example:

int main()
{
int i = 160;
char c = 160;

if ( c == (char)i )
printf("yes\n");
else
printf("no\n");
}
 
K

Keith Thompson

Ian Collins said:
Irrelevant, if CHAR_MAX=127, you can't have a character value of 160. So
"hex escape \xA0' can never appear in a string.

Incorrect.

Assume CHAR_MAX==127 and UCHAR_MAX==255 (this is very common).

C99 6.4.4.4p9 says:

Constraints

9 The value of an octal or hexadecimal escape sequence shall be in the
range of representable values for the type *unsigned char* for an
integer character constant, or the unsigned type corresponding to
wchar_t for a wide character constant.

(emphasis added)

Note that UCHAR_MAX must be at least 255, so '\xA0' is always legal
(and is always of type int with the value 160).
 
I

Ian Collins

Incorrect.

Assume CHAR_MAX==127 and UCHAR_MAX==255 (this is very common).

Oops, I misinterpreted Phil's post. No i didn't, he said

"The conversion of out-of-range values onto signed char is that of
clipping the value at whichever bound is exceeded. So int 160 is
converted to 127."

Which isn't what happens in the common case you quoted. My analysis stands.
 
S

Seebs

Irrelevant, if CHAR_MAX=127, you can't have a character value of 160. So
"hex escape \xA0' can never appear in a string.

So how, on this hypothetical system, would you indicate a character with
a negative value?

On many systems with signed char, you can represent -1 as \xff, because
it wraps around in the "expected" way.

-s
 
K

Keith Thompson

Ian Collins said:
Oops, I misinterpreted Phil's post. No i didn't, he said

"The conversion of out-of-range values onto signed char is that of
clipping the value at whichever bound is exceeded. So int 160 is
converted to 127."

Which isn't what happens in the common case you quoted. My analysis stands.

The standard requires the second argument to strchr() to be converted
from int to char (C99 7.21.5.2p2).

If plain char is signed, and the int value is outside the range
CHAR_MIN..CHAR_MAX, the result of the conversion is at best
implementation-defined (C99 6.3.1.3p3); the clipping Phil described
is legal.

The typical output of this program:

#include <stdio.h>
#include <string.h>

int main(void)
{
const char *s = "\xff--\xa0";
const char *result = strchr(s, '\xa0');

if (result == NULL) {
puts("result == NULL");
}
else {
printf("result - s = %d\n", (int)(result - s));
}
return 0;
}

is "result - s = 3", but I think a conforming implementation on which
signed conversion saturates rather than wrapping around could print
"result - s = 0"; since both '\xff' and '\xa0' yield the same value
when converted to char, we get a false positive match. For the
implementation in question, strchr() doesn't do what we expect it to
*because* the implementation conforms to the standard.

If the standard said that each character of the string *and* the int
value are converted to unsigned char before the comparison, we
wouldn't have this potential problem.

I know of no real-world implementations that have this problem, though
non-2's-complement systems might introduce some interesting corner
cases.
 
P

Phil Carmody

Ian Collins said:
Irrelevant, if CHAR_MAX=127, you can't have a character value of
160. So "hex escape \xA0' can never appear in a string.

WHo's putting hex escapes in strings? I'm certainly not.
I'm putting hex escapes in character constants, and getting
my strings from an outside source such as fgets.

Phil
 
P

Peter Nilsson

Phil Carmody said:
... on the DS-2010, (char)'\xA0' is 127.

I don't see how it can be. "The value of an integer character
constant containing a single character that maps to a single-
byte execution character is the numerical value of the
representation of the mapped character interpreted as an
integer."

Hence, the value of '\xA0' is as if...

({ unsigned char tmp = 0xA0; *(char*)&tmp; })

The representation 10100000 doesn't yield 127 on any
of the three number systems that might apply to signed
char.
 
K

Keith Thompson

Peter Nilsson said:
I don't see how it can be. "The value of an integer character
constant containing a single character that maps to a single-
byte execution character is the numerical value of the
representation of the mapped character interpreted as an
integer."

Hence, the value of '\xA0' is as if...

({ unsigned char tmp = 0xA0; *(char*)&tmp; })

The representation 10100000 doesn't yield 127 on any
of the three number systems that might apply to signed
char.

It's not the character constant that gives you 127, it's the
conversion specified by the cast.

'\xA0' is of type int with value 160. This is true regardless of the
range or signedness of plain char, or any other system-specific
consideration. The constant '\xA0' is the same as the constant 160
(unless you stringize it).

The cast causes the value 160 to be converted from int to char. If
plain char is signed and CHAR_MAX < 160, then the result of the
conversion is governed by C99 6.3.1.3p3:

Otherwise, the new type is signed and the value cannot be
represented in it; either the result is implementation-defined
or an implementation-defined signal is raised.

Most implementations behave "reasonably" by converting 160 to -96 (and
unforunately the standard seems to implicitly assume this behavior, or
something very much like it, in the descriptions of some of the string
functions).
 
K

Keith Thompson

Kenneth Brody said:
Even if chars are signed?

On my system, this prints "-96":

#include <stdio.h>

main()
{
int i = '\xa0';

printf("%d\n",i);
}

Yeah, mine too.

I didn't read far enough in the standard. C99 6.4.4.4p9 says:

Constraints

9 The value of an octal or hexadecimal escape sequence shall
be in the range of representable values for the type unsigned
char for an integer character constant, or the unsigned type
corresponding to wchar_t for a wide character constant.

I mistakenly took that to be a specification of the value of
a character constant containing an octal or hexadecimal escape
sequence, but it isn't. It's just the value of the escape sequence.
The value of the character constant is defined in paragraph 10,
under Semantics:

If an integer character constant contains a single character
or escape sequence, its value is the one that results when
an object with type char whose value is that of the single
character or escape sequence is converted to type int.

I still find the wording a bit shaky. In '\xa0', the value of the
escape sequence, as defined by paragraph 9, is 160. Given that plain
char is signed and CHAR_BIT==8, an object with type char *cannot*
have the value 160.

The standard seems to be assuming that the values 160 and -96 are
interchangeable when stored in a char object. Either I've missed
something else obvious (which is quite possible), or the standard
is playing fast and loose with signed and unsigned values.

I think the standard's accuracy would be improved by changing a
lot of references to character values so they refer to the result
of converting those values to unsigned char. There seem to be a
lot of places that would be improved by this change, including the
description of strchr().

(Mandating that plain char is unsigned would also simplify things,
but that's probably not feasible.)

I've been posting on this thread asserting that '\xa0' is 160.
I apologize for the unintentional misinformation.
 
P

Peter Nilsson

Keith Thompson said:
It's not the character constant that gives you 127,
it's the conversion specified by the cast.

Fair enough, missed the (char).
'\xA0' is of type int with value 160. This is true
regardless of the range or signedness of plain char,
or any other system-specific consideration. The
constant '\xA0' is the same as the constant 160
(unless you stringize it).

Not necessarily, for reasons cited above.

% type schar.c
#include <limits.h>
#include <stdio.h>

int main(void)
{
printf("CHAR_BIT is %d\n", CHAR_BIT);
printf("char is %ssigned\n", (char) -1 < 0 ? "" : "un");
printf("'\\xA0' is %d\n", '\xA0');
return 0;
}

% acc schar.c -o schar.exe

% schar.exe
CHAR_BIT is 8
char is signed
'\xA0' is -96

%
 
P

Peter Nilsson

Fair enough, missed the (char).

No, I take that back, I was right the first time. The
cast to char is redundant as far as converting the value
of '\xA0' because it is necessarily already in the range
of char.

The only possible values for '\xA0' are:

160 - char is unsigned
160 - char is signed, CHAR_BIT > 8
-96 - char is signed, CHAR_BIT == 8, two's complement
-95 - char is signed, CHAR_BIT == 8, ones' complement
-32 - char is signed, CHAR_BIT == 8, sign magnitude
 
M

Michael Foukarakis

Yeah, mine too.

I didn't read far enough in the standard.  C99 6.4.4.4p9 says:

    Constraints

9   The value of an octal or hexadecimal escape sequence shall
    be in the range of representable values for the type unsigned
    char for an integer character constant, or the unsigned type
    corresponding to wchar_t for a wide character constant.

I mistakenly took that to be a specification of the value of
a character constant containing an octal or hexadecimal escape
sequence, but it isn't. It's just the value of the escape sequence.
The value of the character constant is defined in paragraph 10,
under Semantics:

    If an integer character constant contains a single character
    or escape sequence, its value is the one that results when
    an object with type char whose value is that of the single
    character or escape sequence is converted to type int.

I still find the wording a bit shaky.  In '\xa0', the value of the
escape sequence, as defined by paragraph 9, is 160.  Given that plain
char is signed and CHAR_BIT==8, an object with type char *cannot*
have the value 160.

The standard seems to be assuming that the values 160 and -96 are
interchangeable when stored in a char object.  Either I've missed
something else obvious (which is quite possible), or the standard
is playing fast and loose with signed and unsigned values.

If I'm not mistaken, the standard (since ANSI) mandates that char,
signed char and unsigned char are three different types. The whole
confusion in this thread stems from the fact that char has the same
values and representation as either of the other two types - this is
implementation-defined. Personally, I wonder what the rationale for
such a decision was, although I can guess.
 
K

Keith Thompson

Keith Thompson said:
'\xA0' is of type int with value 160. This is true regardless of the
range or signedness of plain char, or any other system-specific
consideration. The constant '\xA0' is the same as the constant 160
(unless you stringize it).
[...]

For the record, I was wrong about this, as I explained in more
detail elsethread.
 
K

Keith Thompson

Peter Nilsson said:
No, I take that back, I was right the first time. The
cast to char is redundant as far as converting the value
of '\xA0' because it is necessarily already in the range
of char.
Right.

The only possible values for '\xA0' are:

160 - char is unsigned
160 - char is signed, CHAR_BIT > 8
-96 - char is signed, CHAR_BIT == 8, two's complement
-95 - char is signed, CHAR_BIT == 8, ones' complement
-32 - char is signed, CHAR_BIT == 8, sign magnitude

You're probably right as far as the intent is concerned, but I think
the wording of the standard is internally inconsistent.
 
I

Ian Collins

If I'm not mistaken, the standard (since ANSI) mandates that char,
signed char and unsigned char are three different types.

Distinct types rather than different, the wording difference is significant.

See section 6.2.5 para 15.
 
C

candide

Ian Collins a écrit :
That's the point I've been making all along. The representation of char
is irrelevant. So back to the original point, the two casts to unsigned
char are both superfluous and wrong.

I don't work out how you make compatible things both "superfluous" and
"wrong" : usually, superfluous suppose things don't hurt, this is not
usually the case for something _wrong_.

Do you mean the casts are always superfluous and sometimes wrong ?
 
P

Phil Carmody

Kenneth Brody said:
Even if chars are signed?

The standard doesn't put a condition on the signedness of chars
when it specifies what the value should be, so presumably yes.
On my system, this prints "-96":

#include <stdio.h>

main()
{
int i = '\xa0';

printf("%d\n",i);
}

Mine too. I guess -96 is the value of the a char with value 160 in
this instance.

Phil
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,781
Messages
2,569,615
Members
45,296
Latest member
HeikeHolli

Latest Threads

Top