Keith Thompson said:
If your system has CHAR_BIT==8, then '\400' violates a constraint.
See C99 6.4.4.4p9:
Constraints
The value of an octal or hexadecimal escape sequence shall
be in the range of representable values for the type unsigned
char for an integer character constant, or the unsigned type
corresponding to wchar_t for a wide character constant.
Of course the compiler is free to issue a warning and then go on to
treat it as '\0' (which is what gcc does).
Hmm. Since a character constant is of type int, I would have expected
'\x82' to have type int and value +130. But gcc and Sun's C compiler
agree that its value is -126.
C99 6.4.4.4p6 specifies the meaning of a hexadecimal escape sequence:
The hexadecimal digits that follow the backslash and the letter
x in a hexadecimal escape sequence are taken to be part of the
construction of a single character for an integer character
constant or of a single wide character for a wide character
constant. The numerical value of the hexadecimal integer so
formed specifies the value of the desired character or wide
character.
Note that it says "character"; it doesn't refer to the type (plain)
char.
And, of course, the constraint I already quoted says that the value of
the hexadecimal escape sequence must be in the range of type unsigned
char. If '\x82' has the value -126, then it violates the constraint,
which I don't think is the intent.
My tentative conclusion is that the value of '\x82' is supposed to be
+130, not -126, and that both gcc and Sun's compiler get this wrong.
I'd be interested in any counterarguments.
I think the answer lies further on, in p10 under semantics:
10 An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined. If an integer character constant contains a
single character or escape sequence, its value is the one that
results when an object with type char whose value is that of the
single character or escape sequence is converted to type int.
The actual int value of '\x82' is the value you'd get from a char with
that "character value" after being converted to int. Since char is
probably signed in the implementation you are using, gcc can give
-126.
I think all the gyrations are to avoid the possibility of an
implementation-defined conversion of an out-of range value. I think
that is why the standard talks about the value of the character rather
than using more concrete C terms. '\x82' can't just be 0x82 because
then, with a signed char type,
char c = '\x82';
would be governed by 6.3.1.3 p3:
"Otherwise, the new type is signed and the value cannot be
represented in it; either the result is implementation-defined or
an implementation-defined signal is raised."
Instead, '\x80' denotes some character (not char) value that is put in
a char object and that char is converted to int. I think -126 is
correct on signed char machines.
This issue isn't likely to cause problems in real-world code, since
character constants are usually used with objects of type char, signed
char, or unsigned char. There's no good reason to use '\x82' rather
than 0x82 if you want to store it in an int.
Agreed.
You do want to be able to assign '\x82' to a char, though, and that
requires that '\x82' (an int) be in the range of char. The hex part
must be in the range of unsigned char, and the value you finally get
is the result of putting a not entirely well-specified "character
value" into a char object and converting that to int.