Questions on character constants

Luca Forlizzi · Dec 12, 2010

There are 2 points in Sec. 6.4.4.4, describing character constants
that are
not entirely clear to me. It may be that I don't read well the text or
that I
have not understood correcly the issues of character encondings.

In p10 there is the sentence "The value of an integer character
constant
containing a single character that maps to a single-byte execution
character is the
numerical value of the representation of the mapped character
interpreted as an integer".
This confirms that it may be that a single character of the source set
may be
mapped to multiple bytes in the execution character set (and this
consistent with
other parts of the standard). But still in p10 there is the sentence
"If an integer
character constant contains a single character or escape sequence, its
value
is the one that results when an object with type char whose value is
that of the
single character or escape sequence is converted to
type int". This sentence seems to imply that the value corresponding
to a single
character (or escape sequence) can be fit into a single object of
thype char,
i.e., into a single byte. Isn't the latter sentence a contradiction
with the
former (and other parts of the standard)?

In p11 there is the sentence "The value of a wide character constant
containing a single
multibyte character that maps to a member of the extended execution
character set is the
wide character corresponding to that multibyte character, as defined
by the mbtowc
function, with an implementation-defined current locale."
This sentence suggests to me that the function mbtowc maps the
multibyte encoding
of a character of the *source* character set to a wide character.
I find this surprising because of the following reasons:
1) the second parameter of mbtowc is a char *, so a pointer to bytes
in the
execution environment
2) wctomb operates at runtime so I think it converts a wide character
to a multibyte
encoding in the execution environment; I would expect that wctomb and
mbtowc were
inverse of each other

One more question: a byte is (sec. 3.3.6) a unit of data storage of
the execution environment.
Isn't it possible that the host environment has units of data storage
with a different
number of bits?

Eric Sosman · Dec 12, 2010

There are 2 points in Sec. 6.4.4.4, describing character constants
that are
not entirely clear to me. It may be that I don't read well the text or
that I
have not understood correcly the issues of character encondings.

In p10 there is the sentence "The value of an integer character
constant
containing a single character that maps to a single-byte execution
character is the
numerical value of the representation of the mapped character
interpreted as an integer".
This confirms that it may be that a single character of the source set
may be
mapped to multiple bytes in the execution character set (and this
consistent with
other parts of the standard). But still in p10 there is the sentence
"If an integer
character constant contains a single character or escape sequence, its
value
is the one that results when an object with type char whose value is
that of the
single character or escape sequence is converted to
type int". This sentence seems to imply that the value corresponding
to a single
character (or escape sequence) can be fit into a single object of
thype char,
i.e., into a single byte. Isn't the latter sentence a contradiction
with the
former (and other parts of the standard)?

"The escape sequence" refers to the source-code escape sequences,
multi-source-character sequences like '\n' or '\xFF'. When you write
the resulting character to a stream, the implementation might use an
encoding scheme like Shift JIS that employs "escape sequences" of its
own, but these "escape sequences" are not the source-level constructs
described in 6.4.4.4.

(I'll pass on your question about mbtowc() et al. because I have
used them only a few times, and even then without real understanding.)

One more question: a byte is (sec. 3.3.6) a unit of data storage of

ITYM 3.6.

the execution environment.
Isn't it possible that the host environment has units of data storage
with a different
number of bits?

Yes, and in fact it's quite common. Very many C platforms support
"units" of many sizes: bytes, halfwords, words, doublewords, pages, ...
The crucial requirement in 3.6 is not only that this unit exist, but
that it be an "addressable unit," because in C's view nearly every data
object can be treated as an array of bytes. Even if this treatment is
not "natural" for the underlying platform, the C implementation must
somehow make the array-of-bytes view work. For example, the original
DEC Alpha supported 32- and 64-bit units, and used shifts and masks
to simulate byte access within those larger blobs.

luser- -droog · Dec 13, 2010

There are 2 points in Sec. 6.4.4.4, describing character constants
that are
not entirely clear to me. It may be that I don't read well the text or
that I
have not understood correcly the issues of character encondings.

In p10 there is the sentence "The value of an integer character
constant
containing a single character that maps to a single-byte execution
character is the
numerical value of the representation of the mapped character
interpreted as an integer".
This confirms that it may be that a single character of the source set
may be
mapped to multiple bytes in the execution character set (and this
consistent with
other parts of the standard).

[snip]

It may be helpful, in learning C, to shift your perspective from bytes
to words. One of C's original, primary purposes was to be fast. This
means
that the compiler is less concerned with the spatial arrangement of
code and data in core memory. and more focused on the temporal
arrangment
of intructions as they will be executed.

So when the standard talks about 'integer character constant's, it's
dead fucking serious. This object is not a char. It's an int. This is
because when it get loaded into a register, it'll be an integer-sized
register. It doesn't matter if there are separately-addressable byte-
sized BH and BL registers, it's gonna use EBX.

It may also be helpful, when reading dense matter like standards,
to circle or highlight the nouns with all their adjectives and
adornments.

So in this case,

"The
value
[-- perhaps they could've said 'value yielded in an expression' --]

of an

integer character constant
[-- this is just the term being defined, so we don't get to assume
anything about this beast yet except those three words --]

containing a

single character that maps to a single-byte execution character
[-- so it's telling us absolutely nothing about characters that
do not map to a 'single-byte execution character', whatever that
may mean --]

is the

numerical value of the representation of the mapped character
interpreted as an integer"
[-- remember we're talking about the 'value' of this creature.
It's value is a number. It's whatever number it needs to be
to match the 'representaion of the mapped character' if you
had to give the most obvious number to it. --]

lxt

Questions on ISO C character constants	1	Nov 8, 2011
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Multi-character constants	2	Jul 9, 2008
Character Constants	23	Feb 21, 2006
Encoding of character literals	4	Nov 3, 2011
mbtowc - combining character	4	Apr 4, 2007
Questions on various string literals in c++0x	1	Dec 7, 2010
Convert string with control character in caret notation to realcontrol character string.	8	Sep 25, 2012

Questions on character constants

Luca Forlizzi

Eric Sosman

luser- -droog

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads