Questions on ISO C character constants

  • Thread starter Edward Rutherford
  • Start date
E

Edward Rutherford

There are 2 points in Sec. 6.4.4.4, describing character constants
that are not entirely clear to me. It may be that I have not understood
correctly the issues of character encondings.

In p10 there is the sentence "The value of an integer character
constant containing a single character that maps to a single-byte
execution character is the numerical value of the representation of the
mapped character interpreted as an integer".

This confirms that it may be that a single character of the source set
may be mapped to multiple bytes in the execution character set (and this
consistent with other parts of the standard). But still on p10 there is
the sentence "If an integer character constant contains a single
character or escape sequence, its value is the one that results when an
object with type char whose value is that of the single character or
escape sequence is converted to type int". This sentence seems to imply
that the value corresponding to a single character (or escape sequence)
can be fit into a single object of type char, i.e., into a single byte.
Isn't the latter sentence a contradiction with the former (and other
parts of the standard)?

On p11 there is the sentence "The value of a wide character constant
containing a single multibyte character that maps to a member of the
extended execution character set is the wide character corresponding to
that multibyte character, as defined by the mbtowc function, with an
implementation-defined current locale."

This sentence suggests to me that the function mbtowc maps the
multibyte encoding of a character of the *source* character set to a wide
character.
I find this surprising for the following reasons:
1) the second parameter of mbtowc is a char *, so a pointer to bytes
in the execution environment
2) wctomb operates at runtime so I think it converts a wide character
to a multibyte encoding in the execution environment; I would expect that
wctomb and mbtowc were inverse of each other

One more question: a byte is (sec. 3.3.6) a unit of data storage of
the execution environment. Isn't it possible that the host environment
has units of data storage with a different number of bits?
 
K

Kaz Kylheku

There are 2 points in Sec. 6.4.4.4, describing character constants
that are not entirely clear to me. It may be that I have not understood
correctly the issues of character encondings.

In p10 there is the sentence "The value of an integer character
constant containing a single character that maps to a single-byte
execution character is the numerical value of the representation of the
mapped character interpreted as an integer".

This confirms that it may be that a single character of the source set
may be mapped to multiple bytes in the execution character set (and this
consistent with other parts of the standard).

Howver, this is not true of the source set described in 5.2.1, which is the
only characters from which strictly conforming C programs can be comprised.

If you have a character constant like 'X' where X is some character that
doesn't fit into a byte, that is code which falls outside of scope of the above
sentence, since X is not "a single character that maps to a single-byte
execution character".

You didn't miss this text, in the same paragraph, right?

The value of an integer character constant containing more than one character
(e.g., 'ab'), or containing a character or escape sequence that does not map
to a single-byte execution character, is implementation-defined.
But still on p10 there is
the sentence "If an integer character constant contains a single
character or escape sequence, its value is the one that results when an
object with type char whose value is that of the single character or
escape sequence is converted to type int".

If the character has a value which doesn't fit into type char,
then it is shoehorned into that type anyway (leading to an
implementation-defined value). The resulting value converted
to int is the value of the character constant.
On p11 there is the sentence "The value of a wide character constant
containing a single multibyte character that maps to a member of the
extended execution character set is the wide character corresponding to
that multibyte character, as defined by the mbtowc function, with an
implementation-defined current locale."

This sentence suggests to me that the function mbtowc maps the
multibyte encoding of a character of the *source* character set to a wide
character.

All mappings are from the source, to produce a value that will be
used in the execution environment. The multibyte character is a piece of
the program source code.

Which encoding is used is implementation-defined.

An implementation could use the encoding is that used on the target system
(execution environment) or it could use the one from the build machine
(translation enviornment) or something else entirely.

It looks like the standard simply does not specify a lot in this area.

There are reasons why one might want it done in any of these ways: source code
in character set/encoding X, the build machine in Y, and target machine in Z,
where these can all be different, or some of them are the same.

It's probably best not to constrain this sort of thing too much, because then
when someone wants a particular compilation scenario, it becomes needlessly
nonconforming.
I find this surprising for the following reasons:
1) the second parameter of mbtowc is a char *, so a pointer to bytes
in the execution environment

A C compiler could contain a function exactly like mbtwoc which it can use at
translation time.

That is to say, just because some semantic action in the translator
is described in terms of a C function doesn't mean that it has to be
delayed into the execution environment (which doesn't make sense).

6.4 Lexical elements has this in paragraph 4: ``If the input stream has been
parsed into preprocessing tokens up to a given character, the next
preprocessing token is the longest sequence of characters that could constitute
a preprocessing token.'' Since they wrote ``stream'' does that mean that
this is about a FILE * pointer in the execution environment?
One more question: a byte is (sec. 3.3.6) a unit of data storage of
the execution environment. Isn't it possible that the host environment
has units of data storage with a different number of bits?

You could have, say, a cross-compiler running on an 8 bit machine targetting a
9 bit machine. That cross-compiler would have to do all compile-time
calculations involving characters and bytes as 9 bit quantities, so that it
works out the correct values, as if the calculation were being done on the
target system. E.g. the constant expression ('\x100' >> 8) should reduce to 1,
etc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top