L
Lauri Alanko
I'm beginning to wonder if I should use the char type at all any
more.
An abstract textual character is nowadays a very complex
concept. Perhaps it is best represented as a Unicode code point,
perhaps as something else, but in any case a sensible
representation of an abstract encoding-independent character
cannot fit into a char (which is almost always eight bits wide),
but needs something else: wchar_t, uint32_t, a struct, or
something.
On the other hand, if we are dealing with an encoding-specific
representation, e.g. an ASCII string or UTF-8 string or whatever,
then we'd better deal with it as pure binary data, and that is
more natural to represent as a sequence of unsigned char or
uint8_t.
Perhaps in the olden days it was at least conceptually (if not
practically) useful to have a type char for characters, which
was distinct from signed char and unsigned char which were
for small integers. This made sense in a world where there were
several encodings but all of them were single-byte. The distinct char
type signalled: "this is meant to be a character, not just any number,
don't depend on the characters integer value if you want to be
portable".
But nowadays Unicode is everywhere, and the de facto standard
encoding is UTF-8. The char type won't cut it for characters any
more. And in those rare situations where one can still assume
that all the world is ASCII (or Latin-1, or even EBCDIC), there
is still no benefit to using char over unsigned char. Apart from
legacy library APIs, of course.
So is there any situation where a modern C programmer, without
the baggage of legacy interfaces, should still use the char
type?
Lauri
more.
An abstract textual character is nowadays a very complex
concept. Perhaps it is best represented as a Unicode code point,
perhaps as something else, but in any case a sensible
representation of an abstract encoding-independent character
cannot fit into a char (which is almost always eight bits wide),
but needs something else: wchar_t, uint32_t, a struct, or
something.
On the other hand, if we are dealing with an encoding-specific
representation, e.g. an ASCII string or UTF-8 string or whatever,
then we'd better deal with it as pure binary data, and that is
more natural to represent as a sequence of unsigned char or
uint8_t.
Perhaps in the olden days it was at least conceptually (if not
practically) useful to have a type char for characters, which
was distinct from signed char and unsigned char which were
for small integers. This made sense in a world where there were
several encodings but all of them were single-byte. The distinct char
type signalled: "this is meant to be a character, not just any number,
don't depend on the characters integer value if you want to be
portable".
But nowadays Unicode is everywhere, and the de facto standard
encoding is UTF-8. The char type won't cut it for characters any
more. And in those rare situations where one can still assume
that all the world is ASCII (or Latin-1, or even EBCDIC), there
is still no benefit to using char over unsigned char. Apart from
legacy library APIs, of course.
So is there any situation where a modern C programmer, without
the baggage of legacy interfaces, should still use the char
type?
Lauri