Is char obsolete?

L

Lauri Alanko

I'm beginning to wonder if I should use the char type at all any
more.

An abstract textual character is nowadays a very complex
concept. Perhaps it is best represented as a Unicode code point,
perhaps as something else, but in any case a sensible
representation of an abstract encoding-independent character
cannot fit into a char (which is almost always eight bits wide),
but needs something else: wchar_t, uint32_t, a struct, or
something.

On the other hand, if we are dealing with an encoding-specific
representation, e.g. an ASCII string or UTF-8 string or whatever,
then we'd better deal with it as pure binary data, and that is
more natural to represent as a sequence of unsigned char or
uint8_t.

Perhaps in the olden days it was at least conceptually (if not
practically) useful to have a type char for characters, which
was distinct from signed char and unsigned char which were
for small integers. This made sense in a world where there were
several encodings but all of them were single-byte. The distinct char
type signalled: "this is meant to be a character, not just any number,
don't depend on the characters integer value if you want to be
portable".

But nowadays Unicode is everywhere, and the de facto standard
encoding is UTF-8. The char type won't cut it for characters any
more. And in those rare situations where one can still assume
that all the world is ASCII (or Latin-1, or even EBCDIC), there
is still no benefit to using char over unsigned char. Apart from
legacy library APIs, of course.

So is there any situation where a modern C programmer, without
the baggage of legacy interfaces, should still use the char
type?


Lauri
 
C

Chris H

Lauri Alanko said:
I'm beginning to wonder if I should use the char type at all any
more.

Yes There are many 8 bit MCU still in wide spread use. Also very many
systems that use ASCII for characters.
Perhaps in the olden days it was at least conceptually

The olden days are still here.
But nowadays Unicode is everywhere,

Not everywhere.
The char type won't cut it for characters any

But it is still needed. The Hayes AT set still uses ASCII
So is there any situation where a modern C programmer, without
the baggage of legacy interfaces, should still use the char
type?

Depends what they are doing
 
T

Thomas Richter

I'm beginning to wonder if I should use the char type at all any
more.

An abstract textual character is nowadays a very complex
concept.

However, a "char" is not an "abstract textual character". A char is the
smallest addressable memory unit of the system you compile to.
So is there any situation where a modern C programmer, without
the baggage of legacy interfaces, should still use the char
type?

Yes, of course. But not necessarily for text strings. It's the smallest
available integer type.

Greetings,
Thomas
 
B

Ben Bacarisse

Lauri Alanko said:
I'm beginning to wonder if I should use the char type at all any
more.
[...] if we are dealing with an encoding-specific
representation, e.g. an ASCII string or UTF-8 string or whatever,
then we'd better deal with it as pure binary data, and that is
more natural to represent as a sequence of unsigned char or
uint8_t.

For UTF-8, that is only true for code that pokes about in the
representation. Most code will function perfectly well treating UTF-8
encoded strings as char arrays.

But nowadays Unicode is everywhere, and the de facto standard
encoding is UTF-8. The char type won't cut it for characters any
more.

I feel this is a generalisation from a specific issue -- that of
manipulating the representation. Can you say why, in general, char
won't cut it for UTF-8 encoded strings?

So is there any situation where a modern C programmer, without
the baggage of legacy interfaces, should still use the char
type?

How can one avoid this baggage? char and char arrays are used for
multi-byte encoded strings throughout the standard library.
 
C

Chris H

Thomas Richter <[email protected] said:
However, a "char" is not an "abstract textual character". A char is the
smallest addressable memory unit of the system you compile to.

Mostly.... I think you are confusing it with Byte (which does not mean 8
bits hence Octet)

Char is the smallest using to hold a Character which may not be the same
thing.
Yes, of course. But not necessarily for text strings. It's the smallest
available integer type.

That is not correct. There are THREE char types.

Signed char
unsigned char
char

Signed and unsigned char are integer types.
[Plain] char is a character type NOT an integer type. If can be mapped
to signed or unsigned are the whim of the compiler implimentor. Come
compilers give you the option.
 
B

Ben Bacarisse

Chris H said:
In message <[email protected]>, Thomas Richter <[email protected]
berlin.de> writes
[sppeaking of char] It's the smallest available integer type.

That is not correct. There are THREE char types.

Signed char
unsigned char
char

Signed and unsigned char are integer types.
[Plain] char is a character type NOT an integer type.[/QUOTE]

6.2.5 p17:

"The type char, the signed and unsigned integer types, and the
enumerated types are collectively called integer types. [...]"

You may be remembering the rather less the helpful term "standard
integer types" defined by paragraphs 4, 6 and 7 of the same section.

<snip>
 
S

Stefan Ram

Lauri Alanko said:
I'm beginning to wonder if I should use the char type at all any
more.
An abstract textual character is nowadays a very complex
concept.

If someone believes that he should use »char« for
characters, then does he also believe that he should use
»float« for floating point numbers?

For characters, you use whatever representation is
appropriate for the project.
So is there any situation where a modern C programmer, without
the baggage of legacy interfaces, should still use the char
type?

A char object is used to store a member of the basic
execution character set.

It can also be used for objects that should have the
size 1 or - via arrays of char - the size n.
 
L

Lauri Alanko

Some clarifications.

Firstly, I'm talking specifically about the type "char". The
types "signed char" and "unsigned char" are perfectly
useful (though idiosyncratically named) integer types for
operating on the smallest addressable memory
units (a.k.a. bytes).

The type "char" is distinct from these, and it is strictly less
useful as an integer (due to its implementation-specific
signedness). So the only justification for it that I can see is
that it serves as a semantic annotation: a char is a byte that is
intended to be interpreted as a character in the basic execution
character set.

But I'm saying that nowadays the basic execution character set no
longer suffices for general-purpose text manipulation. So
wherever you need to manipulate an individual character as a
character, you'd better use wchar_t or similar.

The standard library uses the type "char *" for its
representation of strings: a zero-terminated sequence of bytes
where one or more adjacent bytes represent a single
character. This is fine (although the choice of representation is
questionable), but the type name is confusing: a "char *" is a
pointer to a variable-length string structure, not to
a "character" as such. If we have "char* s;" then the only thing
we know about the meaning of "s" is that it is a byte that is
part of the encoding of a string. This particular use of bytes
hardly seems to be worth a distinct primitive type.

In any case, we are not forced to use the standard library as
such. Yes, it is "baggage", but it is easy to throw away: the
library is finite and it is simple to rewrite it or wrap it to a
different API. Perhaps one where we just have a "struct string"
abstract type and string operations take a "struct string*" as an
argument. We can even support string literals:

struct string {
size_t len;
unsigned char* data;
};

#define string_lit(s) &(struct string) { \
.len = sizeof(s) - 1, \
.data = (unsigned char[sizeof(s) - 1]){ s } \
}

So, if we program in a modern style and don't use the standard library
string operations directly, I'd say we no longer have a very good
reason to use "char" anywhere in the application code.


Lauri
 
N

Nobody

Firstly, I'm talking specifically about the type "char". The
types "signed char" and "unsigned char" are perfectly
useful (though idiosyncratically named) integer types for
operating on the smallest addressable memory
units (a.k.a. bytes).

The type "char" is distinct from these, and it is strictly less
useful as an integer (due to its implementation-specific
signedness). So the only justification for it that I can see is
that it serves as a semantic annotation: a char is a byte that is
intended to be interpreted as a character in the basic execution
character set.

It has another justification: efficiency. Most of the time, you don't
actually care whether char is signed or unsigned, or even integral for
that matter.
The standard library uses the type "char *" for its
representation of strings:
In any case, we are not forced to use the standard library as
such.

You are if you want to interface to the OS. Aside from the ANSI/ISO C
functions, the Unix API uses char* extensively (Windows, OTOH, uses wide
characters).
 
K

Keith Thompson

Ben Bacarisse said:
Lauri Alanko said:
I'm beginning to wonder if I should use the char type at all any
more.
[...] if we are dealing with an encoding-specific
representation, e.g. an ASCII string or UTF-8 string or whatever,
then we'd better deal with it as pure binary data, and that is
more natural to represent as a sequence of unsigned char or
uint8_t.

For UTF-8, that is only true for code that pokes about in the
representation. Most code will function perfectly well treating UTF-8
encoded strings as char arrays.

But nowadays Unicode is everywhere, and the de facto standard
encoding is UTF-8. The char type won't cut it for characters any
more.

I feel this is a generalisation from a specific issue -- that of
manipulating the representation. Can you say why, in general, char
won't cut it for UTF-8 encoded strings?

In principle, if plain char is signed (and let's assume CHAR_BIT==8),
then the result of converting an octet with a value exceeding 127
yields an implementation-define result, and interpreting such an
octet as a plain char may not give you the value you expect. It's
even conceivable that round-trip conversions might lose information
(if plain char is signed and has distinct representations for +0
and -0).

I practice, it all Just Works on any system you're likely to
encounter; the only plausible exceptions are embedded systems
that aren't likely to be dealing with this kind of data anyway.
If plain char is represented in 2's-complement, and if conversions
between corresponding signed and unsigned types simply reinterpret
the representation, then things work as expected. And any vendor
who introduced a system that violates these assumptions (without
some overwhelming reason for doing so) will probably go out of
business while citing the section of the Standard that says their
implementation is conforming.

[...]

If I were designing C from scratch today, I'd probably at least require
plain char to be unsigned (and INT_MAX > UCHAR_MAX) just to avoid these
potential issues.
 
B

BartC

Lauri Alanko said:
Some clarifications.

Firstly, I'm talking specifically about the type "char". The
types "signed char" and "unsigned char" are perfectly
useful (though idiosyncratically named) integer types for
operating on the smallest addressable memory
units (a.k.a. bytes).

If you mean don't use a 'char' type where it's signedness can be unknown,
then I'd agree. Too many things will not work properly when the signedness
of char is opposite to what is assumed.

However, I don't believe the 'char' type in C is anything specifically to do
with characters; it should really have been signed and unsigned byte, ie.
just a small integer (and the smallest addressable unit).

To represent characters, then unsigned byte is most appropriate, and can
even work well for a lot of unicode stuff, either by using utf-8, or just
sticking to the first 256 codes.
But I'm saying that nowadays the basic execution character set no
longer suffices for general-purpose text manipulation. So
wherever you need to manipulate an individual character as a
character, you'd better use wchar_t or similar.

To do unicode properly, I don't think it's just a question of using a
slightly wider type; you'll probably be using extra libraries anyway.

But I would guess that a huge amount of code in C works quite happily using
8-bit characters.
 
F

Florian Weimer

* Lauri Alanko:
But nowadays Unicode is everywhere, and the de facto standard
encoding is UTF-8.

It's not on Windows, where UTF-16 is pervasive, or with Java. Some
other platforms have been infected by that, too.
And in those rare situations where one can still assume
that all the world is ASCII (or Latin-1, or even EBCDIC), there
is still no benefit to using char over unsigned char. Apart from
legacy library APIs, of course.

Even when restricting yourself to GNU/Linux platforms, the signedness
of the char type is not consistent across architectures. So char
seems rather unusable indeed.

I use char * and especially const char * to denote zero-terminated
strings and unsigned char * for arbitrary binary blobs (usually
combined with an explicit length). But that's probably just an odd
personal preference.
 
B

Ben Bacarisse

Keith Thompson said:
Ben Bacarisse said:
Lauri Alanko said:
I'm beginning to wonder if I should use the char type at all any
more.
[...] if we are dealing with an encoding-specific
representation, e.g. an ASCII string or UTF-8 string or whatever,
then we'd better deal with it as pure binary data, and that is
more natural to represent as a sequence of unsigned char or
uint8_t.

For UTF-8, that is only true for code that pokes about in the
representation. Most code will function perfectly well treating UTF-8
encoded strings as char arrays.

But nowadays Unicode is everywhere, and the de facto standard
encoding is UTF-8. The char type won't cut it for characters any
more.

I feel this is a generalisation from a specific issue -- that of
manipulating the representation. Can you say why, in general, char
won't cut it for UTF-8 encoded strings?

In principle, if plain char is signed (and let's assume CHAR_BIT==8),
then the result of converting an octet with a value exceeding 127
yields an implementation-define result, and interpreting such an
octet as a plain char may not give you the value you expect.

To my mind, this comes under the category of "pok[ing] about in the
representation" -- if you care about how char bit patterns map to values
(or vice-versa) you are, I agree, better off using unsigned char. But
normally one does not care what the actual values are.

<snip>
 
I

Ian Collins

Some clarifications.

Firstly, I'm talking specifically about the type "char". The
types "signed char" and "unsigned char" are perfectly
useful (though idiosyncratically named) integer types for
operating on the smallest addressable memory
units (a.k.a. bytes).

The type "char" is distinct from these, and it is strictly less
useful as an integer (due to its implementation-specific
signedness). So the only justification for it that I can see is
that it serves as a semantic annotation: a char is a byte that is
intended to be interpreted as a character in the basic execution
character set.

But I'm saying that nowadays the basic execution character set no
longer suffices for general-purpose text manipulation. So
wherever you need to manipulate an individual character as a
character, you'd better use wchar_t or similar.

That depends on where, and on what you are working. Working in an
English speaking country on Unix/Linux system programming or embedded
controllers, wchar_t is almost unknown.
 
R

robertwessel2

You are if you want to interface to the OS. Aside from the ANSI/ISO C
functions, the Unix API uses char* extensively (Windows, OTOH, uses wide
characters).


FWIW, Windows supports two parallel sets of APIs (for most functions),
ones ending with A for ASCII/ANSI ("CreateFileA") and ones ending with
W for (wide) Unicode ("CreateFileW"). Not unexpectedly, the vast
majority of ...A functions are thin wrappers for the Unicode
function. You can reference the character set specific functions
directly, or much more common, you'll have "UNICODE" #defined (or
not), and the headers generate #defines to make one or the other the
common name ("CreateFile" - without the A or W suffix). So in most
cases, it's perfectly plausible to write an application either way,
although Unicode is usually preferred.
 
L

Lauri Alanko

All wider types do is obscure the problem. The OP should sit down with a
book on Unicode and learn the difference between abstract characters,
combining characters, code points, graphemes, grapheme clusters,
etc.

The OP is familiar with Unicode, thank you very much.

I happen to agree with you that the entire concept of "character" is
bogus, and there are quite a number of reasonable ways to decompose a
string.

But my original point was that char is useless. You are retorting by
saying that wchar_t is also useless. That may be, but I don't see how
that is any kind of a defense for char.
Perl 6 comes the closest, AFAIU, to addressing the complexity of Unicode by
inventing it's own normalization form, "NFG". At runtime it builds a
"grapheme table" mapping new code points to irreducible sets of Unicode code
points so that strings can be indexed with sensible results.

How does it handle XML, then, since the meaning of an XML document can
be changed during normalization?


Lauri
 
B

BartC

Lauri Alanko said:
The OP is familiar with Unicode, thank you very much.

I happen to agree with you that the entire concept of "character" is
bogus, and there are quite a number of reasonable ways to decompose a
string.
But my original point was that char is useless.

Not really. Either as 'char', where the signedness doesn't matter, or
qualified with 'signed' or 'unsigned', it is still jolly useful for most
low-level text processing.

Unicode seems to be more of a library, rather than a language thing.

You are retorting by
saying that wchar_t is also useless. That may be, but I don't see how
that is any kind of a defense for char.


How does it handle XML, then, since the meaning of an XML document can
be changed during normalization?

Again, this is all outside the scope of the language. You don't need to use
any char type at all, nor wchar_t (which in any case is probably 16-bits and
won't quite represent every Unicode character without some extra fiddling);
probably just int will do, especially as you seem to want to use your own
string processing anyway.

But you shouldn't be calling for the abolition of 'char' just because it is
not useful for you.
 
T

Thomas Richter

Mostly.... I think you are confusing it with Byte (which does not mean 8
bits hence Octet)

No, I'm not confusing it. In C, we have char = byte, even though the
latter needs not to be 8 bits wide, i.e. byte != octet.
Char is the smallest using to hold a Character which may not be the same
thing.

That is not correct. There are THREE char types.

In how far does does this contradict what I said? Yes, there are three
char types. All three are integer types.
Signed char
unsigned char
char

Signed and unsigned char are integer types.[/QUOTE]

Yes, and so is char.
[QUOTE]
[Plain] char is a character type NOT an integer type.[/QUOTE]

Nope.

Greetings,
Thomas
 
J

James Kuyper

On 08.04.2011 17:32, Chris H wrote: ....

No, I'm not confusing it. In C, we have char = byte, even though the
latter needs not to be 8 bits wide, i.e. byte != octet.

'char' is a data type, 'byte' is a unit used to measure memory. An
object of type char is stored in one byte of memory, by C's definition
of 'byte', but that object has a lot of other attributes other than
memory storage. Those attributes include a representation, a (trivial)
alignment requirement, and an integer conversion rank. It belongs to
several different type categories: "basic types", "integer types",
"arithmetic types", and "scalar types". It belongs to the real type
domain. None of those concepts have any meaning with respect to 'byte'.
 
N

Nobody

FWIW, Windows supports two parallel sets of APIs (for most functions),
ones ending with A for ASCII/ANSI ("CreateFileA") and ones ending with
W for (wide) Unicode ("CreateFileW"). Not unexpectedly, the vast
majority of ...A functions are thin wrappers for the Unicode
function.

Yep. The Unicode functions provide the "real" API; the "ANSI" versions are
just compatibility wrappers. Use of the ANSI versions on new code is
strongly discouraged, as the resulting code will fail when accessing
strings which can't be converted to the application's codepage.

On NT-based versions of Windows (NT, XP, Vista, 7), almost any string
which can be obtained from the OS is allowed to contain arbitrary Unicode
characters: filenames, registry keys (and their values), environment
variables, clipboard contents, etc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top