RFC: the state of charset support in C

J

Joshua Haberman

I've spent the last two days delving into the state of charset support
in C, and I wrote a blog post summarizing my findings.

http://blog.reverberate.org/2007/04/21/transcoding-adventures-with-c/

I'm new to this stuff, and I would very much appreciate gentle
corrections about any mistakes or misconceptions I've made!

I'd also appreciate hearing about the situation on Windows and more
obscure UNIXes. For example, is iconv() available to Windows
programmers?

Thanks,
Josh
 
J

Jack Klein

I've spent the last two days delving into the state of charset support
in C, and I wrote a blog post summarizing my findings.

http://blog.reverberate.org/2007/04/21/transcoding-adventures-with-c/

I'm new to this stuff, and I would very much appreciate gentle
corrections about any mistakes or misconceptions I've made!

To tell you the truth, the biggest misconception you have made is that
this is topical on comp.lang.c, because it is not.
I'd also appreciate hearing about the situation on Windows and more
obscure UNIXes. For example, is iconv() available to Windows
programmers?

A simple check of the C standard would have told you that it contains
no header named iconv.h or function named iconv(). Since it is a
non-standard (from a C point of view) extension, it is not topical
here, and what platforms might have such an extension, and what that
extension might do, is for platform specific groups.

C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.club.cc.cmu.edu/~ajo/docs/FAQ-acllc.html
 
W

William Ahern

Joshua Haberman said:
I've spent the last two days delving into the state of charset support
in C, and I wrote a blog post summarizing my findings.

I'm new to this stuff, and I would very much appreciate gentle
corrections about any mistakes or misconceptions I've made!

charset support is not comprehensive, and franky broken IMHO given the
apparent intentions.

The wide-character interface is insufficient, because the routines available
pre-suppose qualities of a character set that many character sets are unable
to abide by. In many cases, a you cannot make critical determinations (like
"isalpha") given soley a single wchar_t object (regardless of the width). I
suggest you spend some time over at unicode.org understanding the issues
yourself.

For comprehensive, correct and arguably portable character set manipulation,
I suggest the ICU library. But that's off-topic here. That aside, you can
muddle through using standard C interfaces if you cut back on your
requirements; i.e. increase the level of opacity such that you don't need to
make certain distinctions (like isalpha/iswalpha), and/or employ UTF-8 in
such a way that it works within the confines of the "C" locale. I said
"muddle", but maybe that's an unnecessarily deragatory characterization
since adjusting scope is often the best way to address an issue. I simply
mean to dispossess those who believe standard C really can support
comprehensive I18N text manipulation of that notion.
 
K

Kenneth Brody

Jack Klein wrote:
[...]
C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.

<pedant>
Doesn't the Standard guarantee 0 through 255, _or_ -128 through 127,
as it doesn't impose a signedness on unadorned "char"?
</pedant>

--
+-------------------------+--------------------+-----------------------+
| Kenneth J. Brody | www.hvcomputer.com | #include |
| kenbrody/at\spamcop.net | www.fptech.com | <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------+
Don't e-mail me at: <mailto:[email protected]>
 
R

Richard Heathfield

Kenneth Brody said:
Jack Klein wrote:
[...]
C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.

<pedant>
Doesn't the Standard guarantee 0 through 255, _or_ -128 through 127,

ITYM -127
as it doesn't impose a signedness on unadorned "char"?

It guarantees the existence of unsigned char, however.
 
K

Kenneth Brody

Richard said:
Kenneth Brody said:
Jack Klein wrote:
[...]
C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.

<pedant>
Doesn't the Standard guarantee 0 through 255, _or_ -128 through 127,

ITYM -127

So the standard says nothing about what 0x80 means as a signed char?

I suppose that's allowed to be a trap value?
It guarantees the existence of unsigned char, however.

True.

--
+-------------------------+--------------------+-----------------------+
| Kenneth J. Brody | www.hvcomputer.com | #include |
| kenbrody/at\spamcop.net | www.fptech.com | <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------+
Don't e-mail me at: <mailto:[email protected]>
 
R

Richard Heathfield

Kenneth Brody said:
So the standard says nothing about what 0x80 means as a signed char?

Well, no, not really. Presumably on typical two's complement "def char
is signed" systems it'll evaluate to -128 (and this does seem to be
what happens in practice), on ones' complement it'll be - um - whatever
it is :) - and on sign-and-magnitude it'll be -0. In each case, it is
possible for the implementation to ascribe a meaning to 0x80, but it
needn't necessarily be the same meaning on each platform.

<snip>
 
K

Keith Thompson

Kenneth Brody said:
Richard said:
Kenneth Brody said:
Jack Klein wrote:
[...]
C guarantees for 8-bit characters having numeric values in the range
of 0 to 255 inclusive. It allows, but does not require, support for
wider character types. Everything else is implementation-defined.

<pedant>
Doesn't the Standard guarantee 0 through 255, _or_ -128 through 127,

ITYM -127

So the standard says nothing about what 0x80 means as a signed char?

Correct. The standard allows signed integers to be represented either
in two's-complement, one's-complement, or signed-magnitude. (That's
in C99; I think C90 was more vague).
I suppose that's allowed to be a trap value?

unsigned char is not allowed to have trap values; I *think* the
standard may make a similar statement about signed char. (If so, I'm
sure someone will provide chapter and verse shortly.)

But in general, yes, signed types are allowed to have trap values,
though if 0x8000 isn't simply -32768 it's more likely to be -0.
 
P

Peter Nilsson

Kenneth Brody said:
So the standard says nothing about what 0x80 means as a signed
char?
No.

I suppose that's allowed to be a trap value?

That's debatable. 6.2.6.1p5 says...

Certain object representations need not represent a value of
the object type. If the stored value of an object has such a
representation and is read by an lvalue expression that does
not have character type, the behavior is undefined. ...

There are two points of view on how to interepret this:

1) This implies that the value of such representation for
signed character types is merely unspecified. [Non-
trapping trap representations!]

2) Since the standard fails to define the behaviour for
character types, they too invoke undefined behaviour.
[Although you have to question why the standard was so
explicit.]

Popular view seems to be that the standard could be better
written in this regard, that point 2 applies, and that it's
therefore not worth 'fixing' since nothing is actually
'broken' under that view.
 
O

Old Wolf

So the standard says nothing about what 0x80 means as a signed char?

A signed char cannot have the value of 0x80 (assuming SCHAR_MAX
to be 0x7F).

Assigning the value of 0x80 to a signed char would cause
implementation-defined behaviour.
I suppose that's allowed to be a trap value?

There are no trap values. There are only trap representations.

I suppose you mean to ask, what happens when you interpret
the representation 0x80 as signed char? As other posters noted,
it could be a value of some sort or it could be a trap representation.

I don't see any text that describes whether the value is
implementation-
defined or merely unspecified; nor any text describing what happens if
you evaluate that value. The section about reading trap reps is quite
clear that it only says the behaviour is undefined if the type is not
a
character type.
 
R

Richard Heathfield

Old Wolf said:
A signed char cannot have the value of 0x80 (assuming SCHAR_MAX
to be 0x7F).

Portably speaking, you're right. But non-portably, implementations may
allow a signed char on an 8-bit-char system to be 0x80. Typical PC
implementations set SCHAR_MIN to -128. (For example, Turbo C, Borland C
(IIRC), Microsoft C, gcc...)
 
K

Keith Thompson

Richard Heathfield said:
Old Wolf said:

Portably speaking, you're right. But non-portably, implementations may
allow a signed char on an 8-bit-char system to be 0x80. Typical PC
implementations set SCHAR_MIN to -128. (For example, Turbo C, Borland C
(IIRC), Microsoft C, gcc...)

If you assume that "0x80" refers to a bit pattern (a representation),
that's true. But in C, 0x80 is an integer literal with the value
+128. If SCHAR_MAX is 0x7F (+127), then a signed char cannot have the
value of 0x80 (+128).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top