Multi-byte chars

  • Thread starter Bill Cunningham
  • Start date
B

Bill Cunningham

I've been reading the C standard online and I'm puzzled as to what multibyte
chars are. Wide chars I believe would be characters for languages such as
cantonese or Japanese. I know the ASCII character set specifies that each
character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
character?
Also how would you use the function parameter main (char argc, char
**argv) if that's correct?

Bill
 
R

Richard Heathfield

Bill said:
I've been reading the C standard online and I'm puzzled as to what
multibyte chars are.

A multibyte character is a "sequence of one or more bytes representing a
member of the extended character set of either the source or the execution
environment", if I have the quote from 3.7.2 right.
Wide chars I believe would be characters for
languages such as cantonese or Japanese.

C isn't as specific as that. See 3.7.3.
I know the ASCII character set
specifies that each character such as 'b' or 'B' is an 8 bit character.

7 bits, not 8. ASCII is a 7-bit code.

<snip>
 
L

lawrence.jones

Bill Cunningham said:
I've been reading the C standard online and I'm puzzled as to what multibyte
chars are. Wide chars I believe would be characters for languages such as
cantonese or Japanese. I know the ASCII character set specifies that each
character such as 'b' or 'B' is an 8 bit character. So what's a multibyte
character?

A single logical character that requires more than one byte to express.
For example, consider the UTF-8 encoding format for ISO 10646: normal
ASCII characters (between \x00 and \x7f) are encoded as a single byte
with the same value. Other characters are encoded as multiple bytes,
each of which has the top bit set; the first byte is in the range \xc0
to \xfd and indicates the number of bytes that follow, subsequent bytes
are in the range \x80 to \xbf. UTF-8 encoded characters can be any
length between one and six bytes. So 'A' is encoded as \x41 but '©'
(the copyright sign) is encoded as \xc2\xa9.

Multibyte encodings can be very space efficient, but they are difficult
to process since different characters have different lengths. Wide
characters, on the other hand, are intended to be efficient for
processing, but not necessarily space efficient. Wide characters are
integers that are large enough so that every logical character can be
represented in just one wide character.

-Larry Jones

If I get a bad grade, it'll be YOUR fault for not doing the work for me!
-- Calvin
 
J

Jun Woong

A single logical character that requires more than one byte to express.
For example, consider the UTF-8 encoding format for ISO 10646: normal
ASCII characters (between \x00 and \x7f) are encoded as a single byte
with the same value.

My understanding is that the standard requires 'A' == L'A' by the fact
that the basic character set must be a subset of the extended
character set. Do this and what you mentioned above mean that a
character set whose code values differ from ASCII's can't be the basic
set on an implementation where code values of Unicode is used as those
of the extended set?
 
D

Dan Pop

My understanding is that the standard requires 'A' == L'A' by the fact
that the basic character set must be a subset of the extended
character set.

Non sequitur. The fact that A belongs to the basic character set has
no relevance on the value of L'A', AFAICT. All the standard has to say
on the issue is:

11 A wide character constant has type wchar_t, an integer type
defined in the <stddef.h> header. The value of a wide character
constant containing a single multibyte character that maps to
a member of the extended execution character set is the wide
character corresponding to that multibyte character, as defined
by the mbtowc function, with an implementation-defined current
locale.
Do this and what you mentioned above mean that a
character set whose code values differ from ASCII's can't be the basic
set on an implementation where code values of Unicode is used as those
of the extended set?

Nope, he was merely describing what happens on an implementation using
ASCII for normal characters and UCS for wide characters (therefore UTF-8
for multi-byte characters).

There is nothing preventing an implementation from using EBCDIC for
normal characters and UCS for wide characters, in which case it is foolish
to expect 'A' == L'A'.

Furthermore, there is nothing preventing an implementation from using
ASCII for normal characters and EBCDIC for wide characters (or vice
versa). The fact that C99 supports UCNs in source code means nothing WRT
the execution character set (whose extended version need not contain any
additional characters).

Dan
 
J

Jun Woong

Dan Pop said:
Non sequitur. The fact that A belongs to the basic character set has
no relevance on the value of L'A', AFAICT. All the standard has to say
on the issue is:

11 A wide character constant has type wchar_t, an integer type
defined in the <stddef.h> header. The value of a wide character
constant containing a single multibyte character that maps to
a member of the extended execution character set is the wide
character corresponding to that multibyte character, as defined
by the mbtowc function, with an implementation-defined current
locale.

And in 7.17p2:

wchar_t

which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character
set specified among the supported locales; the null character
shall have the code value zero and each member of the basic
character set shall have a code value equal to its value when used
as the lone character in an integer character constant.
 
L

lawrence.jones

Jun Woong said:
My understanding is that the standard requires 'A' == L'A' by the fact
that the basic character set must be a subset of the extended
character set. Do this and what you mentioned above mean that a
character set whose code values differ from ASCII's can't be the basic
set on an implementation where code values of Unicode is used as those
of the extended set?

Yes, but. That requirement is a hold-over from the very earliest days of
extended character set support, before there were functions to convert
between wide and narrow characters. Now that those functions exist,
there is no longer any reason for the requirement, and the committee has
voted to remove it. See the committee's response to DR #279:

<http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/dr_279.htm>

-Larry Jones

Somebody's always running my life. I never get to do what I want to do.
-- Calvin
 
D

Dan Pop

And in 7.17p2:

wchar_t

which is an integer type whose range of values can represent
distinct codes for all members of the largest extended character
set specified among the supported locales; the null character
shall have the code value zero and each member of the basic
character set shall have a code value equal to its value when used
as the lone character in an integer character constant.

This requirement, carried on from C89, is simply broken: implementations
that don't use ASCII for normal characters wouldn't be able to use *any*
of the ASCII extensions (UCS, most importantly) for wide characters.

Dan
 
J

Jun Woong

Dan Pop said:
This requirement, carried on from C89, is simply broken: implementations
that don't use ASCII for normal characters wouldn't be able to use *any*
of the ASCII extensions (UCS, most importantly) for wide characters.

Then, the proper answer to my previous question should be mention of
the DR in process, not citation of an irrelevant wording.
 
J

Jun Woong

Yes, but. That requirement is a hold-over from the very earliest days of
extended character set support, before there were functions to convert
between wide and narrow characters. Now that those functions exist,
there is no longer any reason for the requirement,

Weren't there some conversion functions between wide and multibyte
characters in C90? Do you mean that the wording in question was
written before the C89 committee decided to put those functions into
the standard, or that now we have more complete set of functions to
deal with wide and multibyte characters so don't need the requirement
any more?
 
L

lawrence.jones

Jun Woong said:
Weren't there some conversion functions between wide and multibyte
characters in C90? Do you mean that the wording in question was
written before the C89 committee decided to put those functions into
the standard, or that now we have more complete set of functions to
deal with wide and multibyte characters so don't need the requirement
any more?

There were conversions between wide characters and multibyte *strings*,
but there weren't any conversions dealing with single byte characters
until btowc() and wctob() were added in NA1.

-Larry Jones

Oh yeah? You just wait! -- Calvin
 
D

Dan Pop

Then, the proper answer to my previous question should be mention of
the DR in process, not citation of an irrelevant wording.

I have quoted the *relevant* wording. The library clause has no business
defining the semantics of wide characters, which are a language issue.

Dan
 
J

Jun Woong

There were conversions between wide characters and multibyte *strings*,
but there weren't any conversions dealing with single byte characters
until btowc() and wctob() were added in NA1.

Oh, now I see your point, thank you. I thought it in an implementer's
viewpoint who has full access to the internal state for the
conversion.
 
J

Jun Woong

Dan Pop said:
I have quoted the *relevant* wording. The library clause has no business
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
defining the semantics of wide characters, which are a language issue.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sorry, but this makes me feel that it's not worth discussing this
problem with you any more. Some implementations of the standard
library depended on that '%' == L'%' with the requirement of C90,
and it was a reliable choice in practice *at that time*.
 
D

Dan Pop

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sorry, but this makes me feel that it's not worth discussing this
problem with you any more.

As I've already told you, you're always welcome to ignore my posts.
The text you've underlined makes perfect sense to me (otherwise I
wouldn't have written in the first place).
Some implementations of the standard
library depended on that '%' == L'%' with the requirement of C90,
and it was a reliable choice in practice *at that time*.

The implementor can depend on *anything* he wants, because he has full
control over the implementation, he doesn't need any guarantees from the
standard about the relationship between normal characters and wide
characters because he knows *exactly* what this relationship is on that
particular implementation.

I thought this was obvious to you...

Dan
 
D

Dan Pop

Dan Pop said:
I have quoted the *relevant* wording. The library clause has no business
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
defining the semantics of wide characters, which are a language issue.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[...]

The text you've underlined makes perfect sense to me (otherwise I
wouldn't have written in the first place).

According to your logic, the following program is not s.c. even in

Don't invoke my logic, since you're obviously unable to undestand it.
C90, which is perfectly incorrect thought. Is this what you are
saying?

#include <stdio.h>

int main(void)
{
if ('a' == L'a') puts("okay");

return 0;
}

Nope, what I'm saying is that C90 is broken by making this program
strictly conforming: what are the choices for wide characters of an
EBCDIC-based implementation? Remove the broken text from the library
clause and C90 becomes more sensible. Ditto about C99, which contains
the same text.
The story changes if the implementer wants to make as many parts of
his library conform to the standard as possible.

The standard contains no requirement that the standard library is
implemented in C in the first place. A library implementation conforms
to the standard if it follows the standard specification for the library,
no matter in what language it is written or how portable or non-portable
its code is. Ideally, all the parts of the library should conform to the
library specification, not only "as many parts as possible" ;-)

Assuming that you're talking about implementing the library in portable
C (which is definitely NOT what you wrote above), I fail to see how the
assumption 'a' == L'a' can make the code more portable.

Dan
 
L

lawrence.jones

Dan Pop said:
Nope, what I'm saying is that C90 is broken by making this program
strictly conforming: what are the choices for wide characters of an
EBCDIC-based implementation?

I wouldn't call it broken, just overly restrictive. Until very
recently, no one with an EBCDIC implementation wanted the wchar_t
encoding to be anything other than IBM's DBCS (Double Byte Character
Set), which has the same relation to EBCDIC that Unicode/ISO 10646 has
to ASCII.

-Larry Jones

He doesn't complain, but his self-righteousness sure gets on my nerves.
-- Calvin
 
J

Jun Woong

Dan Pop said:
Don't invoke my logic, since you're obviously unable to undestand it.

Sorry, your logic is too foolish for me to understand.
Nope, what I'm saying is that C90 is broken by making this program
strictly conforming: what are the choices for wide characters of an
EBCDIC-based implementation? Remove the broken text from the library
clause and C90 becomes more sensible.

This is completely your personal opinion, which is completely
different from the text of C90 exactly says; please don't force others
to follow your poor opinion as did in "return; in main()" discussion.

I've never thought that it was broken, considering that we didn't have
enough support for multibyte and wide characters in C90, it was rather
very restrictive. The only problem I can see about this is that the
committee should have removed it when drafting C99, since we already
had lots of support for the characters then.

[...]
The standard contains no requirement that the standard library is
implemented in C in the first place. A library implementation conforms
to the standard if it follows the standard specification for the library,
no matter in what language it is written or how portable or non-portable
its code is. Ideally, all the parts of the library should conform to the
library specification, not only "as many parts as possible" ;-)

Sorry for my poor wording.
Assuming that you're talking about implementing the library in portable
C (which is definitely NOT what you wrote above), I fail to see how the
assumption 'a' == L'a' can make the code more portable.

Try to implement one of the printf() family in C90 (excluding NA1).
 
J

Jun Woong

Dan Pop said:
Rudeness works both ways ;-)

It's fortune that you know it.
Nope, it isn't, because it's my opinion about what C90 says.

Yes, it's just your opinion, not what C90 says, which is what I said.
So what?
I'm not
denying that it says what it says, merely claiming that what it says is
wrong. For reasons I have clearly explained.

I don't think so. It's very restrictive rather than broken at that
time; read Larry's posting on this.
Are you a complete idiot or what? I didn't force anyone to adopt any of
my opinions in any discussion (how could I do that, assuming that I wanted
to?).

You said it's broken. I said it's not broken, just very restrictive.
But what C90 says doesn't change regardless of whatever we think about
it. The standards, C90 and C99 as the current state, explicitly
guarantees that 'a' == L'a'. What's the problem with this? What
justifies you to say:

The fact that A belongs to the basic character set has
no relevance on the value of L'A'

?

If you meant to say that the wording in the standard should be revised
or will be revised, then you should have done so (as Larry did), not
given me the poor explanation above.
Why wasn't the support enough? And if it wasn't enough, why didn't the
committee add the missing bits, instead of breaking the standard?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since both standards say the same thing, your argument about not enough
support in C90 is completely unsupported. Try something better.

Read the underlined wording.
Convert the format string to wide characters and use only wide character
constants in the implementation of printf. Generate the output as wide
characters and convert them to multibyte characters before actually
outputting them. Where is the portability problem? Which of these
conversions isn't supported by C89?

The thing I can't figure out is how to generate a multibyte format string
in C89, as a string literal. The only solution is to start with a wide
string literal and convert it to a multibyte character string.

The multibyte character sequence given to printf() by user can have
redundant shift characters which can make the resulting mb characters
from the wide characters differ from the original. The guarantee that
'%' == L'%' can make it easy to write a code to scan the conversion
specifier from the mb character sequence, despite lack of support for
conversion between characters; of course, there was a more complicated
way to do it not depedning on the fact.
 
D

Dan Pop

Dan Pop said:
[...]

It's fortune that you know it.

Could you please be a little more careful when writing English text?
Yes, it's just your opinion, not what C90 says, which is what I said.
So what?

I am perfectly entitled to my opinion. Just like anyone else.
I don't think so. It's very restrictive rather than broken at that
time; read Larry's posting on this.

I have: it didn't sound very convincing to someone inclined to use his
own judgement instead of blindly believing everything said by a committee
member.

A standard that prevents mixing, say, EBCDIC (characters) and UCS (wide
characters), for NO good reason, is downright broken in my book. And both
C89 and C99 do that.
You said it's broken. I said it's not broken, just very restrictive.
But what C90 says doesn't change regardless of whatever we think about
it. The standards, C90 and C99 as the current state, explicitly
guarantees that 'a' == L'a'. What's the problem with this? What
justifies you to say:

The fact that A belongs to the basic character set has
no relevance on the value of L'A'

I have already explained what. And I agree that the standard provides
this guarantee. What's the problem with this? ;-)
Read the book, "The Standard C Library" by PJ Plauger, <locale.h>
section, IIRC.

Quote the relevant paragraphs.
Read the underlined wording.

Does it change the fact that both standards say the same thing? If not,
the underlined text doesn't prove anything at all.
The multibyte character sequence given to printf() by user can have
redundant shift characters which can make the resulting mb characters
from the wide characters differ from the original.

Differ in what sense? Are the semantics of the text preserved or not?
The guarantee that
'%' == L'%' can make it easy to write a code to scan the conversion
specifier from the mb character sequence,

Nope, it cannot: you cannot process multibyte characters *before*
converting them to wide characters, because the standard does NOT
specify the encoding mechanism. Keep in mind that characters from the
base character set preserve their single byte values *only* in the initial
shift state (whatever that is):

While in the
initial shift state, all single-byte characters retain their usual
interpretation and do not alter the shift state. The interpretation
^^^^^^^^^^^^^^^^^^
for subsequent bytes in the sequence is a function of the current
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
shift state.
^^^^^^^^^^^^
despite lack of support for
conversion between characters; of course, there was a more complicated
way to do it not depedning on the fact.

There is no other way, without making assumptions about how mb characters
are encoded (see the quote above). And if you make such assumptions,
your code is no longer portable. There is no easy way to tell whether
a byte you read from the string corresponds to a single byte character
or is a shift state changer or is the first character of a multibyte
character.

Dan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top