Multibyte VS. Wide

yazan jab · Nov 6, 2003

Is it true that

Multibyte characters are : char arrays (witch represent a string from
the basic characters set). In this case Wide characters are the way
for encoding characters from the extended characters set.

or

Multibyte characters are: characters from the extended character set
which need more than one byte to encode. And in this case wide
characters are a subset of the multibyte character encoding.

Both the ISO/IEC 9899:1999 and the libc info page (the gnu c library
documentation) are a little bit vague in this area.

I tend to believe the second explanation but want to make sure.

Yazan jaber

Dan Pop · Nov 6, 2003

In said:
Is it true that

Multibyte characters are : char arrays (witch represent a string from
the basic characters set). In this case Wide characters are the way
for encoding characters from the extended characters set.

or

Multibyte characters are: characters from the extended character set
which need more than one byte to encode. And in this case wide
characters are a subset of the multibyte character encoding.

Neither is true, but the latter is closer to the truth. The definition
of the multibyte character is correct, but wide characters are not a
subset of the multibyte character encoding. They are wide enough to
represent *every* character from the extended character set.

Dan

Derk Gwen · Nov 7, 2003

(e-mail address removed) (yazan jab) wrote:
# Is it true that
#
# Multibyte characters are : char arrays (witch represent a string from
# the basic characters set). In this case Wide characters are the way
# for encoding characters from the extended characters set.

For something like Unicode, the character codes range from 0 to 65535 (or 0 to
4 billion to include ideographs as single characters). A wide character
would be an integer sufficient to hold the character code as a fixed size
unit, either 16 or 32 bit integers (typically a short or a long). When you
use wchars for these code, you have the same advantage that you have for
ASCII and char: and n-character string will require exactly n+1 storage
units to store.

However there are still many old and useful programs designed only for char
width characters that would not be able to cope with wchar characters. Instead
of recoding and recompiling all that software, some clever and not so clever
ways have been invented to represent one large 16 or 32 bit characters as a
sequence of one or more 8-bit characters. UTF coding for example represents
16-bit Unicode as 1 to 3 8-bit multibyte characters. UTF has the additional
property that the ASCII subset of Unicode in UTF is the exact same byte
codings as the ASCII codes, and that a multibyte UTF character does not
include any bytes in the 0-127 range.

This means when old ASCII software is given a multibyte encoding like UTF, if
it simply passes through bytes 128-255 unchanged, it is upgraded without coding
changes to being new Unicode software as well.

The disadvantage of multibyte characters is that a n character Unicode string
can take anywhere from n+1 through 3n+1 char storage units; you won't know
with examining the actual characters.

Michael B Allen · Nov 8, 2003

Is it true that

Multibyte characters are : char arrays (witch represent a string from
the basic characters set). In this case Wide characters are the way for
encoding characters from the extended characters set.

or

Multibyte characters are: characters from the extended character set
which need more than one byte to encode. And in this case wide

It's important to distinquish between characters (or charsets) and
character encodings. They are two different things. A charset is a map
that defines which numeric value represents a particular glyph. A
character encoding defines how numeric values are serialized into a
stream of bytes. For example Unicode can be encoded as UTF-8 which which
is space effecient and provides compatibility with the ASCII and ISO-8859-1
charsets. Or it could be encoded as UCS4-LE which is not space effient
but it can be easier to do heavy text processing with it.

Here's a nice link about programming with extended charsets although it
is a little UTF-8/*nix centric:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Mike

FAQ 6.23 How can I match strings with multibyte characters?	0	Jan 11, 2011
get wide character and multibyte character value	2	Jan 24, 2008
multibyte,wchar_t and mblen(),wcslen()	1	Nov 23, 2006
wchar_t and wide characters	1	Mar 13, 2006
Questions on ISO C character constants	1	Nov 8, 2011
wchar_t is useless	18	Nov 21, 2011
Validating multibyte strings	3	Sep 23, 2005
Questions on character constants	2	Dec 12, 2010

Multibyte VS. Wide

yazan jab

Dan Pop

Derk Gwen

Michael B Allen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads