Multibyte VS. Wide

Y

yazan jab

Is it true that

Multibyte characters are : char arrays (witch represent a string from
the basic characters set). In this case Wide characters are the way
for encoding characters from the extended characters set.

or

Multibyte characters are: characters from the extended character set
which need more than one byte to encode. And in this case wide
characters are a subset of the multibyte character encoding.

Both the ISO/IEC 9899:1999 and the libc info page (the gnu c library
documentation) are a little bit vague in this area.

I tend to believe the second explanation but want to make sure.

Yazan jaber
 
D

Dan Pop

In said:
Is it true that

Multibyte characters are : char arrays (witch represent a string from
the basic characters set). In this case Wide characters are the way
for encoding characters from the extended characters set.

or

Multibyte characters are: characters from the extended character set
which need more than one byte to encode. And in this case wide
characters are a subset of the multibyte character encoding.

Neither is true, but the latter is closer to the truth. The definition
of the multibyte character is correct, but wide characters are not a
subset of the multibyte character encoding. They are wide enough to
represent *every* character from the extended character set.

Dan
 
D

Derk Gwen

(e-mail address removed) (yazan jab) wrote:
# Is it true that
#
# Multibyte characters are : char arrays (witch represent a string from
# the basic characters set). In this case Wide characters are the way
# for encoding characters from the extended characters set.

For something like Unicode, the character codes range from 0 to 65535 (or 0 to
4 billion to include ideographs as single characters). A wide character
would be an integer sufficient to hold the character code as a fixed size
unit, either 16 or 32 bit integers (typically a short or a long). When you
use wchars for these code, you have the same advantage that you have for
ASCII and char: and n-character string will require exactly n+1 storage
units to store.

However there are still many old and useful programs designed only for char
width characters that would not be able to cope with wchar characters. Instead
of recoding and recompiling all that software, some clever and not so clever
ways have been invented to represent one large 16 or 32 bit characters as a
sequence of one or more 8-bit characters. UTF coding for example represents
16-bit Unicode as 1 to 3 8-bit multibyte characters. UTF has the additional
property that the ASCII subset of Unicode in UTF is the exact same byte
codings as the ASCII codes, and that a multibyte UTF character does not
include any bytes in the 0-127 range.

This means when old ASCII software is given a multibyte encoding like UTF, if
it simply passes through bytes 128-255 unchanged, it is upgraded without coding
changes to being new Unicode software as well.

The disadvantage of multibyte characters is that a n character Unicode string
can take anywhere from n+1 through 3n+1 char storage units; you won't know
with examining the actual characters.
 
M

Michael B Allen

Is it true that

Multibyte characters are : char arrays (witch represent a string from
the basic characters set). In this case Wide characters are the way for
encoding characters from the extended characters set.

or

Multibyte characters are: characters from the extended character set
which need more than one byte to encode. And in this case wide

It's important to distinquish between characters (or charsets) and
character encodings. They are two different things. A charset is a map
that defines which numeric value represents a particular glyph. A
character encoding defines how numeric values are serialized into a
stream of bytes. For example Unicode can be encoded as UTF-8 which which
is space effecient and provides compatibility with the ASCII and ISO-8859-1
charsets. Or it could be encoded as UCS4-LE which is not space effient
but it can be easier to do heavy text processing with it.

Here's a nice link about programming with extended charsets although it
is a little UTF-8/*nix centric:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,141
Latest member
BlissKeto
Top