Multibyte VS. Wide

Discussion in 'C Programming' started by yazan jab, Nov 6, 2003.

  1. yazan jab

    yazan jab Guest

    Is it true that

    Multibyte characters are : char arrays (witch represent a string from
    the basic characters set). In this case Wide characters are the way
    for encoding characters from the extended characters set.

    or

    Multibyte characters are: characters from the extended character set
    which need more than one byte to encode. And in this case wide
    characters are a subset of the multibyte character encoding.

    Both the ISO/IEC 9899:1999 and the libc info page (the gnu c library
    documentation) are a little bit vague in this area.

    I tend to believe the second explanation but want to make sure.

    Yazan jaber
    yazan jab, Nov 6, 2003
    #1
    1. Advertising

  2. yazan jab

    Dan Pop Guest

    In <> (yazan jab) writes:

    >Is it true that
    >
    >Multibyte characters are : char arrays (witch represent a string from
    >the basic characters set). In this case Wide characters are the way
    >for encoding characters from the extended characters set.
    >
    >or
    >
    >Multibyte characters are: characters from the extended character set
    >which need more than one byte to encode. And in this case wide
    >characters are a subset of the multibyte character encoding.


    Neither is true, but the latter is closer to the truth. The definition
    of the multibyte character is correct, but wide characters are not a
    subset of the multibyte character encoding. They are wide enough to
    represent *every* character from the extended character set.

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
    Dan Pop, Nov 6, 2003
    #2
    1. Advertising

  3. yazan jab

    Derk Gwen Guest

    (yazan jab) wrote:
    # Is it true that
    #
    # Multibyte characters are : char arrays (witch represent a string from
    # the basic characters set). In this case Wide characters are the way
    # for encoding characters from the extended characters set.

    For something like Unicode, the character codes range from 0 to 65535 (or 0 to
    4 billion to include ideographs as single characters). A wide character
    would be an integer sufficient to hold the character code as a fixed size
    unit, either 16 or 32 bit integers (typically a short or a long). When you
    use wchars for these code, you have the same advantage that you have for
    ASCII and char: and n-character string will require exactly n+1 storage
    units to store.

    However there are still many old and useful programs designed only for char
    width characters that would not be able to cope with wchar characters. Instead
    of recoding and recompiling all that software, some clever and not so clever
    ways have been invented to represent one large 16 or 32 bit characters as a
    sequence of one or more 8-bit characters. UTF coding for example represents
    16-bit Unicode as 1 to 3 8-bit multibyte characters. UTF has the additional
    property that the ASCII subset of Unicode in UTF is the exact same byte
    codings as the ASCII codes, and that a multibyte UTF character does not
    include any bytes in the 0-127 range.

    This means when old ASCII software is given a multibyte encoding like UTF, if
    it simply passes through bytes 128-255 unchanged, it is upgraded without coding
    changes to being new Unicode software as well.

    The disadvantage of multibyte characters is that a n character Unicode string
    can take anywhere from n+1 through 3n+1 char storage units; you won't know
    with examining the actual characters.

    --
    Derk Gwen http://derkgwen.250free.com/html/index.html
    Where do you get those wonderful toys?
    Derk Gwen, Nov 7, 2003
    #3
  4. On Thu, 06 Nov 2003 11:55:13 -0500, yazan jab wrote:

    > Is it true that
    >
    > Multibyte characters are : char arrays (witch represent a string from
    > the basic characters set). In this case Wide characters are the way for
    > encoding characters from the extended characters set.
    >
    > or
    >
    > Multibyte characters are: characters from the extended character set
    > which need more than one byte to encode. And in this case wide


    It's important to distinquish between characters (or charsets) and
    character encodings. They are two different things. A charset is a map
    that defines which numeric value represents a particular glyph. A
    character encoding defines how numeric values are serialized into a
    stream of bytes. For example Unicode can be encoded as UTF-8 which which
    is space effecient and provides compatibility with the ASCII and ISO-8859-1
    charsets. Or it could be encoded as UCS4-LE which is not space effient
    but it can be easier to do heavy text processing with it.

    Here's a nice link about programming with extended charsets although it
    is a little UTF-8/*nix centric:

    http://www.cl.cam.ac.uk/~mgk25/unicode.html

    Mike
    Michael B Allen, Nov 8, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Web Developer

    char 8bit wide or 7bit wide in c++?

    Web Developer, Jul 31, 2003, in forum: C++
    Replies:
    2
    Views:
    568
    John Harrison
    Jul 31, 2003
  2. George2
    Replies:
    2
    Views:
    364
    James Kanze
    Jan 25, 2008
  3. Disc Magnet
    Replies:
    2
    Views:
    694
    Jukka K. Korpela
    May 15, 2010
  4. Disc Magnet
    Replies:
    2
    Views:
    777
    Neredbojias
    May 14, 2010
  5. Martin Rinehart

    80 columns wide? 132 columns wide?

    Martin Rinehart, Oct 31, 2008, in forum: Javascript
    Replies:
    16
    Views:
    159
    John W Kennedy
    Nov 13, 2008
Loading...

Share This Page