Multibyte string length

Discussion in 'C Programming' started by Zygmunt Krynicki, Oct 9, 2003.

  1. Hello
    I've browsed the FAQ but apparently it lacks any questions concenring wide
    character strings. I'd like to calculate the length of a multibyte string
    without converting the whole string.

    Zygmunt

    PS: The whole multibyte string vs wide character string concept is broken
    IMHO since it allows wchar_t not to be large enough to contain a full
    character (rendering both types virtually the same). What's the point of
    standartizing wide characters if the standard makes portable usage of such
    mechanism a programming hell? Feel free to disagree.

    PS2: On my implementation wchar_t is 'big enough' so I might overcome the
    problem in some other way but I'd like to see some fully portable approach.
     
    Zygmunt Krynicki, Oct 9, 2003
    #1
    1. Advertising

  2. Zygmunt Krynicki

    Dan Pop Guest

    In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndns._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:

    >I've browsed the FAQ but apparently it lacks any questions concenring wide
    >character strings. I'd like to calculate the length of a multibyte string
    >without converting the whole string.


    Use the mblen function from the standard C library in a loop, until it
    returns 0. The number of mblen calls returning a positive value is the
    number of multibyte characters in that string.

    >PS: The whole multibyte string vs wide character string concept is broken
    >IMHO since it allows wchar_t not to be large enough to contain a full
    >character (rendering both types virtually the same). What's the point of
    >standartizing wide characters if the standard makes portable usage of such
    >mechanism a programming hell? Feel free to disagree.


    The bit you're missing is that the standard doesn't impose one character
    set or another for wide characters. If the implementor decides to use
    ASCII as the character set for wide characters, wchar_t need not be any
    wider than char. But wchar_t is supposed to be wide enough for the
    character set chosen by the implementor for wide characters.

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
     
    Dan Pop, Oct 9, 2003
    #2
    1. Advertising

  3. On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:

    > In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndns._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
    >
    >>PS: The whole multibyte string vs wide character string concept is broken
    >>IMHO since it allows wchar_t not to be large enough to contain a full
    >>character (rendering both types virtually the same). What's the point of
    >>standartizing wide characters if the standard makes portable usage of such
    >>mechanism a programming hell? Feel free to disagree.

    >
    > The bit you're missing is that the standard doesn't impose one character
    > set or another for wide characters. If the implementor decides to use
    > ASCII as the character set for wide characters, wchar_t need not be any
    > wider than char. But wchar_t is supposed to be wide enough for the
    > character set chosen by the implementor for wide characters.


    I don't think he's missing that at all. He's simply pointing out that
    the standard makes it pretty much impossible to use wide characters
    portably (unless you only use wide characters with values between 0
    and 127, of course).

    Had the standard mandated, for instance, that wide characters be at
    least 32 bits wide, then each wide character would be wide enough for
    any character set and it would be possible to write portable code
    using wide characters as long as the code had no character set
    dependency.

    The OP also seems to be griping about certain implementations using
    unicode as a character set that have 16 bit wchar_t. Since it is
    impossible to represent every unicode character in 16 bits, wide
    character strings become 'multiwchar_t' encodings (UTF-16), which
    defeats the whole purpose of wide characters and wide character strings

    - Sheldon
     
    Sheldon Simms, Oct 9, 2003
    #3
  4. Zygmunt Krynicki

    NumLockOff Guest

    "Sheldon Simms" <> wrote in message
    news:p...
    > On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:
    >
    > > In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndns._OUT_org>

    "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
    > >
    > >>PS: The whole multibyte string vs wide character string concept is

    broken
    > >>IMHO since it allows wchar_t not to be large enough to contain a full
    > >>character (rendering both types virtually the same). What's the point of
    > >>standartizing wide characters if the standard makes portable usage of

    such
    > >>mechanism a programming hell? Feel free to disagree.

    > >
    > > The bit you're missing is that the standard doesn't impose one character
    > > set or another for wide characters. If the implementor decides to use
    > > ASCII as the character set for wide characters, wchar_t need not be any
    > > wider than char. But wchar_t is supposed to be wide enough for the
    > > character set chosen by the implementor for wide characters.

    >
    > I don't think he's missing that at all. He's simply pointing out that
    > the standard makes it pretty much impossible to use wide characters
    > portably (unless you only use wide characters with values between 0
    > and 127, of course).
    >
    > Had the standard mandated, for instance, that wide characters be at
    > least 32 bits wide, then each wide character would be wide enough for
    > any character set and it would be possible to write portable code
    > using wide characters as long as the code had no character set
    > dependency.
    >
    > The OP also seems to be griping about certain implementations using
    > unicode as a character set that have 16 bit wchar_t. Since it is
    > impossible to represent every unicode character in 16 bits, wide
    > character strings become 'multiwchar_t' encodings (UTF-16), which
    > defeats the whole purpose of wide characters and wide character strings
    >
    > - Sheldon
    >

    It is just the evolution of the Unicode standard. Surrogares were added at
    U+D800 to include more FarEastern characters. It has become now similar to a
    mbcs mess. Could they have originally specified 32 bit charecters? maybe,
    but in early 1990s, 16 bit characters were considered a major waste and
    opposed. UTF8 was pretty much invented to solve the purpose of older 8bit
    character systems to be able to read vanilla english text without code
    change. With the memory and processing power costs plummeting, we now feel
    that 32 bits is fine. At this point 32 bits seemd to be enough! Who knows
    what will happen once we make the "first contact" :)
     
    NumLockOff, Oct 10, 2003
    #4
  5. On Thu, 09 Oct 2003 23:25:44 -0700, NumLockOff wrote:

    >
    > "Sheldon Simms" <> wrote in message
    > news:p...
    >> On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:
    >>
    >> > In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndns._OUT_org>

    > "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
    >> >
    >> >>PS: The whole multibyte string vs wide character string concept is

    > broken
    >> >>IMHO since it allows wchar_t not to be large enough to contain a full
    >> >>character (rendering both types virtually the same). What's the point of
    >> >>standartizing wide characters if the standard makes portable usage of

    > such
    >> >>mechanism a programming hell? Feel free to disagree.
    >> >
    >> > The bit you're missing is that the standard doesn't impose one character
    >> > set or another for wide characters. If the implementor decides to use
    >> > ASCII as the character set for wide characters, wchar_t need not be any
    >> > wider than char. But wchar_t is supposed to be wide enough for the
    >> > character set chosen by the implementor for wide characters.

    >>
    >> I don't think he's missing that at all. He's simply pointing out that
    >> the standard makes it pretty much impossible to use wide characters
    >> portably (unless you only use wide characters with values between 0
    >> and 127, of course).
    >>
    >> Had the standard mandated, for instance, that wide characters be at
    >> least 32 bits wide, then each wide character would be wide enough for
    >> any character set and it would be possible to write portable code
    >> using wide characters as long as the code had no character set
    >> dependency.
    >>
    >> The OP also seems to be griping about certain implementations using
    >> unicode as a character set that have 16 bit wchar_t. Since it is
    >> impossible to represent every unicode character in 16 bits, wide
    >> character strings become 'multiwchar_t' encodings (UTF-16), which
    >> defeats the whole purpose of wide characters and wide character strings
    >>
    >> - Sheldon
    >>

    > It is just the evolution of the Unicode standard. Surrogares were added at
    > U+D800 to include more FarEastern characters. It has become now similar to a
    > mbcs mess.


    Unicode is not the problem. 16 bit wchar_t is the problem.
     
    Sheldon Simms, Oct 10, 2003
    #5
  6. Zygmunt Krynicki

    Dan Pop Guest

    In <> Sheldon Simms <> writes:

    >On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:
    >
    >> In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndns._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
    >>
    >>>PS: The whole multibyte string vs wide character string concept is broken
    >>>IMHO since it allows wchar_t not to be large enough to contain a full
    >>>character (rendering both types virtually the same). What's the point of
    >>>standartizing wide characters if the standard makes portable usage of such
    >>>mechanism a programming hell? Feel free to disagree.

    >>
    >> The bit you're missing is that the standard doesn't impose one character
    >> set or another for wide characters. If the implementor decides to use
    >> ASCII as the character set for wide characters, wchar_t need not be any
    >> wider than char. But wchar_t is supposed to be wide enough for the
    >> character set chosen by the implementor for wide characters.

    >
    >I don't think he's missing that at all. He's simply pointing out that
    >the standard makes it pretty much impossible to use wide characters
    >portably (unless you only use wide characters with values between 0
    >and 127, of course).
    >
    >Had the standard mandated, for instance, that wide characters be at
    >least 32 bits wide, then each wide character would be wide enough for
    >any character set and it would be possible to write portable code
    >using wide characters as long as the code had no character set
    >dependency.


    Nope, it wouldn't, as long as the standard doesn't specify a certain
    character set for the wide characters. Imagine that you need to output
    the character e with an acute accent. How do you do that *portably*, if
    you have the additional guarantee that wchar_t is at least 32-bit wide?

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
     
    Dan Pop, Oct 10, 2003
    #6
  7. On Fri, 10 Oct 2003 11:49:19 +0000, Dan Pop wrote:

    > In <> Sheldon Simms <> writes:
    >
    >>On Thu, 09 Oct 2003 15:08:51 +0000, Dan Pop wrote:
    >>
    >>> In <pan.2003.10.09.12.50.01.320068@_CUT_2zyga.MEdyndns._OUT_org> "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org> writes:
    >>>
    >>>>PS: The whole multibyte string vs wide character string concept is broken
    >>>>IMHO since it allows wchar_t not to be large enough to contain a full
    >>>>character (rendering both types virtually the same). What's the point of
    >>>>standartizing wide characters if the standard makes portable usage of such
    >>>>mechanism a programming hell? Feel free to disagree.
    >>>

    >>Had the standard mandated, for instance, that wide characters be at
    >>least 32 bits wide, then each wide character would be wide enough for
    >>any character set and it would be possible to write portable code
    >>using wide characters as long as the code had no character set
    >>dependency.

    >
    > Nope, it wouldn't, as long as the standard doesn't specify a certain
    > character set for the wide characters. Imagine that you need to output
    > the character e with an acute accent. How do you do that *portably*, if
    > you have the additional guarantee that wchar_t is at least 32-bit wide?


    I never meant to say that sort of thing could be done portably.

    I was going on the assumption that the OP's assertion "it allows wchar_t
    not to be large enough to contain a full character" was true, and thinking
    about two implementations using the same execution character set where
    one implementation used a wchar_t that was too small for the character
    set.

    It seems to me now, however, that an implementation in which wchar_t is
    not "large enough to contain a full character" would be non-conforming,
    since 7.17.2 states:

    wchar_t which is an integer type whose range of values can represent
    distinct codes for all members of the largest extended character set
    specified among the supported locales;

    In any case, my statment was based on the assumption of multiple
    implementations using a common (but arbitrary) character set, and that
    is an unportable assumption by itself, so I retract my assertion.

    -Sheldon
     
    Sheldon Simms, Oct 10, 2003
    #7
  8. On Fri, 10 Oct 2003 11:49:19 +0000, Dan Pop wrote:

    > Nope, it wouldn't, as long as the standard doesn't specify a certain
    > character set for the wide characters. Imagine that you need to output
    > the character e with an acute accent. How do you do that *portably*, if
    > you have the additional guarantee that wchar_t is at least 32-bit wide?
    >
    > Dan


    To clarify

    Not my problem really, and not a reall one either as any specific program
    knows its output encoding most probably. Hovever imagine I wish to write
    a portable code for wide character regular expressions. Now the whole purpose
    of wide characters is obvious; to be able to address all sorts of
    characters and encodings, not just plain ascii, in a portable way.

    Not to speak names it is common that the INTERNAL encoding used inside
    program routines is often different than EXTERNAL encoding used to
    store/transfer text.

    Now we know that many external encodings use multibyte sequences for
    various reasons which are not important here. We also know how inefficient
    or uncomfortable it is to develop algorithms for multibyte sequence
    character strings. It is much easier to assume that any single charater
    can fit into some data type. Wether it's wchar_t or foo_t is not
    important.

    Now if wchar_t is not forced to able to contain a full character then
    again we are stuck at our multibyte (multi-some-unit) character
    sequence with all of its inconveniances. This IMHO defeats the whole
    purpose of wchar_t.

    Of course it is not clear which character encoding is the best one (or rather
    since there is no perfect encoding which one should be made the standard).
    Unicode seems to help alot providing UTF-8 as external and 32bit Unicode
    as internal encoding. This has all sorts of benefits and non-benefits that
    are not important here.

    Also hardware doesn't need to have 32 bit wide data types so it
    would be problematic to create conforming implementations

    BTW: Thank you all for participating in this discussion :)

    Regards
    Zygmunt Krynicki
     
    Zygmunt Krynicki, Oct 10, 2003
    #8
  9. in comp.lang.c i read:

    >Now if wchar_t is not forced to able to contain a full character then
    >again we are stuck at our multibyte (multi-some-unit) character
    >sequence with all of its inconveniances. This IMHO defeats the whole
    >purpose of wchar_t.


    wchar_t is required to have a range that can handle all the code points
    which can arise from the use of any locale supported by the implementation.
    c99 takes this further: the implementation can indicate to the programmer
    if iso-10646 is directly supported (though the encoding is *not* required
    to be ucs-4), and the creation of the \U and \u escapes so that iso-10646
    code points can be used directly.

    >Also hardware doesn't need to have 32 bit wide data types so it
    >would be problematic to create conforming implementations


    hardware may not necessarily have a 32 bit wide integer type, but the
    standard mandates that long be at least 32 value bits wide (sign + 31 for
    signed long). so, there *is* always a 32 bit type available.

    --
    a signature
     
    those who know me have no need of my name, Oct 11, 2003
    #9
  10. On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
    name wrote:

    > in comp.lang.c i read:
    >
    >>Now if wchar_t is not forced to able to contain a full character then
    >>again we are stuck at our multibyte (multi-some-unit) character
    >>sequence with all of its inconveniances. This IMHO defeats the whole
    >>purpose of wchar_t.

    >
    > wchar_t is required to have a range that can handle all the code points
    > which can arise from the use of any locale supported by the implementation.
    > c99 takes this further: the implementation can indicate to the programmer
    > if iso-10646 is directly supported (though the encoding is *not* required
    > to be ucs-4)


    I guess you're saying the encoding is not required to be ucs-4 because
    the standard doesn't explicitly say so:

    6.10.8.2
    ...
    __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
    example, 199712L), intended to indicate that values of type wchar_t
    are the coded representations of the characters defined by ISO/IEC
    10646, along with all amendments and technical corrigenda as of the
    specified year and month.

    But if the encoding is not ucs-4, then what could it possibly be?
    7.17.2 says

    wchar_t which is an integer type whose range of values can represent
    distinct codes for all members of the largest extended character set
    specified among the supported locales;

    As I read this, it means that in implementations implementing ISO 10646
    must have a wchar_t capable of representing over 1 million distinct
    values. Given this requirement, ucs-4 seems to be the only reasonable
    encoding to use for ISO 10646 wide character strings.

    Would an implementation that used utf-8 encoding in wide character
    strings composed of 32-bit wchar_t be conforming?

    -Sheldon
     
    Sheldon Simms, Oct 12, 2003
    #10
  11. Zygmunt Krynicki

    Micah Cowan Guest

    Sheldon Simms <> writes:

    > On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
    > name wrote:
    >
    > > in comp.lang.c i read:
    > >
    > >>Now if wchar_t is not forced to able to contain a full character then
    > >>again we are stuck at our multibyte (multi-some-unit) character
    > >>sequence with all of its inconveniances. This IMHO defeats the whole
    > >>purpose of wchar_t.

    > >
    > > wchar_t is required to have a range that can handle all the code points
    > > which can arise from the use of any locale supported by the implementation.
    > > c99 takes this further: the implementation can indicate to the programmer
    > > if iso-10646 is directly supported (though the encoding is *not* required
    > > to be ucs-4)

    >
    > I guess you're saying the encoding is not required to be ucs-4 because
    > the standard doesn't explicitly say so:
    >
    > 6.10.8.2
    > ...
    > __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
    > example, 199712L), intended to indicate that values of type wchar_t
    > are the coded representations of the characters defined by ISO/IEC
    > 10646, along with all amendments and technical corrigenda as of the
    > specified year and month.
    >
    > But if the encoding is not ucs-4, then what could it possibly be?
    > 7.17.2 says
    >
    > wchar_t which is an integer type whose range of values can represent
    > distinct codes for all members of the largest extended character set
    > specified among the supported locales;
    >
    > As I read this, it means that in implementations implementing ISO 10646
    > must have a wchar_t capable of representing over 1 million distinct
    > values. Given this requirement, ucs-4 seems to be the only reasonable
    > encoding to use for ISO 10646 wide character strings.


    No; the ISO 10646 and Unicode standards are 16-bit
    encodings. Some 16-bit codes work together (high/low surrogates)
    to produce the effect of a "single" character from two encoded
    characters; however, that does not change the fact that the
    standards themselves claim to present 16-bit encodings (Actually,
    for ISO 10646 I'm making some assumptions, as I've not read it;
    only Unicode). Not only this, but while support is in place for
    character codes 0x10000 and above, no character codes have
    actually been defined for these values, and so UCS-2/UTF-16 can
    safely be used to encode "all members of the largest extended
    character set".

    > Would an implementation that used utf-8 encoding in wide character
    > strings composed of 32-bit wchar_t be conforming?


    I don't think so, no.

    -Micah
     
    Micah Cowan, Oct 12, 2003
    #11
  12. On Sun, 12 Oct 2003 13:29:25 -0700, Micah Cowan wrote:

    > Sheldon Simms <> writes:
    >
    >> On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
    >> name wrote:
    >>
    >> > in comp.lang.c i read:
    >> >
    >> >>Now if wchar_t is not forced to able to contain a full character then
    >> >>again we are stuck at our multibyte (multi-some-unit) character
    >> >>sequence with all of its inconveniances. This IMHO defeats the whole
    >> >>purpose of wchar_t.
    >> >
    >> > wchar_t is required to have a range that can handle all the code points
    >> > which can arise from the use of any locale supported by the implementation.
    >> > c99 takes this further: the implementation can indicate to the programmer
    >> > if iso-10646 is directly supported (though the encoding is *not* required
    >> > to be ucs-4)

    >>
    >> I guess you're saying the encoding is not required to be ucs-4 because
    >> the standard doesn't explicitly say so:
    >>
    >> 6.10.8.2
    >> ...
    >> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
    >> example, 199712L), intended to indicate that values of type wchar_t
    >> are the coded representations of the characters defined by ISO/IEC
    >> 10646, along with all amendments and technical corrigenda as of the
    >> specified year and month.
    >>
    >> But if the encoding is not ucs-4, then what could it possibly be?
    >> 7.17.2 says
    >>
    >> wchar_t which is an integer type whose range of values can represent
    >> distinct codes for all members of the largest extended character set
    >> specified among the supported locales;
    >>
    >> As I read this, it means that in implementations implementing ISO 10646
    >> must have a wchar_t capable of representing over 1 million distinct
    >> values. Given this requirement, ucs-4 seems to be the only reasonable
    >> encoding to use for ISO 10646 wide character strings.

    >
    > No; the ISO 10646 and Unicode standards are 16-bit
    > encodings.


    Unicode 4.0 p.1:
    Unicode provides for three encoding forms: a 32-bit form (UTF-32),
    a 16-bit form (UTF- 16), and an 8-bit form (UTF-8).

    > Some 16-bit codes work together (high/low surrogates)
    > to produce the effect of a "single" character from two encoded
    > characters; however, that does not change the fact that the
    > standards themselves claim to present 16-bit encodings.


    Unicode 4.0 p.1:
    The Unicode Standard specifies a numeric value (code point) and a
    name for each of its characters.
    ...
    The Unicode Standard provides 1,114,112 code points,

    Unicode 4.0 p.28:
    UTF-32 is the simplest Unicode encoding form. Each Unicode code
    point is represented directly by a single 32-bit code unit.
    Because of this, UTF-32 has a one-to-one relationship between
    encoded character and code unit;
    ...
    In the UTF-16 encoding form, ... code points in the supplementary
    planes, in the range U+10000..U+10FFFF, are instead represented
    as pairs of 16-bit code units.
    ...
    The distinction between characters represented with one versus
    two 16-bit code units means that formally UTF-16 is a variable-
    width encoding form.

    > Not only this, but while support is in place for
    > character codes 0x10000 and above, no character codes have
    > actually been defined for these values, and so UCS-2/UTF-16 can
    > safely be used to encode "all members of the largest extended
    > character set".


    Unicode 4.0 p.1:
    The Unicode Standard, Version 4.0, contains 96,382 characters
    from the world's scripts.
    ...
    The unified Han subset contains 70,207 ideographic characters

    Examples of characters at code points greater than or equal to
    0x10000 are "Musical Symbols", "Mathematical Alphanumeric Symbols",
    and "CJK Unified Ideographs Extension B"

    http://www.unicode.org/charts/

    My conclusion is that 16 bit values can NOT in fact encode "all
    members of the largest extended character set", if that character
    set is Unicode. This means that 16 bit wchar_t is NOT conforming
    on implementations that claim to implement Unicode, and that
    the only acceptable encoding for wide character strings in such
    an implementations is UCS-4

    -Sheldon
     
    Sheldon Simms, Oct 13, 2003
    #12
  13. Zygmunt Krynicki

    Dan Pop Guest

    In <> Sheldon Simms <> writes:

    >On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
    >name wrote:
    >
    >> in comp.lang.c i read:
    >>
    >>>Now if wchar_t is not forced to able to contain a full character then
    >>>again we are stuck at our multibyte (multi-some-unit) character
    >>>sequence with all of its inconveniances. This IMHO defeats the whole
    >>>purpose of wchar_t.

    >>
    >> wchar_t is required to have a range that can handle all the code points
    >> which can arise from the use of any locale supported by the implementation.
    >> c99 takes this further: the implementation can indicate to the programmer
    >> if iso-10646 is directly supported (though the encoding is *not* required
    >> to be ucs-4)

    >
    >I guess you're saying the encoding is not required to be ucs-4 because
    >the standard doesn't explicitly say so:
    >
    > 6.10.8.2
    > ...
    > __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
    > example, 199712L), intended to indicate that values of type wchar_t
    > are the coded representations of the characters defined by ISO/IEC
    > 10646, along with all amendments and technical corrigenda as of the
    > specified year and month. ^^^^^^^^^

    ^^^^^^^^^^^^^^^^^^^^^^^^
    >But if the encoding is not ucs-4, then what could it possibly be?
    >7.17.2 says
    >
    > wchar_t which is an integer type whose range of values can represent
    > distinct codes for all members of the largest extended character set
    > specified among the supported locales;


    Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
    as being "the largest extended character set specified among the
    supported locales" and, therefore, having wchar_t defined as char?

    >As I read this, it means that in implementations implementing ISO 10646
    >must have a wchar_t capable of representing over 1 million distinct
    >values.


    It depends on the actual value of the __STDC_ISO_10646__, which could
    point to an earlier version of ISO 10646, or not be defined at all,
    as in my ASCII example above.

    >Given this requirement, ucs-4 seems to be the only reasonable
    >encoding to use for ISO 10646 wide character strings.


    If the implementation chooses to support a recent enough version of the
    ISO 10646. Which the standard allows but doesn't require. The first
    incarnation of ISO 10646 only specified 34203 characters, so a 16-bit
    wchar_t would be enough for an implementation defining __STDC_ISO_10646__.

    >Would an implementation that used utf-8 encoding in wide character
    >strings composed of 32-bit wchar_t be conforming?


    No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
    to six octets). They are clearly intended to be used in multibyte
    character strings, which are composed of plain char's (e.g. printf's
    format string).

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
     
    Dan Pop, Oct 13, 2003
    #13
  14. ffOn Mon, 13 Oct 2003 14:18:31 +0000, Dan Pop wrote:

    > In <> Sheldon Simms <> writes:
    >
    >>On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
    >>name wrote:
    >>
    >> wchar_t which is an integer type whose range of values can represent
    >> distinct codes for all members of the largest extended character set
    >> specified among the supported locales;

    >
    > Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
    > as being "the largest extended character set specified among the
    > supported locales" and, therefore, having wchar_t defined as char?


    Nothing. However, I was only talking about cases where "the largest
    extended character set" is Unicode.

    >>As I read this, it means that in implementations implementing ISO 10646
    >>must have a wchar_t capable of representing over 1 million distinct
    >>values.

    >
    > It depends on the actual value of the __STDC_ISO_10646__, which could
    > point to an earlier version of ISO 10646


    All right. It might suck to know that your preferred implementation
    is not capable of keeping up with ISO 10646 since it's stuck with a
    16 bit wchar_t, but I guess that's a problem for the implementors
    users of such an implementation, and off topic here.

    >>Given this requirement, ucs-4 seems to be the only reasonable
    >>encoding to use for ISO 10646 wide character strings.

    >
    > If the implementation chooses to support a recent enough version of the
    > ISO 10646. Which the standard allows but doesn't require.


    That's what I thought.

    >>Would an implementation that used utf-8 encoding in wide character
    >>strings composed of 32-bit wchar_t be conforming?

    >
    > No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
    > to six octets). They are clearly intended to be used in multibyte
    > character strings, which are composed of plain char's (e.g. printf's
    > format string).


    My intention was to express that each of the 32 bit wide characters
    contain the value of one octet of the UTF-8 encoding. I didn't
    think that would be conforming.
     
    Sheldon Simms, Oct 13, 2003
    #14
  15. Zygmunt Krynicki

    Dan Pop Guest

    In <> Sheldon Simms <> writes:

    >ffOn Mon, 13 Oct 2003 14:18:31 +0000, Dan Pop wrote:
    >
    >> In <> Sheldon Simms <> writes:
    >>
    >>>On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
    >>>name wrote:
    >>>
    >>> wchar_t which is an integer type whose range of values can represent
    >>> distinct codes for all members of the largest extended character set
    >>> specified among the supported locales;

    >>
    >> Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
    >> as being "the largest extended character set specified among the
    >> supported locales" and, therefore, having wchar_t defined as char?

    >
    >Nothing. However, I was only talking about cases where "the largest
    >extended character set" is Unicode.
    >
    >>>As I read this, it means that in implementations implementing ISO 10646
    >>>must have a wchar_t capable of representing over 1 million distinct
    >>>values.

    >>
    >> It depends on the actual value of the __STDC_ISO_10646__, which could
    >> point to an earlier version of ISO 10646

    >
    >All right. It might suck to know that your preferred implementation
    >is not capable of keeping up with ISO 10646 since it's stuck with a
    >16 bit wchar_t, but I guess that's a problem for the implementors
    >users of such an implementation, and off topic here.


    Once you're talking about cases where "the largest extended character
    set" is Unicode *only*, you're off-topic here, anyway.

    However, I can see no reason why a certain implementation would be stuck
    with a 16 bit wchar_t, once its intended market is asking for more. For
    the time being, there is little market pressure for a wider wchar_t,
    however, the 16-bit codes covering practically all locales of interest.

    Widening wchar_t to 32-bit is not a no-cost decision: think about
    programs manipulating huge amounts of wchar_t data.

    >>>Would an implementation that used utf-8 encoding in wide character
    >>>strings composed of 32-bit wchar_t be conforming?

    >>
    >> No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
    >> to six octets). They are clearly intended to be used in multibyte
    >> character strings, which are composed of plain char's (e.g. printf's
    >> format string).

    >
    >My intention was to express that each of the 32 bit wide characters
    >contain the value of one octet of the UTF-8 encoding. I didn't
    >think that would be conforming.


    Of course it wouldn't: wchar_t objects are supposed to contain character
    values, not *encoded* character values. Encoded character values can be
    stored in multibyte character strings only.

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
     
    Dan Pop, Oct 13, 2003
    #15
  16. Zygmunt Krynicki

    Micah Cowan Guest

    Sheldon Simms <> writes:

    > On Sun, 12 Oct 2003 13:29:25 -0700, Micah Cowan wrote:
    >
    > > Sheldon Simms <> writes:
    > >
    > >> On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
    > >> name wrote:
    > >>
    > >> > in comp.lang.c i read:
    > >> >
    > >> >>Now if wchar_t is not forced to able to contain a full character then
    > >> >>again we are stuck at our multibyte (multi-some-unit) character
    > >> >>sequence with all of its inconveniances. This IMHO defeats the whole
    > >> >>purpose of wchar_t.
    > >> >
    > >> > wchar_t is required to have a range that can handle all the code points
    > >> > which can arise from the use of any locale supported by the implementation.
    > >> > c99 takes this further: the implementation can indicate to the programmer
    > >> > if iso-10646 is directly supported (though the encoding is *not* required
    > >> > to be ucs-4)
    > >>
    > >> I guess you're saying the encoding is not required to be ucs-4 because
    > >> the standard doesn't explicitly say so:
    > >>
    > >> 6.10.8.2
    > >> ...
    > >> __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
    > >> example, 199712L), intended to indicate that values of type wchar_t
    > >> are the coded representations of the characters defined by ISO/IEC
    > >> 10646, along with all amendments and technical corrigenda as of the
    > >> specified year and month.
    > >>
    > >> But if the encoding is not ucs-4, then what could it possibly be?
    > >> 7.17.2 says
    > >>
    > >> wchar_t which is an integer type whose range of values can represent
    > >> distinct codes for all members of the largest extended character set
    > >> specified among the supported locales;
    > >>
    > >> As I read this, it means that in implementations implementing ISO 10646
    > >> must have a wchar_t capable of representing over 1 million distinct
    > >> values. Given this requirement, ucs-4 seems to be the only reasonable
    > >> encoding to use for ISO 10646 wide character strings.

    > >
    > > No; the ISO 10646 and Unicode standards are 16-bit
    > > encodings.

    >
    > Unicode 4.0 p.1:
    > Unicode provides for three encoding forms: a 32-bit form (UTF-32),
    > a 16-bit form (UTF- 16), and an 8-bit form (UTF-8).


    I didn't mean quite what I wrote: What I meant was "Unicode
    character codes have a width of 16 bits". This was true
    regardless of the number of encodings available (Unicode 3.0 plus
    addenda had UTF-32), yet sect. 2.2 still said "Unicode character
    codes have a width of 16 bits". This appears to have been removed
    from Unicode 4.0.

    > > Some 16-bit codes work together (high/low surrogates)
    > > to produce the effect of a "single" character from two encoded
    > > characters; however, that does not change the fact that the
    > > standards themselves claim to present 16-bit encodings.

    >
    > Unicode 4.0 p.1:
    > The Unicode Standard specifies a numeric value (code point) and a
    > name for each of its characters.
    > ...
    > The Unicode Standard provides 1,114,112 code points,


    Hm. The same area in Unicode 3.0 said "Using a 16-bit encoding
    means that code values are available for more than 65,000
    characters." They clearly supported more than that; sloppy
    wording on their part.

    > Unicode 4.0 p.28:
    > UTF-32 is the simplest Unicode encoding form. Each Unicode code
    > point is represented directly by a single 32-bit code unit.
    > Because of this, UTF-32 has a one-to-one relationship between
    > encoded character and code unit;
    > ...
    > In the UTF-16 encoding form, ... code points in the supplementary
    > planes, in the range U+10000..U+10FFFF, are instead represented
    > as pairs of 16-bit code units.
    > ...
    > The distinction between characters represented with one versus
    > two 16-bit code units means that formally UTF-16 is a variable-
    > width encoding form.


    Okay. Here's the chief difference then. In Unicode 3.0, UTF-16
    was formally considered the one-to-one representation (which was
    kind of sticky when you deal with surrogates; having to pretend
    that they're really two separate characters...).

    > My conclusion is that 16 bit values can NOT in fact encode "all
    > members of the largest extended character set", if that character
    > set is Unicode. This means that 16 bit wchar_t is NOT conforming
    > on implementations that claim to implement Unicode, and that
    > the only acceptable encoding for wide character strings in such
    > an implementations is UCS-4


    Alright, then: but it *is* conforming provided that they claim to
    conform to a Unicode standard preceding 4.0 whose entire
    character could be represented in 16 bits.

    I hadn't gotten around to reading the 4.0 yet; I'm pleased to see
    that they've eschewed all the "pay no attention to the man behind
    the curtain; Unicode *is* a 16-bit character set... that seemed
    to be present in 3.0". Perhaps they had already remedied some of
    this in their addenda: I didn't read many of those except some of
    the new character codespaces.

    -Micah
     
    Micah Cowan, Oct 13, 2003
    #16
  17. On Mon, 13 Oct 2003 18:25:04 +0000, Dan Pop wrote:

    > In <> Sheldon Simms <> writes:
    >
    >>ffOn Mon, 13 Oct 2003 14:18:31 +0000, Dan Pop wrote:
    >>
    >>> In <> Sheldon Simms <> writes:
    >>>
    >>>>Would an implementation that used utf-8 encoding in wide character
    >>>>strings composed of 32-bit wchar_t be conforming?
    >>>
    >>> No way. utf-8 encodings need not fit in a 32-bit wchar_t (they take one
    >>> to six octets). They are clearly intended to be used in multibyte
    >>> character strings, which are composed of plain char's (e.g. printf's
    >>> format string).

    >>
    >>My intention was to express that each of the 32 bit wide characters
    >>contain the value of one octet of the UTF-8 encoding. I didn't
    >>think that would be conforming.

    >
    > Of course it wouldn't: wchar_t objects are supposed to contain character
    > values, not *encoded* character values. Encoded character values can be
    > stored in multibyte character strings only.


    This gets back to the problem the original poster had. He seemed to
    be confronted with an implementation that used 16 bit wchar_t and
    encoded wide character strings (including characters outside of
    Unicode's Basic Multilingual Plane) in UTF-16, a variable length
    encoding.

    I expressed the view that such an implementation would be non-conforming.
     
    Sheldon Simms, Oct 13, 2003
    #17
  18. Zygmunt Krynicki

    Dingo Guest

    (Dan Pop) wrote in message news:<bmec7n$jsh$>...
    > In <> Sheldon Simms <> writes:
    >
    > >On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
    > >name wrote:
    > >
    > >> in comp.lang.c i read:
    > >>
    > >>>Now if wchar_t is not forced to able to contain a full character then
    > >>>again we are stuck at our multibyte (multi-some-unit) character
    > >>>sequence with all of its inconveniances. This IMHO defeats the whole
    > >>>purpose of wchar_t.
    > >>
    > >> wchar_t is required to have a range that can handle all the code points
    > >> which can arise from the use of any locale supported by the implementation.
    > >> c99 takes this further: the implementation can indicate to the programmer
    > >> if iso-10646 is directly supported (though the encoding is *not* required
    > >> to be ucs-4)

    > >
    > >I guess you're saying the encoding is not required to be ucs-4 because
    > >the standard doesn't explicitly say so:
    > >
    > > 6.10.8.2
    > > ...
    > > __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
    > > example, 199712L), intended to indicate that values of type wchar_t
    > > are the coded representations of the characters defined by ISO/IEC
    > > 10646, along with all amendments and technical corrigenda as of the
    > > specified year and month. ^^^^^^^^^

    > ^^^^^^^^^^^^^^^^^^^^^^^^
    > >But if the encoding is not ucs-4, then what could it possibly be?
    > >7.17.2 says
    > >
    > > wchar_t which is an integer type whose range of values can represent
    > > distinct codes for all members of the largest extended character set
    > > specified among the supported locales;

    >
    > Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
    > as being "the largest extended character set specified among the
    > supported locales" and, therefore, having wchar_t defined as char?
    >
    > >As I read this, it means that in implementations implementing ISO 10646
    > >must have a wchar_t capable of representing over 1 million distinct
    > >values.

    >
    > It depends on the actual value of the __STDC_ISO_10646__, which could
    > point to an earlier version of ISO 10646, or not be defined at all,
    > as in my ASCII example above.


    The way I read it, __STDC_ISO_10646__ doesn't indicate the Unicode
    version that defines the extended character set. It is just states
    the version where wchar_t encodings may be found.

    A seven-bit ASCII implementation with wchar_t defined as char could
    define the most recent value for __STDC_ISO_10646__ and be conforming.
    ASCII encodings map directly to the most recent version of ISO 10646.
    And a char is wide enough to hold "the largest extended character set
    among the supported locales."
     
    Dingo, Oct 14, 2003
    #18
  19. Zygmunt Krynicki

    Dan Pop Guest

    In <> (Dingo) writes:

    > (Dan Pop) wrote in message news:<bmec7n$jsh$>...
    >> In <> Sheldon Simms <> writes:
    >>
    >> >On Sat, 11 Oct 2003 19:42:31 +0000, those who know me have no need of my
    >> >name wrote:
    >> >
    >> >> in comp.lang.c i read:
    >> >>
    >> >>>Now if wchar_t is not forced to able to contain a full character then
    >> >>>again we are stuck at our multibyte (multi-some-unit) character
    >> >>>sequence with all of its inconveniances. This IMHO defeats the whole
    >> >>>purpose of wchar_t.
    >> >>
    >> >> wchar_t is required to have a range that can handle all the code points
    >> >> which can arise from the use of any locale supported by the implementation.
    >> >> c99 takes this further: the implementation can indicate to the programmer
    >> >> if iso-10646 is directly supported (though the encoding is *not* required
    >> >> to be ucs-4)
    >> >
    >> >I guess you're saying the encoding is not required to be ucs-4 because
    >> >the standard doesn't explicitly say so:
    >> >
    >> > 6.10.8.2
    >> > ...
    >> > __STDC_ISO_10646__ An integer constant of the form yyyymmL (for
    >> > example, 199712L), intended to indicate that values of type wchar_t
    >> > are the coded representations of the characters defined by ISO/IEC
    >> > 10646, along with all amendments and technical corrigenda as of the
    >> > specified year and month. ^^^^^^^^^

    >> ^^^^^^^^^^^^^^^^^^^^^^^^
    >> >But if the encoding is not ucs-4, then what could it possibly be?
    >> >7.17.2 says
    >> >
    >> > wchar_t which is an integer type whose range of values can represent
    >> > distinct codes for all members of the largest extended character set
    >> > specified among the supported locales;

    >>
    >> Again, what part of the standard precludes ASCII, EBCDIC or ISO 8859-1
    >> as being "the largest extended character set specified among the
    >> supported locales" and, therefore, having wchar_t defined as char?
    >>
    >> >As I read this, it means that in implementations implementing ISO 10646
    >> >must have a wchar_t capable of representing over 1 million distinct
    >> >values.

    >>
    >> It depends on the actual value of the __STDC_ISO_10646__, which could
    >> point to an earlier version of ISO 10646, or not be defined at all,
    >> as in my ASCII example above.

    >
    >The way I read it, __STDC_ISO_10646__ doesn't indicate the Unicode
    >version that defines the extended character set. It is just states
    >the version where wchar_t encodings may be found.
    >
    >A seven-bit ASCII implementation with wchar_t defined as char could
    >define the most recent value for __STDC_ISO_10646__ and be conforming.
    >ASCII encodings map directly to the most recent version of ISO 10646.
    >And a char is wide enough to hold "the largest extended character set
    >among the supported locales."


    As I read it, it is the whole ISO/IEC 10646 specification that must be
    supported by wchar_t, once this macro is defined. The words "along
    with all amendments and technical corrigenda as of the specified year
    and month" clearly suggest this interpretation to me. Of course, only
    comp.std.c can say which interpretation is the intended one.

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
     
    Dan Pop, Oct 14, 2003
    #19
  20. Zygmunt Krynicki

    Dan Pop Guest

    In <> Sheldon Simms <> writes:

    >This gets back to the problem the original poster had. He seemed to
    >be confronted with an implementation that used 16 bit wchar_t and
    >encoded wide character strings (including characters outside of
    >Unicode's Basic Multilingual Plane) in UTF-16, a variable length
    >encoding.


    Couldn't find anything suggesting this in OP's post:

    From: "Zygmunt Krynicki" <zyga@_CUT_2zyga.MEdyndns._OUT_org>
    Organization: Customers chello Poland
    Date: Thu, 09 Oct 2003 12:54:00 GMT
    Subject: Multibyte string length

    Hello
    I've browsed the FAQ but apparently it lacks any questions concenring wide
    character strings. I'd like to calculate the length of a multibyte string
    without converting the whole string.

    Zygmunt

    PS: The whole multibyte string vs wide character string concept is broken
    IMHO since it allows wchar_t not to be large enough to contain a full
    character (rendering both types virtually the same). What's the point of
    standartizing wide characters if the standard makes portable usage of such
    mechanism a programming hell? Feel free to disagree.

    PS2: On my implementation wchar_t is 'big enough' so I might overcome the
    problem in some other way but I'd like to see some fully portable approach.

    He seemed to be worried about wchar_t not being wide enough for its
    intended purpose, but the C standard makes it quite clear that this cannot
    be the case, by definition, for the simple reason that it is the
    implementor who decides what the extended character set actually is.

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
     
    Dan Pop, Oct 14, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sam
    Replies:
    3
    Views:
    14,123
    Karl Seguin
    Feb 17, 2005
  2. Replies:
    5
    Views:
    667
    John W. Kennedy
    Jan 11, 2007
  3. Jordan Abel

    multibyte length

    Jordan Abel, Mar 3, 2006, in forum: C Programming
    Replies:
    3
    Views:
    320
    Micah Cowan
    Mar 3, 2006
  4. Owner

    How to determine Multibyte string length.

    Owner, Apr 9, 2011, in forum: C Programming
    Replies:
    4
    Views:
    821
    Ben Bacarisse
    Apr 11, 2011
  5. kobayashi
    Replies:
    9
    Views:
    215
    Evan Driscoll
    Nov 26, 2012
Loading...

Share This Page