wchar_t

Discussion in 'C Programming' started by James Brown, Nov 17, 2005.

  1. James Brown

    James Brown Guest

    could someone please tell me when the wchar_t type was introduced into
    the C language (and with what version).....perhaps it was introduced
    as an extension by alot of compiler venders before it became official?

    I am also interested in finding out what first prompted the introduction of
    this type -
    was it Unicode or did wchar_t happen before Unicode came into existence?

    thanks,
    James
     
    James Brown, Nov 17, 2005
    #1
    1. Advertising

  2. James Brown a écrit :
    > could someone please tell me when the wchar_t type was introduced into
    > the C language (and with what version)


    Addendum 1995

    --
    A+

    Emmanuel Delahaye
     
    Emmanuel Delahaye, Nov 17, 2005
    #2
    1. Advertising

  3. in comp.lang.c i read:

    >could someone please tell me when the wchar_t type was introduced into
    >the C language (and with what version)


    in the original standard, in 1989. though it was less than useful until
    amd1 was adopted in 1995, and some might say remains less than successful.

    --
    a signature
     
    those who know me have no need of my name, Nov 18, 2005
    #3
  4. James Brown

    Guest

    James Brown <dont_bother> wrote:
    >
    > I am also interested in finding out what first prompted the introduction of
    > this type -
    > was it Unicode or did wchar_t happen before Unicode came into existence?


    It was large character sets in general. At the time, the prevalent
    large character sets and encodings were IBM's DBCS (the double-byte
    version of EBCDIC), JIS 208, JIS 212, ISO 2022, SJIS, and EUC. Work had
    begun on what would become ISO 10646, but it was caught up in political
    and technical turmoil between those who insisted that 32 bits were
    required and those who thought that 16 were more than enough and far
    more efficient. The latter camp had just broken away and started work
    on a competing standard, Unicode. (Fortunately for everyone, cooler
    heads prevailed and ISO 10646 and Unicode were eventually harmonized to
    the point that most people now think they're the same thing.)

    -Larry Jones

    Let's pretend I already feel terrible about it, and that you
    don't need to rub it in any more. -- Calvin
     
    , Nov 18, 2005
    #4
  5. James Brown

    P.J. Plauger Guest

    <> wrote in message
    news:...

    > James Brown <dont_bother> wrote:
    >>
    >> I am also interested in finding out what first prompted the introduction
    >> of
    >> this type -
    >> was it Unicode or did wchar_t happen before Unicode came into existence?

    >
    > It was large character sets in general. At the time, the prevalent
    > large character sets and encodings were IBM's DBCS (the double-byte
    > version of EBCDIC), JIS 208, JIS 212, ISO 2022, SJIS, and EUC. Work had
    > begun on what would become ISO 10646, but it was caught up in political
    > and technical turmoil between those who insisted that 32 bits were
    > required and those who thought that 16 were more than enough and far
    > more efficient. The latter camp had just broken away and started work
    > on a competing standard, Unicode. (Fortunately for everyone, cooler
    > heads prevailed and ISO 10646 and Unicode were eventually harmonized to
    > the point that most people now think they're the same thing.)


    Right. And the people who thought that 16 bits were more than enough
    and far more efficient are *now* convinced that 21 bits are more
    than enough and far more efficient. I give 'em five years, tops.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Nov 18, 2005
    #5
  6. James Brown

    Guest

    P.J. Plauger wrote:
    > <> wrote in message
    > > James Brown <dont_bother> wrote:
    > >> I am also interested in finding out what first prompted the introduction
    > >> of this type - was it Unicode or did wchar_t happen before Unicode came into
    > >> existence?

    > >
    > > It was large character sets in general. At the time, the prevalent
    > > large character sets and encodings were IBM's DBCS (the double-byte
    > > version of EBCDIC), JIS 208, JIS 212, ISO 2022, SJIS, and EUC. Work had
    > > begun on what would become ISO 10646, but it was caught up in political
    > > and technical turmoil between those who insisted that 32 bits were
    > > required and those who thought that 16 were more than enough and far
    > > more efficient. The latter camp had just broken away and started work
    > > on a competing standard, Unicode. (Fortunately for everyone, cooler
    > > heads prevailed and ISO 10646 and Unicode were eventually harmonized to
    > > the point that most people now think they're the same thing.)

    >
    > Right. And the people who thought that 16 bits were more than enough
    > and far more efficient are *now* convinced that 21 bits are more
    > than enough and far more efficient. I give 'em five years, tops.


    I'll take the other side of any bet you care to form based on that
    statement. (Certainly I'll be recording this message for the
    archives.)

    BTW, the people who thought 32 bits (actually 31bits) was that way to
    go *also* agree that (almost) 21 bits are more than enough. Currently
    less than 17 bits are being used today (dominated by the east asian
    characters), and the growth rate appears to be not worse than a
    thousand new characters added per year. The kinds of things they are
    considering these days are invented character sets (like an
    accessibility alphabet called "blissymbolics", the script used for
    Klingon and Elvish in the Lord of the Rings series ...) and really
    obscure historical symbols (apparently "old hungarian" used an alphabet
    that doesn't survive to today except by obscure historians in Hungary.)
    This seems to be very asymptotic to me.

    The problem with the 16 bits people (i.e., Microsoft, Sun, and I think
    IBM) is that they were so stupid as to think that Asia wouldn't really
    need complete character sets. They were basically being passively
    racist. But, of course, money talks and there is a lot of commerce in
    and with east asia, so they had to be accomodating. It turns out that
    17 bits appears to be the right answer to get them all, but leaving no
    expandability left at all is clearly insane. Having 21 bits means they
    literally have more than 15 times as much space left over than what
    they are currently using (again, remembering, that they're already
    covering the really "big" east asian character sets).

    The only remaining controversy (that I can tell) is the aliasing of
    characters between the three major east asian languages. From what I'm
    told, people in those countries don't seem to care about the subtle
    problems that causes (you can't quote one language within another
    unless you use some meta data, like a font change), and have gone full
    steam ahead with dropping Big5 and adopting Unicode pretty pervasively.

    You think they'll run out in 5 years? Personally, I think they're
    done.

    --
    Paul Hsieh
    http://www.pobox.com/~qed/
    http://bstring.sf.net/
     
    , Nov 19, 2005
    #6
  7. James Brown

    P.J. Plauger Guest

    <> wrote in message
    news:...

    > The only remaining controversy (that I can tell) is the aliasing of
    > characters between the three major east asian languages. From what I'm
    > told, people in those countries don't seem to care about the subtle
    > problems that causes (you can't quote one language within another
    > unless you use some meta data, like a font change), and have gone full
    > steam ahead with dropping Big5 and adopting Unicode pretty pervasively.
    >
    > You think they'll run out in 5 years? Personally, I think they're
    > done.


    Here's a coarse scale or two, just from personal experience.

    -- Number of address bits required to address a "large" memory:

    1960 15 IBM 7090
    1970 20 IBM 360
    1980 25 VAX 11/780
    1990 30 various
    2000 35 various

    -- Number of bits required to represent a (commonly used)
    character set:

    1960 6 numerous vendor-specific codes
    1970 7 7-bit ASCII
    1980 8 extended ASCII
    1990 16 DBCS and others
    2000 21 Unicode

    I could make a similar table of "barely adequate" communication
    speeds, which also continue to expand exponentially.

    So long as you think in terms of linear increases in demand
    for bytes or characters, it's easy to believe at each stage
    that you're through expanding. After all, you currently have
    a bit of headroom, and what possible need can there be for
    much larger programs/character sets?

    I personally can't imagine that people will ever want to
    define common attribute bits for, say:

    -- roman, italic, bold, underscore
    -- red, green, blue
    -- point size
    -- font

    But if we did, each attribute bit would double the number
    of effective character codes, wouldn't it?

    Nor can I imagine that a large government like China might
    thumb its nose at an international standard and, say,
    require a parallel set of many ISO 10646 codes.

    For over 40 years I've been reading regular articles by
    pundits who explain why larger/faster hardware is a waste
    of time and will never sell. They've all been wrong. And
    the further back in time you look, the greater the redshift
    in the predictions.

    So, you may well be right that the need for larger
    character sets has finally come to an end. I'll wait
    and see. Meanwhile, I make sure that the code I write
    will work with 32- (not 31-) bit character sets. With
    any luck, the code will have adequate capacity until
    I retire...

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Nov 19, 2005
    #7
  8. James Brown

    James Brown Guest

    <> wrote in message
    news:...
    > James Brown <dont_bother> wrote:
    >>
    >> I am also interested in finding out what first prompted the introduction
    >> of
    >> this type -
    >> was it Unicode or did wchar_t happen before Unicode came into existence?

    >
    > It was large character sets in general. At the time, the prevalent
    > large character sets and encodings were IBM's DBCS (the double-byte
    > version of EBCDIC), JIS 208, JIS 212, ISO 2022, SJIS, and EUC. Work had
    > begun on what would become ISO 10646, but it was caught up in political
    > and technical turmoil between those who insisted that 32 bits were
    > required and those who thought that 16 were more than enough and far
    > more efficient. The latter camp had just broken away and started work
    > on a competing standard, Unicode. (Fortunately for everyone, cooler
    > heads prevailed and ISO 10646 and Unicode were eventually harmonized to
    > the point that most people now think they're the same thing.)
    >
    > -Larry Jones
    >
    > Let's pretend I already feel terrible about it, and that you
    > don't need to rub it in any more. -- Calvin


    thanks! (to everyone) for the very informative answers..

    cheers,
    James
     
    James Brown, Nov 19, 2005
    #8
  9. James Brown

    Skarmander Guest

    [OT] Re: wchar_t

    I'll mark it OT, since we've left C behind quite a bit by now.

    P.J. Plauger wrote:
    > <> wrote in message
    > news:...
    >
    >
    >>The only remaining controversy (that I can tell) is the aliasing of
    >>characters between the three major east asian languages. From what I'm
    >>told, people in those countries don't seem to care about the subtle
    >>problems that causes (you can't quote one language within another
    >>unless you use some meta data, like a font change), and have gone full
    >>steam ahead with dropping Big5 and adopting Unicode pretty pervasively.
    >>
    >>You think they'll run out in 5 years? Personally, I think they're
    >>done.

    >
    >
    > Here's a coarse scale or two, just from personal experience.
    >
    > -- Number of address bits required to address a "large" memory:
    >
    > 1960 15 IBM 7090
    > 1970 20 IBM 360
    > 1980 25 VAX 11/780
    > 1990 30 various
    > 2000 35 various
    >

    Nice, but this misses a point: there is an upper limit. Address bits
    will not continue to grow indefinitely, because there is an upper limit
    to the amount of information that will fit in the universe. Or maybe
    there isn't, but then we're talking a radical shift in physics, which
    may happen but doesn't allow for fair comparison anymore.

    > -- Number of bits required to represent a (commonly used)
    > character set:
    >
    > 1960 6 numerous vendor-specific codes
    > 1970 7 7-bit ASCII
    > 1980 8 extended ASCII
    > 1990 16 DBCS and others
    > 2000 21 Unicode
    >
    > I could make a similar table of "barely adequate" communication
    > speeds, which also continue to expand exponentially.
    >

    But again: it can't go on forever. The question here, therefore, is
    whether we've reached the end of the line, not whether exponential
    expansion is happening.

    > So long as you think in terms of linear increases in demand
    > for bytes or characters, it's easy to believe at each stage
    > that you're through expanding. After all, you currently have
    > a bit of headroom, and what possible need can there be for
    > much larger programs/character sets?
    >

    Don't think this question hasn't been asked, unlike those people who
    asserted that "640K ought to be enough for anybody" (which Bill Gates
    famously never said) or "16 bits ought to be enough, since it's better
    than wasting 32 bits". Unicode doesn't say "21 bits ought to be enough
    for anybody". It can say "21 bits is enough for every character known to
    man", because it is. Unlike memory, communication speed and a host of
    other things that keep growing, there is a conceivable upper limit, and
    it is not that unreasonable to state we're close to it.

    > I personally can't imagine that people will ever want to
    > define common attribute bits for, say:
    >
    > -- roman, italic, bold, underscore
    > -- red, green, blue
    > -- point size
    > -- font
    >
    > But if we did, each attribute bit would double the number
    > of effective character codes, wouldn't it?
    >


    That's why Unicode doesn't work that way, and no character set ever has.
    They encode *characters*, not *glyphs*. A glyph is what you see on your
    screen, and it may have many nice properties by which it is affected,
    including the formatting characteristics you describe. But a Roman
    capital letter A is a Roman capital letter A, no matter what style,
    color, size or font it happens to be displayed in. Being able to leave
    these things unstated will always remain useful.

    Actually, "glyph sets" were (and probably still are) in common use for
    display on dumb terminals with hardwired character sets (and probably
    some applications for not so dumb terminals, too). Remember when the
    character set was 7-bit ASCII and the terminals extended this to an
    8-bit glyph set with the upper bit meaning "reverse video"? That's this.

    The point is, effective comparison stops being useful at this point,
    because you've shifted the way you look at what a code point represents.
    As the Unicode FAQ itself states:

    "Both Unicode and ISO 10646 have policies in place that formally limit
    future code assignment to the integer range that can be expressed with
    current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e.
    other UTFs) can represent larger intergers, these policies mean that all
    encoding forms will always represent the same set of characters. Over a
    million possible codes is far more than enough for the goal of Unicode
    of encoding characters, not glyphs. Unicode is not designed to encode
    arbitrary data. If you wanted, for example, to give each 'instance of a
    character on paper throughout history' its own code, you might need
    trillions or quadrillions of such codes; noble as this effort might be,
    you would not use Unicode for such an encoding."

    Here's a more interesting thing to think about than adding "blink" bits:
    suppose we encounter extraterrestrial cultures one day, and we want to
    synch character sets eventually... *Then* Unicode may become
    insufficient. But I don't think it would be fair to blame the current
    standard for that.

    > Nor can I imagine that a large government like China might
    > thumb its nose at an international standard and, say,
    > require a parallel set of many ISO 10646 codes.
    >

    It already thumbs its nose to some extent. Unicode is still viewed with
    great suspicion in some parts of the Eastern world, and alternate
    character sets continue to be in use. But the Chinese government can
    require of ISO 10646 what it wants; it's not likely to get it if it
    can't be supported by technical requirements, as opposed to politics.
    Maybe you can slip in one character that's spurious that way, but not a
    few thousand. Maybe when the Chinese achieve global domination and
    abolish our preposterous 21-bit standards, but not before.

    > For over 40 years I've been reading regular articles by
    > pundits who explain why larger/faster hardware is a waste
    > of time and will never sell. They've all been wrong. And
    > the further back in time you look, the greater the redshift
    > in the predictions.
    >

    These arguments do not cleanly translate to character sets, your little
    tables notwithstanding. The upper limit may not be 21 bits, but if
    that's not the upper limit, it's pretty close to it in orders of
    magnitude. If people one day decide to abandon the concept of "character
    set" and go crazy stuffing all sorts of attributes in it (adopting
    "glyph sets"), that's a clear change in application, unlike increased
    hardware capacity. It will be fueled by the *ability* to use such sets
    efficiently, not the *need* to do this.

    > So, you may well be right that the need for larger
    > character sets has finally come to an end. I'll wait
    > and see. Meanwhile, I make sure that the code I write
    > will work with 32- (not 31-) bit character sets. With
    > any luck, the code will have adequate capacity until
    > I retire...
    >

    Fortunately for you, writing code that can handle both 21-bit and 32-bit
    character sets is hardly a challenge, given the current state of
    computer hardware. Even if Unicode had to grow someday (which would have
    to mean a new standard, of course), it wouldn't exactly be hard to
    implement, at least not as far as code point size is concerned.

    S.
     
    Skarmander, Nov 19, 2005
    #9
  10. James Brown

    Guest

    P.J. Plauger wrote:
    > <> wrote in message
    > > The only remaining controversy (that I can tell) is the aliasing of
    > > characters between the three major east asian languages. From what I'm
    > > told, people in those countries don't seem to care about the subtle
    > > problems that causes (you can't quote one language within another
    > > unless you use some meta data, like a font change), and have gone full
    > > steam ahead with dropping Big5 and adopting Unicode pretty pervasively.
    > >
    > > You think they'll run out in 5 years? Personally, I think they're
    > > done.

    >
    > Here's a coarse scale or two, just from personal experience.
    >
    > -- Number of address bits required to address a "large" memory:
    >
    > 1960 15 IBM 7090
    > 1970 20 IBM 360
    > 1980 25 VAX 11/780
    > 1990 30 various
    > 2000 35 various
    >
    > -- Number of bits required to represent a (commonly used)
    > character set:
    >
    > 1960 6 numerous vendor-specific codes


    Used only by computer scientists. (Commerce on computing being
    non-existent.)

    > 1970 7 7-bit ASCII


    Used only in english speaking countries.

    > 1980 8 extended ASCII


    Used only in english, and *some* european countries.

    > 1990 16 DBCS and others


    A nonsensical hack.

    > 2000 21 Unicode


    Used in 100% of all computer using countries (and built to scale to
    those that don't).

    The only potential for future growth here will come from the SETI
    project.

    > I could make a similar table of "barely adequate" communication
    > speeds, which also continue to expand exponentially.
    >
    > So long as you think in terms of linear increases in demand
    > for bytes or characters, it's easy to believe at each stage
    > that you're through expanding. After all, you currently have
    > a bit of headroom, and what possible need can there be for
    > much larger programs/character sets?


    There is nowhere to scale, and the head room is overkill. We would
    have to add at least 16 languages of similar complexity to the
    east-asian ones before the encoding space was at risk.

    > I personally can't imagine that people will ever want to
    > define common attribute bits for, say:
    >
    > -- roman, italic, bold, underscore
    > -- red, green, blue
    > -- point size
    > -- font
    >
    > But if we did, each attribute bit would double the number
    > of effective character codes, wouldn't it?


    So you haven't read anything about Unicode at all have you? Unicode
    does *not* specify meta-information. Those kinds of data will never be
    put into the Unicode standard, and are not considered part of the text
    data that Unicode specifies.

    This also belies an ignorance of what Unicode is specifying. Do you
    think it makes sense to have the accent of one character in a different
    font or size than its base character? Even if you wanted to encode
    this (which I think the east-asians may need in some cases of cross
    multi-language applications), obviously such meta-data specification
    would be encoding as escaped *modes*. This is easily encoded in the
    "private data area" ranges in application specific ways. But most
    people use meta-display formatting languages, like HTML, or the Open
    document format, or MS Word, something like that to encode such things
    today.

    > Nor can I imagine that a large government like China might
    > thumb its nose at an international standard and, say,
    > require a parallel set of many ISO 10646 codes.


    Why would they do this? The closest thing to China setting policy on
    anything regarding computing standards is their adoption of Red Flag
    Linux. Linux uses Unicode as its internationalization mechanism. I
    don't think China wants to give up on the commerce that relies on this
    standardization (i.e., all of it.)

    > For over 40 years I've been reading regular articles by
    > pundits who explain why larger/faster hardware is a waste
    > of time and will never sell. They've all been wrong. And
    > the further back in time you look, the greater the redshift
    > in the predictions.


    That is because they always underestimate the scale and growth in the
    problem being solved. By analogy you are suggesting that human
    languages and the character sets we use will be increasing over time in
    an increasing and exponential way similar to the growth of programming
    applications.

    > So, you may well be right that the need for larger
    > character sets has finally come to an end. I'll wait
    > and see. Meanwhile, I make sure that the code I write
    > will work with 32- (not 31-) bit character sets.


    Are you going to invent your own standard? UTF-32 encodes 31 bits (the
    top bit is assumed to be 0, otherwise an encoding error can be
    assumed). UTF-8 encodes at most 31 bits (this is a physical encoding
    limitation). And UTF-16 encodes a little under 21 bits (again, a
    physical encoding limitation). The only *valid* encodings are the
    intersection of these which is essentially the UTF-16 encoding.

    > [...] With any luck, the code will have adequate capacity until I retire...


    Also, just arbitrarily thinking "characters are 32 bits" are less that
    useful to people who actually want to encode and use Unicode data. For
    example, string comparison and collation cannot be done with a simple
    byte comparison, and character counts do not correspond to the length
    of the encoded data. If you don't encode actual Unicode semantics
    (i.e., you use wchar_t instead) then "adequate" is not something anyone
    is going to consider your implementation.

    --
    Paul Hsieh
    http://www.pobox.com/~qed/
    http://bstring.sf.net/
     
    , Nov 19, 2005
    #10
  11. In article <>,
    <> wrote:
    >And UTF-16 encodes a little under 21 bits


    A little over 20 bits would be more accurate: 2^20 + 2^16.

    -- Richard
     
    Richard Tobin, Nov 19, 2005
    #11
  12. James Brown

    P.J. Plauger Guest

    <> wrote in message
    news:...

    >> But if we did, each attribute bit would double the number
    >> of effective character codes, wouldn't it?

    >
    > So you haven't read anything about Unicode at all have you?


    Actually, I have.

    > Unicode
    > does *not* specify meta-information. Those kinds of data will never be
    > put into the Unicode standard, and are not considered part of the text
    > data that Unicode specifies.


    What, never? You may very well be right.

    > This also belies an ignorance of what Unicode is specifying. Do you
    > think it makes sense to have the accent of one character in a different
    > font or size than its base character?


    Does it make sense to have several different ways to express the
    same "character", some involving multiple codes in arbitrary order?
    Particluarly when there's a one-element version that does the job?
    Who would do a thing like that in an international standard?

    >> Nor can I imagine that a large government like China might
    >> thumb its nose at an international standard and, say,
    >> require a parallel set of many ISO 10646 codes.

    >
    > Why would they do this? The closest thing to China setting policy on
    > anything regarding computing standards is their adoption of Red Flag
    > Linux. Linux uses Unicode as its internationalization mechanism. I
    > don't think China wants to give up on the commerce that relies on this
    > standardization (i.e., all of it.)


    That's not what I've heard.

    >> For over 40 years I've been reading regular articles by
    >> pundits who explain why larger/faster hardware is a waste
    >> of time and will never sell. They've all been wrong. And
    >> the further back in time you look, the greater the redshift
    >> in the predictions.

    >
    > That is because they always underestimate the scale and growth in the
    > problem being solved.


    Uh huh.

    > By analogy you are suggesting that human
    > languages and the character sets we use will be increasing over time in
    > an increasing and exponential way similar to the growth of programming
    > applications.


    Yep.

    >> So, you may well be right that the need for larger
    >> character sets has finally come to an end. I'll wait
    >> and see. Meanwhile, I make sure that the code I write
    >> will work with 32- (not 31-) bit character sets.

    >
    > Are you going to invent your own standard?


    No. But I've already invented my own worst-case *machinery*
    for handling a variety of standards. Differen thing.

    > UTF-32 encodes 31 bits (the
    > top bit is assumed to be 0, otherwise an encoding error can be
    > assumed). UTF-8 encodes at most 31 bits (this is a physical encoding
    > limitation). And UTF-16 encodes a little under 21 bits (again, a
    > physical encoding limitation). The only *valid* encodings are the
    > intersection of these which is essentially the UTF-16 encoding.


    At the moment yes. A few years ago, it was UCS-2.

    >> [...] With any luck, the code will have adequate capacity until I
    >> retire...

    >
    > Also, just arbitrarily thinking "characters are 32 bits" are less that
    > useful to people who actually want to encode and use Unicode data.


    May be true. I didn't say I was.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Nov 19, 2005
    #12
  13. James Brown

    P.J. Plauger Guest

    "Richard Tobin" <> wrote in message
    news:dlo79b$r11$...

    > In article <>,
    > <> wrote:
    >>And UTF-16 encodes a little under 21 bits

    >
    > A little over 20 bits would be more accurate: 2^20 + 2^16.


    Right. And the minimum number of real bits needed to express
    20.087463 bits is...?

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Nov 19, 2005
    #13
  14. James Brown

    Guest

    Richard Tobin wrote:
    > In article <>,
    > <> wrote:
    > >And UTF-16 encodes a little under 21 bits

    >
    > A little over 20 bits would be more accurate: 2^20 + 2^16.


    Ah yes, I misremembered this. But you forgot to subtract out the
    escape hole itself:

    2^20 + 2^16 - 2*2^(20/2).

    Then you can take into account that U+FFFF is always illegal, and at
    that only one of the two encodings: (xFFFE) or (xFEFF) can be legal in
    any single given stream (what this means is that once decoded, U+FEFF
    is legal (and a basically content-free code point), while U+FFFE is
    not):

    2^20 + 2^16 - 2^11 - 2.

    All these complications coming from UTF-16, but which have to be
    adopted by the other encodings (except the 0xFEFF nonsense) just to
    make them all consistent.

    Then I don't know how you want count the unassigned code points that
    are clearly within the range of certain code point categories. You
    *know* that those values will never be assigned and never have meaning,
    but they are not explicitely marked as illegal.

    --
    Paul Hsieh
    http://www.pobox.com/~qed/
    http://bstring.sf.net/
     
    , Nov 19, 2005
    #14
  15. In article <>,
    P.J. Plauger <> wrote:

    >>>And UTF-16 encodes a little under 21 bits

    >>
    >> A little over 20 bits would be more accurate: 2^20 + 2^16.

    >
    >Right. And the minimum number of real bits needed to express
    >20.087463 bits is...?


    What is your point?

    -- Richard
     
    Richard Tobin, Nov 19, 2005
    #15
  16. James Brown

    P.J. Plauger Guest

    Re: [OT] Re: wchar_t

    "Skarmander" <> wrote in message
    news:437f5677$0$11075$4all.nl...

    > I'll mark it OT, since we've left C behind quite a bit by now.


    Not entirely, since this discussion goes to the very heart of
    why we (X3J11) made wchar_t a flexible type, much to the dismay
    of the various jingoists who know what the *right* representation
    should be. (Hint: they don't all agree.)

    >> Here's a coarse scale or two, just from personal experience.
    >>
    >> -- Number of address bits required to address a "large" memory:
    >>
    >> 1960 15 IBM 7090
    >> 1970 20 IBM 360
    >> 1980 25 VAX 11/780
    >> 1990 30 various
    >> 2000 35 various
    >>

    > Nice, but this misses a point: there is an upper limit.


    Okay, what *is* that upper limit? That *was* the point.
    Does anybody dare freeze it now?

    > Address bits will
    > not continue to grow indefinitely, because there is an upper limit to the
    > amount of information that will fit in the universe. Or maybe there isn't,
    > but then we're talking a radical shift in physics, which may happen but
    > doesn't allow for fair comparison anymore.


    Good. Now tell me the practical upper limit that we can use
    to standardize the all-singing, all-dancing physical address
    for now and all future times.

    >> -- Number of bits required to represent a (commonly used)
    >> character set:
    >>
    >> 1960 6 numerous vendor-specific codes
    >> 1970 7 7-bit ASCII
    >> 1980 8 extended ASCII
    >> 1990 16 DBCS and others
    >> 2000 21 Unicode
    >>
    >> I could make a similar table of "barely adequate" communication
    >> speeds, which also continue to expand exponentially.
    >>

    > But again: it can't go on forever. The question here, therefore, is
    > whether we've reached the end of the line, not whether exponential
    > expansion is happening.


    Yes, that's *exactly* the question I raised.

    >> So long as you think in terms of linear increases in demand
    >> for bytes or characters, it's easy to believe at each stage
    >> that you're through expanding. After all, you currently have
    >> a bit of headroom, and what possible need can there be for
    >> much larger programs/character sets?
    >>

    > Don't think this question hasn't been asked,


    I indeed *don't* think that. In fact, I believe I said something
    quite along those lines.

    > unlike those people who
    > asserted that "640K ought to be enough for anybody" (which Bill Gates
    > famously never said) or "16 bits ought to be enough, since it's better
    > than wasting 32 bits". Unicode doesn't say "21 bits ought to be enough for
    > anybody". It can say "21 bits is enough for every character known to man",
    > because it is. Unlike memory, communication speed and a host of other
    > things that keep growing, there is a conceivable upper limit, and it is
    > not that unreasonable to state we're close to it.


    It may not be unreasonable, but I maintain that, on the basis of
    history, it's wildly optimistic. IIRC, SC2/WG2 (the ISO committee
    corresponding to the Unicode Consortium) even saw fit to pass
    a resolution that UTF-16 will forever more be adequate to express
    all expansions of ISO 10646 (the ISO standard corresponding to
    Unicode). I consider that either a) a mark of remarkable self
    confidence, or b) whistling in the dark. Take your pick.

    >> I personally can't imagine that people will ever want to
    >> define common attribute bits for, say:
    >>
    >> -- roman, italic, bold, underscore
    >> -- red, green, blue
    >> -- point size
    >> -- font
    >>
    >> But if we did, each attribute bit would double the number
    >> of effective character codes, wouldn't it?
    >>

    >
    > That's why Unicode doesn't work that way, and no character set ever has.
    > They encode *characters*, not *glyphs*.


    I do understand that. Admittedly, the example of one possible
    cause for exponential expansion was a lightning rod.

    > The point is, effective comparison stops being useful at this point,
    > because you've shifted the way you look at what a code point represents.
    > As the Unicode FAQ itself states:
    >
    > "Both Unicode and ISO 10646 have policies in place that formally limit
    > future code assignment to the integer range that can be expressed with
    > current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e. other
    > UTFs) can represent larger intergers, these policies mean that all
    > encoding forms will always represent the same set of characters. Over a
    > million possible codes is far more than enough for the goal of Unicode of
    > encoding characters, not glyphs. Unicode is not designed to encode
    > arbitrary data. If you wanted, for example, to give each 'instance of a
    > character on paper throughout history' its own code, you might need
    > trillions or quadrillions of such codes; noble as this effort might be,
    > you would not use Unicode for such an encoding."


    So I did RC. The question I raised, however, was whether Unicode can
    resist the inevitable pressures to grow beyond their currently
    self-imposed barrier of 1,114,112 codes. Again IIRC, the Unicode
    Consortium parted company with SC2/WG2 years ago because the former
    body was convinced that 65,536 codes would be enough and the latter
    was intent on leaving room for 2^31. Microsoft and Sun backed that
    play, with Windows and Java (among other products) and now they
    have to wrestle with the inconvenience of UTF-16. BTW, I haven't
    noticed anybody in the Unicode camp blushing at their earlier
    hubris.

    >> Nor can I imagine that a large government like China might
    >> thumb its nose at an international standard and, say,
    >> require a parallel set of many ISO 10646 codes.
    >>

    > It already thumbs its nose to some extent. Unicode is still viewed with
    > great suspicion in some parts of the Eastern world, and alternate
    > character sets continue to be in use. But the Chinese government can
    > require of ISO 10646 what it wants; it's not likely to get it if it can't
    > be supported by technical requirements, as opposed to politics.


    Oh, my, I think you really believe that. When "politics" is backed
    by the odd billion dollars worth of contracts, you'd be surprised
    what it can get.

    >> For over 40 years I've been reading regular articles by
    >> pundits who explain why larger/faster hardware is a waste
    >> of time and will never sell. They've all been wrong. And
    >> the further back in time you look, the greater the redshift
    >> in the predictions.
    >>

    > These arguments do not cleanly translate to character sets, your little
    > tables notwithstanding. The upper limit may not be 21 bits, but if that's
    > not the upper limit, it's pretty close to it in orders of magnitude.


    Okay. My "argument" was that 21 bits will not long prove to be enough.
    Just one order of magnitude will be enough to blow UTF-16 to kingdom
    come. And that was my point.

    >> So, you may well be right that the need for larger
    >> character sets has finally come to an end. I'll wait
    >> and see. Meanwhile, I make sure that the code I write
    >> will work with 32- (not 31-) bit character sets. With
    >> any luck, the code will have adequate capacity until
    >> I retire...
    >>

    > Fortunately for you, writing code that can handle both 21-bit and 32-bit
    > character sets is hardly a challenge, given the current state of computer
    > hardware. Even if Unicode had to grow someday (which would have to mean a
    > new standard, of course), it wouldn't exactly be hard to implement, at
    > least not as far as code point size is concerned.


    Also my point. Having just survived several years of UTF-16
    jingoism, however, I expect to be ungracious if Unicode does
    indeed have to issue a new standard that leaves UTF-16 in the
    same rest home as UCS-2. I also hope to remain intellectually
    honest enough to issue a mea culpa in five years if I prove
    to be wrong.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Nov 19, 2005
    #16
  17. James Brown

    P.J. Plauger Guest

    "Richard Tobin" <> wrote in message
    news:dlocs3$sq6$...

    > In article <>,
    > P.J. Plauger <> wrote:
    >
    >>>>And UTF-16 encodes a little under 21 bits
    >>>
    >>> A little over 20 bits would be more accurate: 2^20 + 2^16.

    >>
    >>Right. And the minimum number of real bits needed to express
    >>20.087463 bits is...?

    >
    > What is your point?


    That saying "a little over 20 bits" rather than 21 bits
    is asinine nit picking.

    P.J. Plauger
    Dinkumware, Ltd.
    http://www.dinkumware.com
     
    P.J. Plauger, Nov 19, 2005
    #17
  18. In article <>,
    P.J. Plauger <> wrote:
    >That saying "a little over 20 bits" rather than 21 bits
    >is asinine nit picking.


    It wasn't "21 bits", it was "a little under 21 bits".
    I thought that might reflect a misunderstanding by the poster.

    My posting was intended to be helpful, yours was just rude.

    -- Richard
     
    Richard Tobin, Nov 20, 2005
    #18
  19. Re: [OT] Re: wchar_t

    In article <> "P.J. Plauger" <> writes:
    > "Skarmander" <> wrote in message
    > news:437f5677$0$11075$4all.nl...

    ....
    > >> -- Number of bits required to represent a (commonly used)
    > >> character set:
    > >>
    > >> 1960 6 numerous vendor-specific codes
    > >> 1970 7 7-bit ASCII
    > >> 1980 8 extended ASCII
    > >> 1990 16 DBCS and others
    > >> 2000 21 Unicode
    > >>
    > >> I could make a similar table of "barely adequate" communication
    > >> speeds, which also continue to expand exponentially.


    I may note that 6, 7 and 8 bits were barely adequate, even in those
    years. That is why there was a plethora of standards. With the
    current version of Unicode, under 100,000 positions are filled with
    charactes, so it could be done in 17 bits (but actually three planes
    are used for it). The base plane has slightly over 50,000 code points
    used, plane 1 only something less than 2500 (used for archaic scripts)
    and plane 2 under 45,000 (used for archaic Chinese characters). It is
    likely that all current and past scripts will fit into those 21 bits,
    and it is unlikely that new scripts will be invented. So the need
    for more than 21 bits is unlikely to come up.

    > > Unlike memory, communication speed and a host of other
    > > things that keep growing, there is a conceivable upper limit, and it is
    > > not that unreasonable to state we're close to it.

    >
    > It may not be unreasonable, but I maintain that, on the basis of
    > history, it's wildly optimistic. IIRC, SC2/WG2 (the ISO committee
    > corresponding to the Unicode Consortium) even saw fit to pass
    > a resolution that UTF-16 will forever more be adequate to express
    > all expansions of ISO 10646 (the ISO standard corresponding to
    > Unicode). I consider that either a) a mark of remarkable self
    > confidence, or b) whistling in the dark. Take your pick.


    It is possible to estimate the number of symbols used in current and
    archaic scripts. And you can be pretty confident that that number will
    not grow very much in time.

    > >> Nor can I imagine that a large government like China might
    > >> thumb its nose at an international standard and, say,
    > >> require a parallel set of many ISO 10646 codes.
    > >>

    > > It already thumbs its nose to some extent. Unicode is still viewed with
    > > great suspicion in some parts of the Eastern world, and alternate
    > > character sets continue to be in use. But the Chinese government can
    > > require of ISO 10646 what it wants; it's not likely to get it if it can't
    > > be supported by technical requirements, as opposed to politics.

    >
    > Oh, my, I think you really believe that. When "politics" is backed
    > by the odd billion dollars worth of contracts, you'd be surprised
    > what it can get.


    Strange enough, most Chinese code points are derived from Taiwanese
    standards (actually the vast majority).

    > > These arguments do not cleanly translate to character sets, your little
    > > tables notwithstanding. The upper limit may not be 21 bits, but if that's
    > > not the upper limit, it's pretty close to it in orders of magnitude.

    >
    > Okay. My "argument" was that 21 bits will not long prove to be enough.
    > Just one order of magnitude will be enough to blow UTF-16 to kingdom
    > come. And that was my point.


    Well, as currently 18 bits would fit to encode all defined code points
    in Unicode with ease, we have still a long way to go. In the history
    of Unicode, the large increases were with 3.0 (an increase with 10,307
    code points) and with 3.1 (an increase with 44,978 code points, mostly
    plane 2). The initial set had 29,929 code points (1991). Other
    increases have been in the order of 4,000 to 5,000 (the initial years)
    and 1,000 to 1,500 (since 2001). It is really pretty certain that no
    script will be found that demands an increase as large as the increase
    in 1999 or 2001. With the current growth, Unicode will be filled with
    version 40.0 or something like that, in about 180 years.

    > Also my point. Having just survived several years of UTF-16
    > jingoism, however, I expect to be ungracious if Unicode does
    > indeed have to issue a new standard that leaves UTF-16 in the
    > same rest home as UCS-2. I also hope to remain intellectually
    > honest enough to issue a mea culpa in five years if I prove
    > to be wrong.


    The probability that UTF-16 is not enough in five years is 0. Even
    if the usage of code points doubles each five years (which is a faster
    growth than in the first 15 years), it will be sufficient until 2020.
    --
    dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
    home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
     
    Dik T. Winter, Nov 20, 2005
    #19
  20. In article <> "P.J. Plauger" <> writes:
    ....
    > Does it make sense to have several different ways to express the
    > same "character", some involving multiple codes in arbitrary order?
    > Particluarly when there's a one-element version that does the job?
    > Who would do a thing like that in an international standard?


    There is such a thing as round-trip compatibility with other standards.
    Meaning that if there is another standard each code in that other standard
    should translate to a single code point in Unicode, and should not use
    multiple codes. For the CJK set these are the compatibility regions.
    It is a bit unlucky that due to that an a with acute can be encoded both
    in a single code and in the code for the letter a and the non-spacing
    acute. But actually the former corresponds to ISO-8859 and the latter
    to ASCII. We are lucky that there is no Indian standard that encoded
    the different ligatures used in Davanagari (or any of the other Indian
    scripts). Also there is no Arabic standard that had different encodings
    for the letters depending on position. Korean grew large in Unicode
    because, although a letter script, there *was* a coding that encoded
    syllables. So I think there are good reasons.
    --
    dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
    home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
     
    Dik T. Winter, Nov 20, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Adrian Cornish
    Replies:
    2
    Views:
    8,209
    Adrian Cornish
    Jul 12, 2003
  2. Bren
    Replies:
    4
    Views:
    4,125
    Peter van Merkerk
    Oct 7, 2003
  3. sorty
    Replies:
    4
    Views:
    20,736
    Rolf Magnus
    Nov 25, 2003
  4. Jon Willeke

    wchar_t -> UTF-8?

    Jon Willeke, Feb 8, 2004, in forum: C++
    Replies:
    2
    Views:
    7,549
    Tilman Kuepper
    Feb 9, 2004
  5. Replies:
    3
    Views:
    1,105
    James Kanze
    Aug 15, 2008
Loading...

Share This Page