Why No Supplemental Characters In Character Literals?

Discussion in 'Java' started by Lawrence D'Oliveiro, Feb 4, 2011.

  1. Why was it decreed in the language spec that characters beyond U+FFFF are
    not allowed in character literals, when they are allowed everywhere else (in
    string literals, in the program text, in character and string values etc)?
    Lawrence D'Oliveiro, Feb 4, 2011
    #1
    1. Advertising

  2. Lawrence D'Oliveiro

    Lew Guest

    On 02/04/2011 12:59 AM, Lawrence D'Oliveiro wrote:
    > Why was it decreed in the language spec that characters beyond U+FFFF are
    > not allowed in character literals, when they are allowed everywhere else (in
    > string literals, in the program text, in character and string values etc)?


    Because a 'char' type holds only 16 bits.

    --
    Lew
    Ceci n'est pas une fenêtre.
    ..___________.
    |###] | [###|
    |##/ | *\##|
    |#/ * | \#|
    |#----|----#|
    || | * ||
    |o * | o|
    |_____|_____|
    |===========|
    Lew, Feb 4, 2011
    #2
    1. Advertising

  3. In message <iig6j2$dul$>, Lew wrote:

    > On 02/04/2011 12:59 AM, Lawrence D'Oliveiro wrote:
    >
    >> Why was it decreed in the language spec that characters beyond U+FFFF are
    >> not allowed in character literals, when they are allowed everywhere else
    >> (in string literals, in the program text, in character and string values
    >> etc)?

    >
    > Because a 'char' type holds only 16 bits.


    No it doesn’t. Otherwise you wouldn’t be allowed supplementary characters in
    character and string values. Which you are.
    Lawrence D'Oliveiro, Feb 4, 2011
    #3
  4. "Lawrence D'Oliveiro" <_zealand> wrote in message
    news:iig84e$uqu$...
    > In message <iig6j2$dul$>, Lew wrote:
    >
    >> On 02/04/2011 12:59 AM, Lawrence D'Oliveiro wrote:
    >>
    >>> Why was it decreed in the language spec that characters beyond U+FFFF
    >>> are
    >>> not allowed in character literals, when they are allowed everywhere else
    >>> (in string literals, in the program text, in character and string values
    >>> etc)?

    >>
    >> Because a 'char' type holds only 16 bits.

    >
    > No it doesn’t. Otherwise you wouldn’t be allowed supplementary characters
    > in
    > character and string values. Which you are.


    Yes, it does (contain 16 bits.) It was defined to do so before there were
    supplemental characters, and there was no way to extend it without breaking
    compatibility with some older programs.

    You can't put a supplementary character in a char. You can put them in
    strings, but only encoded as UTF-16, i.e. into two 16-bit chars.
    Mike Schilling, Feb 4, 2011
    #4
  5. Lawrence D'Oliveiro

    Lew Guest

    Lawrence D'Oliveiro wrote:
    >>>> Why was it decreed in the language spec that characters beyond U+FFFF are
    >>>> not allowed in character literals, when they are allowed everywhere else
    >>>> (in string literals, in the program text, in character and string values
    >>>> etc)?


    It takes TWO 'char' values to represent a supplemental character. 'char' !=
    "character".

    READ the documentation.

    Lew wrote:
    >>> Because a 'char' type holds only 16 bits.


    Lawrence D'Oliveiro wrote:
    >> No it doesn’t. Otherwise you wouldn’t be allowed supplementary characters in
    >> character and string values. Which you are.


    I have an idea for you to try - check the documentation.
    <http://java.sun.com/docs/books/jls/third_edition/html/typesValues.html#4.2.1>

    and you see in §4.2: "... char, whose values are 16-bit unsigned integers ..."

    Mike Schilling wrote:
    > Yes, it does (contain 16 bits.) It was defined to do so before there were
    > supplemental characters, and there was no way to extend it without breaking
    > compatibility with some older programs.
    >
    > You can't put a supplementary character in a char. You can put them in
    > strings, but only encoded as UTF-16, i.e. into two 16-bit chars.


    As the tutorials and JLS tell you, should you deign to read the documentation.
    (It's not a bad idea to do so.)

    --
    Lew
    Ceci n'est pas une fenêtre.
    ..___________.
    |###] | [###|
    |##/ | *\##|
    |#/ * | \#|
    |#----|----#|
    || | * ||
    |o * | o|
    |_____|_____|
    |===========|
    Lew, Feb 4, 2011
    #5
  6. On 02/04/2011 01:59 AM, Lawrence D'Oliveiro wrote:
    > In message<iig6j2$dul$>, Lew wrote:
    >
    >> On 02/04/2011 12:59 AM, Lawrence D'Oliveiro wrote:
    >>
    >>> Why was it decreed in the language spec that characters beyond U+FFFF are
    >>> not allowed in character literals, when they are allowed everywhere else
    >>> (in string literals, in the program text, in character and string values
    >>> etc)?

    >>
    >> Because a 'char' type holds only 16 bits.

    >
    > No it doesn’t. Otherwise you wouldn’t be allowed supplementary characters in
    > character and string values. Which you are.


    The JLS clearly states that a char is an unsigned 16-bit value. Non-BMP
    Unicode characters cannot fit in a single unsigned 16-bit value. Where
    other literals compile down, you can use these non-BMP characters
    because, e.g., Strings are not individual 16-bit values but an array of
    them, and can thus safely hold a pair of them.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Feb 4, 2011
    #6
  7. On 04-02-2011 01:59, Lawrence D'Oliveiro wrote:
    > In message<iig6j2$dul$>, Lew wrote:
    >> On 02/04/2011 12:59 AM, Lawrence D'Oliveiro wrote:
    >>> Why was it decreed in the language spec that characters beyond U+FFFF are
    >>> not allowed in character literals, when they are allowed everywhere else
    >>> (in string literals, in the program text, in character and string values
    >>> etc)?

    >>
    >> Because a 'char' type holds only 16 bits.

    >
    > No it doesn’t. Otherwise you wouldn’t be allowed supplementary characters in
    > character and string values. Which you are.


    It is very clearly specified that a Java char is 16 bit.

    You can't have the codepoints above U+FFFF in a char.

    You can have them in a string but then they actually takes
    two chars in that string.

    It is rather messy.

    If you look at the Java docs for String class you will see:

    charAt & codePointAt
    length & codePointCount

    which is not a nice API.

    But since codepoints above U+FFFF was added after the String
    class was defined, then the options on how to handle it were
    pretty limited.

    Arne
    Arne Vajhøj, Feb 4, 2011
    #7
  8. "Arne Vajhøj" <> wrote in message
    news:4d4c2019$0$23753$...
    > On 04-02-2011 01:59, Lawrence D'Oliveiro wrote:
    >> In message<iig6j2$dul$>, Lew wrote:
    >>> On 02/04/2011 12:59 AM, Lawrence D'Oliveiro wrote:
    >>>> Why was it decreed in the language spec that characters beyond U+FFFF
    >>>> are
    >>>> not allowed in character literals, when they are allowed everywhere
    >>>> else
    >>>> (in string literals, in the program text, in character and string
    >>>> values
    >>>> etc)?
    >>>
    >>> Because a 'char' type holds only 16 bits.

    >>
    >> No it doesn’t. Otherwise you wouldn’t be allowed supplementary characters
    >> in
    >> character and string values. Which you are.

    >
    > It is very clearly specified that a Java char is 16 bit.
    >
    > You can't have the codepoints above U+FFFF in a char.
    >
    > You can have them in a string but then they actually takes
    > two chars in that string.
    >
    > It is rather messy.
    >
    > If you look at the Java docs for String class you will see:
    >
    > charAt & codePointAt
    > length & codePointCount
    >
    > which is not a nice API.
    >
    > But since codepoints above U+FFFF was added after the String
    > class was defined, then the options on how to handle it were
    > pretty limited.


    The sticky issue is, I think, that chars were defined as 16-bit. If that
    had been left undefined, they could have been extended to 24 bits, which
    would make things nice and regular again.
    Mike Schilling, Feb 4, 2011
    #8
  9. On 04-02-2011 12:10, Mike Schilling wrote:
    >
    >
    > "Arne Vajhøj" <> wrote in message
    > news:4d4c2019$0$23753$...
    >> On 04-02-2011 01:59, Lawrence D'Oliveiro wrote:
    >>> In message<iig6j2$dul$>, Lew wrote:
    >>>> On 02/04/2011 12:59 AM, Lawrence D'Oliveiro wrote:
    >>>>> Why was it decreed in the language spec that characters beyond
    >>>>> U+FFFF are
    >>>>> not allowed in character literals, when they are allowed everywhere
    >>>>> else
    >>>>> (in string literals, in the program text, in character and string
    >>>>> values
    >>>>> etc)?
    >>>>
    >>>> Because a 'char' type holds only 16 bits.
    >>>
    >>> No it doesn’t. Otherwise you wouldn’t be allowed supplementary
    >>> characters in
    >>> character and string values. Which you are.

    >>
    >> It is very clearly specified that a Java char is 16 bit.
    >>
    >> You can't have the codepoints above U+FFFF in a char.
    >>
    >> You can have them in a string but then they actually takes
    >> two chars in that string.
    >>
    >> It is rather messy.
    >>
    >> If you look at the Java docs for String class you will see:
    >>
    >> charAt & codePointAt
    >> length & codePointCount
    >>
    >> which is not a nice API.
    >>
    >> But since codepoints above U+FFFF was added after the String
    >> class was defined, then the options on how to handle it were
    >> pretty limited.

    >
    > The sticky issue is, I think, that chars were defined as 16-bit. If that
    > had been left undefined, they could have been extended to 24 bits, which
    > would make things nice and regular again.


    Yes.

    But having specific bit lengths for all types was huge jump
    forward compared to C89 regarding predictability of what code
    would do.

    Arne
    Arne Vajhøj, Feb 4, 2011
    #9
  10. On 04/02/2011 16:49, Arne Vajhøj allegedly wrote:
    > It is very clearly specified that a Java char is 16 bit.
    >
    > You can't have the codepoints above U+FFFF in a char.
    >
    > You can have them in a string but then they actually takes
    > two chars in that string.
    >
    > It is rather messy.
    >
    > If you look at the Java docs for String class you will see:
    >
    > charAt & codePointAt
    > length & codePointCount
    >
    > which is not a nice API.
    >
    > But since codepoints above U+FFFF was added after the String
    > class was defined, then the options on how to handle it were
    > pretty limited.


    They've added supplementary character support to String, StringBuilder,
    StringBuffer.

    Pity they haven't touched upon java.lang.CharSequence. Probably out of
    concerns about compatibility.

    Anyone got an idea how supplementary character support could be
    integrated with CharSequence, or more generally, with an interface
    describing a sequence of code points? Creating a sub-interface, e.g.
    UnicodeSequence with int codePointAt(int), etc. doesn't seem like it'd
    do the trick, since a UnicodeSequence /is-not/ a CharSequence (char
    charAt(int) doesn't make sense for a UnicodeSequence). Adding a new
    interface would mean you don't get the interoperability with all the
    parts of the API that uses CharSequences... The only option would seem
    to refactor CharSequence and all the classes that use or implement it.
    Which means no backwards-compatibility.

    Bloody mess this is.

    --
    DF.
    Daniele Futtorovic, Feb 4, 2011
    #10
  11. Lawrence D'Oliveiro

    Roedy Green Guest

    On Fri, 04 Feb 2011 18:59:30 +1300, Lawrence D'Oliveiro
    <_zealand> wrote, quoted or indirectly quoted
    someone who said :

    >Why was it decreed in the language spec that characters beyond U+FFFF are
    >not allowed in character literals, when they are allowed everywhere else (in
    >string literals, in the program text, in character and string values etc)?


    because they did not exist at the time Java was invented. extended
    literals were tacked on to the 16-bit internal scheme in a somewhat
    half-hearted way. to go to full 32-bit internally would gobble RAM
    hugely.

    Java does not have 32-bit String literals, like C style code points
    e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the
    SurrogatePair applet (see
    http://mindprod.com/applet/surrogatepair.html)
    to convert C-style code points to a arcane surrogate pairs to let you
    use 32-bit Unicode glyphs in your programs.


    Personally, I don’t see the point of any great rush to support 32-bit
    Unicode. The new symbols will be rarely used. Consider what’s there.
    The only ones I would conceivably use are musical symbols and
    Mathematical Alphanumeric symbols (especially the German black letters
    so favoured in real analysis). The rest I can’t imagine ever using
    unless I took up a career in anthropology, i.e. linear B syllabary (I
    have not a clue what it is), linear B ideograms (Looks like symbols
    for categorising cave petroglyphs), Aegean Numbers (counting with
    stones and sticks), Old Italic (looks like Phoenecian), Gothic
    (medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
    (George Bernard Shaw’s phonetic script), Osmanya (Somalian), Cypriot
    syllabary, Byzantine music symbols (looks like Arabic), Musical
    Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
    extensions(Chinese Japanese Korean) and tags (letters with blank
    “price tags”).


    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    To err is human, but to really foul things up requires a computer.
    ~ Farmer's Almanac
    It is breathtaking how a misplaced comma in a computer program can
    shred megabytes of data in seconds.
    Roedy Green, Feb 4, 2011
    #11
  12. Lawrence D'Oliveiro

    Roedy Green Guest

    On Fri, 04 Feb 2011 08:04:23 -0500, Joshua Cranmer
    <> wrote, quoted or indirectly quoted someone
    who said :

    >The JLS clearly states that a char is an unsigned 16-bit value.


    Perhaps char will be redefined as 32 bits, or a new unsigned 32-bit
    echar type will be invented.

    It is an intractable problem. Consider the logic that uses indexOf
    substring with character index arithmetic. Most if it would go insane
    if you threw a few 32-bit chars in there. You need something that
    simulates an array of 32-bit chars to the programmer.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    To err is human, but to really foul things up requires a computer.
    ~ Farmer's Almanac
    It is breathtaking how a misplaced comma in a computer program can
    shred megabytes of data in seconds.
    Roedy Green, Feb 4, 2011
    #12
  13. On 02/04/2011 12:10 PM, Mike Schilling wrote:
    > "Arne Vajhøj" <> wrote in message
    >> But since codepoints above U+FFFF was added after the String
    >> class was defined, then the options on how to handle it were
    >> pretty limited.

    >
    > The sticky issue is, I think, that chars were defined as 16-bit. If that
    > had been left undefined, they could have been extended to 24 bits, which
    > would make things nice and regular again.


    Well, the real problem is that Unicode swore that 16 bits were enough
    for everybody, so people opted for the UTF-16 encoding in Unicode-aware
    platforms (e.g., Windows uses 16-bit char values for wchar_t). When they
    backtracked and increased the count to 20 bits, every system that did
    UTF-16 was now screwed, because UTF-16 "kind of" becomes a
    variable-width format like UTF-8... but not really. Instead you get a
    mess with surrogate characters, this distinction between UTF-16 and
    UCS-2, and, in short, anything not in the Basic Multilingual Plane is a
    recipe for disaster.

    Extending to 24 bits is problematic because 24 bits opens you up to
    unaligned memory access on most, if not all, platforms, so you'd have to
    go fully up to 32 bits (this is what the codePoint methods in String et
    al. do). But considering the sheer amount of Strings in memory, going to
    32-bit memory storage for Strings now doubles the size of that data...
    and can increase memory consumption in some cases by 30-40%.

    To make a long story short: Unicode made a very, very big mistake, and
    everyone who designed their systems to be particularly i18n-aware before
    that is now really smarting as a result.

    It actually is possible to change the internal storage of String to a
    UTF-8 representation (while keeping UTF-16/UTF-32 API access) and still
    get good performance--people largely use direct indexes into strings in
    largely consistent access patterns (e.g., str.substring(str.indexOf(":")
    + 1) ), so you can cache index lookup tables for a few values. It's ugly
    as hell to code properly, taking into account proper multithreading,
    etc., but it is not impossible.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Feb 4, 2011
    #13
  14. Lawrence D'Oliveiro

    markspace Guest

    On 2/4/2011 10:36 AM, Roedy Green wrote:
    > On Fri, 04 Feb 2011 08:04:23 -0500, Joshua Cranmer
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >> The JLS clearly states that a char is an unsigned 16-bit value.

    >
    > Perhaps char will be redefined as 32 bits, or a new unsigned 32-bit
    > echar type will be invented.



    An int is currently used for this purpose. For example,
    Character.codePointAt(CharSequence,int) returns an int.


    <http://download.oracle.com/javase/6/docs/api/java/lang/Character.html>


    Also, from that same page, this explains the whole story in one go:


    "Unicode Character Representations

    "The char data type (and therefore the value that a Character object
    encapsulates) are based on the original Unicode specification, which
    defined characters as fixed-width 16-bit entities. The Unicode standard
    has since been changed to allow for characters whose representation
    requires more than 16 bits. The range of legal code points is now U+0000
    to U+10FFFF, known as Unicode scalar value. (Refer to the definition of
    the U+n notation in the Unicode standard.)

    "The set of characters from U+0000 to U+FFFF is sometimes referred to as
    the Basic Multilingual Plane (BMP). Characters whose code points are
    greater than U+FFFF are called supplementary characters. The Java 2
    platform uses the UTF-16 representation in char arrays and in the String
    and StringBuffer classes. In this representation, supplementary
    characters are represented as a pair of char values, the first from the
    high-surrogates range, (\uD800-\uDBFF), the second from the
    low-surrogates range (\uDC00-\uDFFF).

    "A char value, therefore, represents Basic Multilingual Plane (BMP) code
    points, including the surrogate code points, or code units of the UTF-16
    encoding. An int value represents all Unicode code points, including
    supplementary code points. The lower (least significant) 21 bits of int
    are used to represent Unicode code points and the upper (most
    significant) 11 bits must be zero.


    ....etc....
    markspace, Feb 4, 2011
    #14
  15. Lawrence D'Oliveiro

    markspace Guest

    On 2/4/2011 9:37 AM, Daniele Futtorovic wrote:

    > Pity they haven't touched upon java.lang.CharSequence. Probably out of
    > concerns about compatibility.



    You know that Character has static methods for pulling code points out
    of a CharSequence, right?
    markspace, Feb 4, 2011
    #15
  16. Lawrence D'Oliveiro

    Lew Guest

    Lawrence D'Oliveiro wrote:
    >>> Why was it decreed in the language spec that characters beyond U+FFFF are
    >>> not allowed in character literals, when they are allowed everywhere else
    >>> (in string literals, in the program text, in character and string values
    >>> etc)?

    >


    Lew wrote:
    >> Because a 'char' type holds only 16 bits.

    >


    Lawrence D'Oliveiro wrote:
    > No it doesn’t. Otherwise you wouldn’t be allowed supplementary characters in
    > character and string values. Which you are.
    >


    /* DemoChar */
    package eg;
    public class DemoChar
    {
    public static void main( String [] args )
    {
    System.out.println( "Character.MAX_VALUE + 1 = "
    + (Character.MAX_VALUE + 1) );

    char foo1, foo2;
    foo1 = (char) (Character.MAX_VALUE - 1);
    foo2 = (char) (foo1 / 2);
    System.out.println( "foo1 = "+ (int) foo1 +", foo2 = "+ (int)
    foo2 );

    foo1 = '§';
    foo2 = '@';
    char sum = (char) (foo1 + foo2);
    System.out.println( "foo1 + foo2 = "+ sum );
    }
    }

    --
    Lew
    Lew, Feb 4, 2011
    #16
  17. On 04/02/2011 20:27, markspace allegedly wrote:
    > On 2/4/2011 9:37 AM, Daniele Futtorovic wrote:
    >
    >> Pity they haven't touched upon java.lang.CharSequence. Probably out of
    >> concerns about compatibility.

    >
    >
    > You know that Character has static methods for pulling code points out
    > of a CharSequence, right?


    Yeah. But that's not quite the same thing, is it? What with OOP and all.
    Daniele Futtorovic, Feb 4, 2011
    #17
  18. On 04/02/2011 19:26, Roedy Green allegedly wrote:
    > On Fri, 04 Feb 2011 18:59:30 +1300, Lawrence D'Oliveiro
    > <_zealand> wrote, quoted or indirectly quoted
    > someone who said :
    >
    >> Why was it decreed in the language spec that characters beyond U+FFFF are
    >> not allowed in character literals, when they are allowed everywhere else (in
    >> string literals, in the program text, in character and string values etc)?

    >
    > because they did not exist at the time Java was invented. extended
    > literals were tacked on to the 16-bit internal scheme in a somewhat
    > half-hearted way. to go to full 32-bit internally would gobble RAM
    > hugely.
    >
    > Java does not have 32-bit String literals, like C style code points
    > e.g. \U0001d504. Note the capital U vs the usual \ud504 I wrote the
    > SurrogatePair applet (see
    > http://mindprod.com/applet/surrogatepair.html)
    > to convert C-style code points to a arcane surrogate pairs to let you
    > use 32-bit Unicode glyphs in your programs.
    >
    >
    > Personally, I don’t see the point of any great rush to support 32-bit
    > Unicode. The new symbols will be rarely used. Consider what’s there.
    > The only ones I would conceivably use are musical symbols and
    > Mathematical Alphanumeric symbols (especially the German black letters
    > so favoured in real analysis). The rest I can’t imagine ever using
    > unless I took up a career in anthropology, i.e. linear B syllabary (I
    > have not a clue what it is), linear B ideograms (Looks like symbols
    > for categorising cave petroglyphs), Aegean Numbers (counting with
    > stones and sticks), Old Italic (looks like Phoenecian), Gothic
    > (medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
    > (George Bernard Shaw’s phonetic script), Osmanya (Somalian), Cypriot
    > syllabary, Byzantine music symbols (looks like Arabic), Musical
    > Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
    > extensions(Chinese Japanese Korean) and tags (letters with blank
    > “price tags”).


    And Klingon!

    --
    DF.
    Daniele Futtorovic, Feb 4, 2011
    #18
  19. Lawrence D'Oliveiro

    Tom Anderson Guest

    Efficient unicode string implementation was: Re: Why No SupplementalCharacters In Character Literals?

    On Fri, 4 Feb 2011, Joshua Cranmer wrote:

    >> "Arne Vajhøj" <> wrote in message
    >>
    >>> But since codepoints above U+FFFF was added after the String class was
    >>> defined, then the options on how to handle it were pretty limited.

    >
    > Extending to 24 bits is problematic because 24 bits opens you up to
    > unaligned memory access on most, if not all, platforms, so you'd have to
    > go fully up to 32 bits (this is what the codePoint methods in String et
    > al. do). But considering the sheer amount of Strings in memory, going to
    > 32-bit memory storage for Strings now doubles the size of that data...
    > and can increase memory consumption in some cases by 30-40%.


    This is something i ponder quite a lot.

    It's essential that computers be able to represent characters from any
    living human script. The astral planes include some such characters,
    notably in the CJK extensions, without which it is impossible to write
    some people's names correctly. The necessity of supporting more than 2**16
    codepoints is simply beyond question.

    The problem is how to do it efficiently.

    Going to strings of 24- or 32-bit characters would indeed be prohibitive
    in its effect in memory. But isn't 16-bit already an eye-watering waste?
    Most characters currently sitting in RAM around the world are, i would
    wager, in the ASCII range: the great majority of characters in almost any
    text in a latin script will be ASCII, in that they won't have diacritics
    [1] (and most text is still in latin script), and almost all characters in
    non-natural-language text (HTML and XML markup, configuration files,
    filesystem paths) will be ASCII. A sizeable fraction of non-latin text is
    still encodable in one byte per character, using a national character set.
    Forcing all users of programs written in Java (or any other platform which
    uses UCS-2 encoding) to spend two bytes on each of those characters to
    ease the lives of the minority of users who store a lot of CJK text seems
    wildly regressive.

    I am, however, at a loss to suggest a practical alternative!

    A question to the house, then: has anyone ever invented a data structure
    for strings which allows space-efficient storage for strings in different
    scripts, but also allows time-efficient implementation of the common
    string operations?

    Upthread, Joshua mentions the idea of using UTF-8 strings, and cacheing
    codepoint-to-bytepoint mappings. That's certainly an approach that would
    work, although i worry about the performance effect of generating so many
    writes, the difficulty of making it correct in multithreaded systems, and
    the dependency on a good cache hit rate to make it pay off.

    Anyone else?

    For extra credit, give a representation which also makes it simple and
    efficient to do normalisation, reversal, and "find the first occurrence of
    this character, ignoring diacritics".

    tom

    [1] I would be interested to hear of a language (more properly, an
    orthography) using latin script in which a majority of characters, or even
    an unusually large fraction, do have diacritics. The pinyin romanisation
    of Mandarin uses a lot of accents. Hawaiian uses quite a lot. Some ways of
    writing ancient Greek use a lot of diacritics, for breathings and accents
    and in verse, for long and short syllables.

    --
    Understand the world we're living in
    Tom Anderson, Feb 4, 2011
    #19
  20. In message <iigcva$90q$-september.org>, Mike Schilling wrote:

    > Yes, it does (contain 16 bits.)


    Yeah, I didn’t realize it was spelled out that way in the original language
    spec. What a short-sighted decision.

    > It was defined to do so before there were supplemental characters ...


    Why was there a need to define the size of a character at all? Even in the
    early days of the unification of Unicode and ISO-10646, there was already
    provision for UCS-4. Did they really think that could safely be ignored?
    Lawrence D'Oliveiro, Feb 4, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Pete Elmgreen

    character literals and string

    Pete Elmgreen, Nov 24, 2004, in forum: Java
    Replies:
    3
    Views:
    4,665
  2. John Goche
    Replies:
    8
    Views:
    16,458
  3. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,849
    Smokey Grindel
    Dec 2, 2006
  4. Replies:
    0
    Views:
    243
  5. Lauri Alanko

    Encoding of character literals

    Lauri Alanko, Nov 3, 2011, in forum: C Programming
    Replies:
    4
    Views:
    297
Loading...

Share This Page