Re: languages with full unicode support

Discussion in 'Python' started by Tim Roberts, Jun 28, 2006.

  1. Tim Roberts

    Tim Roberts Guest

    "Xah Lee" <> wrote:

    >Languages with Full Unicode Support
    >
    >As far as i know, Java and JavaScript are languages with full, complete
    >unicode support. That is, they allow names to be defined using unicode.
    >(the JavaScript engine used by FireFox support this)
    >
    >As far as i know, here's few other lang's status:
    >
    >C ? No.


    This is implementation-defined in C. A compiler is allowed to accept
    variable names with alphabetic Unicode characters outside of ASCII.
    --
    - Tim Roberts,
    Providenza & Boekelheide, Inc.
     
    Tim Roberts, Jun 28, 2006
    #1
    1. Advertising

  2. Tim Roberts schrieb:
    > "Xah Lee" <> wrote:
    >> C ? No.

    >
    > This is implementation-defined in C. A compiler is allowed to accept
    > variable names with alphabetic Unicode characters outside of ASCII.


    Hmm... that could would be nonportable, so C support for Unicode is
    half-baked at best.

    Regards,
    Jo
     
    Joachim Durchholz, Jun 28, 2006
    #2
    1. Advertising

  3. Tim Roberts wrote:
    > "Xah Lee" <> wrote:
    >
    >>Languages with Full Unicode Support
    >>
    >>As far as i know, Java and JavaScript are languages with full, complete
    >>unicode support. That is, they allow names to be defined using unicode.
    >>(the JavaScript engine used by FireFox support this)
    >>
    >>As far as i know, here's few other lang's status:
    >>
    >>C ? No.

    >
    > This is implementation-defined in C. A compiler is allowed to accept
    > variable names with alphabetic Unicode characters outside of ASCII.


    It is not implementation-defined in C99 whether Unicode characters are
    accepted; only how they are encoded directly in the source multibyte character
    set.

    Characters escaped using \uHHHH or \U00HHHHHH (H is a hex digit), and that
    are in the sets of characters defined by Unicode for identifiers, are required
    to be supported, and should be mangled in some consistent way by a platform's
    linker. There are Unicode text editors which encode/decode \u and \U on the fly,
    so you can treat this essentially like a Unicode transformation format (it
    would have been nicer to require support for UTF-8, but never mind).


    C99 6.4.2.1:

    # 3 Each universal character name in an identifier shall designate a character
    # whose encoding in ISO/IEC 10646 falls into one of the ranges specified in
    # annex D. 59) The initial character shall not be a universal character name
    # designating a digit. An implementation may allow multibyte characters that
    # are not part of the basic source character set to appear in identifiers;
    # which characters and their correspondence to universal character names is
    # implementation-defined.
    #
    # 59) On systems in which linkers cannot accept extended characters, an encoding
    # of the universal character name may be used in forming valid external
    # identifiers. For example, some otherwise unused character or sequence of
    # characters may be used to encode the \u in a universal character name.
    # Extended characters may produce a long external identifier.

    --
    David Hopwood <>
     
    David Hopwood, Jun 28, 2006
    #3
  4. Tim Roberts

    Chris Uppal Guest

    Joachim Durchholz wrote:

    > > This is implementation-defined in C. A compiler is allowed to accept
    > > variable names with alphabetic Unicode characters outside of ASCII.

    >
    > Hmm... that could would be nonportable, so C support for Unicode is
    > half-baked at best.


    Since the interpretation of characters which are yet to be added to
    Unicode is undefined (will they be digits, "letters", operators, symbol,
    punctuation.... ?), there doesn't seem to be any sane way that a language could
    allow an unrestricted choice of Unicode in identifiers. Hence, it must define
    a specific allowed sub-set. C certainly defines an allowed subset of Unicode
    characters -- so I don't think you could call its Unicode support "half-baked"
    (not in that respect, anyway). A case -- not entirely convincing, IMO -- could
    be made that it would be better to allow a wider range of characters.

    And no, I don't think Java's approach -- where there /is no defined set of
    allowed identifier characters/ -- makes any sense at all :-(

    -- chris
     
    Chris Uppal, Jun 28, 2006
    #4
  5. Java identifiers (was: languages with full unicode support)

    Note Followup-To: comp.lang.java.programmer

    Chris Uppal wrote:
    > Since the interpretation of characters which are yet to be added to
    > Unicode is undefined (will they be digits, "letters", operators, symbol,
    > punctuation.... ?), there doesn't seem to be any sane way that a language could
    > allow an unrestricted choice of Unicode in identifiers. Hence, it must define
    > a specific allowed sub-set. C certainly defines an allowed subset of Unicode
    > characters -- so I don't think you could call its Unicode support "half-baked"
    > (not in that respect, anyway). A case -- not entirely convincing, IMO -- could
    > be made that it would be better to allow a wider range of characters.
    >
    > And no, I don't think Java's approach -- where there /is no defined set of
    > allowed identifier characters/ -- makes any sense at all :-(


    Java does have a defined set of allowed identifier characters. However, you
    certainly have to go around the houses a bit to work out what that set is:


    <http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8>

    # An identifier is an unlimited-length sequence of Java letters and Java digits,
    # the first of which must be a Java letter. An identifier cannot have the same
    # spelling (Unicode character sequence) as a keyword (§3.9), boolean literal
    # (§3.10.3), or the null literal (§3.10.7).
    [...]
    # A "Java letter" is a character for which the method
    # Character.isJavaIdentifierStart(int) returns true. A "Java letter-or-digit"
    # is a character for which the method Character.isJavaIdentifierPart(int)
    # returns true.
    [...]
    # Two identifiers are the same only if they are identical, that is, have the
    # same Unicode character for each letter or digit.

    For Java 1.5.0:

    <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html>

    # Character information is based on the Unicode Standard, version 4.0.

    <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierStart(int)>

    # A character may start a Java identifier if and only if one of the following
    # conditions is true:
    #
    # * isLetter(codePoint) returns true
    # * getType(codePoint) returns LETTER_NUMBER
    # * the referenced character is a currency symbol (such as "$")

    [This means that getType(codePoint) returns CURRENCY_SYMBOL, i.e. Unicode
    General Category Sc.]

    # * the referenced character is a connecting punctuation character (such as "_").

    [This means that getType(codePoint) returns CONNECTOR_PUNCTUATION, i.e. Unicode
    General Category Pc.]

    <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierPart(int)>

    # A character may be part of a Java identifier if any of the following are true:
    #
    # * it is a letter
    # * it is a currency symbol (such as '$')
    # * it is a connecting punctuation character (such as '_')
    # * it is a digit
    # * it is a numeric letter (such as a Roman numeral character)

    [General Category Nl.]

    # * it is a combining mark

    [General Category Mc (see <http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf>).]

    # * it is a non-spacing mark

    [General Category Mn (ditto).]

    # * isIdentifierIgnorable(codePoint) returns true for the character

    <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isDigit(int)>

    # A character is a digit if its general category type, provided by
    # getType(codePoint), is DECIMAL_DIGIT_NUMBER.

    [General Category Nd.]

    <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isIdentifierIgnorable(int)>

    # The following Unicode characters are ignorable in a Java identifier or a Unicode
    # identifier:
    #
    # * ISO control characters that are not whitespace
    # o '\u0000' through '\u0008'
    # o '\u000E' through '\u001B'
    # o '\u007F' through '\u009F'
    # * all characters that have the FORMAT general category value

    [FORMAT is General Category Cf.]

    <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isLetter(int)>

    # A character is considered to be a letter if its general category type, provided
    # by getType(codePoint), is any of the following:
    #
    # * UPPERCASE_LETTER
    # * LOWERCASE_LETTER
    # * TITLECASE_LETTER
    # * MODIFIER_LETTER
    # * OTHER_LETTER

    ====

    To cut a long story short, the syntax of identifiers in Java 1.5 is therefore:

    Keyword ::= one of
    abstract continue for new switch
    assert default if package synchronized
    boolean do goto private this
    break double implements protected throw
    byte else import public throws
    case enum instanceof return transient
    catch extends int short try
    char final interface static void
    class finally long strictfp volatile
    const float native super while

    Identifier ::= IdentifierChars butnot (Keyword | "true" | "false" | "null")
    IdentifierChars ::= JavaLetter | IdentifierChars JavaLetterOrDigit
    JavaLetter ::= Lu | Ll | Lt | Lm | Lo | Nl | Sc | Pc
    JavaLetterOrDigit ::= JavaLetter | Nd | Mn | Mc |
    U+0000..0008 | U+000E..001B | U+007F..009F | Cf

    where the two-letter terminals refer to General Categories in Unicode 4.0.0
    (exactly).

    Note that the so-called "ignorable" characters (for which
    isIdentifierIgnorable(codePoint) returns true) are not ignorable; they are
    treated like any other identifier character. This quote from the API spec:

    # The following Unicode characters are ignorable in a Java identifier [...]

    should be ignored (no pun intended). It is contradicted by:

    # Two identifiers are the same only if they are identical, that is, have the
    # same Unicode character for each letter or digit.

    in the language spec. Unicode does have a concept of ignorable characters in
    identifiers, which is probably where this documentation bug crept in.

    The inclusion of U+0000 and various control characters in the set of valid
    identifier characters is also a dubious decision, IMHO.

    Note that I am not defending in any way the complexity of this definition; there's
    clearly no excuse for it (or for the "ignorable" documentation bug). The language
    spec should have been defined directly in terms of the Unicode General Categories,
    and then the API in terms of the language spec. They way it is done now is
    completely backwards.

    --
    David Hopwood <>
     
    David Hopwood, Jun 28, 2006
    #5
  6. Chris Uppal schrieb:
    > Joachim Durchholz wrote:
    >
    >>> This is implementation-defined in C. A compiler is allowed to accept
    >>> variable names with alphabetic Unicode characters outside of ASCII.

    >> Hmm... that could would be nonportable, so C support for Unicode is
    >> half-baked at best.

    >
    > Since the interpretation of characters which are yet to be added to
    > Unicode is undefined (will they be digits, "letters", operators, symbol,
    > punctuation.... ?), there doesn't seem to be any sane way that a language could
    > allow an unrestricted choice of Unicode in identifiers.


    I don't think this is a problem in practice. E.g. if a language uses the
    usual definition for identifiers (first letter, then letters/digits),
    you end up with a language that changes its definition on the whims of
    the Unicode consortium, but that's less of a problem than one might
    think at first.

    I'd expect two kinds of changes in character categorization: additions
    and corrections. (Any other?)

    Additions are relatively unproblematic. Existing code will remain valid
    and retain its semantics. The new characters will be available for new
    programs.
    There's a slight technological complication: the compiler needs to be
    able to look up the newest definition. In other words, for a compiler to
    run, it needs to be able to access http://unicode.org, or the language
    infrastructure needs a way to carry around various revisions of the
    Unicode tables and select the newest one.

    Corrections are technically more problematic, but then we can rely on
    the common sense of the programmers. If the Unicode consortium
    miscategorized a character as a letter, the programmers that use that
    character set will probably know it well enough to avoid its use. It
    will probably not even occur to them that that character could be a
    letter ;-)


    Actually I'm not sure that Unicode is important for long-lived code.
    Code tends to not survive very long unless it's written in English, in
    which case anything outside of strings is in 7-bit ASCII. So the
    majority of code won't ever be affected by Unicode problems - Unicode is
    more a way of lowering entry barriers.

    Regards,
    Jo
     
    Joachim Durchholz, Jul 1, 2006
    #6
  7. Tim Roberts

    Dr.Ruud Guest

    Chris Uppal schreef:

    > Since the interpretation of characters which are yet to be added to
    > Unicode is undefined (will they be digits, "letters", operators,
    > symbol, punctuation.... ?), there doesn't seem to be any sane way
    > that a language could allow an unrestricted choice of Unicode in
    > identifiers.


    The Perl-code below prints:

    xdigit
    22 /194522 = 0.011% (lower: 6, upper: 6)
    ascii
    128 /194522 = 0.066% (lower: 26, upper: 26)
    \d
    268 /194522 = 0.138%
    digit
    268 /194522 = 0.138%
    IsNumber
    612 /194522 = 0.315%
    alpha
    91183 /194522 = 46.875% (lower: 1380, upper: 1160)
    alnum
    91451 /194522 = 47.013% (lower: 1380, upper: 1160)
    word
    91801 /194522 = 47.193% (lower: 1380, upper: 1160)
    graph
    102330 /194522 = 52.606% (lower: 1380, upper: 1160)
    print
    102349 /194522 = 52.616% (lower: 1380, upper: 1160)
    blank
    18 /194522 = 0.009%
    space
    24 /194522 = 0.012%
    punct
    374 /194522 = 0.192%
    cntrl
    6473 /194522 = 3.328%


    Especially look at 'word', the same as \w, which for ASCII is
    [0-9A-Za-z_].


    ==8<===================
    #!/usr/bin/perl
    # Program-Id: unicount.pl
    # Subject: show Unicode statistics

    use strict ;
    use warnings ;

    use Data::Alias ;

    binmode STDOUT, ':utf8' ;

    my @table =
    # +--Name------+---qRegexp--------+-C-+-L-+-U-+
    (
    [ 'xdigit' , qr/[[:xdigit:]]/ , 0 , 0 , 0 ] ,
    [ 'ascii' , qr/[[:ascii:]]/ , 0 , 0 , 0 ] ,
    [ '\\d' , qr/\d/ , 0 , 0 , 0 ] ,
    [ 'digit' , qr/[[:digit:]]/ , 0 , 0 , 0 ] ,
    [ 'IsNumber' , qr/\p{IsNumber}/ , 0 , 0 , 0 ] ,
    [ 'alpha' , qr/[[:alpha:]]/ , 0 , 0 , 0 ] ,
    [ 'alnum' , qr/[[:alnum:]]/ , 0 , 0 , 0 ] ,
    [ 'word' , qr/[[:word:]]/ , 0 , 0 , 0 ] ,
    [ 'graph' , qr/[[:graph:]]/ , 0 , 0 , 0 ] ,
    [ 'print' , qr/[[:print:]]/ , 0 , 0 , 0 ] ,
    [ 'blank' , qr/[[:blank:]]/ , 0 , 0 , 0 ] ,
    [ 'space' , qr/[[:space:]]/ , 0 , 0 , 0 ] ,
    [ 'punct' , qr/[[:punct:]]/ , 0 , 0 , 0 ] ,
    [ 'cntrl' , qr/[[:cntrl:]]/ , 0 , 0 , 0 ] ,
    ) ;

    my @codepoints =
    (
    0x0000 .. 0xD7FF,
    0xE000 .. 0xFDCF,
    0xFDF0 .. 0xFFFD,
    0x10000 .. 0x1FFFD,
    0x20000 .. 0x2FFFD,
    # 0x30000 .. 0x3FFFD, # etc.
    ) ;

    for my $row ( @table )
    {
    alias my ($name, $qrx, $count, $lower, $upper) = @$row ;

    printf "\n%s\n", $name ;

    my $n = 0 ;

    for ( @codepoints )
    {
    local $_ = chr ; # int-2-char conversion
    $n++ ;

    if ( /$qrx/ )
    {
    $count++ ;
    $lower++ if / [[:lower:]] /x ;
    $upper++ if / [[:upper:]] /x ;
    }
    }

    my $show_lower_upper =
    ($lower || $upper)
    ? sprintf( " (lower:%6d, upper:%6d)"
    , $lower
    , $upper
    )
    : '' ;

    printf "%6d /%6d =%7.3f%%%s\n"
    , $count
    , $n
    , 100 * $count / $n
    , $show_lower_upper
    }
    __END__

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jul 1, 2006
    #7
  8. Joachim Durchholz wrote:
    > Chris Uppal schrieb:
    >> Joachim Durchholz wrote:
    >>
    >>>> This is implementation-defined in C. A compiler is allowed to accept
    >>>> variable names with alphabetic Unicode characters outside of ASCII.
    >>>
    >>> Hmm... that could would be nonportable, so C support for Unicode is
    >>> half-baked at best.

    >>
    >> Since the interpretation of characters which are yet to be added to
    >> Unicode is undefined (will they be digits, "letters", operators, symbol,
    >> punctuation.... ?), there doesn't seem to be any sane way that a
    >> language could allow an unrestricted choice of Unicode in identifiers.

    >
    > I don't think this is a problem in practice. E.g. if a language uses the
    > usual definition for identifiers (first letter, then letters/digits),
    > you end up with a language that changes its definition on the whims of
    > the Unicode consortium, but that's less of a problem than one might
    > think at first.


    It is not a problem at all. See the stability policies in
    <http://www.unicode.org/reports/tr31/tr31-2.html>.

    > Actually I'm not sure that Unicode is important for long-lived code.
    > Code tends to not survive very long unless it's written in English, in
    > which case anything outside of strings is in 7-bit ASCII. So the
    > majority of code won't ever be affected by Unicode problems - Unicode is
    > more a way of lowering entry barriers.


    Unicode in identifiers has certainly been less important than some thought
    it would be -- and not at all important for open source projects, for example,
    which essentially have to use English to get the widest possible participation.

    --
    David Hopwood <>
     
    David Hopwood, Jul 1, 2006
    #8
  9. Tim Roberts

    Dale King Guest

    Tim Roberts wrote:
    > "Xah Lee" <> wrote:
    >
    >> Languages with Full Unicode Support
    >>
    >> As far as i know, Java and JavaScript are languages with full, complete
    >> unicode support. That is, they allow names to be defined using unicode.
    >> (the JavaScript engine used by FireFox support this)
    >>
    >> As far as i know, here's few other lang's status:
    >>
    >> C ? No.

    >
    > This is implementation-defined in C. A compiler is allowed to accept
    > variable names with alphabetic Unicode characters outside of ASCII.


    I don't think it is implementation defined. I believe it is actually
    required by the spec. The trouble is that so few compilers actually
    comply with the spec. A few years ago I asked for someone to actually
    point to a fully compliant compiler and no one could.

    --
    Dale King
     
    Dale King, Jul 5, 2006
    #9
  10. Tim Roberts

    Tim Roberts Guest

    Dale King <> wrote:
    >Tim Roberts wrote:
    >> "Xah Lee" <> wrote:
    >>
    >>> Languages with Full Unicode Support
    >>>
    >>> As far as i know, Java and JavaScript are languages with full, complete
    >>> unicode support. That is, they allow names to be defined using unicode.
    >>> (the JavaScript engine used by FireFox support this)

    >>
    >> This is implementation-defined in C. A compiler is allowed to accept
    >> variable names with alphabetic Unicode characters outside of ASCII.

    >
    >I don't think it is implementation defined. I believe it is actually
    >required by the spec.


    C99 does have a list of Unicode codepoints that are required to be accepted
    in identifiers, although implementations are free to accept other
    characters as well. For example, few people realize that Visual C++
    accepts the dollar sign $ in an identifier.
    --
    - Tim Roberts,
    Providenza & Boekelheide, Inc.
     
    Tim Roberts, Jul 6, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mumia W.
    Replies:
    0
    Views:
    376
    Mumia W.
    Jun 25, 2006
  2. Oliver Bandel

    Re: languages with full unicode support

    Oliver Bandel, Jun 25, 2006, in forum: Java
    Replies:
    8
    Views:
    412
    Mumia W.
    Jul 4, 2006
  3. Tim Roberts
    Replies:
    12
    Views:
    647
    Tim Roberts
    Jul 6, 2006
  4. Oliver Bandel

    Re: languages with full unicode support

    Oliver Bandel, Jun 25, 2006, in forum: Python
    Replies:
    8
    Views:
    404
    Mumia W.
    Jul 4, 2006
  5. prachi
    Replies:
    0
    Views:
    643
    prachi
    Jan 17, 2009
Loading...

Share This Page