Re: Isn't java.lang.Character.html#{ isLetterFromLang(int codePoint,String ISOLangDef) missing from

Discussion in 'Java' started by Joshua Cranmer, Dec 5, 2010.

  1. On 12/04/2010 07:16 PM, wrote:
    > One possibly (and easily ;-)) could based on the Unicode code
    > points check the ranges for each language, but I think it would be
    > very useful for people parsing text from different languages.


    Language is not so simple. First of all, code points don't necessarily
    map to a `character' in a language--you can represent `è' as both the
    "Latin small e with accent grave" and as "Latin small e" followed by a
    "modifying accent grave". Second of all, what would you say makes a
    character in a language? For the most part, é does not exist in English,
    but, e.g., résumé is the proper spelling. Then you get complicated cases
    like Japanese, which can write in hiragana, katakana, kanji, or rÅmaji.
    Technically, rÅmaji is merely Latin transliteration of Japanese, so it's
    debatable how much it is or isn't Japanese.

    Finally, you run into the ambiguities of Unicode codepoints. Are
    fullwidth roman letters valid for en-US, even though English typography
    doesn't distinguish between fullwidth and halfwidth? English also
    borrows the characters of other languages for various purposes: remember
    that the abbreviation for micrometer is `μm', so is `μ' in en-US or not?

    In my opinion, this is not generally useful enough to be worth having in
    the standard library. Actually, I don't think Java even has Unicode
    normalization functions, which are much more useful than divining
    languages from code points.

    > Do you know of any java packages to address these NLP issues? or, if
    > you don't, is there something like that for text processing in ANSI C
    > or C++? ~ Thanks lbrtchx


    What are you really trying to do? If you are trying to detect languages
    based on codepoints, that is not going to work that well. You would be
    far better trying to guess language based on letter frequency, or even
    just parsing it different languages and seeing which language has the
    least "misspelled" words.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
     
    Joshua Cranmer, Dec 5, 2010
    #1
    1. Advertising

  2. Joshua Cranmer

    Roedy Green Guest

    Re: Isn't java.lang.Character.html#{ isLetterFromLang(int codePoint, String ISOLangDef) missing from the spec?

    On Sat, 04 Dec 2010 21:00:37 -0500, Joshua Cranmer
    <> wrote, quoted or indirectly quoted someone
    who said :

    >Language is not so simple. First of all, code points don't necessarily
    >map to a `character' in a language--you can represen


    Then there is Arabic where the Unicode is just a hint as what needs to
    rendered. It was originally designed to be written cursively, so
    there are special forms for starting and ending, and bits can shift
    around. It is more like a 2D tessellation problem.

    From the little I learned about it, I am impressed anyone ever figured
    out how to use computers to typeset books. The results are
    aesthetically quite pleasing, though I can only read a few words.

    If anyone speaks Arabic, I would like to know how close what you see
    on computer screens when programming comes to the classical form used
    in books.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    In programming, and documenting programs, keep vocabulary consistent and precisely defined! Variation in vocabulary to relieve the tedium is for novels.
     
    Roedy Green, Dec 5, 2010
    #2
    1. Advertising

  3. Re: Isn't java.lang.Character.html#{ isLetterFromLang(int codePoint, String ISOLangDef) missing from the spec?

    On 2010-12-04 21:00:37 -0500, Joshua Cranmer said:

    > [...] I don't think Java even has Unicode normalization functions,
    > which are much more useful than divining languages from code points.


    java.text.Normalizer - Hope that helps.

    -o
     
    Owen Jacobson, Dec 5, 2010
    #3
  4. Re: Isn't java.lang.Character.html#{ isLetterFromLang(int codePoint, String ISOLangDef) missing from the spec?

    On 2010-12-04 21:00:37 -0500, Joshua Cranmer said:

    > [...] I don't think Java even has Unicode normalization functions,
    > which are much more useful than divining languages from code points.


    java.text.Normalizer - Hope that helps.

    -o
     
    Owen Jacobson, Dec 5, 2010
    #4
  5. Re: Isn't java.lang.Character.html#{ isLetterFromLang(int codePoint, String ISOLangDef) missing from the spec?

    On 2010-12-04 21:00:37 -0500, Joshua Cranmer said:

    > [...] I don't think Java even has Unicode normalization functions,
    > which are much more useful than divining languages from code points.


    java.text.Normalizer - Hope that helps.

    -o
     
    Owen Jacobson, Dec 5, 2010
    #5
  6. Joshua Cranmer

    Tom Anderson Guest

    On Sat, 4 Dec 2010, Roedy Green wrote:

    > On Sat, 04 Dec 2010 21:00:37 -0500, Joshua Cranmer
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >> Language is not so simple. First of all, code points don't necessarily
    >> map to a `character' in a language--you can represen

    >
    > Then there is Arabic where the Unicode is just a hint as what needs to
    > rendered. It was originally designed to be written cursively, so
    > there are special forms for starting and ending, and bits can shift
    > around. It is more like a 2D tessellation problem.


    I know that the Unicode consortium did cock up Arabic quite badly by
    starting off with a lot of precomposed characters, when they should have
    gone down a more base-plus-combining route, but my impression was that it
    was now possible to encode all Arabic text. The typesetting may not be
    easy (indeed, i understand it's still realy rather hard), but it's very
    much a matter of typesetting rather than encoding.

    A good explanation of the situation is hard to come across, but most of it
    is in here:

    http://www.paktribune.com/pforums/posts.php?t=7389

    > From the little I learned about it, I am impressed anyone ever figured
    > out how to use computers to typeset books. The results are aesthetically
    > quite pleasing, though I can only read a few words.
    >
    > If anyone speaks Arabic, I would like to know how close what you see on
    > computer screens when programming comes to the classical form used in
    > books.


    Another - more leading! - question would be how computer-set text compares
    to the typewritten text that people have been using for day-to-day work
    for the immediately preceding decades. Another would be how it compares to
    newspaper typesetting, which again accounts for a large amount of the text
    people read, and i suspect is not set as carefully as book text. If modern
    computers are better than typewriters, that's a huge amount of utility
    right there; if they're better than manual newspaper typesetting, even
    better.

    tom

    --
    Now I am thoroughly confused. -- Colin Brace sums up RT3090 support
    in Linux
     
    Tom Anderson, Dec 5, 2010
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Schnoffos
    Replies:
    2
    Views:
    1,247
    Martien Verbruggen
    Jun 27, 2003
  2. aling
    Replies:
    8
    Views:
    1,002
    Jim Langston
    Oct 20, 2005
  3. Arne Vajhøj
    Replies:
    2
    Views:
    282
    Arne Vajhøj
    Dec 5, 2010
  4. Arne Vajhøj
    Replies:
    10
    Views:
    417
    Arne Vajhøj
    Dec 8, 2010
  5. Lew
    Replies:
    16
    Views:
    504
    Arne Vajhøj
    Dec 10, 2010
Loading...

Share This Page