Re: Isn't java.lang.Character.html#{ isLetterFromLang(int codePoint,String ISOLangDef) missing from

Discussion in 'Java' started by Arne Vajhøj, Dec 5, 2010.

  1. Arne Vajhøj

    Arne Vajhøj Guest

    On 04-12-2010 20:34, wrote:
    >> The concept will be fundamentally broken if one language
    >> has more than one alphabet (I don't know if such case exist,
    >> but it could).

    > ~
    > Well, there are plenty of languages using more than one alphabet. Japanese comes to mind:
    > ~
    > http://en.wikipedia.org/wiki/Japanese_writing_system
    > ~
    > It actually uses 4 writing systems: Kanji, Hiragana, Katakana, Rōmaji


    Then the function is unimplementable.

    >> And the benefits are very limited given the practice
    >> of writing names as they are in their native language
    >> even though the letters are not used in the language
    >> of the text.

    > ~
    > If all you take into account are nominal entries (POS bearing a name) in language,
    >

    I don't think that the benefits are -very limited-, since in every
    language
    > those are a very small part of all it goes on


    Names and specialized terms and phrases are in a lot of text.

    Arne
    Arne Vajhøj, Dec 5, 2010
    #1
    1. Advertising

  2. Arne Vajhøj

    Tom Anderson Guest

    On Sat, 4 Dec 2010, Arne Vajh?j wrote:

    > On 04-12-2010 20:34, wrote:
    >>> The concept will be fundamentally broken if one language
    >>> has more than one alphabet (I don't know if such case exist,
    >>> but it could).

    >> ~
    >> Well, there are plenty of languages using more than one alphabet.
    >> Japanese comes to mind:
    >> ~
    >> http://en.wikipedia.org/wiki/Japanese_writing_system
    >> ~
    >> It actually uses 4 writing systems: Kanji, Hiragana, Katakana,
    >> Rōmaji

    >
    > Then the function is unimplementable.


    Why? Why doesn't the function simply return true for glyphs from any of
    those groups and "ja"?

    I note that in particular, it might return true for the fullwidth romaji,
    but not the standard width ones.

    tom

    --
    Now I am thoroughly confused. -- Colin Brace sums up RT3090 support
    in Linux
    Tom Anderson, Dec 5, 2010
    #2
    1. Advertising

  3. Arne Vajhøj

    Arne Vajhøj Guest

    On 05-12-2010 07:07, Tom Anderson wrote:
    > On Sat, 4 Dec 2010, Arne Vajh?j wrote:
    >> On 04-12-2010 20:34, wrote:
    >>>> The concept will be fundamentally broken if one language
    >>>> has more than one alphabet (I don't know if such case exist,
    >>>> but it could).
    >>> ~
    >>> Well, there are plenty of languages using more than one alphabet.
    >>> Japanese comes to mind:
    >>> ~
    >>> http://en.wikipedia.org/wiki/Japanese_writing_system
    >>> ~
    >>> It actually uses 4 writing systems: Kanji, Hiragana, Katakana,
    >>> Rōmaji

    >>
    >> Then the function is unimplementable.

    >
    > Why? Why doesn't the function simply return true for glyphs from any of
    > those groups and "ja"?
    >
    > I note that in particular, it might return true for the fullwidth
    > romaji, but not the standard width ones.


    I guess it could.

    But then what does the result mean?

    It does not mean that the code point is valid in any
    text in that language.

    Arne
    Arne Vajhøj, Dec 5, 2010
    #3
  4. Arne Vajhøj

    Tom Anderson Guest

    On Sun, 5 Dec 2010, Arne Vajh?j wrote:

    > On 05-12-2010 07:07, Tom Anderson wrote:
    >> On Sat, 4 Dec 2010, Arne Vajh?j wrote:
    >>> On 04-12-2010 20:34, wrote:
    >>>>> The concept will be fundamentally broken if one language
    >>>>> has more than one alphabet (I don't know if such case exist,
    >>>>> but it could).
    >>>> ~
    >>>> Well, there are plenty of languages using more than one alphabet.
    >>>> Japanese comes to mind:
    >>>> ~
    >>>> http://en.wikipedia.org/wiki/Japanese_writing_system
    >>>> ~
    >>>> It actually uses 4 writing systems: Kanji, Hiragana, Katakana,
    >>>> Rōmaji
    >>>
    >>> Then the function is unimplementable.

    >>
    >> Why? Why doesn't the function simply return true for glyphs from any of
    >> those groups and "ja"?
    >>
    >> I note that in particular, it might return true for the fullwidth
    >> romaji, but not the standard width ones.

    >
    > I guess it could.
    >
    > But then what does the result mean?


    It means that the character could be part of a text in that language.

    > It does not mean that the code point is valid in any text in that
    > language.


    Surely that's exactly what it means?

    tom

    --
    The coolest thing to do with your data will be thought of by someone
    else. -- Rufus Pollock
    Tom Anderson, Dec 5, 2010
    #4
  5. Arne Vajhøj

    Arne Vajhøj Guest

    On 05-12-2010 10:10, Tom Anderson wrote:
    > On Sun, 5 Dec 2010, Arne Vajh?j wrote:
    >> On 05-12-2010 07:07, Tom Anderson wrote:
    >>> On Sat, 4 Dec 2010, Arne Vajh?j wrote:
    >>>> On 04-12-2010 20:34, wrote:
    >>>>>> The concept will be fundamentally broken if one language
    >>>>>> has more than one alphabet (I don't know if such case exist,
    >>>>>> but it could).
    >>>>> ~
    >>>>> Well, there are plenty of languages using more than one alphabet.
    >>>>> Japanese comes to mind:
    >>>>> ~
    >>>>> http://en.wikipedia.org/wiki/Japanese_writing_system
    >>>>> ~
    >>>>> It actually uses 4 writing systems: Kanji, Hiragana, Katakana,
    >>>>> Rōmaji
    >>>>
    >>>> Then the function is unimplementable.
    >>>
    >>> Why? Why doesn't the function simply return true for glyphs from any of
    >>> those groups and "ja"?
    >>>
    >>> I note that in particular, it might return true for the fullwidth
    >>> romaji, but not the standard width ones.

    >>
    >> I guess it could.
    >>
    >> But then what does the result mean?

    >
    > It means that the character could be part of a text in that language.
    >
    >> It does not mean that the code point is valid in any text in that
    >> language.

    >
    > Surely that's exactly what it means?


    No.

    The difference is between any and some.

    With that semantics I find the function useless.

    isLetterFromAlphabet may make more sense. If Alphabet is
    sufficient well defined.

    Arne
    Arne Vajhøj, Dec 5, 2010
    #5
  6. Arne Vajhøj

    Tom Anderson Guest

    On Sun, 5 Dec 2010, Arne Vajh?j wrote:

    > On 05-12-2010 10:10, Tom Anderson wrote:
    >> On Sun, 5 Dec 2010, Arne Vajh?j wrote:
    >>> On 05-12-2010 07:07, Tom Anderson wrote:
    >>>> On Sat, 4 Dec 2010, Arne Vajh?j wrote:
    >>>>> On 04-12-2010 20:34, wrote:
    >>>>>>> The concept will be fundamentally broken if one language
    >>>>>>> has more than one alphabet (I don't know if such case exist,
    >>>>>>> but it could).
    >>>>>> ~
    >>>>>> Well, there are plenty of languages using more than one alphabet.
    >>>>>> Japanese comes to mind:
    >>>>>> ~
    >>>>>> http://en.wikipedia.org/wiki/Japanese_writing_system
    >>>>>> ~
    >>>>>> It actually uses 4 writing systems: Kanji, Hiragana, Katakana,
    >>>>>> Rōmaji
    >>>>>
    >>>>> Then the function is unimplementable.
    >>>>
    >>>> Why? Why doesn't the function simply return true for glyphs from any of
    >>>> those groups and "ja"?
    >>>>
    >>>> I note that in particular, it might return true for the fullwidth
    >>>> romaji, but not the standard width ones.
    >>>
    >>> I guess it could.
    >>>
    >>> But then what does the result mean?

    >>
    >> It means that the character could be part of a text in that language.
    >>
    >>> It does not mean that the code point is valid in any text in that
    >>> language.

    >>
    >> Surely that's exactly what it means?

    >
    > No.
    >
    > The difference is between any and some.
    >
    > With that semantics I find the function useless.


    I'm sorry to hear that. I don't.

    Perhaps the function should return a result from an enum -
    NOT_IN_THIS_LANGUAGE, USED_IN_THIS_LANGUAGE,
    EXCLUSIVELY_USED_IN_THIS_LANGUAGE.

    > isLetterFromAlphabet may make more sense. If Alphabet is sufficient well
    > defined.


    That could certainly be handy too. Since you could fairly easily
    construct a many-to-many alphabet -> language mapping, you could implement
    the original function on top of it.

    Even better might be a function Set<Language>
    languagesWhichUseThisCharacter(). Or perhaps, applying your idea,
    Set<Script> scriptsWhichUseThisCharacter, with Script having a
    Set<Language> languagesWrittenInThisScript().

    tom

    --
    non, scarecrow, forensics, rituals, bacteria, scientific instruments, ..
    Tom Anderson, Dec 6, 2010
    #6
  7. Arne Vajhøj

    Arne Vajhøj Guest

    On 06-12-2010 07:48, Tom Anderson wrote:
    > On Sun, 5 Dec 2010, Arne Vajh?j wrote:
    >
    >> On 05-12-2010 10:10, Tom Anderson wrote:
    >>> On Sun, 5 Dec 2010, Arne Vajh?j wrote:
    >>>> On 05-12-2010 07:07, Tom Anderson wrote:
    >>>>> On Sat, 4 Dec 2010, Arne Vajh?j wrote:
    >>>>>> On 04-12-2010 20:34, wrote:
    >>>>>>>> The concept will be fundamentally broken if one language
    >>>>>>>> has more than one alphabet (I don't know if such case exist,
    >>>>>>>> but it could).
    >>>>>>> ~
    >>>>>>> Well, there are plenty of languages using more than one alphabet.
    >>>>>>> Japanese comes to mind:
    >>>>>>> ~
    >>>>>>> http://en.wikipedia.org/wiki/Japanese_writing_system
    >>>>>>> ~
    >>>>>>> It actually uses 4 writing systems: Kanji, Hiragana, Katakana,
    >>>>>>> Rōmaji
    >>>>>>
    >>>>>> Then the function is unimplementable.
    >>>>>
    >>>>> Why? Why doesn't the function simply return true for glyphs from
    >>>>> any of
    >>>>> those groups and "ja"?
    >>>>>
    >>>>> I note that in particular, it might return true for the fullwidth
    >>>>> romaji, but not the standard width ones.
    >>>>
    >>>> I guess it could.
    >>>>
    >>>> But then what does the result mean?
    >>>
    >>> It means that the character could be part of a text in that language.
    >>>
    >>>> It does not mean that the code point is valid in any text in that
    >>>> language.
    >>>
    >>> Surely that's exactly what it means?

    >>
    >> No.
    >>
    >> The difference is between any and some.
    >>
    >> With that semantics I find the function useless.

    >
    > I'm sorry to hear that. I don't.
    >
    > Perhaps the function should return a result from an enum -
    > NOT_IN_THIS_LANGUAGE, USED_IN_THIS_LANGUAGE,
    > EXCLUSIVELY_USED_IN_THIS_LANGUAGE.


    Which is not related at all to the problem I am describing!?!?

    >> isLetterFromAlphabet may make more sense. If Alphabet is sufficient
    >> well defined.

    >
    > That could certainly be handy too. Since you could fairly easily
    > construct a many-to-many alphabet -> language mapping, you could
    > implement the original function on top of it.


    You could.

    But I can still not see the value of it.

    Arne
    Arne Vajhøj, Dec 7, 2010
    #7
  8. Arne Vajhøj

    Tom Anderson Guest

    On Mon, 6 Dec 2010, Arne Vajh?j wrote:

    > On 06-12-2010 07:48, Tom Anderson wrote:
    >> On Sun, 5 Dec 2010, Arne Vajh?j wrote:
    >>
    >>> On 05-12-2010 10:10, Tom Anderson wrote:
    >>>> On Sun, 5 Dec 2010, Arne Vajh?j wrote:
    >>>>> On 05-12-2010 07:07, Tom Anderson wrote:
    >>>>>> On Sat, 4 Dec 2010, Arne Vajh?j wrote:
    >>>>>>> On 04-12-2010 20:34, wrote:
    >>>>>>>>> The concept will be fundamentally broken if one language
    >>>>>>>>> has more than one alphabet (I don't know if such case exist,
    >>>>>>>>> but it could).
    >>>>>>>> ~
    >>>>>>>> Well, there are plenty of languages using more than one alphabet.
    >>>>>>>> Japanese comes to mind:
    >>>>>>>> ~
    >>>>>>>> http://en.wikipedia.org/wiki/Japanese_writing_system
    >>>>>>>> ~
    >>>>>>>> It actually uses 4 writing systems: Kanji, Hiragana, Katakana,
    >>>>>>>> Rōmaji
    >>>>>>>
    >>>>>>> Then the function is unimplementable.
    >>>>>>
    >>>>>> Why? Why doesn't the function simply return true for glyphs from
    >>>>>> any of
    >>>>>> those groups and "ja"?
    >>>>>>
    >>>>>> I note that in particular, it might return true for the fullwidth
    >>>>>> romaji, but not the standard width ones.
    >>>>>
    >>>>> I guess it could.
    >>>>>
    >>>>> But then what does the result mean?
    >>>>
    >>>> It means that the character could be part of a text in that language.
    >>>>
    >>>>> It does not mean that the code point is valid in any text in that
    >>>>> language.
    >>>>
    >>>> Surely that's exactly what it means?
    >>>
    >>> No.
    >>>
    >>> The difference is between any and some.
    >>>
    >>> With that semantics I find the function useless.

    >>
    >> I'm sorry to hear that. I don't.
    >>
    >> Perhaps the function should return a result from an enum -
    >> NOT_IN_THIS_LANGUAGE, USED_IN_THIS_LANGUAGE,
    >> EXCLUSIVELY_USED_IN_THIS_LANGUAGE.

    >
    > Which is not related at all to the problem I am describing!?!?


    Then at least one of us has misunderstood the other. Could you restate
    your problem?

    >>> isLetterFromAlphabet may make more sense. If Alphabet is sufficient
    >>> well defined.

    >>
    >> That could certainly be handy too. Since you could fairly easily
    >> construct a many-to-many alphabet -> language mapping, you could
    >> implement the original function on top of it.

    >
    > You could.
    >
    > But I can still not see the value of it.


    You could take some text and produce a set of languages it could possibly
    be from. You wouldn't be able to tell many European languages apart, but
    you could tell runs of typical Japanese, Chinese, Hindi, Arabic, Urdu, etc
    apart.

    tom

    --
    a moratorium on the future
    Tom Anderson, Dec 7, 2010
    #8
  9. Arne Vajhøj

    Lew Guest

    Tom Anderson wrote:
    > You could take some text and produce a set of languages it could
    > possibly be from. You wouldn't be able to tell many European languages
    > apart, but you could tell runs of typical Japanese, Chinese, Hindi,
    > Arabic, Urdu, etc apart.


    You can tell about some things regarding runs of "typical" characters, but
    cannot reliably rate an entire document. Suppose this post were about Asian
    art(or má jiàng), and I mention the "four gentlemen", å››å›å­.
    <http://en.wikipedia.org/wiki/Four_Gentlemen>

    That run of ideograms is the same in Chinese, Japanese, Korean and Vietnamese.
    Which one is it? Is this post in English or one of those four languages?

    The run of characters "má jiàng" - what language is that?

    --
    Lew
    Lew, Dec 7, 2010
    #9
  10. Arne Vajhøj

    Tom Anderson Guest

    On Tue, 7 Dec 2010, Lew wrote:

    > Tom Anderson wrote:
    >> You could take some text and produce a set of languages it could
    >> possibly be from. You wouldn't be able to tell many European languages
    >> apart, but you could tell runs of typical Japanese, Chinese, Hindi,
    >> Arabic, Urdu, etc apart.

    >
    > You can tell about some things regarding runs of "typical" characters, but
    > cannot reliably rate an entire document. Suppose this post were about Asian
    > art(or m? ji?ng), and I mention the "four gentlemen", ???.
    > <http://en.wikipedia.org/wiki/Four_Gentlemen>
    >
    > That run of ideograms is the same in Chinese, Japanese, Korean and
    > Vietnamese. Which one is it? Is this post in English or one of those four
    > languages?


    I'd conclude that it couldn't be in any one language.

    If you were interested in mixed-language text, you could do set covering
    on the results (hoping that the number of different
    sets-of-possible-languages is small enough that nobody notices you're
    solving an NP-hard problem), to find out possible sets of languages that
    could be in the mix. I'd hope you'd end up with {English, Chinese},
    {English, Japanese}, and so on.

    > The run of characters "m? ji?ng" - what language is that?


    Looks like Hungarian to me.

    tom

    --
    09F911029D74E35BD84156C5635688C0 -- AACS Licensing Administrator
    Tom Anderson, Dec 7, 2010
    #10
  11. Arne Vajhøj

    Arne Vajhøj Guest

    On 07-12-2010 08:19, Tom Anderson wrote:
    > On Mon, 6 Dec 2010, Arne Vajh?j wrote:
    >
    >> On 06-12-2010 07:48, Tom Anderson wrote:
    >>> On Sun, 5 Dec 2010, Arne Vajh?j wrote:
    >>>
    >>>> On 05-12-2010 10:10, Tom Anderson wrote:
    >>>>> On Sun, 5 Dec 2010, Arne Vajh?j wrote:
    >>>>>> On 05-12-2010 07:07, Tom Anderson wrote:
    >>>>>>> On Sat, 4 Dec 2010, Arne Vajh?j wrote:
    >>>>>>>> On 04-12-2010 20:34, wrote:
    >>>>>>>>>> The concept will be fundamentally broken if one language
    >>>>>>>>>> has more than one alphabet (I don't know if such case exist,
    >>>>>>>>>> but it could).
    >>>>>>>>> ~
    >>>>>>>>> Well, there are plenty of languages using more than one alphabet.
    >>>>>>>>> Japanese comes to mind:
    >>>>>>>>> ~
    >>>>>>>>> http://en.wikipedia.org/wiki/Japanese_writing_system
    >>>>>>>>> ~
    >>>>>>>>> It actually uses 4 writing systems: Kanji, Hiragana, Katakana,
    >>>>>>>>> Rōmaji
    >>>>>>>>
    >>>>>>>> Then the function is unimplementable.
    >>>>>>>
    >>>>>>> Why? Why doesn't the function simply return true for glyphs from
    >>>>>>> any of
    >>>>>>> those groups and "ja"?
    >>>>>>>
    >>>>>>> I note that in particular, it might return true for the fullwidth
    >>>>>>> romaji, but not the standard width ones.
    >>>>>>
    >>>>>> I guess it could.
    >>>>>>
    >>>>>> But then what does the result mean?
    >>>>>
    >>>>> It means that the character could be part of a text in that language.
    >>>>>
    >>>>>> It does not mean that the code point is valid in any text in that
    >>>>>> language.
    >>>>>
    >>>>> Surely that's exactly what it means?
    >>>>
    >>>> No.
    >>>>
    >>>> The difference is between any and some.
    >>>>
    >>>> With that semantics I find the function useless.
    >>>
    >>> I'm sorry to hear that. I don't.
    >>>
    >>> Perhaps the function should return a result from an enum -
    >>> NOT_IN_THIS_LANGUAGE, USED_IN_THIS_LANGUAGE,
    >>> EXCLUSIVELY_USED_IN_THIS_LANGUAGE.

    >>
    >> Which is not related at all to the problem I am describing!?!?

    >
    > Then at least one of us has misunderstood the other. Could you restate
    > your problem?


    The problem is that a given code point can be both valid and
    non valid in text in a given language depending on the alphabet
    used.

    NOT_IN_THIS_LANGUAGE is OK
    USED_IN_THIS_LANGUAGE is useless because it does not tell whether the
    codepoint is valid or not
    EXCLUSIVELY_USED_IN_THIS_LANGUAGE is irrelevant for the problem

    >>>> isLetterFromAlphabet may make more sense. If Alphabet is sufficient
    >>>> well defined.
    >>>
    >>> That could certainly be handy too. Since you could fairly easily
    >>> construct a many-to-many alphabet -> language mapping, you could
    >>> implement the original function on top of it.

    >>
    >> You could.
    >>
    >> But I can still not see the value of it.

    >
    > You could take some text and produce a set of languages it could
    > possibly be from. You wouldn't be able to tell many European languages
    > apart, but you could tell runs of typical Japanese, Chinese, Hindi,
    > Arabic, Urdu, etc apart.


    1) Code that works in some cases are rarely useful.

    2) What you are using is a fake language-codepoint
    relation which in relaity is two relationships
    language-alphabet and alphabet-codepoint. Not a very
    good model of reality.

    Arne
    Arne Vajhøj, Dec 8, 2010
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Schnoffos
    Replies:
    2
    Views:
    1,206
    Martien Verbruggen
    Jun 27, 2003
  2. aling
    Replies:
    8
    Views:
    942
    Jim Langston
    Oct 20, 2005
  3. Arne Vajhøj
    Replies:
    2
    Views:
    272
    Arne Vajhøj
    Dec 5, 2010
  4. Joshua Cranmer
    Replies:
    5
    Views:
    311
    Tom Anderson
    Dec 5, 2010
  5. Lew
    Replies:
    16
    Views:
    491
    Arne Vajhøj
    Dec 10, 2010
Loading...

Share This Page