List of English Alphabet Letters Programatically

F

Fabio

I been looking for a way to list the letters of the English Letter in
Java. It seems that they have forgot to add such feature.

I am working in a internationalized application that have a bar with
the letters of any alphabet (let's say just Western European). The
user clicks in
one of those letters and then I display a list of options that start
with that
letter.

I have not found a way to extract the letters of a particular
language.
Even worst it seems that the methods in the Character class do not
work properly. The code below will print characters that are not
letter or digit
to the screen. Even if I could get the character I have not seen a
constructor
that takes the Locale and gives me a Character set for a particular
language.
I know there are Character set for different encodings such 8859-1,
etc. But
this is not what I need because in that encoding there are all the
languages.

Has anybody done this before?

Thank,
Fabio


public class CharDemo2 {

public static void main(String args[]) {
int dig_count = 0;
int def_count = 0;
Character c = null;
for (int i = 0; i <= 0xff; i++) {
c = new Character((char)i);
if (Character.isDigit(c.charValue())) {
dig_count++;
}
if (Character.isLetter(c.charValue())) {
def_count++;
}
if(!Character.isLetterOrDigit(c.charValue()))
System.out.print(" i="+i+" "+c.charValue());
if(i%10 == 0)
System.out.println();
}
System.out.println();
System.out.println("number of digits = " + dig_count);
System.out.println("number of defined = " + def_count);
}

}
 
A

Alexander Bryanson

Fabio said:
I been looking for a way to list the letters of the English Letter in
Java. It seems that they have forgot to add such feature.

I am working in a internationalized application that have a bar with
the letters of any alphabet (let's say just Western European). The
user clicks in
one of those letters and then I display a list of options that start
with that
letter.

I have not found a way to extract the letters of a particular
language.
Even worst it seems that the methods in the Character class do not
work properly. The code below will print characters that are not
letter or digit
to the screen. Even if I could get the character I have not seen a
constructor
that takes the Locale and gives me a Character set for a particular
language.
I know there are Character set for different encodings such 8859-1,
etc. But
this is not what I need because in that encoding there are all the
languages.

Has anybody done this before?

Thank,
Fabio


public class CharDemo2 {

public static void main(String args[]) {
int dig_count = 0;
int def_count = 0;
Character c = null;
for (int i = 0; i <= 0xff; i++) {
c = new Character((char)i);
if (Character.isDigit(c.charValue())) {
dig_count++;
}
if (Character.isLetter(c.charValue())) {
def_count++;
}
if(!Character.isLetterOrDigit(c.charValue()))
System.out.print(" i="+i+" "+c.charValue());
if(i%10 == 0)
System.out.println();
}
System.out.println();
System.out.println("number of digits = " + dig_count);
System.out.println("number of defined = " + def_count);
}

}

I suggest you use the Unicode definitions for your character toolbar.
Most languages will have a given range (/uxxxx to /uyyyy) that their
characters will appear in. The Character class was never written with
the idea to distinguish between languages, I assume that is because the
unicode standard could very well change.
 
R

Roedy Green

I am working in a internationalized application that have a bar with
the letters of any alphabet (let's say just Western European). The
user clicks in
one of those letters and then I display a list of options that start
with that
letter.

For String sorting in various international orders see
java.text.Collator, java.text.RuleBasedCollator, and
java.text.CollationKey.

I'd think what you might do is create one of these collators then
figure out from that which characters it considered part of its normal
alphabet.
 
V

VK

Java "speaks" Unicode only, so forget any iso-..., go to www.unicode.org and
get the current characters tables.
Two things to mention:
1) The unicode tables are done in pdf format with "no save" flag on, so if
you just download them, it will say "file damaged". You need to either crack
it, or copy from the site through the clipboard. Are they nuts? - have no
idea, but looks like...
2) From the very beginning and up till now Java "speaks" Unicode like it
just learned it, with a bad accent and numerous mistakes. So be prepared for
a lot of fun with internationalization under Java. The printing issues only
can make a poem...

Good luck!
 
N

Niels Dybdahl

I have not found a way to extract the letters of a particular
language.

I am not sure that this is straightforward. In danish we usually assume that
we have the letters æ,ø and å in addition to the english letters and that we
do not have the letter w. However we do have danish words containing w, á
and é, and there are personal names in Danish containing the letters ö and
ü, although they are not official danish letters.
In spanish they consider the letter sequences ch and ll as single characters
even though they often are written as two characters. How are you going to
handle that ? Dutch has the same problem with the sequence ij which is one
dutch character, but that sequence can also be written with the special
character ÿ, which solves the problem.

Best regards
Niels Dybdahl
 
J

John C. Bollinger

Fabio said:
I been looking for a way to list the letters of the English Letter in
Java. It seems that they have forgot to add such feature.

I doubt whether it was an oversight. How would it work, exactly? Is
this something that one would obtain from a Locale? What an
extraordinary amount of extra work in developing -- and maintaining --
Locales.
I am working in a internationalized application that have a bar with
the letters of any alphabet (let's say just Western European). The
user clicks in
one of those letters and then I display a list of options that start
with that
letter.

Evidently, then, you will have a list of localized options. You can get
at least most of what you want by examining the list and compiling the
first characters of all the options to figure out which to display.

Do note that whether by that technique or by having a full list of the
characters used in any language, you still have potential problems. For
instance, are you going to distinguish accented characters from their
non-accented base characters?

Oh, and I don't recommend trying to do anything like this in a written
language that uses ideographic characters.
I have not found a way to extract the letters of a particular
language.
Even worst it seems that the methods in the Character class do not
work properly. The code below will print characters that are not
letter or digit
to the screen.

Well, conceivably the character attribute tables are messed up, but more
likely your screen font displays different characters for some codes
than is specified by the Unicode code tables. Note also that
Character's isLetter(char), isDigit(char), and isLetterOrDigit(char) are
based solely on the attributes assigned to each character by Unicode,
and thus identify characters that are letters or digits in any locale.

(That's not as much of a muddle as it might sound. There are generally
no clashes because even when a letter in one writing system has the same
or a very similar (series of) glyph(s) as a number or other character in
a different writing system, Unicode considers them different characters.
A Unicode character is logically distinct from its representation in
any writing system.)
Even if I could get the character I have not seen a
constructor
that takes the Locale and gives me a Character set for a particular
language.
I know there are Character set for different encodings such 8859-1,
etc. But
this is not what I need because in that encoding there are all the
languages.

Now you are mixing concepts. A CharSet is used to transform a series
of bytes into a series of characters, or vise versa. CharSets are not
separable from character encodings, and they really have nothing to do
with grouping characters together except inasmuch as a CharSet may only
be applicable to a subset of the full Unicode table.

I apologize that this may not have been very helpful with regard to
solving your problem, but I think the best approach may be to change the
problem.


John Bollinger
(e-mail address removed)
 
R

Roedy Green

Evidently, then, you will have a list of localized options. You can get
at least most of what you want by examining the list and compiling the
first characters of all the options to figure out which to display.

What you might do is just roll your own class that does a lookup on
Locale. That's what I did for currency.
 
S

Steven Coco

Fabio said:
I am working in a internationalized application that have a bar with
the letters of any alphabet (let's say just Western European). The
user clicks in
one of those letters and then I display a list of options that start
with that
letter.

I have not found a way to extract the letters of a particular
language.

(Please stick with me for a second:)

It _would_ actually be a bit of programming and a *lot* of
synchronization to implement that--and in fact; your need is pretty much
the reason that _Unicode_ exists. Here you need to tap into the Unicode
resource to achieve what you're looking for.

The characters of any language supported by Unicode are mapped to "code
points" in the Unicode table--0x0000-0xFFFF basically. The contents of
the points is described in their specifications and charts at
Unicode.org. What you'll need to *do* is:

1) Locate the characters within that table for each language you wish to
support.
2) Build a map of each alphabet by letters' code point.
3) Load these maps through Resource Bundles; or your preferred
internationalization method; in Java.

The reason you have to do it that way is because since some languages do
share characters and for various other compiling and maintenance
reasons; there is no guarantee that the 'alphabet' of every supported
language appears in consecutive order within the table--and this is why
we're on the point that it would be a bunch of work to support such a
thing in the core library. Hopefully you've stuck with me up till now.

A very good probable place to start is this page:
<http://www.unicode.org/roadmaps/bmp/> that gives a road map of the
various planes. And I can toss in a word of good advice too:

If you read through the recent thread in this forum titled "char
math"?, you'll find that to maintain stable and reliable code, you want
to use Unicode escapes (char literals of the form '\uxxxx') as opposed
to character literals wherever possible to avoid bad encoding
translations which could break your code. In your situation, you'll be
encountering most of the doubly-mapped characters and such that can
easily become switched and invalidate your programming efforts.

Also; where you said:
Even worst it seems that the methods in the Character class do not
work properly. The code below will print characters that are not
letter or digit
to the screen.

Well--according to what you wrote there--that's what you're code does;
in fact it ONLY prints characters that are not letters or digits--see
the "!" in this line:
if(!Character.isLetterOrDigit(c.charValue()))
System.out.print(" i="+i+" "+c.charValue());


I have no magnified skill in this area. What I know I learned through
interpreting the language and JVM specification, the Unicode
specification, and the advice of those programmers who gave it on this
list. And that works.

Best of luck.

--

..Steven Coco.
.........................................................................
When you're not sure:
"Confess your heart" says the Lord, "and you'll be freed."
 
W

Wojtek

I been looking for a way to list the letters of the English Letter in
Java. It seems that they have forgot to add such feature.

I am working in a internationalized application that have a bar with
the letters of any alphabet (let's say just Western European). The
user clicks in
one of those letters and then I display a list of options that start
with that
letter.

Are you also displaying letters for which options do not exist?

If not, then why not scan your options, built a list with unique first
letters, sort it, then create your bar?
 
S

Steven Coco

Wojtek said:
Are you also displaying letters for which options do not exist?

If not, then why not scan your options, built a list with unique first
letters, sort it, then create your bar?

I'm curious if you are _sure_ that works--are all alphabets guaranteed
to sort 'alphabetically'; or just by code point--as *I* could (and HAVE)
only guess?

--

..Steven Coco.
.........................................................................
When you're not sure:
"Confess your heart" says the Lord, "and you'll be freed."
 
F

Fabio

John C. Bollinger said:
I doubt whether it was an oversight. How would it work, exactly? Is
this something that one would obtain from a Locale? What an
extraordinary amount of extra work in developing -- and maintaining --
Locales.
I guess if somebody took the time to write the
java.text.CollationRules which is 10 times more work than just listing
the letters and numbers in each alphabet. And by the way the
constructor of the collator uses !!!Locale.
Evidently, then, you will have a list of localized options. You can get
at least most of what you want by examining the list and compiling the
first characters of all the options to figure out which to display.
This could be misleading if there is no entries in the database for a
particular letter then it would not show up. So at the begining my
menu
would not appear. And now I have to do an expensive database query
just to
get a list of letters?
Do note that whether by that technique or by having a full list of the
characters used in any language, you still have potential problems. For
instance, are you going to distinguish accented characters from their
non-accented base characters?

That's exactly my point. encapsulation, encapsulation, encapsulation.
Why as
a developer should now about all the little variants of languages. I
could write
an application that works in any language whitout knowing the details.
Like somebody mention before Spanish consider CH a single letter. Why
not encapsulate
details once and let the developer use it.
Oh, and I don't recommend trying to do anything like this in a written
language that uses ideographic characters.


Well, conceivably the character attribute tables are messed up, but more
likely your screen font displays different characters for some codes
than is specified by the Unicode code tables. Note also that
Character's isLetter(char), isDigit(char), and isLetterOrDigit(char) are
based solely on the attributes assigned to each character by Unicode,
and thus identify characters that are letters or digits in any locale.
I guess the purpose of Unicode was to include all the languages of the
world.
Which is a good idea, but you only use one at the time. Look at the
NumberFormat
and DateFormat they use locale and they encapsulate the formating you
only need
to pass the locale and voila.
 
M

Michael Borgwardt

Fabio said:
I guess the purpose of Unicode was to include all the languages of the
world.
Which is a good idea, but you only use one at the time.

Um, no. What if you want to talk in one language *about* another language,
citing words and sentences from it? I do that frequently, talking in German
about Japanese, and before Unicode (UTF-8) it would have been impossible
to use both Japanese characters and German umlauts in the same email.
 
B

Bent C Dalager

In spanish they consider the letter sequences ch and ll as single characters
even though they often are written as two characters. How are you going to
handle that ? Dutch has the same problem with the sequence ij which is one
dutch character, but that sequence can also be written with the special
character ÿ, which solves the problem.

In Norwegian, you get the character sequence "aa", which is either to
be counted as one character (older form of the letter "å" - an "a"
with a circle on top for those who don't have that glyph) or as the
sequence of two letters "a" and "a". There is generally no way to tell
which of the two is correct for any given situation. You just have to
know :)

To make it more fun, an "a" is at the start of the alphabet while the
single-character interpretation of "aa" is at the end of it, which
gets interesting when you write sorting algorithms ...

Cheers
Bent D
 
J

John C. Bollinger

Fabio said:
I guess if somebody took the time to write the
java.text.CollationRules which is 10 times more work than just listing
the letters and numbers in each alphabet. And by the way the
constructor of the collator uses !!!Locale.

Do you mean java.text.RuleBasedCollator? There is no
java.text.CollationRules in the 1.4.2 API docs.

RuleBasedCollator only needed to be written once; it is highly
configurable and generally applicable to a wide variety of problems. A
facility that can provide the information you want for any and every
language might be possible if the required data were encoded into
Locales, but that means extra work in creating _every_ Locale (including
user-defined ones). A more restricted facility with a static list of
supported languages would be less of a burden, but it wouldn't achieve
the true generality that you seem to want.

[...]
That's exactly my point. encapsulation, encapsulation, encapsulation.
Why as
a developer should now about all the little variants of languages. I
could write
an application that works in any language whitout knowing the details.
Like somebody mention before Spanish consider CH a single letter. Why
not encapsulate
details once and let the developer use it.

Again, who is going to provide a full suite of local language support
utilities for every (human) language under the sun? Should every Java
implementor be required to do so? That is what specifiying it as part
of the platform API would do. Such software surely would be convenient
for internationalization projects, but I'm not persuaded that the
benefits would outweigh the initial and continuing costs of putting it in.

You will have particular requirements for doing generic language
processing in your app, so I suggest you define those requirements by
way of an interface, and provide an appropriate implementation for each
language you support. The appropriate class could then be loaded and
instantiated along with the rest of your localized data. You would
still have to work out the details of each concrete implementation, of
course, which is what you were trying to avoid, but someone somewhere
has to do that work. You may be able to get some of what you need from
the Unicode tables. For anything you need that is not found there, I
have trouble understanding why you seem outraged that Java doesn't have
it built-in.
I guess the purpose of Unicode was to include all the languages of the
world.

Not exactly. The purpose of Unicode is to encode all the characters of
all the writing systems of the world -- relationships to languages are
ancillary and not altogether straightforward. Some writing systems are
used by multiple languages, some languages have no associated writing
system, and some languages may have more than one associated writing system.
Which is a good idea, but you only use one at the time.

That is generally true, but not universally so.
Look at the
NumberFormat
and DateFormat they use locale and they encapsulate the formating you
only need
to pass the locale and voila.

And very convenient it is, too, provided that you want to use a
supported locale. I fully appreciate that you want to write generically
localized code, and I applaud you for it. However, I think that at
least some of what you will need to have to implement your plan as
currently constituted does not belong in the core API. I am therefore
not very receptive to complaints about their absence. If you disagree,
as you seem to do, then why don't you draft specifications for the
features you would like to see, and submit them to Sun as a request for
enhancement? That won't get it for you very quickly, and might not get
it for you at all, but it would be more productive than are angry
complaints to this newsgroup. It will also help you directly in your
software design effort.


John Bollinger
(e-mail address removed)
 
W

Wojtek

I'm curious if you are _sure_ that works--are all alphabets guaranteed
to sort 'alphabetically'; or just by code point--as *I* could (and HAVE)
only guess?

See java.text.Collator

It uses locales for sorting
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top