Function for removing Accents?

L

Luc The Perverse

Right now I have a crude hard coded method using a series of replaceAll's
for removing accents and converting them to their approximate non accented
equivalents.

Included are the Latin AE combo, the German double S etc.

I know that Java understands these characters because when I make a string
lowercase it will convert the capital AE to a lowercase ae.

All I could find was a reference to an obscure unimplemented API. I would
go for a publicly available function though - anything has got to be better
than the way I am doing it. (If worse comes to worse I will copy my ANSI
tables and make some "range" conversions)
 
C

Chris Uppal

Luc said:
Right now I have a crude hard coded method using a series of replaceAll's
for removing accents and converting them to their approximate non accented
equivalents.

I doubt if that is either meaningful or possible in general. It is certainly
not easy.

If you try it at all, and are not satisfied with a handful of hardwired
mappings, then you'll probably have to get deeply into Unicode. See:
http://www.unicode.org/reports/tr15/index.html
for one of the Unicode technical reports which discusses decomposition of
characters into base characters plus various kinds of diacritical marks. You
could presumably filter out characters representing diacritical marks leaving
the character which was qualified by the marks in place.

-- chris
 
B

bugbear

Luc said:
Right now I have a crude hard coded method using a series of replaceAll's
for removing accents and converting them to their approximate non accented
equivalents.

Included are the Latin AE combo, the German double S etc.

I know that Java understands these characters because when I make a string
lowercase it will convert the capital AE to a lowercase ae.

All I could find was a reference to an obscure unimplemented API. I would
go for a publicly available function though - anything has got to be better
than the way I am doing it. (If worse comes to worse I will copy my ANSI
tables and make some "range" conversions)

http://glaforge.free.fr/weblog/index.php?itemid=115

looks informative.

http://www.alphaworks.ibm.com/tech/unicodenormalizer

looks helpful

BugBear
 
M

Mickey Segal

Luc The Perverse said:
Right now I have a crude hard coded method using a series of replaceAll's
for removing accents and converting them to their approximate non accented
equivalents.

When I brought up this issue a few months ago the consensus was that the
replacing approach was the way to go. I've chosen 37 commonly used
characters to replace, but I'm sure on could get more fancy.
 
T

Thomas Weidenfeller

Luc said:
Right now I have a crude hard coded method using a series of replaceAll's
for removing accents and converting them to their approximate non accented
equivalents.

There is no such thing as "non accented equivalents" for accented
characters in general. Accents, diaeresis, etc. on a character have a
meaning, they are not there for decorative reasons. A character without
its "decoration" is usually a complete different character than the
original.

In short, removing such information is a very bad idea.
All I could find was a reference to an obscure unimplemented API. I would
go for a publicly available function though - anything has got to be better
than the way I am doing it. (If worse comes to worse I will copy my ANSI
tables and make some "range" conversions)

If preserving the meaning of your input text data is not desirable, or,
in other words, you badly want to butcher your text, annoy native
speakers and apply an irreversible algorithm, then you could try the
following:

Get the Unicode Character Database (UCD), and build a particular
lookup-table from the information in the UCD:

Parse the UCD data to extract all characters which have explicit
decomposition information. Analyze the decomposition information (search
for the accents, diaeresis marks, etc. in the information). If you find
one, remove it from the decomposition map of the particular character.
If a single character remains in the map, consider that your result,
otherwise apply more decomposition.

Use that information to build a lookup table "original char -> butchered
char". Code that table into some Java class and provide a method to look
up characters in the map.

/Thomas
 
C

Chris Uppal

Thomas said:
Get the Unicode Character Database (UCD), and build a particular
lookup-table from the information in the UCD:

Parse the UCD data to extract all characters which have explicit
decomposition information. Analyze the decomposition information (search
for the accents, diaeresis marks, etc. in the information). If you find
one, remove it from the decomposition map of the particular character.
If a single character remains in the map, consider that your result,
otherwise apply more decomposition.

You could probably speed up this process considerably by using a pre-existing
Unicode package such as ICU:

http://icu.sourceforge.net/

That would only save code, though. You'd still have the problem of learning
what all the data means before you start to manipulate it.

Incidentally, why is ICU never mentioned around here ? It looks like a very
solid and complete package -- although I admit I haven't actually /used/ it yet
(planning to do so fairly soon).

-- chris
 
T

Thomas Weidenfeller

Chris said:
You could probably speed up this process considerably by using a pre-existing
Unicode package such as ICU:

I am not sure :) My understanding of ICU after checking the
documentation is that it doesn't do the destructive thing the OP might
want to do. It looks more as if the authors of ICU tried very hard to
get every aspect of Unicode right. Mapping an accented character to a
single non-accented "equivalent" is certainly not right in the scope of
Unicode, and also not in the scope of non-ascii languages.

The effort to invest in a solution also depends on how good the solution
has to be. Since the original text is anyhow supposed to be butchered, I
don't see a reason for 100% accuracy.

So, scripting the parsing of the UCD for finding the interesting values
should not take that much time. I would guess less than an hour. That
should include scripting of checking the decomposition values for these
"bad" accents (probably code points starting at 0x300 up to some value I
forgot). The result should be a map of a bunch of characters.

Some more scripting to get that output into a Java data structure, add a
lookup method, compile, and that's it.
Incidentally, why is ICU never mentioned around here ?

Probably because people don't know about it (I didn't). And probably
because it solves problems not many people have each day.

/Thomas
 
H

Hendrik Maryns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Luc The Perverse schreef:
Right now I have a crude hard coded method using a series of replaceAll's
for removing accents and converting them to their approximate non accented
equivalents.

Included are the Latin AE combo, the German double S etc.

I think this is impossible, because the conversion is very language
dependent: if you want to remove the ¨ from a German ä, it should become
ae, whereas in the more seldom case you encounter an ä in Dutch, the
best equivalent would be a. Unless you just want to remove accents,
without trying to keep meaning, but then, why not just remove the
accented letters altogether?

H.
- --
Hendrik Maryns

==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFEae5he+7xMGD3itQRAuOBAJ49JzYYY+1Si/5RISkzNGoLQ38PMQCdE1Za
1penb5eUmPx0AfDaGWLDfqw=
=MjiT
-----END PGP SIGNATURE-----
 
M

Mickey Segal

Thomas Weidenfeller said:
There is no such thing as "non accented equivalents" for accented
characters in general. Accents, diaeresis, etc. on a character have a
meaning, they are not there for decorative reasons. A character without
its "decoration" is usually a complete different character than the
original.

In short, removing such information is a very bad idea.

If preserving the meaning of your input text data is not desirable, or, in
other words, you badly want to butcher your text, annoy native speakers
and apply an irreversible algorithm, then you could try the following:

There are times in which it makes sense to remove accented characters, for
example when carrying out a search. In our software we store names of
diseases in their fully accented forms. However, users of our software may
be familiar with the name of the disease as rendered with unaccented
characters, or they may not know how to type accented characters. We carry
out a user's search by removing accents from the search text and stripping
accents from the disease names for the purposes of the search. We do not
store the stripped-down strings.
 
R

Real Gagnon

All I could find was a reference to an obscure unimplemented API. I
would go for a publicly available function though - anything has got
to be better than the way I am doing it. (If worse comes to worse I
will copy my ANSI tables and make some "range" conversions)

You can go for the "obscure API" (Sun or IBM), do a bunch of replaceAll()
or use an Array lookup.

An example of each can be found at
http://www.rgagnon.com/javadetails/java-0456.html

Bye.
 
L

Luc The Perverse

Hendrik Maryns said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Luc The Perverse schreef:

I think this is impossible, because the conversion is very language
dependent: if you want to remove the ¨ from a German ä, it should become
ae, whereas in the more seldom case you encounter an ä in Dutch, the
best equivalent would be a. Unless you just want to remove accents,
without trying to keep meaning, but then, why not just remove the
accented letters altogether?

All I want to design is an accent insensitive search dictionary. No
meaning needs to be preserved because the user will not see any converted
text.

I do not wish to remove the accent altogether, because I want the user to be
able to search with or without the accent.

For instance if I want to search for rückenwind, it will take me at least
several additional seconds to experiment with the ALT keys to find the right
character combination for ü. (Although learning the character codes is
easier than attempting to learn alternate German spellings.) It doesn't
mean I don't know that there is an accent there, it is just easier to type
ruckenwind.

And if you cringe at that, you might faint to see the approximate
transliterations that I have used to name my Russian MP3's!
 
L

Luc The Perverse

Chris Uppal said:
I doubt if that is either meaningful or possible in general. It is
certainly
not easy.

My wife does not know to push ALT 0 2 4 4 for an ñ, so she would have a hard
time searching for Piña Coladas if she wanted to hear that Garth Brooks'
song.
If you try it at all, and are not satisfied with a handful of hardwired
mappings, then you'll probably have to get deeply into Unicode. See:
http://www.unicode.org/reports/tr15/index.html
for one of the Unicode technical reports which discusses decomposition of
characters into base characters plus various kinds of diacritical marks.
You
could presumably filter out characters representing diacritical marks
leaving
the character which was qualified by the marks in place.

AH! I don't want to "learn" unicode. Especially after looking at that
link!

Perhaps I should have explained in more detail the scope of my question.
 
L

Luc The Perverse

Thomas Weidenfeller said:
I am not sure :) My understanding of ICU after checking the documentation
is that it doesn't do the destructive thing the OP might want to do. It
looks more as if the authors of ICU tried very hard to get every aspect of
Unicode right. Mapping an accented character to a single non-accented
"equivalent" is certainly not right in the scope of Unicode, and also not
in the scope of non-ascii languages.

I did say approximate representation.

The thing that annoys me the most is the song by Tool named ænima.
The effort to invest in a solution also depends on how good the solution
has to be. Since the original text is anyhow supposed to be butchered, I
don't see a reason for 100% accuracy.

To be honest the solution that I have now works just fine. I have hardcoded
ReplaceAlls for every character appearing in my file name scope.
Probably because people don't know about it (I didn't). And probably
because it solves problems not many people have each day.

You are not the only one that had not heard about it.

Unicode is such a beautiful thing - I wonder why it takes so long to catch
on?
 
L

Luc The Perverse

Mickey Segal said:
There are times in which it makes sense to remove accented characters, for
example when carrying out a search. In our software we store names of
diseases in their fully accented forms. However, users of our software
may be familiar with the name of the disease as rendered with unaccented
characters, or they may not know how to type accented characters. We
carry out a user's search by removing accents from the search text and
stripping accents from the disease names for the purposes of the search.
We do not store the stripped-down strings.

Thomas seemed to think I was removing accents from a document.

The easiest solution would have been to rename all the files/titles of the
media so that I didn't have to deal with accented characters at all. But
to me ça is not ca, muñeca is not muneca. So I would have to agree with
him.

My desire to get a "better solution" is just to make my functions more
future ready (I hesitate to say future proof)
 
L

Luc The Perverse

Real Gagnon said:
You can go for the "obscure API" (Sun or IBM), do a bunch of replaceAll()
or use an Array lookup.

An example of each can be found at
http://www.rgagnon.com/javadetails/java-0456.html

Awesome! That is exactly what I was looking for.

Now I can search for "Belanger" next time I want to listen to "La
Parapluie"! (But to be honest, é is one of the truly natural key sequences
that I have memorized.)
 
A

alexandre_paterson

Luc The Perverse wrote:
....
For instance if I want to search for rückenwind, it will take me at least
several additional seconds to experiment with the ALT keys to find the right
character combination for ü. (Although learning the character codes is
easier than attempting to learn alternate German spellings.) It doesn't
mean I don't know that there is an accent there, it is just easier to type
ruckenwind.

And if you cringe at that, you might faint to see the approximate
transliterations that I have used to name my Russian MP3's!

You decided to go for the "accent-removing" way though there
are many different ways to achieve the functionality you're after.

To me this looks very similar to a spelling algorithm (even though
you're using several languages at once).

For every search word entered you could:
-check exact matches
-check "one typo" matches (eg "Pixes" instead of "Pixies")
-check "two typos" matches (eg "Spellshaker" instead of "Spellchecker")
-check all "sound-alike" words (using eg Soundex or Double-Metaphone)
-rank all the propositions found using these various techniques by
their "closeness" to the search term entered (using Levenhstein's
edit distance, implemented using a DP algorithm).

This should allow to enter really badly written band names/song
names and still find back the song(s) you're after.

For example, my home-made spellchecker's first proposition for
"paulitiquale" ("political" spelled in a strange french-phonetical
way) is, surely enough, "political" (note that Un*x' aspell/ispell
works the same way).

Now of course this is lots of work for a simple functionality, but
you could still use something similar but much much simpler
(take could be implemented very easily):

- for every correct name you have in your database, you create
a hashmap with all vowels removed

Pixies -> Pxs
Pink Floyd -> Pnk Fld
rückenwind -> rckwnd

etc.

Then, when someone enters a search term, *if you don't find
an exact match*, you try to remove every vowels
from the entered search terms and sees if this corresponds
to something in your hashmap.

You can get a little more fancy by checking if the entry (or
entries) in the hashmap really correspond to something close
to the entered search term by calculating the "edit distance"
between the two strings.

It really looks like what a spellchecker would do: the whole
point is not using a single technique, but a variety of
techniques. Each one augmenting the probability that
the search gives back a meaningfull result.

FWIW,

Alex

P.S: this post directly edited in groups.google.com, without
bothering to copy&paste in a spellchecker ;)
 
H

Hendrik Maryns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(e-mail address removed) schreef:
Luc The Perverse wrote:
...

You decided to go for the "accent-removing" way though there
are many different ways to achieve the functionality you're after.

To me this looks very similar to a spelling algorithm (even though
you're using several languages at once).

For every search word entered you could:
-check exact matches
-check "one typo" matches (eg "Pixes" instead of "Pixies")
-check "two typos" matches (eg "Spellshaker" instead of "Spellchecker")
-check all "sound-alike" words (using eg Soundex or Double-Metaphone)
-rank all the propositions found using these various techniques by
their "closeness" to the search term entered (using Levenhstein's
edit distance, implemented using a DP algorithm).

There surely are packages that do ‘fuzzy’ string matching (when not in
Java, then certainly in Perl (CPAN)).
This should allow to enter really badly written band names/song
names and still find back the song(s) you're after.

For example, my home-made spellchecker's first proposition for
"paulitiquale" ("political" spelled in a strange french-phonetical
way) is, surely enough, "political" (note that Un*x' aspell/ispell
works the same way).

Now of course this is lots of work for a simple functionality, but
you could still use something similar but much much simpler
(take could be implemented very easily):

- for every correct name you have in your database, you create
a hashmap with all vowels removed

Pixies -> Pxs
Pink Floyd -> Pnk Fld
rückenwind -> rckwnd

etc.

Nce idea, but in the end you end up removing all letters: how about ñ
(Spanish), ý, ð, þ (Icelandic), ç (French), ß (German), ś (Czech?), and
those with a v on top of them, I can’t find them either...

And what to do if you want to look for this song: second on
http://en.wikipedia.org/wiki/Windowlicker#Track_listing ?? :p

H.
- --
Hendrik Maryns

==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFEatnre+7xMGD3itQRAlBWAJ9CMkcI4zvA2Yw/HKCKEZy3b4GoAwCfSvdT
5GgUJn1H7j1kEWx2unQ7weU=
=JMA/
-----END PGP SIGNATURE-----
 
D

Domagoj Klepac

There is no such thing as "non accented equivalents" for accented
characters in general. Accents, diaeresis, etc. on a character have a
meaning, they are not there for decorative reasons. A character without
its "decoration" is usually a complete different character than the
original.

In short, removing such information is a very bad idea.

It depends, really.

We in non-English parts of the world, whose alphabet has more than 25
letters, have lived with that since the PC came along (Mac, on the
other hand, always had almost transparent internationalization
support). This has led to a wide adaptation of "non-accented
equivalents", in specific areas - for example, in Usenet posts, most
people don't use non-accented characters, although nowdays almost all
Usenet readers support it. But even the Windows XP might not be able
to access the file if the file name (or folder name) contains
non-English characters.

Mostly, those equivalents are simply the same character with stripped
accent. Sometimes, accented character maps to several ASCII
characters. For example, Croatian "ž" maps to "z", but "Ð" maps to
"Dj". Or, in German, "ß" maps to "ss". But all vowels, which make most
of the accented characters, (I think) can be simply stripped to their
non-accented counterparts.

Google does this, BTW. It searches for both accented and non-accented
version of the word you entered. That's a good example of a case where
removing accents is good.

Domchi
 
C

Chris Uppal

Luc said:
My wife does not know to push ALT 0 2 4 4 for an ñ, so she would have a
hard time searching for Piña Coladas if she wanted to hear that Garth
Brooks' song.

In that case a little properties file containing mappings should do the job.
Add new mappings as and when the need arises. You could allow multiple
candidate replacements, so that your aenima example (which I can't type in to
this newsreader correctly) could be matched by any of
the correct version
aenima
anima
enima

I don't want to "learn" unicode.

Odd that you should say that; quite a lot of programmers seem to feel the
same...

-- chris
 
D

Domagoj Klepac

Unicode is such a beautiful thing - I wonder why it takes so long to catch
on?

Because internationalization was always expensive, and the IT industry
was based in the Western countries, and it had no incentive to adapt
to the new, small markets. For example, I would guess that there is no
Esperanto Windows version. Or Esperanto Office. Not enough people
would buy it. And if even mighty Microsoft can't afford to do it,
other software vendors certainly can't.

Think of it - how many programs did you write, which export all GUI
text, and make it easy to translate it? And which are designed so that
both German translation and Arabic translation fit into design? And
mind you, German translation might be much longer than English text -
Germans are known for their excellent synchronization of foreign TV
shows, since German subtitles would take half of screen and are simply
not an option. And Arabic is written from left to right.

Though things regarding internationalization are changing rapidly
since the India and China had become great markets. Now most of the
Western-centered companies are struggling to get into Asia. And now
they have huge disadvantage regarding internationalization compared to
the native companies. What goes around comes around. :)

But I digress - what I wanted to say is that Java, with its native
Unicode support, has probably done more for Unicode adaptation than
all other languages combined. :) But Java is still new, and I suspect
that most of the world's codebase is in C/C++, where you're not
"forced" to use Unicode by default. Which is probably why Unicode
takes so long to catch on.

Domchi
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top