Function for removing Accents?

A

Alex Buell

Because internationalization was always expensive, and the IT industry
was based in the Western countries, and it had no incentive to adapt
to the new, small markets. For example, I would guess that there is no
Esperanto Windows version. Or Esperanto Office. Not enough people
would buy it. And if even mighty Microsoft can't afford to do it,
other software vendors certainly can't.

Microsoft can afford to. They just aren't interested.
 
A

alexandre_paterson

Hi Hendrik!

Hendrik Maryns wrote:
....
Nce idea, but in the end you end up removing all letters: how about ñ
(Spanish), ý, ð, þ (Icelandic), ç (French), ß (German), s (Czech?), and
those with a v on top of them, I can't find them either...

I know... Which is why I said that the idea of a spellchecker is not
to use a single technique but a variety of techniques.

You've got a "result List<SortableString>", each technique adds
zero or more "propositions" to this List.

- You first try all correct matches
- You then try all "one letter typos" (one wrong letter, one letter
not entered, one letter typed too much)
- You then try all "two letters typos" (still easily doable)
- You then try all close "no vowels" words
- etc.

Then you rank all those propositions according to their "edit
distance" to correct band names/song names.

It's not computationally intensive: at billions operations/second
you can get "creative". (it can be memory consuming though if
you've got hundreds of thousands of songs, then you need to
bypass Java's object [HashMap and String just won't cut it] and
use your own data structure. I know, for I got to 22 bits per
word on a spellchecker able to deal with dictionaries handling
hundreds of thousands of words -- try to do that with Jazzy ;)

And there's no "bad technique" either. Each trick simply can
lead to propositions.

Imagine a really bad technique adds "gigolo" to the list of
proposition when the user entered "gogle" (looking for "google"),
well... The "one letter typo" check will have found "google".
So propositions will contains ?????, ?????, gigolo, ?????, google, ????

Then they get sorted by edit distance to the user's entered search
string: google is close to gogle (edit distance of 1) while gigolo is
not that close to gogle. It doesn't mean that "gigolo" isn't proposed
but simply that "google" is proposed first.

I'll re-write what I wrote in the previous post: the whole point is
not using a single technique, but a variety of techniques. Each one
augmenting the probability that the search gives back a meaningfull
result.

And what to do if you want to look for this song: second on
http://en.wikipedia.org/wiki/Windowlicker#Track_listing ?? :p

Ah, Aphex Twin :)

Well, for this one I admit I'm not so sure... I'd add a "special case"
and I would have the only spellchecker/search engine in the world
able to deal with that case!

;)
 
P

Patricia Shanahan

Domagoj Klepac wrote:
....
Mostly, those equivalents are simply the same character with stripped
accent. Sometimes, accented character maps to several ASCII
characters. For example, Croatian "ž" maps to "z", but "Ð" maps to
"Dj". Or, in German, "ß" maps to "ss". But all vowels, which make most
of the accented characters, (I think) can be simply stripped to their
non-accented counterparts.

What about o-umlaut and oe? As in "Goedel".

Google can deal with this - I tried a search for "Goedel" and got a
page,
http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Godel.html, that
spells the name with o-umlaut.

Patricia
 
H

Hendrik Maryns

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(e-mail address removed) schreef:
Hi Hendrik!

Hendrik Maryns wrote:
...

I know... Which is why I said that the idea of a spellchecker is not
to use a single technique but a variety of techniques.

You've got a "result List<SortableString>", each technique adds
zero or more "propositions" to this List.

- You first try all correct matches
- You then try all "one letter typos" (one wrong letter, one letter
not entered, one letter typed too much)
- You then try all "two letters typos" (still easily doable)
- You then try all close "no vowels" words
- etc.

<snip>

I agree that you’re on the right track here...
Ah, Aphex Twin :)

Well, for this one I admit I'm not so sure... I'd add a "special case"
and I would have the only spellchecker/search engine in the world
able to deal with that case!

It is commonly known as (A complex mathematical equation), with or
without brackets, or simply [mathematical equation]. The whole song is
a joke, though, it is just a picture of his face converted with some
pic-to-audio program. Nevertheless, I like it.

H.
- --
Hendrik Maryns

==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFEa1HSe+7xMGD3itQRAoPZAJ9YXxyUDWpn1xaZVnbinKFJ9VvdPwCdGitm
Cn+gYT8FRV6C66AY7qAvKC0=
=JCVX
-----END PGP SIGNATURE-----
 
L

Luc The Perverse

Alex Buell said:
Microsoft can afford to. They just aren't interested.

Microsoft is too "closed room development". If they had allowed people to
make their own translations and then paid someone a small fee to "check it"
people would have jumped at the opportunity.

But they are not completely immune to suggestions.

I made a suggestion for a modification to the MSDN library which literally
would have turned a 5 hour job into a 5 minute job. I didn't know what I
was looking for (a matter of semantics), and a simple "See Also" link at the
bottom would have saved me much headache. About 17 months after I
"emailed" (it was a web form) them I received an email back saying that they
had accepted my suggestion and it would be included in the next version of
MSDN.
 
R

Roedy Green

Right now I have a crude hard coded method using a series of replaceAll's
for removing accents and converting them to their approximate non accented
equivalents.

Included are the Latin AE combo, the German double S etc.

I know that Java understands these characters because when I make a string
lowercase it will convert the capital AE to a lowercase ae.

Boyer Moore has a version that searches for multiple strings at once.

You could do it with a translate table of bytes. You index by char and
get a classifying byte.. You use the byte to index a delegate method
or switch method to deal with the class.

e.g.
0 = leave alone, append char to StringBuilder.

1 = check if next letter is e or o and convert to ligature ae or ao,
arrange so next letter is ignored.

2 = check if next letter is s, and convert to Eszett


Now you just loop through the string once, classifying each char and
calling the corresponding method,
 
R

Roedy Green

What about o-umlaut and oe? As in "Goedel".

Google can deal with this - I tried a search for "Goedel" and got a
page,
http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Godel.html, that
spells the name with o-umlaut.

I have noticed that too, also finding variant spellings. I have not
yet experimented to see if it is Google collapsing or the existence of
multiple variants buried in the text, perhaps in the keywords.

On most typos though Google asks if I really meant something else, but
on a popular "variant" spelling it does not.

I suspect I would have better luck looking up "seperater" that
"separator" even though it is the wrong spelling.
 
R

Roedy Green

About 17 months after I
"emailed" (it was a web form) them I received an email back saying that they
had accepted my suggestion and it would be included in the next version of
MSDN.
I am unworthy.
 
R

Roedy Green

. . .

I can't tell - are you mocking me?

I am making a reference to pop culture. It very unusual to get a
major company to actually implement an outside idea. It is quite a
feather in your cap.
 
L

Luc The Perverse

Roedy Green said:
I am making a reference to pop culture. It very unusual to get a
major company to actually implement an outside idea. It is quite a
feather in your cap.

Well thank you.

Now if I could just get Taco Time to bring back (permanently) the Smokey
Southwest Chicken Burrito - the finest fast food item that has ever existed.
 
M

Mickey Segal

Roedy Green said:
I am making a reference to pop culture. It very unusual to get a
major company to actually implement an outside idea. It is quite a
feather in your cap.

In my experience, Microsoft has implemented things I've suggested a
surprising number of times. In some cases I know it was not just an obvious
idea that lots of other people had suggested already.

I never figured out how much of this success was from was asking for the
right things, reaching the right people or the similarity of my name to that
of a Seattle talk show host.
 
R

Roedy Green

In my experience, Microsoft has implemented things I've suggested a
surprising number of times. In some cases I know it was not just an obvious
idea that lots of other people had suggested already.

I wrote Walt Disney circa 1975 to explain how you could use computers
to automate the production of cartoons (an extremely labour intensive
process). They wrote back an unusually rude letter saying they weren't
interested in any ideas except those generated by Disney employees.

Around the same time, I wrote Polaroid a letter about how you might
incorporate image enhancement eventually in a home camera. They were
much more polite. They sent back an inch thick contract to protect
them from me later asking them for money. It was such a scary thing I
decided to let the matter drop.

Sun wrote me and asked me to serve on some sort of advisory board. I
simply have not had the energy to deal with it. That would probably
give me a lot more leverage than simply posting ideas.
 
L

Luc The Perverse

Roedy Green said:
I wrote Walt Disney circa 1975 to explain how you could use computers
to automate the production of cartoons (an extremely labour intensive
process). They wrote back an unusually rude letter saying they weren't
interested in any ideas except those generated by Disney employees.

OMG. Well they sure showed you. See how the cartoon industry is thriving
without computers.
Around the same time, I wrote Polaroid a letter about how you might
incorporate image enhancement eventually in a home camera. They were
much more polite. They sent back an inch thick contract to protect
them from me later asking them for money. It was such a scary thing I
decided to let the matter drop.
LOL

Sun wrote me and asked me to serve on some sort of advisory board. I
simply have not had the energy to deal with it. That would probably
give me a lot more leverage than simply posting ideas.

You should accept that one - assuming it doesn't require something which you
are physically incapable of.
 
M

Mickey Segal

Roedy Green said:
Around the same time, I wrote Polaroid a letter about how you might
incorporate image enhancement eventually in a home camera. They were
much more polite. They sent back an inch thick contract to protect
them from me later asking them for money. It was such a scary thing I
decided to let the matter drop.

It may be no coincidence that Polaroid is gone. There have, though, been
cases in which people who made suggestions have sued companies for a share
and won, so their concerns were not totally irrational, but there are less
bureaucratic ways of dealing with it.
Sun wrote me and asked me to serve on some sort of advisory board. I
simply have not had the energy to deal with it. That would probably
give me a lot more leverage than simply posting ideas.

I've had good experience with Sun too, but that may have something to do
with the seniority of the person I've contacted.
 
D

Dave Glasser

Right now I have a crude hard coded method using a series of replaceAll's
for removing accents and converting them to their approximate non accented
equivalents.

Included are the Latin AE combo, the German double S etc.

I know that Java understands these characters because when I make a string
lowercase it will convert the capital AE to a lowercase ae.

All I could find was a reference to an obscure unimplemented API. I would
go for a publicly available function though - anything has got to be better
than the way I am doing it. (If worse comes to worse I will copy my ANSI
tables and make some "range" conversions)

I realize I'm a latecomer to this thread, but has anyone suggested
using a java.text.Collator? You could probably use one to build your
lookup table (that maps accented characters to their non-accented
equivalents) one time and store it in memory. At least that way it
would deal with different character sets and locales without changing
the code. I don't know how you would handle the combination characters
with a Collator, though, without knowing in advance which ones a
character set contains.


--
Check out QueryForm, a free, open source, Java/Swing-based
front end for relational databases.

http://qform.sourceforge.net

If you're a musician, check out RPitch Relative Pitch
Ear Training Software.

http://rpitch.sourceforge.net
 
L

Luc The Perverse

Dave Glasser said:
I realize I'm a latecomer to this thread, but has anyone suggested
using a java.text.Collator? You could probably use one to build your
lookup table (that maps accented characters to their non-accented
equivalents) one time and store it in memory. At least that way it
would deal with different character sets and locales without changing
the code. I don't know how you would handle the combination characters
with a Collator, though, without knowing in advance which ones a
character set contains.

I looked it up on the Java glossary and while I don't claim to fully
understand it, it looks a little complicated for my needs.
 
D

Domagoj Klepac

I wrote Walt Disney circa 1975 to explain how you could use computers
to automate the production of cartoons (an extremely labour intensive
process). They wrote back an unusually rude letter saying they weren't
interested in any ideas except those generated by Disney employees.

I think that at that time Disney had a official view that cartoons
need to have a heart, which no machine can supply. They were also
employing a lot of animators and artists, and feared that with the
advent of computers, they would have to let most of them go. You
probably struck a nerve. :)

Domchi
 
D

Dale King

Mickey said:
It may be no coincidence that Polaroid is gone. There have, though, been
cases in which people who made suggestions have sued companies for a share
and won, so their concerns were not totally irrational, but there are less
bureaucratic ways of dealing with it.

Apple's policy for dealing with it is to reject any outside suggestions,
which unfortunately led to some negative publicity:

http://www.macnn.com/articles/06/04/17/apple.makes.girl.cry
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,608
Members
45,241
Latest member
Lisa1997

Latest Threads

Top