Diacretical incensitive search

Olive · May 17, 2013

One feature that seems to be missing in the re module (or any tools that I know for searching text) is "diacretical incensitive search". I would like to have a match for something like this:

re.match("franc", "français")

in about the same whay we can have a case incensitive search:

re.match("(?i)fran", "Français").

Another related and more general problem (in the sense that it could easilybe used to solve the first problem) would be to translate a string removing any diacritical mark:

nodiac("Français") -> "Francais"

The algorithm to write such a function is trivial but there are a lot of mark we can put on a letter. It would be necessary to have the list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter. Trying to make such a list by hand would inevitably lead to some symbols forgotten (and would be tedious).

Olive

Petite Abeille · May 17, 2013

The algorithm to write such a function is trivial but there are a lot of mark we can put on a letter. It would be necessary to have the list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter. Trying to make such a list by hand would inevitably lead to some symbols forgotten (and would be tedious).

Perhaps of interest… Sean M. Burke Unidecode…

There appear to be several python implementations, e.g.:

https://pypi.python.org/pypi/Unidecode

Peter Otten · May 17, 2013

Olive said:
One feature that seems to be missing in the re module (or any tools that I
know for searching text) is "diacretical incensitive search". I would like
to have a match for something like this:

re.match("franc", "franÃ§ais")

in about the same whay we can have a case incensitive search:

re.match("(?i)fran", "FranÃ§ais").

Another related and more general problem (in the sense that it could
easily be used to solve the first problem) would be to translate a string
removing any diacritical mark:

nodiac("FranÃ§ais") -> "Francais"

The algorithm to write such a function is trivial but there are a lot of
mark we can put on a letter. It would be necessary to have the list of
"a"'s with something on it. i.e. "Ã ,Ã¡,Ã£", etc. and this for every letter.
Trying to make such a list by hand would inevitably lead to some symbols
forgotten (and would be tedious).
[Python3.3]

"ignore").decode()
'Francais'

import sys
from collections import defaultdict
from unicodedata import name, normalize

d = defaultdict(list)
for i in range(sys.maxunicode):
c = chr(i)
n = normalize("NFKD", c)[0]
if ord(n) < 128 and n.isalpha(): # optional
d[n].append(c)

for k, v in d.items():
if len(v) > 1:
print(k, "".join(v))

See also <http://effbot.org/zone/unicode-convert.htm>

PS: Be warned that experiments on the console may be misleading:
"'c\\u0327'"

Olive · May 17, 2013

Tanks a lot!

jmfauth · May 17, 2013

--------

The handling of diacriticals is especially a nice case
study. One can use it to toy with some specific features of
Unicode, normalisation, decomposition, ...

.... and also to show how Unicode can be badly implemented.

First and quick example that came to my mind (Py325 and Py332):
[2.929404406789672, 2.923327801150208, 2.923659417064755]
[3.8437222586746884, 3.829490737203514, 3.819266963414293]

jmf

Jorgen Grahn · May 20, 2013

One feature that seems to be missing in the re module (or any tools
that I know for searching text) is "diacretical incensitive search". I
would like to have a match for something like this:

re.match("franc", "français") ....

The algorithm to write such a function is trivial but there are a
lot of mark we can put on a letter. It would be necessary to have the
list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for
every letter. Trying to make such a list by hand would inevitably lead
to some symbols forgotten (and would be tedious).

Ok, but please remember that the diacriticals are of varying importance.
The english "naïve" is easily recognizable when written as "naive".
The swedish word "får" cannot be spelled "far" and still be understood.

This is IMHO out of the scope of re, and perhaps case-insensitivity
should have been too. Perhaps it /would/ have been, if regular
expressions hadn't come from the ASCII world where these things are
easy.

/Jorgen

split string into multi-character "letters"	7	Aug 25, 2010
Tasks	1	Nov 29, 2022
First steps in setting up VSCode to work with Python.	2	Mar 13, 2023
how to search multiple textfiles ?	12	Sep 26, 2008
Pythonic way for retrieving value for a nested dictionary.	1	Mar 5, 2013
GoogleHack Search	0	Jun 6, 2009
Preserving unicode filename encoding	1	Oct 20, 2012
Google Custom Search Engine, CSE	5	Jul 12, 2012

Diacretical incensitive search

Olive

Petite Abeille

Peter Otten

Olive

jmfauth

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads