Diacretical incensitive search

Discussion in 'Python' started by Olive, May 17, 2013.

  1. Olive

    Olive Guest

    One feature that seems to be missing in the re module (or any tools that I know for searching text) is "diacretical incensitive search". I would like to have a match for something like this:

    re.match("franc", "français")

    in about the same whay we can have a case incensitive search:

    re.match("(?i)fran", "Français").

    Another related and more general problem (in the sense that it could easilybe used to solve the first problem) would be to translate a string removing any diacritical mark:

    nodiac("Français") -> "Francais"

    The algorithm to write such a function is trivial but there are a lot of mark we can put on a letter. It would be necessary to have the list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter. Trying to make such a list by hand would inevitably lead to some symbols forgotten (and would be tedious).

    Olive
    Olive, May 17, 2013
    #1
    1. Advertising

  2. On May 17, 2013, at 8:57 AM, Olive <> wrote:

    > The algorithm to write such a function is trivial but there are a lot of mark we can put on a letter. It would be necessary to have the list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter. Trying to make such a list by hand would inevitably lead to some symbols forgotten (and would be tedious).


    Perhaps of interest… Sean M. Burke Unidecode…

    There appear to be several python implementations, e.g.:

    https://pypi.python.org/pypi/Unidecode
    Petite Abeille, May 17, 2013
    #2
    1. Advertising

  3. Olive

    Peter Otten Guest

    Olive wrote:

    > One feature that seems to be missing in the re module (or any tools that I
    > know for searching text) is "diacretical incensitive search". I would like
    > to have a match for something like this:
    >
    > re.match("franc", "français")
    >
    > in about the same whay we can have a case incensitive search:
    >
    > re.match("(?i)fran", "Français").
    >
    > Another related and more general problem (in the sense that it could
    > easily be used to solve the first problem) would be to translate a string
    > removing any diacritical mark:
    >
    > nodiac("Français") -> "Francais"
    >
    > The algorithm to write such a function is trivial but there are a lot of
    > mark we can put on a letter. It would be necessary to have the list of
    > "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter.
    > Trying to make such a list by hand would inevitably lead to some symbols
    > forgotten (and would be tedious).


    [Python3.3]

    >>> unicodedata.normalize("NFKD", "Français").encode("ascii",

    "ignore").decode()
    'Francais'

    import sys
    from collections import defaultdict
    from unicodedata import name, normalize

    d = defaultdict(list)
    for i in range(sys.maxunicode):
    c = chr(i)
    n = normalize("NFKD", c)[0]
    if ord(n) < 128 and n.isalpha(): # optional
    d[n].append(c)

    for k, v in d.items():
    if len(v) > 1:
    print(k, "".join(v))

    See also <http://effbot.org/zone/unicode-convert.htm>

    PS: Be warned that experiments on the console may be misleading:

    >>> unicodedata.normalize("NFKD", "ç")

    'c'
    >>> ascii(_)

    "'c\\u0327'"
    Peter Otten, May 17, 2013
    #3
  4. Olive

    Olive Guest

    Tanks a lot!
    Olive, May 17, 2013
    #4
  5. Olive

    jmfauth Guest

    --------


    The handling of diacriticals is especially a nice case
    study. One can use it to toy with some specific features of
    Unicode, normalisation, decomposition, ...

    .... and also to show how Unicode can be badly implemented.

    First and quick example that came to my mind (Py325 and Py332):

    >>> timeit.repeat("ud.normalize('NFKC', ud.normalize('NFKD', 'ᶑḗḖḕḹ'))", "import unicodedata as ud")

    [2.929404406789672, 2.923327801150208, 2.923659417064755]

    >>> timeit.repeat("ud.normalize('NFKC', ud.normalize('NFKD', 'ᶑḗḖḕḹ'))", "import unicodedata as ud")

    [3.8437222586746884, 3.829490737203514, 3.819266963414293]

    jmf
    jmfauth, May 17, 2013
    #5
  6. Olive

    Jorgen Grahn Guest

    On Fri, 2013-05-17, Olive wrote:

    > One feature that seems to be missing in the re module (or any tools
    > that I know for searching text) is "diacretical incensitive search". I
    > would like to have a match for something like this:


    > re.match("franc", "français")

    ....

    > The algorithm to write such a function is trivial but there are a
    > lot of mark we can put on a letter. It would be necessary to have the
    > list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for
    > every letter. Trying to make such a list by hand would inevitably lead
    > to some symbols forgotten (and would be tedious).


    Ok, but please remember that the diacriticals are of varying importance.
    The english "naïve" is easily recognizable when written as "naive".
    The swedish word "får" cannot be spelled "far" and still be understood.

    This is IMHO out of the scope of re, and perhaps case-insensitivity
    should have been too. Perhaps it /would/ have been, if regular
    expressions hadn't come from the ASCII world where these things are
    easy.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
    Jorgen Grahn, May 20, 2013
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?TGFrc2htaSBOYXJheWFuYW4uUg==?=

    Google search result like site search!! How?

    =?Utf-8?B?TGFrc2htaSBOYXJheWFuYW4uUg==?=, May 5, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    655
    Lucas Tam
    May 6, 2005
  2. Andy
    Replies:
    1
    Views:
    339
    Jack Klein
    Nov 25, 2003
  3. Anand Pillai

    String search vs regexp search

    Anand Pillai, Oct 12, 2003, in forum: Python
    Replies:
    10
    Views:
    572
    Anand Pillai
    Oct 15, 2003
  4. mason66
    Replies:
    0
    Views:
    409
    mason66
    Jul 27, 2006
  5. Abby Lee
    Replies:
    5
    Views:
    375
    Abby Lee
    Aug 2, 2004
Loading...

Share This Page