Re: [SOLVED] Least-lossy string.encode to us-ascii?

Discussion in 'Python' started by Vlastimil Brom, Sep 14, 2012.

  1. 2012/9/14 Tim Chase <>:
    > On 09/13/12 16:44, Vlastimil Brom wrote:
    >> >>> import unicodedata
    >> >>> unicodedata.normalize("NFD", u"serviço móvil").encode("ascii", "ignore").decode("ascii")

    >> u'servico movil'

    >
    > Works well for all the test-cases I threw at it. Thanks!
    >
    > -tkc
    >
    >


    Hi,
    I am glad, it works, but I agree with the other comments, that it
    would be preferable to keep the original accented text, if at all
    possible in the whole processing.
    The above works by decomposing the accented characters into "basic"
    characters and the bare accents (combining diacritics) using
    normalize() and just striping anything outside ascii in encode("...",
    "ignore")
    This works for "combinable" accents, and most of the Portuguese
    characters outside of ascii appear to fall into this category, but
    there are others as well.
    E.g. according to
    http://tlt.its.psu.edu/suggestions/international/bylanguage/portuguese.html
    there are at least ºª«»€, which would be lost completely in such conversion.
    ª (dec.: 170) (hex.: 0xaa) # FEMININE ORDINAL INDICATOR
    º (dec.: 186) (hex.: 0xba) # MASCULINE ORDINAL INDICATOR

    You can preprocess such cases as appropriate before doing the
    conversion, e.g. just:

    >>> u"ºª«»€".replace(u"º", u"o").replace(u"ª", u"a").replace(u"«", u'"').replace(u"»", u'"').replace(u"€", u"EUR")

    u'oa""EUR'
    >>>

    or using a more elegant function and the replacement lists (eventually
    handling other cases as well).

    regards,
    vbr
    Vlastimil Brom, Sep 14, 2012
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tim Chase
    Replies:
    6
    Views:
    197
  2. Vlastimil Brom

    Re: Least-lossy string.encode to us-ascii?

    Vlastimil Brom, Sep 13, 2012, in forum: Python
    Replies:
    0
    Views:
    136
    Vlastimil Brom
    Sep 13, 2012
  3. Tim Chase
    Replies:
    0
    Views:
    121
    Tim Chase
    Sep 13, 2012
  4. Ethan Furman
    Replies:
    0
    Views:
    127
    Ethan Furman
    Sep 13, 2012
  5. Terry Reedy
    Replies:
    0
    Views:
    141
    Terry Reedy
    Sep 14, 2012
Loading...

Share This Page