unicode issue

Discussion in 'Python' started by gentlestone, Sep 30, 2009.

  1. gentlestone

    gentlestone Guest

    Why don't work this code on Python 2.6? Or how can I do this job?

    _MAP = {
    # LATIN
    u'À': 'A', u'Ã': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Ã…': 'A',
    u'Æ': 'AE', u'Ç':'C',
    u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'ÃŒ': 'I', u'Ã': 'I',
    u'ÃŽ': 'I',
    u'Ã': 'I', u'Ã': 'D', u'Ñ': 'N', u'Ã’': 'O', u'Ó': 'O', u'Ô': 'O',
    u'Õ': 'O', u'Ö':'O',
    u'Å': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ãœ': 'U',
    u'Å°': 'U',
    u'Ã': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a',
    u'ã': 'a', u'ä':'a',
    u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e',
    u'ë': 'e',
    u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n',
    u'ò': 'o', u'ó':'o',
    u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u',
    u'ú': 'u',
    u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y',
    # LATIN_SYMBOLS
    u'©':'(c)',
    # GREEK
    u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z',
    u'η':'h', u'θ':'8',
    u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3',
    u'ο':'o', u'π':'p',
    u'Ï':'r', u'σ':'s', u'Ï„':'t', u'Ï…':'y', u'φ':'f', u'χ':'x',
    u'ψ':'ps', u'ω':'w',
    u'ά':'a', u'έ':'e', u'ί':'i', u'ÏŒ':'o', u'Ï':'y', u'ή':'h',
    u'ÏŽ':'w', u'Ï‚':'s',
    u'ÏŠ':'i', u'ΰ':'y', u'Ï‹':'y', u'Î':'i',
    u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z',
    u'Η':'H', u'Θ':'8',
    u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Îœ':'M', u'Î':'N', u'Ξ':'3',
    u'Ο':'O', u'Π':'P',
    u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X',
    u'Ψ':'PS', u'Ω':'W',
    u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H',
    u'Î':'W', u'Ϊ':'I', u'Ϋ':'Y',
    # TURKISH
    u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C',
    u'ü':'u', u'Ü':'U',
    u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G',
    # RUSSIAN
    u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e',
    u'ё':'yo', u'ж':'zh',
    u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m',
    u'н':'n', u'о':'o',
    u'п':'p', u'Ñ€':'r', u'Ñ':'s', u'Ñ‚':'t', u'у':'u', u'Ñ„':'f',
    u'х':'h', u'ц':'c',
    u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'',
    u'Ñ':'e', u'ÑŽ':'yu', u'Ñ':'ya',
    u'Ð':'A', u'Б':'B', u'Ð’':'V', u'Г':'G', u'Д':'D', u'Е':'E',
    u'Ð':'Yo', u'Ж':'Zh',
    u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M',
    u'Ð':'N', u'О':'O',
    u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F',
    u'Х':'H', u'Ц':'C',
    u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'',
    u'Э':'E', u'Ю':'Yu', u'Я':'Ya',
    # UKRAINIAN
    u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ò':'G', u'Ñ”':'ye', u'Ñ–':'i',
    u'Ñ—':'yi', u'Ò‘':'g',
    # CZECH
    u'Ä':'c', u'Ä':'d', u'Ä›':'e', u'ň':'n', u'Å™':'r', u'Å¡':'s',
    u'ť':'t', u'ů':'u',
    u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R',
    u'Š':'S', u'Ť':'T', u'Ů':'U', u'Ž':'Z',
    # POLISH
    u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o',
    u'ś':'s', u'ź':'z',
    u'ż':'z', u'Ä„':'A', u'Ć':'C', u'Ę':'e', u'Å':'L', u'Ń':'N',
    u'Ó':'o', u'Ś':'S',
    u'Ź':'Z', u'Ż':'Z',
    # LATVIAN
    u'Ä':'a', u'Ä':'c', u'Ä“':'e', u'Ä£':'g', u'Ä«':'i', u'Ä·':'k',
    u'ļ':'l', u'ņ':'n',
    u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E',
    u'Ģ':'G', u'Ī':'i',
    u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
    }

    def downcode(name):
    """
    >>> downcode(u"Žabovitá zmiešaná kaša")

    u'Zabovita zmiesana kasa'
    """
    for key, value in _MAP.iteritems():
    name = name.replace(key, value)
    return name
    gentlestone, Sep 30, 2009
    #1
    1. Advertising

  2. gentlestone

    Andre Engels Guest

    On Wed, Sep 30, 2009 at 9:34 AM, gentlestone <> wrote:
    > Why don't work this code on Python 2.6? Or how can I do this job?


    Please be more specific than "it doesn't work":
    * What exactly are you doing
    * What were you expecting the result of that to be
    * What is the actual result?

    --
    André Engels,
    Andre Engels, Sep 30, 2009
    #2
    1. Advertising

  3. gentlestone

    gentlestone Guest

    On 30. Sep., 09:41 h., Andre Engels <> wrote:
    > On Wed, Sep 30, 2009 at 9:34 AM, gentlestone <> wrote:
    > > Why don't work this code on Python 2.6? Or how can I do this job?

    >
    > Please be more specific than "it doesn't work":
    > * What exactly are you doing
    > * What were you expecting the result of that to be
    > * What is the actual result?
    >
    > --
    > André Engels,


    * What exactly are you doing
    replace non-ascii characters - see doctest documentation

    * What were you expecting the result of that to be
    see doctest documentation

    * What is the actual result?
    the actual result is unchanged name
    gentlestone, Sep 30, 2009
    #3
  4. gentlestone

    Andre Engels Guest

    I get the feeling that the problem is with the Python interactive
    mode. It does not have full unicode support, so u"Žabovitá zmiešaná
    kaša" is changed to u'\x8eabovit\xe1 zmie\x9aan\xe1 ka\x9aa'. If you
    call your code from another program, it might work correctly.


    --
    André Engels,
    Andre Engels, Sep 30, 2009
    #4
  5. gentlestone

    gentlestone Guest

    On 30. Sep., 10:35 h., Andre Engels <> wrote:
    > I get the feeling that the problem is with the Python interactive
    > mode. It does not have full unicode support, so u"Žabovitá zmiešaná
    > kaša" is changed to u'\x8eabovit\xe1 zmie\x9aan\xe1 ka\x9aa'. If you
    > call your code from another program, it might work correctly.
    >
    > --
    > André Engels,


    thx a lot

    I spent 2 days of my life beacause of this

    so doctests are unuseable for non-engish users in python - seems to be
    gentlestone, Sep 30, 2009
    #5
  6. gentlestone

    gentlestone Guest

    On 30. Sep., 10:43 h., gentlestone <> wrote:
    > On 30. Sep., 10:35 h., Andre Engels <> wrote:
    >
    > > I get the feeling that the problem is with the Python interactive
    > > mode. It does not have full unicode support, so u"Žabovitá zmiešaná
    > > kaša" is changed to u'\x8eabovit\xe1 zmie\x9aan\xe1 ka\x9aa'. If you
    > > call your code from another program, it might work correctly.

    >
    > > --
    > > André Engels,

    >
    > thx a lot
    >
    > I spent 2 days of my life beacause of this
    >
    > so doctests are unuseable for non-engish users in python - seems to be


    yes, you are right, now it works:

    def slugify(name):
    """
    >>> slugify(u'\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a s.r.o')

    u'zabovita-zmiesana-kasa-sro'
    """
    for key, value in _MAP.iteritems():
    name = name.replace(key, value)
    return defaultfilters.slugify(name)
    gentlestone, Sep 30, 2009
    #6
  7. gentlestone

    Dave Angel Guest

    gentlestone wrote:
    > Why don't work this code on Python 2.6? Or how can I do this job?
    >
    > _MAP =
    > # LATIN
    > u'À': 'A', u'Ã': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Ã…': 'A',
    > u'Æ': 'AE', u'Ç':'C',
    > u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'ÃŒ': 'I', u'Ã': 'I',
    > u'ÃŽ': 'I',
    > u'Ã': 'I', u'Ã': 'D', u'Ñ': 'N', u'Ã’': 'O', u'Ó': 'O', u'Ô': 'O',
    > u'Õ': 'O', u'Ö':'O',
    > u'Å': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ãœ': 'U',
    > u'Å°': 'U',
    > u'Ã': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a',
    > u'ã': 'a', u'ä':'a',
    > u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e',
    > u'ë': 'e',
    > u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n',
    > u'ò': 'o', u'ó':'o',
    > u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u',
    > u'ú': 'u',
    > u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y',
    > # LATIN_SYMBOLS
    > u'©':'(c)',
    > # GREEK
    > u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z',
    > u'η':'h', u'θ':'8',
    > u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3',
    > u'ο':'o', u'π':'p',
    > u'Ï':'r', u'σ':'s', u'Ï„':'t', u'Ï…':'y', u'φ':'f', u'χ':'x',
    > u'ψ':'ps', u'ω':'w',
    > u'ά':'a', u'έ':'e', u'ί':'i', u'ÏŒ':'o', u'Ï':'y', u'ή':'h',
    > u'ÏŽ':'w', u'Ï‚':'s',
    > u'ÏŠ':'i', u'ΰ':'y', u'Ï‹':'y', u'Î':'i',
    > u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z',
    > u'Η':'H', u'Θ':'8',
    > u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Îœ':'M', u'Î':'N', u'Ξ':'3',
    > u'Ο':'O', u'Π':'P',
    > u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X',
    > u'Ψ':'PS', u'Ω':'W',
    > u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H',
    > u'Î':'W', u'Ϊ':'I', u'Ϋ':'Y',
    > # TURKISH
    > u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C',
    > u'ü':'u', u'Ü':'U',
    > u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G',
    > # RUSSIAN
    > u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e',
    > u'ё':'yo', u'ж':'zh',
    > u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m',
    > u'н':'n', u'о':'o',
    > u'п':'p', u'Ñ€':'r', u'Ñ':'s', u'Ñ‚':'t', u'у':'u', u'Ñ„':'f',
    > u'х':'h', u'ц':'c',
    > u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'',
    > u'Ñ':'e', u'ÑŽ':'yu', u'Ñ':'ya',
    > u'Ð':'A', u'Б':'B', u'Ð’':'V', u'Г':'G', u'Д':'D', u'Е':'E',
    > u'Ð':'Yo', u'Ж':'Zh',
    > u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M',
    > u'Ð':'N', u'О':'O',
    > u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F',
    > u'Х':'H', u'Ц':'C',
    > u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'',
    > u'Э':'E', u'Ю':'Yu', u'Я':'Ya',
    > # UKRAINIAN
    > u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ò':'G', u'Ñ”':'ye', u'Ñ–':'i',
    > u'Ñ—':'yi', u'Ò‘':'g',
    > # CZECH
    > u'Ä':'c', u'Ä':'d', u'Ä›':'e', u'ň':'n', u'Å™':'r', u'Å¡':'s',
    > u'ť':'t', u'ů':'u',
    > u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R',
    > u'Š':'S', u'Ť':'T', u'Ů':'U', u'Ž':'Z',
    > # POLISH
    > u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o',
    > u'ś':'s', u'ź':'z',
    > u'ż':'z', u'Ä„':'A', u'Ć':'C', u'Ę':'e', u'Å':'L', u'Ń':'N',
    > u'Ó':'o', u'Ś':'S',
    > u'Ź':'Z', u'Ż':'Z',
    > # LATVIAN
    > u'Ä':'a', u'Ä':'c', u'Ä“':'e', u'Ä£':'g', u'Ä«':'i', u'Ä·':'k',
    > u'ļ':'l', u'ņ':'n',
    > u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E',
    > u'Ģ':'G', u'Ī':'i',
    > u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
    > }
    >
    > def downcode(name):
    > """
    > >>> downcode(u"Žabovitá zmiešaná kaša")

    > u'Zabovita zmiesana kasa'
    > """
    > for key, value in _MAP.iteritems():
    > name =ame.replace(key, value)
    > return name
    >
    >

    Works for me:

    rrr = downcode(u"Žabovitá zmiešaná kaša")
    print repr(rrr)
    print rrr

    prints out:

    u'Zabovita zmiesana kasa'
    Zabovita zmiesana kasa

    I did have to add an encoding declaration as line 2 of the file:

    #-*- coding: latin-1 -*-

    and I had to convince my editor (Komodo) to save the file in utf-8.

    DaveA
    Dave Angel, Sep 30, 2009
    #7
  8. gentlestone

    gentlestone Guest

    On 30. Sep., 11:45 h., Dave Angel <> wrote:
    > gentlestone wrote:
    > > Why don't work this code on Python 2.6? Or how can I do this job?

    >
    > > _MAP =
    > >     # LATIN
    > >     u'À': 'A', u'Ã': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Ã…': 'A',
    > > u'Æ': 'AE', u'Ç':'C',
    > >     u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'ÃŒ': 'I', u'Ã': 'I',
    > > u'ÃŽ': 'I',
    > >     u'Ã': 'I', u'Ã': 'D', u'Ñ': 'N', u'Ã’': 'O', u'Ó': 'O', u'Ô': 'O',
    > > u'Õ': 'O', u'Ö':'O',
    > >     u'Å': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ãœ': 'U',
    > > u'Å°': 'U',
    > >     u'Ã': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a',
    > > u'ã': 'a', u'ä':'a',
    > >     u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e',
    > > u'ë': 'e',
    > >     u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n',
    > > u'ò': 'o', u'ó':'o',
    > >     u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u',
    > > u'ú': 'u',
    > >     u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y',
    > >     # LATIN_SYMBOLS
    > >     u'©':'(c)',
    > >     # GREEK
    > >     u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z',
    > > u'η':'h', u'θ':'8',
    > >     u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3',
    > > u'ο':'o', u'π':'p',
    > >     u'Ï':'r', u'σ':'s', u'Ï„':'t', u'Ï…':'y', u'φ':'f', u'χ':'x',
    > > u'ψ':'ps', u'ω':'w',
    > >     u'ά':'a', u'έ':'e', u'ί':'i', u'ÏŒ':'o', u'Ï':'y', u'ή':'h',
    > > u'ÏŽ':'w', u'Ï‚':'s',
    > >     u'ÏŠ':'i', u'ΰ':'y', u'Ï‹':'y', u'Î':'i',
    > >     u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z',
    > > u'Η':'H', u'Θ':'8',
    > >     u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Îœ':'M', u'Î':'N', u'Ξ':'3',
    > > u'Ο':'O', u'Π':'P',
    > >     u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X',
    > > u'Ψ':'PS', u'Ω':'W',
    > >     u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H',
    > > u'Î':'W', u'Ϊ':'I', u'Ϋ':'Y',
    > >     # TURKISH
    > >     u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C',
    > > u'ü':'u', u'Ü':'U',
    > >     u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G',
    > >     # RUSSIAN
    > >     u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e',
    > > u'ё':'yo', u'ж':'zh',
    > >     u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m',
    > > u'н':'n', u'о':'o',
    > >     u'п':'p', u'Ñ€':'r', u'Ñ':'s', u'Ñ‚':'t', u'у':'u', u'Ñ„':'f',
    > > u'х':'h', u'ц':'c',
    > >     u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'',
    > > u'Ñ':'e', u'ÑŽ':'yu', u'Ñ':'ya',
    > >     u'Ð':'A', u'Б':'B', u'Ð’':'V', u'Г':'G', u'Д':'D', u'Е':'E',
    > > u'Ð':'Yo', u'Ж':'Zh',
    > >     u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M',
    > > u'Ð':'N', u'О':'O',
    > >     u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F',
    > > u'Х':'H', u'Ц':'C',
    > >     u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'',
    > > u'Э':'E', u'Ю':'Yu', u'Я':'Ya',
    > >     # UKRAINIAN
    > >     u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ò':'G', u'Ñ”':'ye', u'Ñ–':'i',
    > > u'Ñ—':'yi', u'Ò‘':'g',
    > >     # CZECH
    > >     u'Ä':'c', u'Ä':'d', u'Ä›':'e', u'ň':'n', u'Å™':'r', u'Å¡':'s',
    > > u'ť':'t', u'ů':'u',
    > >     u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R',
    > > u'Š':'S', u'Ť':'T', u'Ů':'U', u'Ž':'Z',
    > >     # POLISH
    > >     u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o',
    > > u'ś':'s', u'ź':'z',
    > >     u'ż':'z', u'Ä„':'A', u'Ć':'C', u'Ę':'e', u'Å':'L', u'Ń':'N',
    > > u'Ó':'o', u'Ś':'S',
    > >     u'Ź':'Z', u'Ż':'Z',
    > >     # LATVIAN
    > >     u'Ä':'a', u'Ä':'c', u'Ä“':'e', u'Ä£':'g', u'Ä«':'i', u'Ä·':'k',
    > > u'ļ':'l', u'ņ':'n',
    > >     u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E',
    > > u'Ģ':'G', u'Ī':'i',
    > >     u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
    > > }

    >
    > > def downcode(name):
    > >     """
    > >     >>> downcode(u"Žabovitá zmiešaná kaša")
    > >     u'Zabovita zmiesana kasa'
    > >     """
    > >     for key, value in _MAP.iteritems():
    > >         name =ame.replace(key, value)
    > >     return name

    >
    > Works for me:
    >
    > rrr = downcode(u"Žabovitá zmiešaná kaša")
    > print repr(rrr)
    > print rrr
    >
    > prints out:
    >
    > u'Zabovita zmiesana kasa'
    > Zabovita zmiesana kasa
    >
    > I did have to add an encoding declaration as line 2 of the file:
    >
    > #-*- coding: latin-1 -*-
    >
    > and I had to convince my editor (Komodo) to save the file in utf-8.
    >
    > DaveA


    great, thanks you all, I changed utf-8 to latin-1 in the header and it
    works for me too

    how mutch time could I save, just ask in this forum
    gentlestone, Sep 30, 2009
    #8
  9. gentlestone

    saeed.gnu Guest

    I recommend to use UTF-8 coding(specially in GNU/Linux) then write
    this in the second line:
    #-*- coding: latin-1 -*-
    saeed.gnu, Sep 30, 2009
    #9
  10. gentlestone

    Mark Tolonen Guest

    "Dave Angel" <> wrote in message
    news:...
    > gentlestone wrote:
    >> Why don't work this code on Python 2.6? Or how can I do this job?
    >>
    >> _MAP =
    >> # LATIN
    >> u'À': 'A', u'Ã': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Ã…': 'A',
    >> u'Æ': 'AE', u'Ç':'C',
    >> u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'ÃŒ': 'I', u'Ã': 'I',
    >> u'ÃŽ': 'I',
    >> u'Ã': 'I', u'Ã': 'D', u'Ñ': 'N', u'Ã’': 'O', u'Ó': 'O', u'Ô': 'O',
    >> u'Õ': 'O', u'Ö':'O',
    >> u'Å': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ãœ': 'U',
    >> u'Å°': 'U',
    >> u'Ã': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a',
    >> u'ã': 'a', u'ä':'a',
    >> u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e',
    >> u'ë': 'e',
    >> u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n',
    >> u'ò': 'o', u'ó':'o',
    >> u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u',
    >> u'ú': 'u',
    >> u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y',
    >> # LATIN_SYMBOLS
    >> u'©':'(c)',
    >> # GREEK
    >> u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z',
    >> u'η':'h', u'θ':'8',
    >> u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3',
    >> u'ο':'o', u'π':'p',
    >> u'Ï':'r', u'σ':'s', u'Ï„':'t', u'Ï…':'y', u'φ':'f', u'χ':'x',
    >> u'ψ':'ps', u'ω':'w',
    >> u'ά':'a', u'έ':'e', u'ί':'i', u'ÏŒ':'o', u'Ï':'y', u'ή':'h',
    >> u'ÏŽ':'w', u'Ï‚':'s',
    >> u'ÏŠ':'i', u'ΰ':'y', u'Ï‹':'y', u'Î':'i',
    >> u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z',
    >> u'Η':'H', u'Θ':'8',
    >> u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Îœ':'M', u'Î':'N', u'Ξ':'3',
    >> u'Ο':'O', u'Π':'P',
    >> u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X',
    >> u'Ψ':'PS', u'Ω':'W',
    >> u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H',
    >> u'Î':'W', u'Ϊ':'I', u'Ϋ':'Y',
    >> # TURKISH
    >> u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C',
    >> u'ü':'u', u'Ü':'U',
    >> u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G',
    >> # RUSSIAN
    >> u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e',
    >> u'ё':'yo', u'ж':'zh',
    >> u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m',
    >> u'н':'n', u'о':'o',
    >> u'п':'p', u'Ñ€':'r', u'Ñ':'s', u'Ñ‚':'t', u'у':'u', u'Ñ„':'f',
    >> u'х':'h', u'ц':'c',
    >> u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'',
    >> u'Ñ':'e', u'ÑŽ':'yu', u'Ñ':'ya',
    >> u'Ð':'A', u'Б':'B', u'Ð’':'V', u'Г':'G', u'Д':'D', u'Е':'E',
    >> u'Ð':'Yo', u'Ж':'Zh',
    >> u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M',
    >> u'Ð':'N', u'О':'O',
    >> u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F',
    >> u'Х':'H', u'Ц':'C',
    >> u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'',
    >> u'Э':'E', u'Ю':'Yu', u'Я':'Ya',
    >> # UKRAINIAN
    >> u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ò':'G', u'Ñ”':'ye', u'Ñ–':'i',
    >> u'Ñ—':'yi', u'Ò‘':'g',
    >> # CZECH
    >> u'Ä':'c', u'Ä':'d', u'Ä›':'e', u'ň':'n', u'Å™':'r', u'Å¡':'s',
    >> u'ť':'t', u'ů':'u',
    >> u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R',
    >> u'Š':'S', u'Ť':'T', u'Ů':'U', u'Ž':'Z',
    >> # POLISH
    >> u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o',
    >> u'ś':'s', u'ź':'z',
    >> u'ż':'z', u'Ä„':'A', u'Ć':'C', u'Ę':'e', u'Å':'L', u'Ń':'N',
    >> u'Ó':'o', u'Ś':'S',
    >> u'Ź':'Z', u'Ż':'Z',
    >> # LATVIAN
    >> u'Ä':'a', u'Ä':'c', u'Ä“':'e', u'Ä£':'g', u'Ä«':'i', u'Ä·':'k',
    >> u'ļ':'l', u'ņ':'n',
    >> u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E',
    >> u'Ģ':'G', u'Ī':'i',
    >> u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
    >> }
    >>
    >> def downcode(name):
    >> """
    >> >>> downcode(u"Žabovitá zmiešaná kaša")

    >> u'Zabovita zmiesana kasa'
    >> """
    >> for key, value in _MAP.iteritems():
    >> name =ame.replace(key, value)
    >> return name
    >>
    >>

    > Works for me:
    >
    > rrr = downcode(u"Žabovitá zmiešaná kaša")
    > print repr(rrr)
    > print rrr
    >
    > prints out:
    >
    > u'Zabovita zmiesana kasa'
    > Zabovita zmiesana kasa
    >
    > I did have to add an encoding declaration as line 2 of the file:
    >
    > #-*- coding: latin-1 -*-
    >
    > and I had to convince my editor (Komodo) to save the file in utf-8.


    Why decare latin-1 and save in utf-8? I'm not sure how you got that to work
    because those encodings aren't equivalent. I get:

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "testit.py", line 1
    SyntaxError: encoding problem: utf-8

    In fact, some of the characters in the above code don't map to latin-1.

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0150' in
    position
    309: ordinal not in range(256)
    >>> import unicodedata as ud
    >>> ud.name(u'\u0150')


    -Mark
    Mark Tolonen, Sep 30, 2009
    #10
  11. >>>>> Dave Angel <> (DA) wrote:

    >DA> Works for me:


    >DA> rrr = downcode(u"´abovitá zmie¨aná ka¨a")
    >DA> print repr(rrr)
    >DA> print rrr


    >DA> prints out:


    >DA> u'Zabovita zmiesana kasa'
    >DA> Zabovita zmiesana kasa


    >DA> I did have to add an encoding declaration as line 2 of the file:


    >DA> #-*- coding: latin-1 -*-


    >DA> and I had to convince my editor (Komodo) to save the file in utf-8.


    *Seems to work*.
    If you save in utf-8 the coding declaration also has to be utf-8.
    Besides, many of these characters won't be representable in latin-1.
    The reason it worked is that these characters were translated into two-
    or more-bytes sequences and replace did work with these. But it's
    dangerous, as they are then no longer the unicode characters they were
    intended to be.
    --
    Piet van Oostrum <>
    WWW: http://pietvanoostrum.com/
    PGP key: [8DAE142BE17999C4]
    Piet van Oostrum, Sep 30, 2009
    #11
  12. gentlestone

    Dave Angel Guest

    Piet van Oostrum wrote:
    >>>>>> Dave Angel <> (DA) wrote:
    >>>>>>

    >
    >
    >> DA> Works for me:
    >>

    >
    >
    >> DA> rrr = downcode(u"Žabovitá zmiešaná kaša")
    >> DA> print repr(rrr)
    >> DA> print rrr
    >>

    >
    >
    >> DA> prints out:
    >>

    >
    >
    >> DA> u'Zabovita zmiesana kasa'
    >> DA> Zabovita zmiesana kasa
    >>

    >
    >
    >> DA> I did have to add an encoding declaration as line 2 of the file:
    >>

    >
    >
    >> DA> #-*- coding: latin-1 -*-
    >>

    >
    >
    >> DA> and I had to convince my editor (Komodo) to save the file in utf-8.
    >>

    >
    > *Seems to work*.
    > If you save in utf-8 the coding declaration also has to be utf-8.
    > Besides, many of these characters won't be representable in latin-1.
    > The reason it worked is that these characters were translated into two-
    > or more-bytes sequences and replace did work with these. But it's
    > dangerous, as they are then no longer the unicode characters they were
    > intended to be.
    >

    Thanks for the correction. What I meant by "works for me" is that the
    single example in the docstring translated okay. But I do have a lot to
    learn about using Unicode in sources, and I want to learn.

    So tell me, how were we supposed to guess what encoding the original
    message used? I originally had the mailing list message (in Thunderbird
    email). When I copied (copy/paste) to Komodo IDE (text editor), it
    wouldn't let me save because the file type was ASCII. So I randomly
    chosen latin-1 for file type, and it seemed to like it.

    At that point I expected and got errors from Python because I had no
    coding declaration. I used latin-1, and still had problems, though I
    forget what they were. Only when I changed the file encoding type again,
    to utf-8, did the errors go away. I agree that they should agree, but I
    don't know how to reconcile the copy/paste boundary, the file type
    (without BOM, which is another variable), the coding declaration, and
    the stdout implicit ASCII encoding. I understand a bunch of it, but not
    enough to be able to safely walk through the choices.

    Is this all written up in one place, to where an experienced programmer
    can make sense of it? I've nibbled at the edges (even wrote a UTF-8
    encoder/decoder a dozen years ago).

    DaveA
    Dave Angel, Sep 30, 2009
    #12
  13. >>>>> Dave Angel <> (DA) wrote:
    [snip]
    >DA> Thanks for the correction. What I meant by "works for me" is that the
    >DA> single example in the docstring translated okay. But I do have a lot to
    >DA> learn about using Unicode in sources, and I want to learn.


    >DA> So tell me, how were we supposed to guess what encoding the original
    >DA> message used? I originally had the mailing list message (in Thunderbird
    >DA> email). When I copied (copy/paste) to Komodo IDE (text editor), it wouldn't
    >DA> let me save because the file type was ASCII. So I randomly chosen latin-1
    >DA> for file type, and it seemed to like it.


    You can see the encoding of the message in its headers. But it is not
    important, as the Unicode characters you see is what it is about. You
    just copy and paste them in your Python file. The Python file does not
    have to use the same encoding as the message from which you pasted. The
    editor will do the proper conversion. (If it doesn't throw it away
    immediately.) Only for the Python file you must choose an encoding that
    can encode all the characters that are in the file. In this case utf-8
    is the only reasonable choice, but if there are only latin-1 characters
    in the file then of course latin-1 (iso-8859-1) will also be good.

    Any decent editor will only allow you to save in an encoding that can
    encode all the characters in the file, otherwise you will lose some
    characters.

    Because Python must also know which encoding you used and this is not in
    itself deductible from the file contents, you need the coding
    declaration. And it must be the same as the encoding in which the file
    is saved, otherwise Python will see something different than you saw in
    your editor. Sooner or later this will give you a big headache.

    >DA> At that point I expected and got errors from Python because I had no coding
    >DA> declaration. I used latin-1, and still had problems, though I forget what
    >DA> they were. Only when I changed the file encoding type again, to utf-8, did
    >DA> the errors go away. I agree that they should agree, but I don't know how to
    >DA> reconcile the copy/paste boundary, the file type (without BOM, which is
    >DA> another variable), the coding declaration, and the stdout implicit ASCII
    >DA> encoding. I understand a bunch of it, but not enough to be able to safely
    >DA> walk through the choices.


    >DA> Is this all written up in one place, to where an experienced programmer can
    >DA> make sense of it? I've nibbled at the edges (even wrote a UTF-8
    >DA> encoder/decoder a dozen years ago).


    I don't know a place. Usually utf-8 is a safe bet but in some cases can
    be overkill. And then in you Python input/output (read/write) you may
    have to use a different encoding if the programs that you have to
    communicate with expect something different.
    --
    Piet van Oostrum <>
    WWW: http://pietvanoostrum.com/
    PGP key: [8DAE142BE17999C4]
    Piet van Oostrum, Sep 30, 2009
    #13
  14. gentlestone

    Dave Angel Guest

    Piet van Oostrum wrote:
    >>>>>> Dave Angel <> (DA) wrote:
    >>>>>>

    > [snip]
    >
    >> DA> Thanks for the correction. What I meant by "works for me" is that the
    >> DA> single example in the docstring translated okay. But I do have a lot to
    >> DA> learn about using Unicode in sources, and I want to learn.
    >>

    >
    >
    >> DA> So tell me, how were we supposed to guess what encoding the original
    >> DA> message used? I originally had the mailing list message (in Thunderbird
    >> DA> email). When I copied (copy/paste) to Komodo IDE (text editor), it wouldn't
    >> DA> let me save because the file type was ASCII. So I randomly chosen latin-1
    >> DA> for file type, and it seemed to like it.
    >>

    >
    > You can see the encoding of the message in its headers. But it is not
    > important, as the Unicode characters you see is what it is about. You
    > just copy and paste them in your Python file. The Python file does not
    > have to use the same encoding as the message from which you pasted. The
    > editor will do the proper conversion. (If it doesn't throw it away
    > immediately.) Only for the Python file you must choose an encoding that
    > can encode all the characters that are in the file. In this case utf-8
    > is the only reasonable choice, but if there are only latin-1 characters
    > in the file then of course latin-1 (iso-8859-1) will also be good.
    >
    > Any decent editor will only allow you to save in an encoding that can
    > encode all the characters in the file, otherwise you will lose some
    > characters.
    >
    > Because Python must also know which encoding you used and this is not in
    > itself deductible from the file contents, you need the coding
    > declaration. And it must be the same as the encoding in which the file
    > is saved, otherwise Python will see something different than you saw in
    > your editor. Sooner or later this will give you a big headache.
    >
    >
    >> DA> At that point I expected and got errors from Python because I had no coding
    >> DA> declaration. I used latin-1, and still had problems, though I forget what
    >> DA> they were. Only when I changed the file encoding type again, to utf-8, did
    >> DA> the errors go away. I agree that they should agree, but I don't know how to
    >> DA> reconcile the copy/paste boundary, the file type (without BOM, which is
    >> DA> another variable), the coding declaration, and the stdout implicit ASCII
    >> DA> encoding. I understand a bunch of it, but not enough to be able to safely
    >> DA> walk through the choices.
    >>

    >
    >
    >> DA> Is this all written up in one place, to where an experienced programmer can
    >> DA> make sense of it? I've nibbled at the edges (even wrote a UTF-8
    >> DA> encoder/decoder a dozen years ago).
    >>

    >
    > I don't know a place. Usually utf-8 is a safe bet but in some cases can
    > be overkill. And then in you Python input/output (read/write) you may
    > have to use a different encoding if the programs that you have to
    > communicate with expect something different.
    >


    I know what I was missing. The copy/paste must be doing it in pure
    Unicode. And the in-memory version of the source text is in Unicode.
    So the text editor's encoding affects how that Unicode is encoded into 8
    bit bytes for the file (and how it will be reloaded next time). OK,
    that seems to make sense.

    I know that the clipboard has type tags, but I haven't looked at them in
    so long that I forget what they look like. For text, is it just ASCII
    and Unicode? Or are there other possible encodings that the source and
    sink negotiate?

    Thanks for the clear explanation.

    DaveA
    Dave Angel, Oct 1, 2009
    #14
  15. gentlestone

    gentlestone Guest

    >save in utf-8 the coding declaration also has to be utf-8

    ok, I understand, but what's the problem? Unfortunately seems to be
    the Python interactive
    mode doesn't have unicode support. It recognize the latin-1 encoding
    only.

    So I have 2 options, how to write doctest:
    1. Replace native charaters with their encoded representation like
    u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" instead of u"Žabovitá
    zmiešaná kaša"
    2. Use latin-1 encoding, where the file is saved in utf-8

    The first is bad because doctest is a great documenttion tool and it
    is propably the main reason I use python. And something like
    u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" is not a best
    documentation style. But the tests work.

    The second is bad, because the declaration is incorrect and if I use
    it in Django model declaration for example I got bad data in the
    application.

    So what is the solution? Back to Java? :)
    gentlestone, Oct 1, 2009
    #15
  16. gentlestone

    Dave Angel Guest

    gentlestone wrote:
    >> save in utf-8 the coding declaration also has to be utf-8
    >>

    >
    > ok, I understand, but what's the problem? Unfortunately seems to be
    > the Python interactive
    > mode doesn't have unicode support. It recognize the latin-1 encoding
    > only.
    >
    > So I have 2 options, how to write doctest:
    > 1. Replace native charaters with their encoded representation like
    > u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" instead of u"Žabovitá
    > zmiešaná kaša"
    > 2. Use latin-1 encoding, where the file is saved in utf-8
    >
    > The first is bad because doctest is a great documenttion tool and it
    > is propably the main reason I use python. And something like
    > u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" is not a best
    > documentation style. But the tests work.
    >
    > The second is bad, because the declaration is incorrect and if I use
    > it in Django model declaration for example I got bad data in the
    > application.
    >
    > So what is the solution? Back to Java? :)
    >
    >

    Wait -- don't give up yet. Since I'm one of the ones who (partially)
    steered you wrong, let me try to help.

    Key variable here is how your text editor behaves. Since I've never
    taken my (programming) text editor out of ASCII mode before this week,
    it took some experimenting (and more importantly a message from Piet on
    this thread) to make sense of things. I think I now know how to make my
    own editor (Komodo IDE) behave in this environment, and you probably can
    do as well or better. In fact, judging from your messages, you probably
    are doing much better on the editor front.

    When I tried this morning to re-open that test file from yesterday, many
    of the characters were all messed up. I was okay as long as the project
    was still open, but not today. The editor itself apparently looks to
    that encoding declaration when it's deciding how to interpret the bytes
    on disk.

    So I did the following, using Komodo IDE. I created a new file in the
    project. Before saving it, I used
    Edit->CurrentFileSettings->Properties->Encoding to set it to UTF-8.
    *NOW* I pasted the stuff from your email message. And added the
    #-*- coding: utf-8 -*-

    as the second line of the file. Notice it's *NOT* latin-1.

    At this point I save and run the file, and it seems to work fine.

    My guess is that I could set these as default settings in Komodo, if I
    were doing UTF-8 very often, and it would become painless. I know I
    have certain stuff in my python template, and could add that encoding
    line as well.


    Anyway, that gets us to the step of running the doctest. The trick here
    seems to be that we need to define the docstring as a Unicode docstring
    to have it interpreted correctly. Try adding the u in front of the
    triple quote as follows:

    def downcode(name):
    u"""
    >>> downcode(u"Žabovitá zmiešaná kaša")

    u'Zabovita zmiesana kasa'
    """
    for key, value in _MAP.iteritems():
    name = name.replace(key, value)
    return name

    Now, if the doctest passes, we seem to be in good shape.

    There's another problem, that hopefully somebody else can help with.
    That's if doctest needs to report an error. When I deliberately changed
    the "expect" string I get an error like the following.

    UnicodeEncodeError: 'ascii' codec can't encode character u'\u017d' in
    position 1
    50: ordinal not in range(128)

    I get a similar error if running the -v option on doctest. (Note that
    I do *NOT* get the error when running inside Komodo. And what I've read
    implies that the same would be true if running inside IDLE.) The
    problem is similar to the one you'd have doing a simple:

    print u"\u017d"

    I think these are avoided if sys.stdout.encoding (and maybe
    sys.stderr.encoding) are set to utf-8. On my system they're set to
    None, which says to use "the system default encoding." On my system
    that would be ASCII, so I get the error. But perhaps yours is already
    something better.

    I found links:
    http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/
    http://wiki.python.org/moin/PrintFails

    http://lists.macromates.com/textmate/2008-June/025735.html
    which indicate you may want to try:

    set LC_CTYPE=en_GB.utf-8 python

    at the command prompt before running python. This could be system specific; it didn't work for me on XP.

    The workaround that works for me (so far) is:

    if __name__ == "__main__":
    import sys, codecs
    sys.stdout = codecs.getwriter('utf8')(sys.stdout)

    print u"Žabovitá zmiešaná kaša"
    import doctest
    doctest.testmod()

    The codecs line tells python that stdout should use utf-8. That doesn't make the characters look good on my console, but at least it avoids the errors. I'm guessing that on my system I should use latin1 here instead of utf8. But I don't want to confuse things.


    HTH

    DaveA
    Dave Angel, Oct 1, 2009
    #16
  17. gentlestone

    Hyuga Guest

    On Sep 30, 3:34 am, gentlestone <> wrote:
    > Why don't work this code on Python 2.6? Or how can I do this job?
    >
    > _MAP = {
    >     # LATIN
    >     u'À': 'A', u'Ã': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Ã…': 'A',
    > u'Æ': 'AE', u'Ç':'C',
    >     u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'ÃŒ': 'I', u'Ã': 'I',
    > u'ÃŽ': 'I',
    >     u'Ã': 'I', u'Ã': 'D', u'Ñ': 'N', u'Ã’': 'O', u'Ó': 'O', u'Ô': 'O',
    > u'Õ': 'O', u'Ö':'O',
    >     u'Å': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ãœ': 'U',
    > u'Å°': 'U',
    >     u'Ã': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a',
    > u'ã': 'a', u'ä':'a',
    >     u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e',
    > u'ë': 'e',
    >     u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n',
    > u'ò': 'o', u'ó':'o',
    >     u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u',
    > u'ú': 'u',
    >     u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y',
    >     # LATIN_SYMBOLS
    >     u'©':'(c)',
    >     # GREEK
    >     u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z',
    > u'η':'h', u'θ':'8',
    >     u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3',
    > u'ο':'o', u'π':'p',
    >     u'Ï':'r', u'σ':'s', u'Ï„':'t', u'Ï…':'y', u'φ':'f', u'χ':'x',
    > u'ψ':'ps', u'ω':'w',
    >     u'ά':'a', u'έ':'e', u'ί':'i', u'ÏŒ':'o', u'Ï':'y', u'ή':'h',
    > u'ÏŽ':'w', u'Ï‚':'s',
    >     u'ÏŠ':'i', u'ΰ':'y', u'Ï‹':'y', u'Î':'i',
    >     u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z',
    > u'Η':'H', u'Θ':'8',
    >     u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Îœ':'M', u'Î':'N', u'Ξ':'3',
    > u'Ο':'O', u'Π':'P',
    >     u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X',
    > u'Ψ':'PS', u'Ω':'W',
    >     u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H',
    > u'Î':'W', u'Ϊ':'I', u'Ϋ':'Y',
    >     # TURKISH
    >     u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C',
    > u'ü':'u', u'Ü':'U',
    >     u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G',
    >     # RUSSIAN
    >     u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e',
    > u'ё':'yo', u'ж':'zh',
    >     u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m',
    > u'н':'n', u'о':'o',
    >     u'п':'p', u'Ñ€':'r', u'Ñ':'s', u'Ñ‚':'t', u'у':'u', u'Ñ„':'f',
    > u'х':'h', u'ц':'c',
    >     u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'',
    > u'Ñ':'e', u'ÑŽ':'yu', u'Ñ':'ya',
    >     u'Ð':'A', u'Б':'B', u'Ð’':'V', u'Г':'G', u'Д':'D', u'Е':'E',
    > u'Ð':'Yo', u'Ж':'Zh',
    >     u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M',
    > u'Ð':'N', u'О':'O',
    >     u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F',
    > u'Х':'H', u'Ц':'C',
    >     u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'',
    > u'Э':'E', u'Ю':'Yu', u'Я':'Ya',
    >     # UKRAINIAN
    >     u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ò':'G', u'Ñ”':'ye', u'Ñ–':'i',
    > u'Ñ—':'yi', u'Ò‘':'g',
    >     # CZECH
    >     u'Ä':'c', u'Ä':'d', u'Ä›':'e', u'ň':'n', u'Å™':'r', u'Å¡':'s',
    > u'ť':'t', u'ů':'u',
    >     u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R',
    > u'Š':'S', u'Ť':'T', u'Ů':'U', u'Ž':'Z',
    >     # POLISH
    >     u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o',
    > u'ś':'s', u'ź':'z',
    >     u'ż':'z', u'Ä„':'A', u'Ć':'C', u'Ę':'e', u'Å':'L', u'Ń':'N',
    > u'Ó':'o', u'Ś':'S',
    >     u'Ź':'Z', u'Ż':'Z',
    >     # LATVIAN
    >     u'Ä':'a', u'Ä':'c', u'Ä“':'e', u'Ä£':'g', u'Ä«':'i', u'Ä·':'k',
    > u'ļ':'l', u'ņ':'n',
    >     u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E',
    > u'Ģ':'G', u'Ī':'i',
    >     u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
    >
    > }
    >
    > def downcode(name):
    >     """
    >     >>> downcode(u"Žabovitá zmiešaná kaša")
    >     u'Zabovita zmiesana kasa'
    >     """
    >     for key, value in _MAP.iteritems():
    >         name = name.replace(key, value)
    >     return name


    Though C Python is pretty optimized under the hood for this sort of
    single-character replacement, this still seems pretty inefficient
    since you're calling replace for every character you want to map. I
    think that a better approach might be something like:

    def downcode(name):
    return ''.join(_MAP.get(c, c) for c in name)

    Or using string.translate:

    import string
    def downcode(name):
    table = string.maketrans(
    'ÀÃÂÃÄÅ...',
    'AAAAAA...')
    return name.translate(table)
    Hyuga, Oct 1, 2009
    #17
  18. On 01.10.09 16:09, Hyuga wrote:
    > On Sep 30, 3:34 am, gentlestone <> wrote:
    >> Why don't work this code on Python 2.6? Or how can I do this job?
    >>
    >> _MAP = {
    >> # LATIN
    >> u'À': 'A', u'Ã': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Ã…': 'A',
    >> u'Æ': 'AE', u'Ç':'C',
    >> u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'ÃŒ': 'I', u'Ã': 'I',
    >> u'ÃŽ': 'I',
    >> u'Ã': 'I', u'Ã': 'D', u'Ñ': 'N', u'Ã’': 'O', u'Ó': 'O', u'Ô': 'O',
    >> u'Õ': 'O', u'Ö':'O',
    >> u'Å': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ãœ': 'U',
    >> u'Å°': 'U',
    >> u'Ã': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a',
    >> u'ã': 'a', u'ä':'a',
    >> u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e',
    >> u'ë': 'e',
    >> u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n',
    >> u'ò': 'o', u'ó':'o',
    >> u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u',
    >> u'ú': 'u',
    >> u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y',
    >> # LATIN_SYMBOLS
    >> u'©':'(c)',
    >> # GREEK
    >> u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z',
    >> u'η':'h', u'θ':'8',
    >> u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3',
    >> u'ο':'o', u'π':'p',
    >> u'Ï':'r', u'σ':'s', u'Ï„':'t', u'Ï…':'y', u'φ':'f', u'χ':'x',
    >> u'ψ':'ps', u'ω':'w',
    >> u'ά':'a', u'έ':'e', u'ί':'i', u'ÏŒ':'o', u'Ï':'y', u'ή':'h',
    >> u'ÏŽ':'w', u'Ï‚':'s',
    >> u'ÏŠ':'i', u'ΰ':'y', u'Ï‹':'y', u'Î':'i',
    >> u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z',
    >> u'Η':'H', u'Θ':'8',
    >> u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Îœ':'M', u'Î':'N', u'Ξ':'3',
    >> u'Ο':'O', u'Π':'P',
    >> u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X',
    >> u'Ψ':'PS', u'Ω':'W',
    >> u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H',
    >> u'Î':'W', u'Ϊ':'I', u'Ϋ':'Y',
    >> # TURKISH
    >> u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C',
    >> u'ü':'u', u'Ü':'U',
    >> u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G',
    >> # RUSSIAN
    >> u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e',
    >> u'ё':'yo', u'ж':'zh',
    >> u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m',
    >> u'н':'n', u'о':'o',
    >> u'п':'p', u'Ñ€':'r', u'Ñ':'s', u'Ñ‚':'t', u'у':'u', u'Ñ„':'f',
    >> u'х':'h', u'ц':'c',
    >> u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'',
    >> u'Ñ':'e', u'ÑŽ':'yu', u'Ñ':'ya',
    >> u'Ð':'A', u'Б':'B', u'Ð’':'V', u'Г':'G', u'Д':'D', u'Е':'E',
    >> u'Ð':'Yo', u'Ж':'Zh',
    >> u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M',
    >> u'Ð':'N', u'О':'O',
    >> u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F',
    >> u'Х':'H', u'Ц':'C',
    >> u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'',
    >> u'Э':'E', u'Ю':'Yu', u'Я':'Ya',
    >> # UKRAINIAN
    >> u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ò':'G', u'Ñ”':'ye', u'Ñ–':'i',
    >> u'Ñ—':'yi', u'Ò‘':'g',
    >> # CZECH
    >> u'Ä':'c', u'Ä':'d', u'Ä›':'e', u'ň':'n', u'Å™':'r', u'Å¡':'s',
    >> u'ť':'t', u'ů':'u',
    >> u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R',
    >> u'Š':'S', u'Ť':'T', u'Ů':'U', u'Ž':'Z',
    >> # POLISH
    >> u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o',
    >> u'ś':'s', u'ź':'z',
    >> u'ż':'z', u'Ä„':'A', u'Ć':'C', u'Ę':'e', u'Å':'L', u'Ń':'N',
    >> u'Ó':'o', u'Ś':'S',
    >> u'Ź':'Z', u'Ż':'Z',
    >> # LATVIAN
    >> u'Ä':'a', u'Ä':'c', u'Ä“':'e', u'Ä£':'g', u'Ä«':'i', u'Ä·':'k',
    >> u'ļ':'l', u'ņ':'n',
    >> u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E',
    >> u'Ģ':'G', u'Ī':'i',
    >> u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
    >>
    >> }
    >>
    >> def downcode(name):
    >> """
    >> >>> downcode(u"Žabovitá zmiešaná kaša")

    >> u'Zabovita zmiesana kasa'
    >> """
    >> for key, value in _MAP.iteritems():
    >> name = name.replace(key, value)
    >> return name

    >
    > Though C Python is pretty optimized under the hood for this sort of
    > single-character replacement, this still seems pretty inefficient
    > since you're calling replace for every character you want to map. I
    > think that a better approach might be something like:
    >
    > def downcode(name):
    > return ''.join(_MAP.get(c, c) for c in name)
    >
    > Or using string.translate:
    >
    > import string
    > def downcode(name):
    > table = string.maketrans(
    > 'ÀÃÂÃÄÅ...',
    > 'AAAAAA...')
    > return name.translate(table)


    Or even simpler:

    import unicodedata

    def downcode(name):
    return unicodedata.normalize("NFD", name)\
    .encode("ascii", "ignore")\
    .decode("ascii")

    Servus,
    Walter
    Walter Dörwald, Oct 1, 2009
    #18
  19. On Thu, 01 Oct 2009 08:10:58 -0700, Walter Dörwald <>
    wrote:

    > On 01.10.09 16:09, Hyuga wrote:
    >> On Sep 30, 3:34 am, gentlestone <> wrote:
    >>> Why don't work this code on Python 2.6? Or how can I do this job?
    >>>
    >>> [snip _MAP]
    >>>
    >>> def downcode(name):
    >>> """
    >>> >>> downcode(u"Žabovitá zmiešaná kaša")
    >>> u'Zabovita zmiesana kasa'
    >>> """
    >>> for key, value in _MAP.iteritems():
    >>> name = name.replace(key, value)
    >>> return name

    >>
    >> Though C Python is pretty optimized under the hood for this sort of
    >> single-character replacement, this still seems pretty inefficient
    >> since you're calling replace for every character you want to map. I
    >> think that a better approach might be something like:
    >>
    >> def downcode(name):
    >> return ''.join(_MAP.get(c, c) for c in name)
    >>
    >> Or using string.translate:
    >>
    >> import string
    >> def downcode(name):
    >> table = string.maketrans(
    >> 'ÀÃÂÃÄÅ...',
    >> 'AAAAAA...')
    >> return name.translate(table)

    >
    > Or even simpler:
    >
    > import unicodedata
    >
    > def downcode(name):
    > return unicodedata.normalize("NFD", name)\
    > .encode("ascii", "ignore")\
    > .decode("ascii")
    >
    > Servus,
    > Walter


    As I understand it, the "ignore" argument to str.encode *removes* the
    undecodable characters, rather than replacing them with an ASCII
    approximation. Is that correct? If so, wouldn't that rather defeat the
    purpose?

    --
    Rami Chowdhury
    "Never attribute to malice that which can be attributed to stupidity" --
    Hanlon's Razor
    408-597-7068 (US) / 07875-841-046 (UK) / 0189-245544 (BD)
    Rami Chowdhury, Oct 1, 2009
    #19
  20. gentlestone

    Peter Otten Guest

    Rami Chowdhury wrote:

    > On Thu, 01 Oct 2009 08:10:58 -0700, Walter Dörwald <>
    > wrote:
    >
    >> On 01.10.09 16:09, Hyuga wrote:
    >>> On Sep 30, 3:34 am, gentlestone <> wrote:
    >>>> Why don't work this code on Python 2.6? Or how can I do this job?
    >>>>
    >>>> [snip _MAP]
    >>>>
    >>>> def downcode(name):
    >>>> """
    >>>> >>> downcode(u"Žabovitá zmiešaná kaša")
    >>>> u'Zabovita zmiesana kasa'
    >>>> """
    >>>> for key, value in _MAP.iteritems():
    >>>> name = name.replace(key, value)
    >>>> return name
    >>>
    >>> Though C Python is pretty optimized under the hood for this sort of
    >>> single-character replacement, this still seems pretty inefficient
    >>> since you're calling replace for every character you want to map. I
    >>> think that a better approach might be something like:
    >>>
    >>> def downcode(name):
    >>> return ''.join(_MAP.get(c, c) for c in name)
    >>>
    >>> Or using string.translate:
    >>>
    >>> import string
    >>> def downcode(name):
    >>> table = string.maketrans(
    >>> 'ÀÃÂÃÄÅ...',
    >>> 'AAAAAA...')
    >>> return name.translate(table)

    >>
    >> Or even simpler:
    >>
    >> import unicodedata
    >>
    >> def downcode(name):
    >> return unicodedata.normalize("NFD", name)\
    >> .encode("ascii", "ignore")\
    >> .decode("ascii")
    >>
    >> Servus,
    >> Walter

    >
    > As I understand it, the "ignore" argument to str.encode *removes* the
    > undecodable characters, rather than replacing them with an ASCII
    > approximation. Is that correct? If so, wouldn't that rather defeat the
    > purpose?


    You didn't take the normalization step into your consideration. Example:

    >>> import unicodedata
    >>> s = u"Ä"
    >>> unicodedata.normalize("NFD", s)

    u'A\u0308'
    >>> _.encode("ascii", "ignore")

    'A'
    Peter Otten, Oct 1, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,909
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    533
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    502
    Gabriele *darkbard* Farina
    May 16, 2006
  4. gabor
    Replies:
    13
    Views:
    540
    Leo Kislov
    Nov 18, 2006
  5. Jean-Paul Calderone
    Replies:
    23
    Views:
    658
    Leo Kislov
    Nov 21, 2006
Loading...

Share This Page