Least-lossy string.encode to us-ascii?

Discussion in 'Python' started by Tim Chase, Sep 13, 2012.

  1. Tim Chase

    Tim Chase Guest

    I've got a bunch of text in Portuguese and to transmit them, need to
    have them in us-ascii (7-bit). I'd like to keep as much information
    as possible, just stripping accents, cedillas, tildes, etc. So
    "serviço móvil" becomes "servico movil". Is there anything stock
    that I've missed? I can do mystring.encode('us-ascii', 'replace')
    but that doesn't keep as much information as I'd hope.

    -tkc
     
    Tim Chase, Sep 13, 2012
    #1
    1. Advertising

  2. On Thu, 13 Sep 2012 16:26:07 -0500, Tim Chase wrote:

    > I've got a bunch of text in Portuguese and to transmit them, need to
    > have them in us-ascii (7-bit).


    That could mean two things:

    1) "The receiver is incapable of dealing with Unicode in 2012, which is
    frankly appalling, but what can I do about it?"

    2) "The transport mechanism I use to transmit the data is only capable of
    dealing with 7-bit ASCII strings, which is sad but pretty much standard."

    In the case of 1), I suggest you look at the Unicode Hammer, a.k.a. "The
    Stupid American":

    http://code.activestate.com/recipes/251871

    and especially the very many useful comments.


    In the case of 2), just binhex or uuencode your data for transport.



    --
    Steven
     
    Steven D'Aprano, Sep 14, 2012
    #2
    1. Advertising

  3. Tim Chase

    Guest

    Le jeudi 13 septembre 2012 23:25:27 UTC+2, Tim Chase a écrit :
    > I've got a bunch of text in Portuguese and to transmit them, need to
    >
    > have them in us-ascii (7-bit). I'd like to keep as much information
    >
    > as possible, just stripping accents, cedillas, tildes, etc. So
    >
    > "serviço móvil" becomes "servico movil". Is there anything stock
    >
    > that I've missed? I can do mystring.encode('us-ascii', 'replace')
    >
    > but that doesn't keep as much information as I'd hope.
    >


    Interesting case. It's where the coding of characters
    meets characters usage, scripts, typography, linguistic
    features.

    I cann't discuss the Portugese case, but in French
    and in German one way to achieve the task is to
    convert the text in uppercases. It preserves a correct
    text.

    >>> s = 'Lætitia cœur éléphant français LUŸ Stoß Erklärung stören'
    >>> libfrancais.SpecMajuscules(s)

    'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG
    STOEREN'

    >>> r = 'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG STOEREN'
    >>> r.encode('ascii', 'strict').decode('ascii', 'strict') == r

    True

    PS Avoid Py3.3 :)

    jmf
     
    , Sep 14, 2012
    #3
  4. Tim Chase

    Guest

    Le jeudi 13 septembre 2012 23:25:27 UTC+2, Tim Chase a écrit :
    > I've got a bunch of text in Portuguese and to transmit them, need to
    >
    > have them in us-ascii (7-bit). I'd like to keep as much information
    >
    > as possible, just stripping accents, cedillas, tildes, etc. So
    >
    > "serviço móvil" becomes "servico movil". Is there anything stock
    >
    > that I've missed? I can do mystring.encode('us-ascii', 'replace')
    >
    > but that doesn't keep as much information as I'd hope.
    >


    Interesting case. It's where the coding of characters
    meets characters usage, scripts, typography, linguistic
    features.

    I cann't discuss the Portugese case, but in French
    and in German one way to achieve the task is to
    convert the text in uppercases. It preserves a correct
    text.

    >>> s = 'Lætitia cœur éléphant français LUŸ Stoß Erklärung stören'
    >>> libfrancais.SpecMajuscules(s)

    'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG
    STOEREN'

    >>> r = 'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG STOEREN'
    >>> r.encode('ascii', 'strict').decode('ascii', 'strict') == r

    True

    PS Avoid Py3.3 :)

    jmf
     
    , Sep 14, 2012
    #4
  5. Tim Chase

    Terry Reedy Guest

    On 9/14/2012 12:15 PM, wrote:

    > PS Avoid Py3.3 :)


    pps Start using 3.3 as soon as possible. It has Python's first fully
    portable non-buggy Unicode implementation. The second release candidate
    is already out.

    --
    Terry Jan Reedy
     
    Terry Reedy, Sep 14, 2012
    #5
  6. Tim Chase

    Guest

    Le vendredi 14 septembre 2012 22:45:05 UTC+2, Terry Reedy a écrit :
    > On 9/14/2012 12:15 PM, wrote:
    >
    >
    >
    > > PS Avoid Py3.3 :)

    >
    >
    >
    > pps Start using 3.3 as soon as possible. It has Python's first fully
    >
    > portable non-buggy Unicode implementation. The second release candidate
    >
    > is already out.
    >
    >
    >
    > --
    >
    > Terry Jan Reedy


    - I will drop Python.
    - No complaints.
    - (OT, luckily one of the two Unicode TeX engines is called LuaTeX.)

    jmf
     
    , Sep 15, 2012
    #6
  7. Tim Chase

    Guest

    Le vendredi 14 septembre 2012 22:45:05 UTC+2, Terry Reedy a écrit :
    > On 9/14/2012 12:15 PM, wrote:
    >
    >
    >
    > > PS Avoid Py3.3 :)

    >
    >
    >
    > pps Start using 3.3 as soon as possible. It has Python's first fully
    >
    > portable non-buggy Unicode implementation. The second release candidate
    >
    > is already out.
    >
    >
    >
    > --
    >
    > Terry Jan Reedy


    - I will drop Python.
    - No complaints.
    - (OT, luckily one of the two Unicode TeX engines is called LuaTeX.)

    jmf
     
    , Sep 15, 2012
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vlastimil Brom

    Re: Least-lossy string.encode to us-ascii?

    Vlastimil Brom, Sep 13, 2012, in forum: Python
    Replies:
    0
    Views:
    149
    Vlastimil Brom
    Sep 13, 2012
  2. Tim Chase
    Replies:
    0
    Views:
    129
    Tim Chase
    Sep 13, 2012
  3. Ethan Furman
    Replies:
    0
    Views:
    142
    Ethan Furman
    Sep 13, 2012
  4. Terry Reedy
    Replies:
    0
    Views:
    152
    Terry Reedy
    Sep 14, 2012
  5. Tim Chase
    Replies:
    5
    Views:
    170
    Terry Reedy
    Sep 14, 2012
Loading...

Share This Page