Re: Least-lossy string.encode to us-ascii?

Discussion in 'Python' started by Tim Chase, Sep 14, 2012.

  1. Tim Chase

    Tim Chase Guest

    On 09/13/12 18:36, Terry Reedy wrote:
    > On 9/13/2012 5:26 PM, Tim Chase wrote:
    >> I've got a bunch of text in Portuguese and to transmit them, need to
    >> have them in us-ascii (7-bit). I'd like to keep as much information
    >> as possible,just stripping accents, cedillas, tildes, etc.

    >
    > 'keep as much information as possible' would mean an effectively
    > lossless transliteration, which you could do with a dict.
    > {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would
    > never occur in normal text of the sort you are transmitting), ...}


    Vlastimil's solution kept the characters but stripped them of their
    accents/tildes/cedillas/etc, doing just what I wanted, all using the
    stdlib. Hard to do better than that :)

    -tkc
     
    Tim Chase, Sep 14, 2012
    #1
    1. Advertising

  2. Tim Chase

    Mark Tolonen Guest

    On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
    > On 09/13/12 18:36, Terry Reedy wrote:
    >
    > > On 9/13/2012 5:26 PM, Tim Chase wrote:

    >
    > >> I've got a bunch of text in Portuguese and to transmit them, need to

    >
    > >> have them in us-ascii (7-bit). I'd like to keep as much information

    >
    > >> as possible,just stripping accents, cedillas, tildes, etc.

    >
    > >

    >
    > > 'keep as much information as possible' would mean an effectively

    >
    > > lossless transliteration, which you could do with a dict.

    >
    > > {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would

    >
    > > never occur in normal text of the sort you are transmitting), ...}

    >
    >
    >
    > Vlastimil's solution kept the characters but stripped them of their
    >
    > accents/tildes/cedillas/etc, doing just what I wanted, all using the
    >
    > stdlib. Hard to do better than that :)
    >
    >
    >
    > -tkc


    How about using UTF-7 for transmission and decode on the other end? This keeps the transmission all 7-bit, and no loss.

    >>> s=u"serviço móvil".encode('utf-7')
    >>> print s

    servi+AOc-o m+APM-vil
    >>> print s.decode('utf-7')

    serviço móvil

    -Mark
     
    Mark Tolonen, Sep 14, 2012
    #2
    1. Advertising

  3. Tim Chase

    Mark Tolonen Guest

    On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
    > On 09/13/12 18:36, Terry Reedy wrote:
    >
    > > On 9/13/2012 5:26 PM, Tim Chase wrote:

    >
    > >> I've got a bunch of text in Portuguese and to transmit them, need to

    >
    > >> have them in us-ascii (7-bit). I'd like to keep as much information

    >
    > >> as possible,just stripping accents, cedillas, tildes, etc.

    >
    > >

    >
    > > 'keep as much information as possible' would mean an effectively

    >
    > > lossless transliteration, which you could do with a dict.

    >
    > > {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would

    >
    > > never occur in normal text of the sort you are transmitting), ...}

    >
    >
    >
    > Vlastimil's solution kept the characters but stripped them of their
    >
    > accents/tildes/cedillas/etc, doing just what I wanted, all using the
    >
    > stdlib. Hard to do better than that :)
    >
    >
    >
    > -tkc


    How about using UTF-7 for transmission and decode on the other end? This keeps the transmission all 7-bit, and no loss.

    >>> s=u"serviço móvil".encode('utf-7')
    >>> print s

    servi+AOc-o m+APM-vil
    >>> print s.decode('utf-7')

    serviço móvil

    -Mark
     
    Mark Tolonen, Sep 14, 2012
    #3
  4. Tim Chase

    Tim Chase Guest

    On 09/13/12 21:09, Mark Tolonen wrote:
    > On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
    >> Vlastimil's solution kept the characters but stripped them of their
    >> accents/tildes/cedillas/etc, doing just what I wanted, all using the
    >> stdlib. Hard to do better than that :)

    >
    > How about using UTF-7 for transmission and decode on the other end? This keeps the transmission all 7-bit, and no loss.
    >
    > >>> s=u"serviço móvil".encode('utf-7')
    > >>> print s

    > servi+AOc-o m+APM-vil
    > >>> print s.decode('utf-7')

    > serviço móvil


    Nice if I control both ends of the pipe. Unfortunately, I only
    control what goes in, and I want it to be as un-screw-uppable as
    possible when it comes out the other end (may be web, CSV files,
    PDFs, FTP'ed file dumps, spreadsheets, word-processing documents,
    etc), and us-ascii is the lowest-common-denominator of
    unscrewuppableness while requiring nothing of the the other end. :)

    -tkc
     
    Tim Chase, Sep 14, 2012
    #4
  5. On Thu, 13 Sep 2012 21:34:52 -0500, Tim Chase wrote:

    > On 09/13/12 21:09, Mark Tolonen wrote:
    >> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
    >>> Vlastimil's solution kept the characters but stripped them of their
    >>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
    >>> stdlib. Hard to do better than that :)

    >>
    >> How about using UTF-7 for transmission and decode on the other end?
    >> This keeps the transmission all 7-bit, and no loss.
    >>
    >> >>> s=u"serviço móvil".encode('utf-7')
    >> >>> print s

    >> servi+AOc-o m+APM-vil
    >> >>> print s.decode('utf-7')

    >> serviço móvil

    >
    > Nice if I control both ends of the pipe. Unfortunately, I only control
    > what goes in, and I want it to be as un-screw-uppable as possible when
    > it comes out the other end (may be web, CSV files, PDFs, FTP'ed file
    > dumps, spreadsheets, word-processing documents, etc), and us-ascii is
    > the lowest-common-denominator of unscrewuppableness while requiring
    > nothing of the the other end. :)


    Wrong. It requires support for US-ASCII. What if the other end is an IBM
    mainframe using EBCDIC?

    Frankly, I am appalled that you are intentionally perpetuating the
    ignorance of US-ASCII-only applications, not because you have no choice
    about inter-operating with some ancient, brain-dead application, but
    because you artificially choose to follow an obsolete *and incorrect*
    standard.

    It is *incorrect* because you can change the meaning of text by stripping
    accents and deleting characters. Consequences can include murder and suicide:

    http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail

    At least tell me that "ASCII only" is merely an *option* for your
    application, not the only choice, and that it defaults to UTF-8 which is
    the right standard to use for text.



    --
    Steven
     
    Steven D'Aprano, Sep 14, 2012
    #5
  6. Tim Chase

    Terry Reedy Guest

    On 9/13/2012 10:09 PM, Mark Tolonen wrote:
    > On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
    >> On 09/13/12 18:36, Terry Reedy wrote:


    >>> 'keep as much information as possible' would mean an effectively
    >>> lossless transliteration, which you could do with a dict.
    >>> {<o-with-accent>: 'o', <c-cedilla>: 'c,' (or pick something that would


    >> Vlastimil's solution kept the characters but stripped them of their
    >> accents/tildes/cedillas/etc, doing just what I wanted, all using the
    >> stdlib. Hard to do better than that :)


    You mean, hard to do better than what you think you want, as opposed to
    what you said you wanted in both the subject line and the text line I
    quoted. What you need depends on why you need ascii only text and what
    the recipient will do with the ascii only text. Print it on an
    ascii-only printer? Or something similar? If so, a lossy encoding may be
    sufficient, but why not let the recipient decide to toss info?

    > How about using UTF-7 for transmission and decode on the other end?
    > This keeps the transmission all 7-bit, and no loss.
    >
    > >>> s=u"serviço móvil".encode('utf-7')
    > >>> print s

    > servi+AOc-o m+APM-vil
    > >>> print s.decode('utf-7')

    > serviço móvil


    Nice. I was barely aware of and forgot that option. This and similar
    suggestions to use existing methods is much better than my hackish approach.

    --
    Terry Jan Reedy
     
    Terry Reedy, Sep 14, 2012
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tim Chase
    Replies:
    6
    Views:
    203
  2. Vlastimil Brom

    Re: Least-lossy string.encode to us-ascii?

    Vlastimil Brom, Sep 13, 2012, in forum: Python
    Replies:
    0
    Views:
    139
    Vlastimil Brom
    Sep 13, 2012
  3. Tim Chase
    Replies:
    0
    Views:
    122
    Tim Chase
    Sep 13, 2012
  4. Ethan Furman
    Replies:
    0
    Views:
    129
    Ethan Furman
    Sep 13, 2012
  5. Terry Reedy
    Replies:
    0
    Views:
    143
    Terry Reedy
    Sep 14, 2012
Loading...

Share This Page