Least-lossy string.encode to us-ascii?

T

Tim Chase

I've got a bunch of text in Portuguese and to transmit them, need to
have them in us-ascii (7-bit). I'd like to keep as much information
as possible, just stripping accents, cedillas, tildes, etc. So
"serviço móvil" becomes "servico movil". Is there anything stock
that I've missed? I can do mystring.encode('us-ascii', 'replace')
but that doesn't keep as much information as I'd hope.

-tkc
 
S

Steven D'Aprano

I've got a bunch of text in Portuguese and to transmit them, need to
have them in us-ascii (7-bit).

That could mean two things:

1) "The receiver is incapable of dealing with Unicode in 2012, which is
frankly appalling, but what can I do about it?"

2) "The transport mechanism I use to transmit the data is only capable of
dealing with 7-bit ASCII strings, which is sad but pretty much standard."

In the case of 1), I suggest you look at the Unicode Hammer, a.k.a. "The
Stupid American":

http://code.activestate.com/recipes/251871

and especially the very many useful comments.


In the case of 2), just binhex or uuencode your data for transport.
 
W

wxjmfauth

Le jeudi 13 septembre 2012 23:25:27 UTC+2, Tim Chase a écrit :
I've got a bunch of text in Portuguese and to transmit them, need to

have them in us-ascii (7-bit). I'd like to keep as much information

as possible, just stripping accents, cedillas, tildes, etc. So

"serviço móvil" becomes "servico movil". Is there anything stock

that I've missed? I can do mystring.encode('us-ascii', 'replace')

but that doesn't keep as much information as I'd hope.

Interesting case. It's where the coding of characters
meets characters usage, scripts, typography, linguistic
features.

I cann't discuss the Portugese case, but in French
and in German one way to achieve the task is to
convert the text in uppercases. It preserves a correct
text.
'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG
STOEREN'
True

PS Avoid Py3.3 :)

jmf
 
W

wxjmfauth

Le jeudi 13 septembre 2012 23:25:27 UTC+2, Tim Chase a écrit :
I've got a bunch of text in Portuguese and to transmit them, need to

have them in us-ascii (7-bit). I'd like to keep as much information

as possible, just stripping accents, cedillas, tildes, etc. So

"serviço móvil" becomes "servico movil". Is there anything stock

that I've missed? I can do mystring.encode('us-ascii', 'replace')

but that doesn't keep as much information as I'd hope.

Interesting case. It's where the coding of characters
meets characters usage, scripts, typography, linguistic
features.

I cann't discuss the Portugese case, but in French
and in German one way to achieve the task is to
convert the text in uppercases. It preserves a correct
text.
'LAETITIA COEUR ELEPHANT FRANCAIS LUY STOSS ERKLAERUNG
STOEREN'
True

PS Avoid Py3.3 :)

jmf
 
T

Terry Reedy

PS Avoid Py3.3 :)

pps Start using 3.3 as soon as possible. It has Python's first fully
portable non-buggy Unicode implementation. The second release candidate
is already out.
 
W

wxjmfauth

Le vendredi 14 septembre 2012 22:45:05 UTC+2, Terry Reedy a écrit :
pps Start using 3.3 as soon as possible. It has Python's first fully

portable non-buggy Unicode implementation. The second release candidate

is already out.

- I will drop Python.
- No complaints.
- (OT, luckily one of the two Unicode TeX engines is called LuaTeX.)

jmf
 
W

wxjmfauth

Le vendredi 14 septembre 2012 22:45:05 UTC+2, Terry Reedy a écrit :
pps Start using 3.3 as soon as possible. It has Python's first fully

portable non-buggy Unicode implementation. The second release candidate

is already out.

- I will drop Python.
- No complaints.
- (OT, luckily one of the two Unicode TeX engines is called LuaTeX.)

jmf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top