Replacement in unicodestrings?

K

KvS

Dear all,

could somebody please just put an end to the unicode mysery I'm in,
men... The situation is that I have a Tkinter program that let's the
user enter data in some Entries and this data needs to be transformed
to the encoding compatible with an .rtf-file. In fact I only need to
do some of the usual symbols like ë etc.

Here's the function that I am using:

def pythonUnicodeToRTFAscii(self,s):
if isinstance(s,str):
return s
s_str=repr(s.encode('UTF-8'))
replDic={'\xc3\xa0':"\\'e0",'\xc3\xa4':"\\'e4",'\xc3\xa1':"\
\'e1",
'\xc3\xa8':"\\'e8",'\xc3\xab':"\\'eb",'\xc3\xa9':"\
\'e9",
'\xc3\xb2':"\\'f2",'\xc3\xb6':"\\'f6",'\xc3\xb3':"\
\'f3",
'\xe2\x82\xac':"\\'80"}
for k in replDic.keys():
if repr(k) in s_str:
s_str=s_str.replace(repr(k),replDic[k])
return s_str

So replDic represents the mapping from one encoding to the other. Now,
if I enter e.g. 'Arjën' in the Entry, then s_str in the above function
becomes 'Arj\xc3\xabn' and since replDic contains the key \xc3\xab I
would expect the replacement in the final lines of the function to
kick in. This however doesn't happen, there's no match.

However interactive:
True

I just don't get it, what's the difference? Is the above anyhow the
best way to attack such a problem?

Thanks & best wishes, Kees
 
M

Martin v. Löwis

s_str=repr(s.encode('UTF-8'))

It would be easier to encode this in cp1252 here, as this is apparently
the encoding that you want to use in the RTF file, too. You could then
loop over the string, replacing all bytes >= 128 with \\'%.2x

As yet another alternative, you could create a Unicode error handler
(call it 'rtf'), and then do

return s.encode('ascii', errors='rtf')
replDic={'\xc3\xa0':"\\'e0",'\xc3\xa4':"\\'e4",'\xc3\xa1':"\
\'e1",
'\xc3\xa8':"\\'e8",'\xc3\xab':"\\'eb",'\xc3\xa9':"\
\'e9",
'\xc3\xb2':"\\'f2",'\xc3\xb6':"\\'f6",'\xc3\xb3':"\
\'f3",
'\xe2\x82\xac':"\\'80"}
for k in replDic.keys():
if repr(k) in s_str:
s_str=s_str.replace(repr(k),replDic[k])
return s_str

However interactive:
True

I just don't get it, what's the difference?

It's the repr():

py> '\xc3\xab' in 'Arj\xc3\xabn'
True
py> repr('\xc3\xab') in repr('Arj\xc3\xabn')
False
py> repr('\xc3\xab')
"'\\xc3\\xab'"
py> repr('Arj\xc3\xabn')
"'Arj\\xc3\\xabn'"

repr('\xc3\xab') starts with an apostrophe, which doesn't
appear before the \\xc3 in repr('Arj\xc3\xabn').

HTH,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top