Replacement in unicodestrings?

KvS · Oct 5, 2008

Dear all,

could somebody please just put an end to the unicode mysery I'm in,
men... The situation is that I have a Tkinter program that let's the
user enter data in some Entries and this data needs to be transformed
to the encoding compatible with an .rtf-file. In fact I only need to
do some of the usual symbols like ë etc.

Here's the function that I am using:

def pythonUnicodeToRTFAscii(self,s):
if isinstance(s,str):
return s
s_str=repr(s.encode('UTF-8'))
replDic={'\xc3\xa0':"\\'e0",'\xc3\xa4':"\\'e4",'\xc3\xa1':"\
\'e1",
'\xc3\xa8':"\\'e8",'\xc3\xab':"\\'eb",'\xc3\xa9':"\
\'e9",
'\xc3\xb2':"\\'f2",'\xc3\xb6':"\\'f6",'\xc3\xb3':"\
\'f3",
'\xe2\x82\xac':"\\'80"}
for k in replDic.keys():
if repr(k) in s_str:
s_str=s_str.replace(repr(k),replDic[k])
return s_str

So replDic represents the mapping from one encoding to the other. Now,
if I enter e.g. 'Arjën' in the Entry, then s_str in the above function
becomes 'Arj\xc3\xabn' and since replDic contains the key \xc3\xab I
would expect the replacement in the final lines of the function to
kick in. This however doesn't happen, there's no match.

However interactive:
True

I just don't get it, what's the difference? Is the above anyhow the
best way to attack such a problem?

Thanks & best wishes, Kees

Martin v. Löwis · Oct 5, 2008

s_str=repr(s.encode('UTF-8'))

It would be easier to encode this in cp1252 here, as this is apparently
the encoding that you want to use in the RTF file, too. You could then
loop over the string, replacing all bytes >= 128 with \\'%.2x

As yet another alternative, you could create a Unicode error handler
(call it 'rtf'), and then do

return s.encode('ascii', errors='rtf')

replDic={'\xc3\xa0':"\\'e0",'\xc3\xa4':"\\'e4",'\xc3\xa1':"\
\'e1",
'\xc3\xa8':"\\'e8",'\xc3\xab':"\\'eb",'\xc3\xa9':"\
\'e9",
'\xc3\xb2':"\\'f2",'\xc3\xb6':"\\'f6",'\xc3\xb3':"\
\'f3",
'\xe2\x82\xac':"\\'80"}
for k in replDic.keys():
if repr(k) in s_str:
s_str=s_str.replace(repr(k),replDic[k])
return s_str

However interactive:
True

I just don't get it, what's the difference?

It's the repr():

py> '\xc3\xab' in 'Arj\xc3\xabn'
True
py> repr('\xc3\xab') in repr('Arj\xc3\xabn')
False
py> repr('\xc3\xab')
"'\\xc3\\xab'"
py> repr('Arj\xc3\xabn')
"'Arj\\xc3\\xabn'"

repr('\xc3\xab') starts with an apostrophe, which doesn't
appear before the \\xc3 in repr('Arj\xc3\xabn').

HTH,
Martin

Unicode characters in btye-strings	5	Mar 12, 2010
Output confusion	2	Mar 9, 2023
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
WSGI/wsgiref: modifying output on windows ?	2	Jun 3, 2007
UTF-8 characters in doctest	6	Sep 19, 2007
convert perl code to ruby: help please	3	Dec 25, 2005
elementtree and gbk encoding	12	Mar 14, 2006
windows active directory ldap output encoding	2	Jul 8, 2008

Replacement in unicodestrings?

KvS

Martin v. Löwis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads