unicode string alteration

BAvant Garde · Aug 12, 2010

HELP!!!
I need help with a unicode issue that has me stumped. I must be doing something wrong because I don't believe this condition would have slipped thru testing.

Wherever the string u'\udbff\udc00' occurs u'\U0010fc00' or unichr(1113088) is substituted and the file loses 1 character resulting in all trailing characters being shifted out of position. No other corrupt strings have been detected.

The condition was noticed while testing in Python 2.6.5 on Ubuntu 10.04 where the maximum ord # is 1114111 (wide Python build).

Using Python 2.5.4 on Windows-ME where the maximum ord # is 65535 (narrow Python build) the string u'\U0010fc00' also occurs and it "seems" that the substitution takes place but no characters are lost and file sizes are ok. Note that ord(u'\U0010fc00')
causes the following error:
"TypeError: ord() expected a character, but string of length 2 found"
The condition is otherwise invisible in 2.5.4 and is handled internally without any apparent effect on processing with characters u'\udbff' and u'\udc00' each being separately accessible.

The first part of the attachment repeats this email but also has examples and illustrates other related oddities.

Any help would be greatly appreciated.
Bruce

How do I display unicode value stored in a string variable using ord()	133	Aug 16, 2012
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Unicode codepoints	5	Jun 21, 2011
break unichr instead of fix ord?	20	Aug 25, 2009
Python Unicode handling wins again -- mostly	67	Nov 29, 2013
Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
portable unicode literals	4	Oct 15, 2012
Unicode Chars in Windows Path	12	Apr 2, 2014

unicode string alteration

BAvant Garde

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads