xsl and unicode surrogate characters

Sakcee · Jan 5, 2006

Hi

In one of the data files that I have , I am seeing these characters
\xed\xa0\xa0 . They seem to break the xsl.

---------------------------------------------------------------
Extra content at the end of the document
XML/XSL Error: </data><data ><![CDATA[ í Pls advice
----------------------------------------------------------------

this seems to break the libxml2/libxslt

is this a unicode utf-16 surrogate pair ?
for displaying it on xml/xsl, should I extract only \xa0?
since this is hingher than 00-7f range can i just strip it?
under what condition the encoding software put this string in?

thanks for help,

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jan 5, 2006

Sakcee said:
Hi

In one of the data files that I have , I am seeing these characters
\xed\xa0\xa0 . They seem to break the xsl. [...]
is this a unicode utf-16 surrogate pair ?

Yes and no. This is the UTF-8 encoding of U+D820, which is a high
surrogate code point. So yes. It's not yet a pair; there would have to
be a second such code point. So no.

Furthermore, in UTF-8, you should never ever have encoded surrogate
codes; instead, whoever generated the UTF-8 should have combined the
two surrogate code point into a single coded character, and should
have encoded *that* character. So no - this byte sequence isn't
even valid UTF-8.

for displaying it on xml/xsl, should I extract only \xa0?

You should tell your parser to reject the file as ill-formed.

since this is hingher than 00-7f range can i just strip it?

Depending an what you want to achieve: sure! It will modify
the meaning of the bytes, of course.

under what condition the encoding software put this string in?

If it has a bug.

Regards,
Martin

Sakcee · Jan 5, 2006

thanks very much for the info, it really helped

we are using the text from file to display on webpage and we have a
method for conversion the parsed data to utf-8 and then displaying, all
the data looks fine after parsing except the
surrogate pair,
since i can not guess what it was supposed to be , is it ok to strip it
using regex re.complie(' [\xed|\xa0] ')?

Sakcee said:
Sakcee said:

Hi

In one of the data files that I have , I am seeing these characters
\xed\xa0\xa0 . They seem to break the xsl. [...]
is this a unicode utf-16 surrogate pair ?

Click to expand...

Yes and no. This is the UTF-8 encoding of U+D820, which is a high
surrogate code point. So yes. It's not yet a pair; there would have to
be a second such code point. So no.

Furthermore, in UTF-8, you should never ever have encoded surrogate
codes; instead, whoever generated the UTF-8 should have combined the
two surrogate code point into a single coded character, and should
have encoded *that* character. So no - this byte sequence isn't
even valid UTF-8.

for displaying it on xml/xsl, should I extract only \xa0?

Click to expand...

You should tell your parser to reject the file as ill-formed.

since this is hingher than 00-7f range can i just strip it?

Click to expand...

Depending an what you want to achieve: sure! It will modify
the meaning of the bytes, of course.

under what condition the encoding software put this string in?

Click to expand...

If it has a bug.

Regards,
Martin

Diez B. Roggisch · Jan 5, 2006

Sakcee said:
thanks very much for the info, it really helped

we are using the text from file to display on webpage and we have a
method for conversion the parsed data to utf-8 and then displaying, all
the data looks fine after parsing except the
surrogate pair,
since i can not guess what it was supposed to be , is it ok to strip it
using regex re.complie(' [\xed|\xa0] ')?

As martin said: that alters the meaning of the bytes. If that has to bother
you or not, that's yours to decide. If for example you stripped all vocals
from a text, it still might be comprehensible for most people, so if vocals
bother you for whatever reason, remove them.

Bt myb y bttr try nd fx th prblm n th frst plc.

Regards,

Diez

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
unicode and xml/xsl	0	Aug 9, 2004
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Unicode characters, XML/RSS	1	Jul 31, 2008
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
attempting to print unicode characters.	23	Aug 29, 2010
Python's handling of unicode surrogates	17	Apr 20, 2007

xsl and unicode surrogate characters

Sakcee

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Sakcee

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads