xsl and unicode surrogate characters

Discussion in 'Python' started by Sakcee, Jan 5, 2006.

  1. Sakcee

    Sakcee Guest

    Hi

    In one of the data files that I have , I am seeing these characters
    \xed\xa0\xa0 . They seem to break the xsl.

    ---------------------------------------------------------------
    Extra content at the end of the document
    XML/XSL Error: </data><data ><![CDATA[ í Pls advice
    ----------------------------------------------------------------


    this seems to break the libxml2/libxslt

    is this a unicode utf-16 surrogate pair ?
    for displaying it on xml/xsl, should I extract only \xa0?
    since this is hingher than 00-7f range can i just strip it?
    under what condition the encoding software put this string in?


    thanks for help,
    Sakcee, Jan 5, 2006
    #1
    1. Advertising

  2. Sakcee wrote:
    > Hi
    >
    > In one of the data files that I have , I am seeing these characters
    > \xed\xa0\xa0 . They seem to break the xsl.

    [...]
    > is this a unicode utf-16 surrogate pair ?


    Yes and no. This is the UTF-8 encoding of U+D820, which is a high
    surrogate code point. So yes. It's not yet a pair; there would have to
    be a second such code point. So no.

    Furthermore, in UTF-8, you should never ever have encoded surrogate
    codes; instead, whoever generated the UTF-8 should have combined the
    two surrogate code point into a single coded character, and should
    have encoded *that* character. So no - this byte sequence isn't
    even valid UTF-8.

    > for displaying it on xml/xsl, should I extract only \xa0?


    You should tell your parser to reject the file as ill-formed.

    > since this is hingher than 00-7f range can i just strip it?


    Depending an what you want to achieve: sure! It will modify
    the meaning of the bytes, of course.

    > under what condition the encoding software put this string in?


    If it has a bug.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 5, 2006
    #2
    1. Advertising

  3. Sakcee

    Sakcee Guest

    thanks very much for the info, it really helped

    we are using the text from file to display on webpage and we have a
    method for conversion the parsed data to utf-8 and then displaying, all
    the data looks fine after parsing except the
    surrogate pair,
    since i can not guess what it was supposed to be , is it ok to strip it
    using regex re.complie(' [\xed|\xa0] ')?




    Martin v. Löwis wrote:
    > Sakcee wrote:
    > > Hi
    > >
    > > In one of the data files that I have , I am seeing these characters
    > > \xed\xa0\xa0 . They seem to break the xsl.

    > [...]
    > > is this a unicode utf-16 surrogate pair ?

    >
    > Yes and no. This is the UTF-8 encoding of U+D820, which is a high
    > surrogate code point. So yes. It's not yet a pair; there would have to
    > be a second such code point. So no.
    >
    > Furthermore, in UTF-8, you should never ever have encoded surrogate
    > codes; instead, whoever generated the UTF-8 should have combined the
    > two surrogate code point into a single coded character, and should
    > have encoded *that* character. So no - this byte sequence isn't
    > even valid UTF-8.
    >
    > > for displaying it on xml/xsl, should I extract only \xa0?

    >
    > You should tell your parser to reject the file as ill-formed.
    >
    > > since this is hingher than 00-7f range can i just strip it?

    >
    > Depending an what you want to achieve: sure! It will modify
    > the meaning of the bytes, of course.
    >
    > > under what condition the encoding software put this string in?

    >
    > If it has a bug.
    >
    > Regards,
    > Martin
    Sakcee, Jan 5, 2006
    #3
  4. Sakcee wrote:

    > thanks very much for the info, it really helped
    >
    > we are using the text from file to display on webpage and we have a
    > method for conversion the parsed data to utf-8 and then displaying, all
    > the data looks fine after parsing except the
    > surrogate pair,
    > since i can not guess what it was supposed to be , is it ok to strip it
    > using regex re.complie(' [\xed|\xa0] ')?


    As martin said: that alters the meaning of the bytes. If that has to bother
    you or not, that's yours to decide. If for example you stripped all vocals
    from a text, it still might be comprehensible for most people, so if vocals
    bother you for whatever reason, remove them.

    Bt myb y bttr try nd fx th prblm n th frst plc.

    Regards,

    Diez
    Diez B. Roggisch, Jan 5, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?bmVpbHQxMDA=?=

    COM Surrogate Errors

    =?Utf-8?B?bmVpbHQxMDA=?=, Jul 21, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    8,678
    lhardwick69
    Mar 24, 2007
  2. rao.kamran
    Replies:
    0
    Views:
    1,881
    rao.kamran
    May 21, 2008
  3. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    930
    Grzegorz ¦liwiñski
    Jan 19, 2011
  4. neilt100

    COM Surrogate Errors

    neilt100, Jul 21, 2005, in forum: ASP General
    Replies:
    0
    Views:
    140
    neilt100
    Jul 21, 2005
  5. J Krugman

    "Surrogate infinity" in Perl?

    J Krugman, Sep 8, 2003, in forum: Perl Misc
    Replies:
    7
    Views:
    130
    Eric Wilhelm
    Nov 3, 2003
Loading...

Share This Page