Re: [ANN] pyxser-1.2r --- Python-Object to XML serialization module

Discussion in 'Python' started by Stefan Behnel, Aug 24, 2009.

  1. Daniel Molina Wegener wrote:
    > * Added encoded serialization of Unicode strings by using
    > the user defined encoding as is passed to the serialization
    > functions as enc parameter
    >
    > As you see, now Unicode strings are serialized as encoded byte string
    > by using the encoding that the user pass as enc parameter to the
    > serialization function. This means that Unicode strings are serialized
    > in a human readable form, regarding a better interoperability with
    > other platforms.


    You mean, the whole XML document is serialised with that encoding, right?

    Stefan
    Stefan Behnel, Aug 24, 2009
    #1
    1. Advertising

  2. Daniel Molina Wegener wrote:
    > unicode objects are encoded into the
    > encoding that the XML document encoding has, and as you say, the whole
    > XML document has one encoding. There is no mixing of byte encoded strings
    > with different encodings in the outout document.


    Ok, that's what I hoped anyway. It just wasn't clear from your description.


    > When the object is restored, by using pyxser.unserialize:
    >
    > pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8")


    But this is XML, right? What do you need to pass the encoding for at this
    point?


    > Another issue is the fact that if you have mixed some encodings in byte
    > strings objects in your object tree, such as iso-8859-1 and utf-8, and
    > you try to serialize that object, pyxser will output to stdout the
    > serialization errors by trying to handle those mixed encodings which are
    > not regarding the document encoding.


    There shouldn't be any serialisation errors (unless you try to recode byte
    strings on the way out, which is a no-no for arbitrary user input). All you
    have to do is properly escape the byte string so that it passes the XML
    encoding step.

    One trick to do that is to decode the byte string as ISO-8859-1 and
    serialise the result as a normal Unicode string. Then you can re-encode the
    unicode string on input back to ISO-8859-1.

    I choose ISO-8859-1 here because it has the well-defined side-effect of
    mapping byte values directly to Unicode characters with an identical code
    point value. So you do not risk any failures or data loss.

    Stefan
    Stefan Behnel, Aug 24, 2009
    #2
    1. Advertising

  3. Daniel Molina Wegener wrote:
    > Stefan Behnel <> wrote:
    >> Daniel Molina Wegener wrote:
    >>> When the object is restored, by using pyxser.unserialize:
    >>>
    >>> pyobj = pyxser.unserialize(obj = xmldocstr, enc = "utf-8")

    >> But this is XML, right? What do you need to pass the encoding for at this
    >> point?

    >
    > The user may want a different encoding, other than utf-8, it can
    > be any encoding supported by libxml2.


    I really meant what I wrote: this is XML. The encoding is well defined in
    the XML declaration at the start of the document (and will default to UTF-8
    if not provided). Passing it externally will allow users to override that,
    which doesn't make any sense at all.


    > if the encodings are mixed inside Python byte strings, I think
    > that there is no way to know which encoding are using them.


    Correct.

    > This may cause XML serialization errors


    Yes, but only if you try to recode the strings (which, as I said, is a no-no).


    >> One trick to do that is to decode the byte string as ISO-8859-1 and
    >> serialise the result as a normal Unicode string. Then you can re-encode
    >> the unicode string on input back to ISO-8859-1.

    >
    >> I choose ISO-8859-1 here because it has the well-defined side-effect of
    >> mapping byte values directly to Unicode characters with an identical code
    >> point value. So you do not risk any failures or data loss.

    >
    > Sure, but if there are Python byte strings (not Unicode strings), ones
    > encoded in big5 and others in iso-8859-1 inside the object tree, the
    > XML serialization would throw errors on the encoding conversion, by
    > setting those bytes inside the document...


    No, I really meant: decoding from ISO-8859-1 to Unicode, for all byte
    strings, regardless of their encoding (since you can't even know if they
    represent encoded text at all). So you get a unicode string that you can
    serialise to the target encoding, although it may result in character
    references () being output. But you won't get any errors, at least.

    On the way in, you get a unicode string again, which you can encode to
    ISO-8859-1 to get the original byte string back.

    Stefan
    Stefan Behnel, Aug 25, 2009
    #3
  4. Stefan Behnel wrote:
    > for all byte
    > strings, regardless of their encoding (since you can't even know if they
    > represent encoded text at all).


    Hmm, having written that, I guess it's actually best to encode byte strings
    as base64 instead. Otherwise, null bytes and other special byte values
    won't pass.

    I also think that if the user wants readable output for text strings, it's
    reasonable to require Unicode input instead of byte strings. Handling text
    in byte strings is just too error prone.

    Still, you may have to sanitize text input to make sure it doesn't contain
    special characters either. Take a look at the way lxml does it in the
    apihelpers.pxi source file, or read the XML spec on character content.

    Stefan
    Stefan Behnel, Aug 25, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Behnel
    Replies:
    3
    Views:
    349
    Stefan Behnel
    Apr 20, 2009
  2. Daniel Molina Wegener
    Replies:
    0
    Views:
    374
    Daniel Molina Wegener
    Jun 20, 2010
  3. Daniel Molina Wegener
    Replies:
    3
    Views:
    474
    Josh English
    Aug 28, 2010
  4. Daniel Molina Wegener
    Replies:
    0
    Views:
    213
    Daniel Molina Wegener
    Oct 11, 2010
  5. Daniel Molina Wegener
    Replies:
    0
    Views:
    652
    Daniel Molina Wegener
    Jan 8, 2011
Loading...

Share This Page