encoding ascii data for xml

Discussion in 'Python' started by harrelson, Oct 3, 2008.

  1. harrelson

    harrelson Guest

    I have a large amount of data in a postgresql database with the
    encoding of SQL_ASCII. Most recent data is UTF-8 but data from
    several years ago could be of some unknown other data type. Being
    honest with myself, I am not even sure that the most recent data is
    always UTF-8-- data is entered on web forms and I wouldn't be
    surprised if data of other encodings is slipping in.

    Up to the point I have just ignored the problem-- on the web side of
    things everything works good enough. But now I am required to stuff
    this data into xml datasets and I am, of course, having problems. My
    preference would be to force the data into UTF-8 even if it is
    ultimately an incorrect encoding translation but this isn't working.
    The below code represents my most recent problem:

    import xml.dom.minidom
    print chr(3).encode('utf-8')
    dom = xml.dom.minidom.parseString( "<test>%s</test>" %
    chr(3).encode('utf-8') )

    chr(3) is the ascii character for "end of line". I would think that
    trying to encode this to utf-8 would fail but it doesn't-- I don't get
    a failure till we get into xml land and the parser complains. My
    question is why doesn't encode() blow up? It seems to me that
    encode() shouldn't output anything that parseString() can't handle.

    Sorry in advanced if this post is ugly-- it is through the google
    groups interface and google mangles the entry sometimes.
     
    harrelson, Oct 3, 2008
    #1
    1. Advertising

  2. On Friday 03 October 2008, harrelson wrote:
    > import xml.dom.minidom
    > print chr(3).encode('utf-8')
    > dom = xml.dom.minidom.parseString( "<test>%s</test>" %
    > chr(3).encode('utf-8') )
    >
    > chr(3) is the ascii character for "end of line". [...] My
    > question is why doesn't encode() blow up?


    You just answered your question. 0x03 may not be a printing character, but it
    is a valid character in the ascii character set and therefore is not a
    problem. For xml, however, it is an illegal character so that's why the
    parser is throwing an error.
     
    Dillon Collins, Oct 4, 2008
    #2
    1. Advertising

  3. On Fri, 03 Oct 2008 14:41:13 -0700, harrelson wrote:

    > import xml.dom.minidom
    > print chr(3).encode('utf-8')
    > dom = xml.dom.minidom.parseString( "<test>%s</test>" %
    > chr(3).encode('utf-8') )
    >
    > chr(3) is the ascii character for "end of line". I would think that
    > trying to encode this to utf-8 would fail but it doesn't-- I don't get a
    > failure till we get into xml land and the parser complains. My question
    > is why doesn't encode() blow up? It seems to me that encode() shouldn't
    > output anything that parseString() can't handle.


    It's not a problem with encode IMHO but with XML because XML can't handle
    all ASCII characters. XML parsers choke on every code below 32 that is
    not whitespace. BTW `chr(3)` isn't "end of line" but "end of text" (ETX).

    If you want to be sure that an arbitrary string can be embedded into XML
    you'll have to encode it as base64 or something similar.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Oct 4, 2008
    #3
  4. harrelson

    John Machin Guest

    On Oct 4, 7:41 am, harrelson <> wrote:
    > I have a large amount of data in a postgresql database with the
    > encoding of SQL_ASCII.  Most recent data is UTF-8 but data from
    > several years ago could be of some unknown other data type.  Being
    > honest with myself, I am not even sure that the most recent data is
    > always UTF-8-- data is entered on web forms and I wouldn't be
    > surprised if data of other encodings is slipping in.
    >
    > Up to the point I have just ignored the problem-- on the web side of
    > things everything works good enough.  But now I am required to stuff
    > this data into xml datasets and I am, of course, having problems.  My
    > preference would be to force the data into UTF-8 even if it is
    > ultimately an incorrect encoding translation but this isn't working.
    > The below code represents my most recent problem:
    >
    > import xml.dom.minidom
    > print chr(3).encode('utf-8')
    > dom = xml.dom.minidom.parseString( "<test>%s</test>" %
    > chr(3).encode('utf-8') )
    >
    > chr(3) is the ascii character for "end of line".  I would think that
    > trying to encode this to utf-8 would fail but it doesn't-- I don't get
    > a failure till we get into xml land and the parser complains.  My
    > question is why doesn't encode() blow up?  It seems to me that
    > encode() shouldn't output anything that parseString() can't handle.


    The encode method is doing its job, which is to encode ANY and EVERY
    unicode character as utf-8, so that it can be transported reliably
    over an 8-bit-wide channel. encode is *not* supposed to guess what you
    are going to do with the output.

    Perhaps instead of "forcing the data into utf-8", you should be
    thinking about what is actually in your data. What is the context that
    chr(3) appears in? Perhaps when you get around to print
    repr(some_data), you might see things like "\x03harlie \x03haplin" --
    it's a common enough keyboarding error to hit the Ctrl key instead of
    the Shift key and unfortunately a common-enough design error for there
    to be no checking at all.

    BTW, there's no forcing involved -- chr(3) is *already* utf-8.

    HTH,
    John
     
    John Machin, Oct 4, 2008
    #4
  5. On Sat, 04 Oct 2008 12:18:13 -0700
    Dennis Lee Bieber <> wrote:
    > On 4 Oct 2008 06:59:20 GMT, Marc 'BlackJack' Rintsch <>
    > declaimed the following in comp.lang.python:
    >
    > > not whitespace. BTW `chr(3)` isn't "end of line" but "end of text" (ETX).
    > >

    > Hmm, think I'll need to look up an ASCII chart -- I seem to recall
    > ETX as "end of transmission"


    Nope, Marc is correct. EOT, chr(4), is "end of transmission."

    --
    D'Arcy J.M. Cain <> | Democracy is three wolves
    http://www.druid.net/darcy/ | and a sheep voting on
    +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
     
    D'Arcy J.M. Cain, Oct 4, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. TOXiC
    Replies:
    5
    Views:
    1,289
    TOXiC
    Jan 31, 2007
  2. James O'Brien
    Replies:
    3
    Views:
    271
    Ben Morrow
    Mar 5, 2004
  3. Alextophi
    Replies:
    8
    Views:
    547
    Alan J. Flavell
    Dec 30, 2005
  4. bruce
    Replies:
    38
    Views:
    295
    Mark Lawrence
    Nov 1, 2013
  5. MRAB
    Replies:
    0
    Views:
    102
Loading...

Share This Page