encoding ascii data for xml

harrelson · Oct 3, 2008

I have a large amount of data in a postgresql database with the
encoding of SQL_ASCII. Most recent data is UTF-8 but data from
several years ago could be of some unknown other data type. Being
honest with myself, I am not even sure that the most recent data is
always UTF-8-- data is entered on web forms and I wouldn't be
surprised if data of other encodings is slipping in.

Up to the point I have just ignored the problem-- on the web side of
things everything works good enough. But now I am required to stuff
this data into xml datasets and I am, of course, having problems. My
preference would be to force the data into UTF-8 even if it is
ultimately an incorrect encoding translation but this isn't working.
The below code represents my most recent problem:

import xml.dom.minidom
print chr(3).encode('utf-8')
dom = xml.dom.minidom.parseString( "<test>%s</test>" %
chr(3).encode('utf-8') )

chr(3) is the ascii character for "end of line". I would think that
trying to encode this to utf-8 would fail but it doesn't-- I don't get
a failure till we get into xml land and the parser complains. My
question is why doesn't encode() blow up? It seems to me that
encode() shouldn't output anything that parseString() can't handle.

Sorry in advanced if this post is ugly-- it is through the google
groups interface and google mangles the entry sometimes.

Dillon Collins · Oct 4, 2008

import xml.dom.minidom
print chr(3).encode('utf-8')
dom = xml.dom.minidom.parseString( "<test>%s</test>" %
chr(3).encode('utf-8') )

chr(3) is the ascii character for "end of line". [...] My
question is why doesn't encode() blow up?

You just answered your question. 0x03 may not be a printing character, but it
is a valid character in the ascii character set and therefore is not a
problem. For xml, however, it is an illegal character so that's why the
parser is throwing an error.

Marc 'BlackJack' Rintsch · Oct 4, 2008

import xml.dom.minidom
print chr(3).encode('utf-8')
dom = xml.dom.minidom.parseString( "<test>%s</test>" %
chr(3).encode('utf-8') )

chr(3) is the ascii character for "end of line". I would think that
trying to encode this to utf-8 would fail but it doesn't-- I don't get a
failure till we get into xml land and the parser complains. My question
is why doesn't encode() blow up? It seems to me that encode() shouldn't
output anything that parseString() can't handle.

It's not a problem with encode IMHO but with XML because XML can't handle
all ASCII characters. XML parsers choke on every code below 32 that is
not whitespace. BTW `chr(3)` isn't "end of line" but "end of text" (ETX).

If you want to be sure that an arbitrary string can be embedded into XML
you'll have to encode it as base64 or something similar.

Ciao,
Marc 'BlackJack' Rintsch

John Machin · Oct 4, 2008

I have a large amount of data in a postgresql database with the
encoding of SQL_ASCII. Most recent data is UTF-8 but data from
several years ago could be of some unknown other data type. Being
honest with myself, I am not even sure that the most recent data is
always UTF-8-- data is entered on web forms and I wouldn't be
surprised if data of other encodings is slipping in.

Up to the point I have just ignored the problem-- on the web side of
things everything works good enough. But now I am required to stuff
this data into xml datasets and I am, of course, having problems. My
preference would be to force the data into UTF-8 even if it is
ultimately an incorrect encoding translation but this isn't working.
The below code represents my most recent problem:

import xml.dom.minidom
print chr(3).encode('utf-8')
dom = xml.dom.minidom.parseString( "<test>%s</test>" %
chr(3).encode('utf-8') )

chr(3) is the ascii character for "end of line". I would think that
trying to encode this to utf-8 would fail but it doesn't-- I don't get
a failure till we get into xml land and the parser complains. My
question is why doesn't encode() blow up? It seems to me that
encode() shouldn't output anything that parseString() can't handle.

The encode method is doing its job, which is to encode ANY and EVERY
unicode character as utf-8, so that it can be transported reliably
over an 8-bit-wide channel. encode is *not* supposed to guess what you
are going to do with the output.

Perhaps instead of "forcing the data into utf-8", you should be
thinking about what is actually in your data. What is the context that
chr(3) appears in? Perhaps when you get around to print
repr(some_data), you might see things like "\x03harlie \x03haplin" --
it's a common enough keyboarding error to hit the Ctrl key instead of
the Shift key and unfortunately a common-enough design error for there
to be no checking at all.

BTW, there's no forcing involved -- chr(3) is *already* utf-8.

HTH,
John

D'Arcy J.M. Cain · Oct 4, 2008

Hmm, think I'll need to look up an ASCII chart -- I seem to recall
ETX as "end of transmission"

Nope, Marc is correct. EOT, chr(4), is "end of transmission."

files.py (encoding error)	0	Jun 10, 2013
files.py (weird encoding error)	0	Jun 10, 2013
XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
Good cross-version ASCII serialisation protocol for simple types	4	Feb 23, 2013
encoding error	1	Feb 20, 2013
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
xml.dom.minidom character encoding	6	Apr 21, 2010
newbie with a encoding question, please help	8	Apr 1, 2010

encoding ascii data for xml

harrelson

Dillon Collins

Marc 'BlackJack' Rintsch

John Machin

D'Arcy J.M. Cain

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads