Preventing control characters from entering an XML file

F

Frank Niessink

Hi list,

First of all, I wish you all a happy 2006. I have a small question that
googling didn't turn up an answer for. So hopefully you'll be kind
enough to send me in the right direction.

I'm developing a desktop application, called Task Coach, that saves its
domain objects (tasks, mostly :) in an XML file. Users have reported
that sometimes their Task Coach file would become unreadable by Task
Coach after copying information from some other application into e.g. a
task description. Looking at the 'corrupted' file showed that control
characters ended up in the XML file (Control-K for example). Task Coach
uses xml.dom to create an XML document and save it, like this:

class XMLWriter:
...

def write(self, taskList):
domImplementation = xml.dom.getDOMImplementation()
self.document = domImplementation.createDocument(None, 'tasks',
None)
...
for task in taskList.rootTasks():
self.document.documentElement.appendChild(self.taskNode(task))
self.document.writexml(self.__fd) # __fd is a file open for writing

...

Apparently, the writexml method of xml.dom (which comes from
xml.dom.minidom if pyxml is not installed I think) does not feel that
writing control characters in an XML file is wrong, but the parser does:

Traceback (most recent call last):
....
File "c:\Program Files\Python24\lib\xml\dom\expatbuilder.py", line
207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 77,
column 147

Rightfully so, because ^K is not valid XML 1.0, according to
http://www.w3.org/TR/REC-xml/:

"Legal characters are tab, carriage return, line feed, and the legal
characters of Unicode and ISO/IEC 10646. [...] Consequently, XML
processors MUST accept any character in the range specified for Char.

Character Range
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]"

So, all this leads me to the following questions:
- Why does the writexml method of the document created by the object
returned by domImplementation() allow control characters? Isn't that a bug?
- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?

Thanks, Frank
 
S

Scott David Daniels

Frank said:
...
Character Range
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]"

- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?

drop_controls = [None] * 0x20
for c in '\t\r\n':
drop_controls[c] = unichr(c)
...
some_unicode_string = some_unicode_string.translate(drop_controls)

--Scott David Daniels
(e-mail address removed)
 
F

Frank Niessink

Scott said:
Frank said:
- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?


drop_controls = [None] * 0x20
for c in '\t\r\n':
drop_controls[c] = unichr(c)
...
some_unicode_string = some_unicode_string.translate(drop_controls)

Hi Scott,

Your code gave me a "TypeError: an integer is required". Anyway, it was
sufficient to push me in the right direction. This is my version:

UNICODE_CONTROL_CHARACTERS_TO_WEED = {}
for ordinal in range(0x20):
if chr(ordinal) not in '\t\r\n':
UNICODE_CONTROL_CHARACTERS_TO_WEED[ordinal] = None

Which let you do:
u'Test\t'


Thanks, Frank
 
S

Scott David Daniels

Frank said:
Scott said:
Frank said:
- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?
drop_controls = [None] * 0x20
for c in '\t\r\n':
drop_controls[c] = unichr(c)
...
some_unicode_string = some_unicode_string.translate(drop_controls)

Your code gave me a "TypeError: an integer is required"....

Sorry about that.
>> drop_controls[c] = unichr(c) should have been:
>> drop_controls[ord(c)] = unichr(c)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top