suppressing bad characters in output PCDATA (converting JSON to XML)

A

Adam Funk

I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules. I get the input this way:

input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')
big_json = json.load(input_source)
input_source.close()

Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary), and I save the
document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31

I guess I need to process each piece of PCDATA to clean out the
control characters before creating the text node:

text = doc.createTextNode(j)
root.appendChild(text)

What's the best way to do that, bearing in mind that there can be
multibyte characters in the strings? I found some suggestions on the
WWW involving filter with string.printable, which AFAICT isn't
unicode-friendly --- is there a unicode.printable or something like
that?
 
S

Steven D'Aprano

I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules. I get the input this way:

input_source = codecs.open(input_file, 'rb', encoding='UTF-8',
errors='replace') big_json = json.load(input_source)
input_source.close()

Then I recurse through the contents of big_json to build an instance of
xml.dom.minidom.Document (the recursion includes some code to rewrite
dict keys as valid element names if necessary),

How are you doing that? What do you consider valid?

and I save the document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8',
errors='replace') doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

It will force the output to be valid UTF-8 encoded to bytes, not
necessarily valid XML.

PCDATA invalid Char value 7
PCDATA invalid Char value 31

What's xmlstarlet, and at what point does it give this error? It doesn't
appear to be in the standard library.


I guess I need to process each piece of PCDATA to clean out the control
characters before creating the text node:

text = doc.createTextNode(j)
root.appendChild(text)

What's the best way to do that, bearing in mind that there can be
multibyte characters in the strings?

Are you mixing unicode and byte strings?

Are you sure that the input source is actually UTF-8? If not, then all
bets are off: even if the decoding step works, and returns a string, it
may contain the wrong characters. This might explain why you are getting
unexpected control characters in the output: they've come from a badly
decoded input.

Another possibility is that your data actually does contain control
characters where there shouldn't be any.
 
S

Stefan Behnel

Adam Funk, 25.11.2011 14:50:
I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules. I get the input this way:

input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')

It doesn't make sense to use codecs.open() with a "b" mode.

big_json = json.load(input_source)

You shouldn't decode the input before passing it into json.load(), just
open the file in binary mode. Serialised JSON is defined as being UTF-8
encoded (or BOM-prefixed), not decoded Unicode.

input_source.close()

In case of a failure, the file will not be closed safely. All in all, use
this instead:

with open(input_file, 'rb') as f:
big_json = json.load(f)

Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary)

If the name "big_json" is supposed to hint at a large set of data, you may
want to use something other than minidom. Take a look at the
xml.etree.cElementTree module instead, which is substantially more memory
efficient.

and I save the document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()

Same mistakes as above. Especially the double encoding is both unnecessary
and likely to fail. This is also most likely the source of your problems.

I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31

This strongly hints at a broken encoding, which can easily be triggered by
your erroneous encode-and-encode cycles above.

Also, the kind of problem you present here makes it pretty clear that you
are using Python 2.x. In Python 3, you'd get the appropriate exceptions
when trying to write binary data to a Unicode file.

Stefan
 
A

Adam Funk

How are you doing that? What do you consider valid?

Regex-replacing all whitespace ('\s+') with '_', and adding 'a_' to
the beginning of any potential tag that doesn't start with a letter.
This is good enough for my purposes.
It will force the output to be valid UTF-8 encoded to bytes, not
necessarily valid XML.
Yes!


What's xmlstarlet, and at what point does it give this error? It doesn't
appear to be in the standard library.

It's a command-line tool I use a lot for finding the bad bits in XML,
nothing to do with python.

http://xmlstar.sourceforge.net/
Are you mixing unicode and byte strings?

I don't think I am.
Are you sure that the input source is actually UTF-8? If not, then all
bets are off: even if the decoding step works, and returns a string, it
may contain the wrong characters. This might explain why you are getting
unexpected control characters in the output: they've come from a badly
decoded input.

I'm pretty sure that the input is really UTF-8, but has a few control
characters (fairly rare).
Another possibility is that your data actually does contain control
characters where there shouldn't be any.

I think that's the problem, and I'm looking for an efficient way to
delete them from unicode strings.
 
A

Adam Funk

Adam Funk, 25.11.2011 14:50:

It doesn't make sense to use codecs.open() with a "b" mode.

OK, thanks.
You shouldn't decode the input before passing it into json.load(), just
open the file in binary mode. Serialised JSON is defined as being UTF-8
encoded (or BOM-prefixed), not decoded Unicode.

So just do
input_source = open(input_file, 'rb')
big_json = json.load(input_source)
?
In case of a failure, the file will not be closed safely. All in all, use
this instead:

with open(input_file, 'rb') as f:
big_json = json.load(f)

OK, thanks.
If the name "big_json" is supposed to hint at a large set of data, you may
want to use something other than minidom. Take a look at the
xml.etree.cElementTree module instead, which is substantially more memory
efficient.

Well, the input file in this case contains one big JSON list of
reasonably sized elements, each of which I'm turning into a separate
XML file. The output files range from 600 to 6000 bytes.

Same mistakes as above. Especially the double encoding is both unnecessary
and likely to fail. This is also most likely the source of your problems.

Well actually, I had the problem with the occasional control
characters in the output *before* I started sticking encoding="UTF-8"
all over the place (in an unsuccessful attempt to beat them down).

This strongly hints at a broken encoding, which can easily be triggered by
your erroneous encode-and-encode cycles above.

No, I've checked the JSON input and those exact control characters are
there too. I want to suppress them (delete or replace with spaces).
Also, the kind of problem you present here makes it pretty clear that you
are using Python 2.x. In Python 3, you'd get the appropriate exceptions
when trying to write binary data to a Unicode file.

Sorry, I forgot to mention the version I'm using, which is "2.7.2+".
 
S

Stefan Behnel

Adam Funk, 29.11.2011 13:57:
Well, the input file in this case contains one big JSON list of
reasonably sized elements, each of which I'm turning into a separate
XML file. The output files range from 600 to 6000 bytes.

It's also substantially easier to use, but if your XML writing code works
already, why change it.

Well actually, I had the problem with the occasional control
characters in the output *before* I started sticking encoding="UTF-8"
all over the place (in an unsuccessful attempt to beat them down).

You should read up on Unicode a bit.

No, I've checked the JSON input and those exact control characters are
there too.

Ah, right, I didn't look closely enough. Those are forbidden in XML:

http://www.w3.org/TR/REC-xml/#charsets

It's sad that minidom (apparently) lets them pass through without even a
warning.

I want to suppress them (delete or replace with spaces).

Ok, then you need to process your string content while creating XML from
it. If replacing is enough, take a look at string.maketrans() in the string
module and str.translate(), a method on strings. Or maybe just use a
regular expression that matches any whitespace character and replace it
with a space. Or whatever suits your data best.

Sorry, I forgot to mention the version I'm using, which is "2.7.2+".

Yep, Py2 makes Unicode handling harder than it should be.

Stefan
 
A

Adam Funk

Adam Funk, 29.11.2011 13:57:

It's also substantially easier to use, but if your XML writing code works
already, why change it.

That module looks useful --- thanks for the tip. (TBH, I'm using
minidom mainly because I've used it before and the API is similar to
the DOM APIs I've used in other languages.)

You should read up on Unicode a bit.

It wouldn't do me any harm. :)

Ah, right, I didn't look closely enough. Those are forbidden in XML:

http://www.w3.org/TR/REC-xml/#charsets

It's sad that minidom (apparently) lets them pass through without even a
warning.

Yes, it is! I've now found this, which seems to fix the problem:

http://bitkickers.blogspot.com/2011/05/stripping-control-characters-in-python.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top