Python 3 - xml - crlf handling problem

D

durumdara

Hi!

As I see that XML parsing is "wrong" in Python.

I must use predefined XML files, parsing them, extending them, and
produce some result.

But as I see that in Windows this is working wrong.

When the predefined XMLs are "formatted" (prettied) with CRLFs, then
the parser keeps these plus LF characters (not handle the logic that
CR = LF = CRLF), and it is appearing in the new result too.

xo = parse('test_original.xml')
de = xo.documentElement
de.setAttribute('b', "2")
b = xo.toxml('utf-8')
f = open('test_original2.xml', 'wb')
f.write(b)
f.close()

And: if I used text elements, this can extend the information with
plus characters and make wrong xml...

I can use only "myowngenerated", and not prettied xmls because of this
problem!

Is this normal?

Thanks for your read:
dd
 
S

Stefan Behnel

durumdara, 30.11.2011 13:08:
As I see that XML parsing is "wrong" in Python.

You didn't say what you are using for parsing, but from your example, it
appears likely that you are using the xml.dom.minidom module.

I must use predefined XML files, parsing them, extending them, and
produce some result.

But as I see that in Windows this is working wrong.

When the predefined XMLs are "formatted" (prettied) with CRLFs, then
the parser keeps these plus LF characters (not handle the logic that
CR = LF = CRLF), and it is appearing in the new result too.

I assume that you are referring to XML's newline normalisation algorithm?
That should normally be handled by the parser, which, in the case of
minidom, is usually expat. And I seriously doubt that expat has a problem
with something as basic as newline normalisation.

Did you verify that the newlines are really not being converted by the
parser? From your example, I can only see that you are serialising the XML
tree back into a file, which may or may not alter the line endings by
itself. Instead, take a look at the text content in the tree right after
parsing to see how line endings look at that level.

xo = parse('test_original.xml')
de = xo.documentElement
de.setAttribute('b', "2")
b = xo.toxml('utf-8')
f = open('test_original2.xml', 'wb')
f.write(b)
f.close()

This doesn't do any pretty printing, though, in case that's what you were
really after (which appears likely according to your comments).

And: if I used text elements, this can extend the information with
plus characters and make wrong xml...

Sorry, I don't understand this sentence.

I can use only "myowngenerated", and not prettied xmls because of this
problem!

Then what is the actual problem? Do you get an error somewhere? And if so,
could you paste the exact error message and describe what you do to produce
it? The mere fact that the line endings use the normal platform specific
representation doesn't seem like a problem to me.

Stefan
 
D

durumdara

Dear Stefan!


So: may I don't understand the things well, but I thought that parser
drop the "nondata" CRLF-s + other characters (not preserve them).

Then don't matters that I read the XML from a file, or I create it
from code, because all of them generating SAME RESULT.
But Python don't do that.
If I make xml from code, the code is without plus characters.
But Python preserves parsed CRLF characters somewhere, and they are
also flushing into the result.

Example:

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
<element a="1">
AnyText
</element>
</doc>
'''

If I parse this, and write with toxml, the CRLF-s remaining in the
code, but if I create this document line by line, there is no CRLF,
the toxml write "only lined" xml.

This also meaning that if I use prettyxml call, to prettying the xml,
the file size is growing.

If there is a multiple processing queue - if two pythons communicating
in xml files, the size can growing every time.

Py1 - read the Py2's file, process it, and write to a result file
Py2 - read the Py1's result file, process it, and pass back to Py1
this can grow the file with each call, because "pretty" CRLF-s not
normalized out from the code.

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
<element a="1">
AnyText
</element>
</doc>
'''

def main():
f = open('test.0.xml','w')
f.write(original.strip())
f.close()

for i in range(1, 10 + 1):
xo = parse('test.%d.xml' % (i - 1))
de = xo.documentElement
de.setAttribute('c', str(i))
t = de.getElementsByTagName('element')[0]
tn = t.childNodes[0]
print (dir(t))
print (tn)
print (tn.nodeValue)
tn.nodeValue = str(i) + '\t' + '\n'
#s = xo.toxml()
s = xo.toprettyxml()
f = open('test.%d.xml' % i,'w')
f.write(s)
f.close()

sys.exit()

And: because Python is not converting CRLF to &013; I cannot make
different from "prettied source's CRLF" (loaded from template file),
"my own pretty's CRLF" (my own topretty), and really contained CRLF
(for example a memo field's value).

My case is that the processor application (for whom I pass the XML
from Python) is sensitive to "plus CRLF"-s in text nodes, I must do
something these "plus" items to avoid external's program errors.

I got these templates and input files from prettied format (with
CRLFS), but I must "eat" them to make an XML that one lined if
possible.

I hope you understand my problem with it.

Thanks:
dd
 
S

Stefan Behnel

durumdara, 02.12.2011 09:13:
So: may I don't understand the things well, but I thought that parser
drop the "nondata" CRLF-s + other characters (not preserve them).

Well, it does that, at least on my side (which is not under Windows):

===================
original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
<element a="1">
AnyText
</element>
</doc>
'''

from xml.dom.minidom import parse

def main():
f = open('test.0.xml', 'wb')
f.write(original.strip().replace('\n', '\r\n').encode('utf8'))
f.close()

xo = parse('test.0.xml')
de = xo.documentElement
print(repr(de.childNodes[0].nodeValue))
print(repr(de.childNodes[1].childNodes[0].nodeValue))

if __name__ == '__main__':
main()
===================

This prints '\n ' and '\n AnyText\n ' on my side, so the
whitespace normalisation in the parser properly did its work.

Then don't matters that I read the XML from a file, or I create it
from code, because all of them generating SAME RESULT.
But Python don't do that.
If I make xml from code, the code is without plus characters.

What do you mean by "plus characters"? It's not the "+" character that you
are referring to, right? Do you mean additional characters? Such as the
additional '\r'?

But Python preserves parsed CRLF characters somewhere, and they are
also flushing into the result.

Example:

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
<element a="1">
AnyText
</element>
</doc>
'''

If I parse this, and write with toxml, the CRLF-s remaining in the
code, but if I create this document line by line, there is no CRLF,
the toxml write "only lined" xml.

This also meaning that if I use prettyxml call, to prettying the xml,
the file size is growing.

If there is a multiple processing queue - if two pythons communicating
in xml files, the size can growing every time.

Py1 - read the Py2's file, process it, and write to a result file
Py2 - read the Py1's result file, process it, and pass back to Py1
this can grow the file with each call, because "pretty" CRLF-s not
normalized out from the code.

original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
<element a="1">
AnyText
</element>
</doc>
'''

def main():
f = open('test.0.xml','w')
f.write(original.strip())
f.close()

for i in range(1, 10 + 1):
xo = parse('test.%d.xml' % (i - 1))
de = xo.documentElement
de.setAttribute('c', str(i))
t = de.getElementsByTagName('element')[0]
tn = t.childNodes[0]
print (dir(t))
print (tn)
print (tn.nodeValue)
tn.nodeValue = str(i) + '\t' + '\n'
#s = xo.toxml()
s = xo.toprettyxml()
f = open('test.%d.xml' % i,'w')
f.write(s)
f.close()

sys.exit()

And: because Python is not converting CRLF to&013; I cannot make
different from "prettied source's CRLF" (loaded from template file),
"my own pretty's CRLF" (my own topretty), and really contained CRLF
(for example a memo field's value).

My case is that the processor application (for whom I pass the XML
from Python) is sensitive to "plus CRLF"-s in text nodes, I must do
something these "plus" items to avoid external's program errors.

I got these templates and input files from prettied format (with
CRLFS), but I must "eat" them to make an XML that one lined if
possible.

I hope you understand my problem with it.

Still not quite, but never mind. May or may not be a problem in minidom or
your code. For example, you shouldn't open the output file in text mode but
in binary mode (i.e. "wb") because you are writing bytes into it.

Here's what I tried with ElementTree, and it seems to do what your code
above wants. The indent() function is taken from Fredrik's element lib page:

http://effbot.org/zone/element-lib.htm

========================
original='''
<?xml version="1.0" encoding="utf-8"?>
<doc a="1">
<element a="1">
AnyText
</element>
</doc>
'''

def indent(elem, level=0):
i = "\n" + level*" "
if len(elem):
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for elem in elem:
indent(elem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = i
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i

def main():
f = open('test.0.xml','w', encoding='utf8')
f.write(original.strip())
f.close()

from xml.etree.cElementTree import parse

for i in range(10):
tree = parse('test.%d.xml' % i)
root = tree.getroot()
root.set('c', str(i+1))
t = root.find('.//element')
t.text = '%d\t\n' % (i+1)
indent(root)
tree.write('test.%d.xml' % (i+1), encoding='utf8')

if __name__ == '__main__':
main()
========================

Hope that helps,

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top