Problem with processing XML

  • Thread starter John Carlyle-Clarke
  • Start date
J

John Carlyle-Clarke

Hi.

I'm new to Python and trying to use it to solve a specific problem. I
have an XML file in which I need to locate a specific text node and
replace the contents with some other text. The text in question is
actually about 70k of base64 encoded data.

I wrote some code that works on my Linux box using xml.dom.minidom, but
it will not run on the windows box that I really need it on. Python
2.5.1 on both.

On the windows machine, it's a clean install of the Python .msi from
python.org. The linux box is Ubuntu 7.10, which has some Python XML
packages installed which can't easily be removed (namely python-libxml2
and python-xml).

I have boiled the code down to its simplest form which shows the problem:-

import xml.dom.minidom
import sys

input_file = sys.argv[1];
output_file = sys.argv[2];

doc = xml.dom.minidom.parse(input_file)
file = open(output_file, "w")
doc.writexml(file)

The error is:-

$ python test2.py input2.xml output.xml
Traceback (most recent call last):
File "test2.py", line 9, in <module>
doc.writexml(file)
File "c:\Python25\lib\xml\dom\minidom.py", line 1744, in writexml
node.writexml(writer, indent, addindent, newl)
File "c:\Python25\lib\xml\dom\minidom.py", line 814, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "c:\Python25\lib\xml\dom\minidom.py", line 809, in writexml
_write_data(writer, attrs[a_name].value)
File "c:\Python25\lib\xml\dom\minidom.py", line 299, in _write_data
data = data.replace("&", "&amp;").replace("<", "&lt;")
AttributeError: 'NoneType' object has no attribute 'replace'

As I said, this code runs fine on the Ubuntu box. If I could work out
why the code runs on this box, that would help because then I call set
up the windows box the same way.

The input file contains an <xsd:schema> block which is what actually
causes the problem. If you remove that node and subnodes, it works
fine. For a while at least, you can view the input file at
http://rafb.net/p/5R1JlW12.html

Someone suggested that I should try xml.etree.ElementTree, however
writing the same type of simple code to import and then write the file
mangles the xsd:schema stuff because ElementTree does not understand
namespaces.

By the way, is pyxml a live project or not? Should it still be used?
It's odd that if you go to http://www.python.org/ and click the link
"Using python for..." XML, it leads you to
http://pyxml.sourceforge.net/topics/

If you then follow the download links to
http://sourceforge.net/project/showfiles.php?group_id=6473 you see that
the latest file is 2004, and there are no versions for newer pythons.
It also says "PyXML is no longer maintained". Shouldn't the link be
removed from python.org?

Thanks in advance!
 
P

Paul McGuire

Hi.

I'm new to Python and trying to use it to solve a specific problem.  I
have an XML file in which I need to locate a specific text node and
replace the contents with some other text.  The text in question is
actually about 70k of base64 encoded data.

Here is a pyparsing hack for your problem. I normally advise against
using literal strings like "<value>" to match XML or HTML tags in a
parser, since this doesn't cover variations in case, embedded
whitespace, or unforeseen attributes, but your example was too simple
to haul in the extra machinery of an expression created by pyparsing's
makeXMLTags.

Also, I don't generally recommend pyparsing for working on XML, since
there are so many better and faster XML-specific modules available.
But if this does the trick for you for your specific base64-removal
task, great.

-- Paul

# requires pyparsing 1.4.8 or later
from pyparsing import makeXMLTags, withAttribute, keepOriginalText,
SkipTo

xml = """
... long XML string goes here ...
"""

# define a filter that will key off of the <data> tag with the
# attribute 'name="PctShow.Image"', and then use suppress to filter
the
# body of the following <value> tag
dataTag = makeXMLTags("data")[0]
dataTag.setParseAction(withAttribute(name="PctShow.Image"),
keepOriginalText)

filter = dataTag + "<value>" + SkipTo("</value>").suppress() + "</
value>"

xmlWithoutBase64Block = filter.transformString(xml)
print xmlWithoutBase64Block
 
A

Alnilam

By the way, is pyxml a live project or not?  Should it still be used?
It's odd that if you go tohttp://www.python.org/and click the link
"Using python for..." XML, it leads you tohttp://pyxml.sourceforge.net/topics/

If you then follow the download links tohttp://sourceforge.net/project/showfiles.php?group_id=6473you see that
the latest file is 2004, and there are no versions for newer pythons.
It also says "PyXML is no longer maintained".  Shouldn't the link be
removed from python.org?

I was wondering that myself. Any answer yet?
 
P

Paul Boddie

I wrote some code that works on my Linux box using xml.dom.minidom, but
it will not run on the windows box that I really need it on. Python
2.5.1 on both.

On the windows machine, it's a clean install of the Python .msi from
python.org. The linux box is Ubuntu 7.10, which has some Python XML
packages installed which can't easily be removed (namely python-libxml2
and python-xml).

I don't think you're straying into libxml2 or PyXML territory here...
I have boiled the code down to its simplest form which shows the problem:-

import xml.dom.minidom
import sys

input_file = sys.argv[1];
output_file = sys.argv[2];

doc = xml.dom.minidom.parse(input_file)
file = open(output_file, "w")

On Windows, shouldn't this be the following...?

file = open(output_file, "wb")
doc.writexml(file)

The error is:-

$ python test2.py input2.xml output.xml
Traceback (most recent call last):
File "test2.py", line 9, in <module>
doc.writexml(file)
File "c:\Python25\lib\xml\dom\minidom.py", line 1744, in writexml
node.writexml(writer, indent, addindent, newl)
File "c:\Python25\lib\xml\dom\minidom.py", line 814, in writexml
node.writexml(writer,indent+addindent,addindent,newl)
File "c:\Python25\lib\xml\dom\minidom.py", line 809, in writexml
_write_data(writer, attrs[a_name].value)
File "c:\Python25\lib\xml\dom\minidom.py", line 299, in _write_data
data = data.replace("&", "&amp;").replace("<", "&lt;")
AttributeError: 'NoneType' object has no attribute 'replace'

As I said, this code runs fine on the Ubuntu box. If I could work out
why the code runs on this box, that would help because then I call set
up the windows box the same way.

If I encountered the same issue, I'd have to inspect the goings-on
inside minidom, possibly using judicious trace statements in the
minidom.py file. Either way, the above looks like an attribute node
produces a value of None rather than any kind of character string.
The input file contains an <xsd:schema> block which is what actually
causes the problem. If you remove that node and subnodes, it works
fine. For a while at least, you can view the input file at
http://rafb.net/p/5R1JlW12.html

The horror! ;-)
Someone suggested that I should try xml.etree.ElementTree, however
writing the same type of simple code to import and then write the file
mangles the xsd:schema stuff because ElementTree does not understand
namespaces.

I'll leave this to others: I don't use ElementTree.
By the way, is pyxml a live project or not? Should it still be used?
It's odd that if you go to http://www.python.org/and click the link
"Using python for..." XML, it leads you to http://pyxml.sourceforge.net/topics/

If you then follow the download links to
http://sourceforge.net/project/showfiles.php?group_id=6473 you see that
the latest file is 2004, and there are no versions for newer pythons.
It also says "PyXML is no longer maintained". Shouldn't the link be
removed from python.org?

The XML situation in Python's standard library is controversial and
can be probably inaccurately summarised by the following chronology:

1. XML is born, various efforts start up (see the qp_xml and xmllib
modules).
2. Various people organise themselves, contributing software to the
PyXML project (4Suite, xmlproc).
3. The XML backlash begins: we should all apparently be using stuff
like YAML (but don't worry if you haven't heard of it).
4. ElementTree is released, people tell you that you shouldn't be
using SAX or DOM any more, "pull" parsers are all the rage
(although proponents overlook the presence of xml.dom.pulldom in
the Python standard library).
5. ElementTree enters the standard library as xml.etree; PyXML falls
into apparent disuse (see remarks about SAX and DOM above).

I think I looked seriously at wrapping libxml2 (with libxml2dom [1])
when I experienced issues with both PyXML and 4Suite when used
together with mod_python, since each project used its own Expat
libraries and the resulting mis-linked software produced very bizarre
results. Moreover, only cDomlette from 4Suite seemed remotely fast,
and yet did not seem to be an adequate replacement for the usual PyXML
functionality.

People will, of course, tell you that you shouldn't use a DOM for
anything and that the "consensus" is to use ElementTree or lxml (see
above), but I can't help feeling that this has a damaging effect on
the XML situation for Python: some newcomers would actually benefit
from the traditional APIs, may already be familiar with them from
other contexts, and may consider Python lacking if the support for
them is in apparent decay. It requires a degree of motivation to
actually attempt to maintain software providing such APIs (which was
my solution to the problem), but if someone isn't totally bound to
Python then they might easily start looking at other languages and
tools in order to get the job done.

Meanwhile, here are some resources:

http://wiki.python.org/moin/PythonXml

Paul

[1] http://www.python.org/pypi/libxml2dom
 
J

John Carlyle-Clarke

Paul said:
Here is a pyparsing hack for your problem.

Thanks Paul! This looks like an interesting approach, and once I get my
head around the syntax, I'll give it a proper whirl.
 
S

Stefan Behnel

Hi,

Paul said:
People will, of course, tell you that you shouldn't use a DOM for
anything and that the "consensus" is to use ElementTree or lxml (see
above), but I can't help feeling that this has a damaging effect on
the XML situation for Python: some newcomers would actually benefit
from the traditional APIs, may already be familiar with them from
other contexts, and may consider Python lacking if the support for
them is in apparent decay. It requires a degree of motivation to
actually attempt to maintain software providing such APIs (which was
my solution to the problem), but if someone isn't totally bound to
Python then they might easily start looking at other languages and
tools in order to get the job done.

I had a discussion with Java people lately and they were all for Ruby, Groovy
and similar languages, "because they have curly braces and are easy to learn
when you know Java".

My take on that is: Python is easy to learn, full-stop.

It's the same for DOM: when you know DOM from (usually) the Java world, having
a DOM-API in Python keeps you from having to learn too many new things. But
when you get your nose kicked into ElementTree, having to learn new things
will actually help you in understanding that what you knew before did not
support your way of thinking.

http://www.python.org/about/success/esr/

So, there is a learning curve, but it's much shorter than what you already
invested to learn 'the wrong thing'. It's what people on this list tend to
call their "unlearning curve".

Stefan
 
P

Paul Boddie

I had a discussion with Java people lately and they were all for Ruby, Groovy
and similar languages, "because they have curly braces and are easy to learn
when you know Java".

My take on that is: Python is easy to learn, full-stop.

Well, that may be so, but it's somewhat beside the point in question.
It's the same for DOM: when you know DOM from (usually) the Java world, having
a DOM-API in Python keeps you from having to learn too many new things. But
when you get your nose kicked into ElementTree, having to learn new things
will actually help you in understanding that what you knew before did not
support your way of thinking.

I'm not disputing the benefits of the ElementTree approach, but one
has to recall that the DOM is probably the most widely used XML API
out there (being the one most client-side developers are using) and
together with the other standards (XPath and so on) isn't as bad as
most people like to make out. Furthermore, I don't think it does
Python much good to have people "acting all Ruby on Rails" and telling
people to throw out everything they ever did in order to suck up the
benefits, regardless of the magnitude of those benefits; it comes
across as saying that "your experience counts for nothing compared to
our superior skills". Not exactly the best way to keep people around.

As I noted in my chronology, the kind of attitude projected by various
people in the Python community at various times (and probably still
perpetuated in the Ruby community) is that stuff originating from the
W3C is bad like, for example, XSLT because "it's like Lisp but all in
XML (yuck!)", and yet for many tasks the most elegant solution is
actually XSLT because it's specifically designed for those very tasks.
Fortunately or unfortunately, XSLT didn't make it into the standard
library and thus isn't provided in a way which may or may not seem
broken but, like the DOM stuff, if the support for standardised/
recognised technologies is perceived as deficient, and given the point
above about glossing over what people themselves bring with them to
solve a particular problem, then people are quite likely to gloss over
Python than hear anyone's sermon about how great Python's other XML
technologies are.
http://www.python.org/about/success/esr/

So, there is a learning curve, but it's much shorter than what you already
invested to learn 'the wrong thing'. It's what people on this list tend to
call their "unlearning curve".

Well, maybe if someone helped the inquirer with his namespace problem
he'd be getting along quite nicely with his "unlearning curve".

Paul
 
S

Stefan Behnel

Hi,

Paul said:
I'm not disputing the benefits of the ElementTree approach, but one
has to recall that the DOM is probably the most widely used XML API
out there (being the one most client-side developers are using) and
together with the other standards (XPath and so on) isn't as bad as
most people like to make out.

I didn't deny that it works in general. However, it does not fit into the
standard ways things work in Python.

Furthermore, I don't think it does
Python much good to have people "acting all Ruby on Rails" and telling
people to throw out everything they ever did in order to suck up the
benefits, regardless of the magnitude of those benefits; it comes
across as saying that "your experience counts for nothing compared to
our superior skills". Not exactly the best way to keep people around.

I would have formulated it a bit different from my experience, which usually
is: people complain on the list that they can't manage to get X to work for
them. Others tell them: "don't use X, use Y", implicitly suggesting that you
may have to learn it, but it will help you get your problem done in a way that
you can /understand/ (i.e. that will fix your code for you, by enabling you to
fix it yourself).

From my experience, this works in most (although admittedly not all) cases.
But in any case, this reduction of complexity is an important step towards
making people ask less questions.

As I noted in my chronology, the kind of attitude projected by various
people in the Python community at various times (and probably still
perpetuated in the Ruby community) is that stuff originating from the
W3C is bad

The W3C is good in defining standards for portability and interoperability.
APIs rarely fall into that bag. They should be language specific as they are
made for use in a programming language, and therefore must match the way this
language works.

However, programming languages themselves are sometimes made for
interoperability, and this is definitely true for XSLT and XQuery. I am a big
fan of domain specific languages, because they (usually) are great in what
they are designed for, and nothing more.

like the DOM stuff, if the support for standardised/
recognised technologies is perceived as deficient, and given the point
above about glossing over what people themselves bring with them to
solve a particular problem, then people are quite likely to gloss over
Python than hear anyone's sermon about how great Python's other XML
technologies are.

It's not about "other XML technologies", it's only about making the standard
XML technologies accessible and usable. It's about designing interfaces in a
way that matches the tool people are using anyway, which in this case is Python.

Stefan
 
P

Paul Boddie

I didn't deny that it works in general. However, it does not fit into the
standard ways things work in Python.

You're only one step away from using the magic word.

I agree that writing getAttribute all the time instead of, say, using
magic attributes (provided the characters employed are lexically
compatible with Python - another thing that people tend to overlook)
can be distressing for some people, but as usual the language comes to
the rescue: you can assign the method to a shorter name, amongst other
things. If you want to stray from the standards then with some APIs
(as you know), you can override various classes and provide your own
convenience attributes and methods, but the interoperability remains
beneath.
I would have formulated it a bit different from my experience, which usually
is: people complain on the list that they can't manage to get X to work for
them. Others tell them: "don't use X, use Y", implicitly suggesting that you
may have to learn it, but it will help you get your problem done in a way that
you can /understand/ (i.e. that will fix your code for you, by enabling you to
fix it yourself).

If people feel that they've solved 90% of the problem using tools
they're become familiar with, I think it's somewhat infuriating to be
told to forget about the last 10% and to use something else. We don't
know how nasty the code is in the case of this particular inquirer,
but I've seen nothing recently where the DOM specifically was
obstructing anyone's comprehension.

In one case, had PyXML or minidom been up-to-date, the solution would
have been within easy reach (the textContent property), but with
everyone being waved off to greener pastures, there's probably little
gratitude to be had in doing the legwork to fix and enhance those
implementations.

[...]
It's not about "other XML technologies", it's only about making the standard
XML technologies accessible and usable. It's about designing interfaces in a
way that matches the tool people are using anyway, which in this case is Python.

Well, the standard XML technologies include those covered by PyXML,
like it or not, and whilst some APIs may be nicer than the variants of
the standard APIs provided by PyXML, there's a lot of potential in
those standards that hasn't been exploited in Python. Consider why the
last Web browser of note written in Python was Grail, circa 1996, for
example.

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top