request for advice - possible ElementTree nexus

mirandacascade · Jul 4, 2006

Situation is this:
1) I have inherited some python code that accepts a string object, the
contents of which is an XML document, and produces a data structure
that represents some of the content of the XML document
2) The inherited code is somewhat 'brittle' in that some well-formed
XML documents are not correctly processed by the code; the brittleness
is caused by how the parser portion of the code handles whitespace.
3) I would like to change the code to make it less brittle. Whatever
changes I make must continue to produce the same data structure that is
currently being produced.
4) Rather than attempt to fix the parser portion of the code, I would
prefer to use ElementTree. ElementTree handles parsing XML documents
flawlessly, so the brittle portion of the code goes away. In addition,
the ElementTree model is very sweet to work with, so it is a relatively
easy task using the information in ElementTree to produce the same data
structure that is currently being produced.
5) The existing data structure--the structure that must be
maintained--that gets produced does NOT include any {xmlns=<whatever>}
information that may appear in the source XML document.
6) Based on a review of several posts in this group, I understand why
ElementTree hanldes xmlns=<whatever> information the way it does. This
is an oversimplification, but one of the things it does is to
incorporate the {whatever} within the tag property of the element and
of any descendent elements.
7) One of the pieces of information in the data structure that gets
produced by this code is the tag...the tag in the data structure should
not have any xmlns=<whatever> information.

So, given that the goal is to produce the same data structure and given
that I really want to use ElementTree, I need to find a way to remove
the xmlns=<whatever> information. It seems like there are 2 general
methods for accomplishing this:
1) before feeding the string object to the ElementTree.XML() method,
remove the xmlns=<whatever> information from the string.
2) keep the xmlns=<whatever> information in the string that feeds
ElementTree.XML(), but when building the data structure, ensure that
the {whatever} information in the tag property of the element should
NOT be included in the data structure.

My requests for advice are:
a) What are the pros/cons of each of the 2 general methods described
above?
b) If I want to remove the xmlns information before feeding it to the
ElementTree.XML() method, and I don't want to be aware of what is to
the right of the equal sign, what is the best way to remove all the
substrings that are of the form xmlns=<whatever>? Would this require
learning the nuances of regular expressions?
c) If I want to leave the xmlns information in the string that gets fed
to ElementTree.XML, and I want to remove the {whatever} from the tag
before building the data structure, what is the best way to find
{whatever} from the tag property...is this another case where one
should be using regular expressions?

Gerard Flanagan · Jul 5, 2006

Situation is this:
1) I have inherited some python code that accepts a string object, the
contents of which is an XML document, and produces a data structure
that represents some of the content of the XML document
2) The inherited code is somewhat 'brittle' in that some well-formed
XML documents are not correctly processed by the code; the brittleness
is caused by how the parser portion of the code handles whitespace.
3) I would like to change the code to make it less brittle. Whatever
changes I make must continue to produce the same data structure that is
currently being produced.
4) Rather than attempt to fix the parser portion of the code, I would
prefer to use ElementTree. ElementTree handles parsing XML documents
flawlessly, so the brittle portion of the code goes away. In addition,
the ElementTree model is very sweet to work with, so it is a relatively
easy task using the information in ElementTree to produce the same data
structure that is currently being produced.
5) The existing data structure--the structure that must be
maintained--that gets produced does NOT include any {xmlns=<whatever>}
information that may appear in the source XML document.
6) Based on a review of several posts in this group, I understand why
ElementTree hanldes xmlns=<whatever> information the way it does. This
is an oversimplification, but one of the things it does is to
incorporate the {whatever} within the tag property of the element and
of any descendent elements.
7) One of the pieces of information in the data structure that gets
produced by this code is the tag...the tag in the data structure should
not have any xmlns=<whatever> information.

So, given that the goal is to produce the same data structure and given
that I really want to use ElementTree, I need to find a way to remove
the xmlns=<whatever> information. It seems like there are 2 general
methods for accomplishing this:
1) before feeding the string object to the ElementTree.XML() method,
remove the xmlns=<whatever> information from the string.
2) keep the xmlns=<whatever> information in the string that feeds
ElementTree.XML(), but when building the data structure, ensure that
the {whatever} information in the tag property of the element should
NOT be included in the data structure.

[snip]

maybe transform the document with XSLT before processing?

google: xslt remove namespaces

eg. http://www.tei-c.org/wiki/index.php/Remove-Namespaces.xsl

eg. http://www.thescripts.com/forum/thread86057.html

hth

Gerard

Fredrik Lundh · Jul 5, 2006

c) If I want to leave the xmlns information in the string that gets fed
to ElementTree.XML, and I want to remove the {whatever} from the tag
before building the data structure, what is the best way to find
{whatever} from the tag property...is this another case where one
should be using regular expressions?

if the "whatever" in {whatever} is known in advance, you can use the
approach described here:

http://effbot.org/zone/element-tidylib.htm#converting-xhtml-to-html

if the "whatever" is not known, you can do e.g.

if elem.tag.startswith("{"):
elem.tag = elem.tag.split("}")[1]

</F>

Looking For Advice	1	Dec 10, 2022
ADA Compliance/Web Accessibility Advice	1	Oct 22, 2024
elementtree	11	Aug 24, 2009
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
Is it possible to open MBOX files in Maildir format directly?	0	Apr 20, 2026
Dealing with xml namespaces with ElementTree	0	Jan 21, 2011
Is it possible to import multiple MBOX files into Apple Mail at once?	0	Apr 16, 2026
HTTP request with trailer	0	Mar 22, 2024

request for advice - possible ElementTree nexus

mirandacascade

Gerard Flanagan

Fredrik Lundh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads