Processing XML File

jakecjacobson · Jan 29, 2010

I need to take a XML web resource and split it up into smaller XML
files. I am able to retrieve the web resource but I can't find any
good XML examples. I am just learning Python so forgive me if this
question has been answered many times in the past.

My resource is like:

<document>
...
...
</document>
<document>
...
...
</document>

So in this example, I would need to output 2 files with the contents
of each file what is between the open and close document tag.

Adam Tauno Williams · Jan 29, 2010

I need to take a XML web resource and split it up into smaller XML
files. I am able to retrieve the web resource but I can't find any
good XML examples. I am just learning Python so forgive me if this
question has been answered many times in the past.
My resource is like:
<document>
...
...
</document>
<document>
...
...
</document>
So in this example, I would need to output 2 files with the contents
of each file what is between the open and close document tag.

Do you want to parse the document or SaX?

I have a SaX example at
<http://coils.hg.sourceforge.net/hgw...27b08f7f/src/coils/logic/workflow/xml/bpml.py>

jakecjacobson · Jan 29, 2010

Do you want to parse the document or SaX?

I have a SaX example at
<http://coils.hg.sourceforge.net/hgweb/coils/coils/file/99b227b08f7f/s...>

Thanks but I am way over my head with XML, Python. I am working with
DDMS and need to output the individual resource nodes to their own
file. I hope that this helps and I need a good example and how to use
it.

Here is what a resource node looks like:
<ddms:Resource
xsi:schemaLocation="https://metadata.dod.mil/mdr/ns/DDMS/1.4/
https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
xmlns:ddms="https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ICISM="urn:us:gov:ic:ism:v2">
<ddms:identifier ddms:qualifier="URL" ddms:value="https://
metadata.dod.mil/mdr/ns/TBD/1.0/SampleTaxonomy.owl"/>
<ddms:identifier ddms:qualifier="https://metadata.dod.mil/mdr/
ns/MDR/1.0/MDR.owl#GovernanceNamespace" ddms:value="TBD"/>
<ddms:identifier ddms:qualifier="Version" ddms:value="1.0"/>
<ddms:title ICISM

wnerProducer="USA"
ICISM:classification="U">Sample Taxonomy</ddms:title>
<ddms:description ICISM

wnerProducer="USA"
ICISM:classification="U">
This is a sample taxonomy created for the Help page.
</ddms:description>
<ddms:dates ddms

osted="2007-11-24"/>
<ddms:creator ICISM

wnerProducer="USA"
ICISM:classification="U">
<ddms

erson>
<ddms:name>Sample</ddms:name>
<ddms:surname>Developer</ddms:surname>
<ddms:affiliation>FGM, Inc.</ddms:affiliation>
<ddms

hone>703-885-1000</ddms

hone>
<ddms:email>[email protected]</ddms:email>
</ddms

erson>
</ddms:creator>
<ddms:security ICISM

wnerProducer="USA"
ICISM:classification="U" ICISM:nonICmarkings="DIST_STMT_A" />

</ddms:Resource>

You can see the DDMS site at https://metadata.dod.mil/.

Adam Tauno Williams · Jan 29, 2010

Thanks but I am way over my head with XML, Python. I am working with
DDMS and need to output the individual resource nodes to their own
file. I hope that this helps and I need a good example and how to use
it.

If that is all you need XPath will spit it apart for you like
<http://coils.hg.sourceforge.net/hgw...src/coils/logic/workflow/actions/xml/xpath.py>

doc = etree.parse(self._rfile)
results = doc.xpath(xpath)
for result in results:
print str(result)

For example if your XML has an outermost element of ResultSet with inner row elements just do:
for record in doc.xpath(u'/ResultSet/row')

Implied import for these examples is "from lxml import etree"

Stefan Behnel · Jan 29, 2010

jakecjacobson, 29.01.2010 18:25:

I need to take a XML web resource and split it up into smaller XML
files. I am able to retrieve the web resource but I can't find any
good XML examples. I am just learning Python so forgive me if this
question has been answered many times in the past.

My resource is like:

<document>
...
...
</document>
<document>
...
...
</document>

Is this what you get as a document or is this just /contained/ in the document?

Note that XML does not allow more than one root element, so the above is
not XML. Each of the two <document>...</document> parts form an XML
document by themselves, though.

So in this example, I would need to output 2 files with the contents
of each file what is between the open and close document tag.

Are the two files formatted as you show above? In that case, you can simply
iterate over the lines and cut the document when you see "<document>". Or,
if you are sure that "<document>" only appears as top-most elements and not
inside of the documents, you can search for "<document>" in the content (a
string, I guess) and split it there.

As was pointed out before, once you have these two documents, use the
xml.etree package to work with them.

Something like this might work:

import xml.etree.ElementTree as ET

data = urllib2.urlopen(url).read()

for part in data.split('<document>'):
document = ET.fromstring('<document>'+part)
print(document.tag)
# ... do other stuff

Stefan

Sells, Fred · Jan 29, 2010

Google is your friend. Elementtree is one of the better documented
IMHO, but there are many modules to do this.

-----Original Message-----
From: [email protected]
[mailto[email protected]] On
Behalf Of Stefan Behnel
Sent: Friday, January 29, 2010 2:25 PM
To: (e-mail address removed)
Subject: Re: Processing XML File

jakecjacobson, 29.01.2010 18:25:

I need to take a XML web resource and split it up into smaller XML
files. I am able to retrieve the web resource but I can't find any
good XML examples. I am just learning Python so forgive me if this
question has been answered many times in the past.

My resource is like:

<document>
...
...
</document>
<document>
...
...
</document>

Click to expand...

Is this what you get as a document or is this just /contained/ in the
document?

Note that XML does not allow more than one root element, so the above is
not XML. Each of the two <document>...</document> parts form an XML
document by themselves, though.

So in this example, I would need to output 2 files with the contents
of each file what is between the open and close document tag.

Click to expand...

Are the two files formatted as you show above? In that case, you can
simply
iterate over the lines and cut the document when you see "<document>". Or,
if you are sure that "<document>" only appears as top-most elements and
not
inside of the documents, you can search for "<document>" in the content (a
string, I guess) and split it there.

As was pointed out before, once you have these two documents, use the
xml.etree package to work with them.

Something like this might work:

import xml.etree.ElementTree as ET

data = urllib2.urlopen(url).read()

for part in data.split('<document>'):
document = ET.fromstring('<document>'+part)
print(document.tag)
# ... do other stuff

Stefan

Stefan Behnel · Jan 29, 2010

Sells, Fred, 29.01.2010 20:31:

Google is your friend. Elementtree is one of the better documented
IMHO, but there are many modules to do this.

Unless the OP provides some more information, "do this" is rather
underdefined. And sending someone off to Google who is just learning the
basics of Python and XML and trying to solve a very specific problem with
them is not exactly the spirit I'm used to in this newsgroup.

Stefan

jakecjacobson · Feb 1, 2010

Sells, Fred, 29.01.2010 20:31:

Unless the OP provides some more information, "do this" is rather
underdefined. And sending someone off to Google who is just learning the
basics of Python and XML and trying to solve a very specific problem with
them is not exactly the spirit I'm used to in this newsgroup.

Stefan

Just want to thank everyone for their posts. I got it working after I
discovered a name space issue with this code.

xmlDoc = libxml2.parseDoc(guts)
# Ignore namespace and just get the Resource
resourceNodes = xmlDoc.xpathEval('//*[local-name()="Resource"]')
for rNode in resourceNodes:
print rNode

Processing in Python help	0	Aug 31, 2022
Read xml column inside csv file with Python	0	Jul 23, 2022
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
Creating a direct download div link for pdf file	3	Mar 19, 2023
Processing XML that's embedded in HTML	10	Jan 22, 2008
Processing a file using multithreads	4	Sep 8, 2011
XML/XHTML/HTML differences, bugs... and howto	0	Jan 23, 2013
How to create PDF file in Batch	5	May 11, 2022

Processing XML File

jakecjacobson

Adam Tauno Williams

jakecjacobson

Adam Tauno Williams

Stefan Behnel

Sells, Fred

Stefan Behnel

jakecjacobson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads