Processing XML File

J

jakecjacobson

I need to take a XML web resource and split it up into smaller XML
files. I am able to retrieve the web resource but I can't find any
good XML examples. I am just learning Python so forgive me if this
question has been answered many times in the past.

My resource is like:

<document>
...
...
</document>
<document>
...
...
</document>

So in this example, I would need to output 2 files with the contents
of each file what is between the open and close document tag.
 
A

Adam Tauno Williams

I need to take a XML web resource and split it up into smaller XML
files. I am able to retrieve the web resource but I can't find any
good XML examples. I am just learning Python so forgive me if this
question has been answered many times in the past.
My resource is like:
<document>
...
...
</document>
<document>
...
...
</document>
So in this example, I would need to output 2 files with the contents
of each file what is between the open and close document tag.

Do you want to parse the document or SaX?

I have a SaX example at
<http://coils.hg.sourceforge.net/hgw...27b08f7f/src/coils/logic/workflow/xml/bpml.py>
 
J

jakecjacobson

Do you want to parse the document or SaX?

I have a SaX example at
<http://coils.hg.sourceforge.net/hgweb/coils/coils/file/99b227b08f7f/s...>

Thanks but I am way over my head with XML, Python. I am working with
DDMS and need to output the individual resource nodes to their own
file. I hope that this helps and I need a good example and how to use
it.

Here is what a resource node looks like:
<ddms:Resource
xsi:schemaLocation="https://metadata.dod.mil/mdr/ns/DDMS/1.4/
https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
xmlns:ddms="https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ICISM="urn:us:gov:ic:ism:v2">
<ddms:identifier ddms:qualifier="URL" ddms:value="https://
metadata.dod.mil/mdr/ns/TBD/1.0/SampleTaxonomy.owl"/>
<ddms:identifier ddms:qualifier="https://metadata.dod.mil/mdr/
ns/MDR/1.0/MDR.owl#GovernanceNamespace" ddms:value="TBD"/>
<ddms:identifier ddms:qualifier="Version" ddms:value="1.0"/>
<ddms:title ICISM:eek:wnerProducer="USA"
ICISM:classification="U">Sample Taxonomy</ddms:title>
<ddms:description ICISM:eek:wnerProducer="USA"
ICISM:classification="U">
This is a sample taxonomy created for the Help page.
</ddms:description>
<ddms:dates ddms:posted="2007-11-24"/>
<ddms:creator ICISM:eek:wnerProducer="USA"
ICISM:classification="U">
<ddms:person>
<ddms:name>Sample</ddms:name>
<ddms:surname>Developer</ddms:surname>
<ddms:affiliation>FGM, Inc.</ddms:affiliation>
<ddms:phone>703-885-1000</ddms:phone>
<ddms:email>[email protected]</ddms:email>
</ddms:person>
</ddms:creator>
<ddms:security ICISM:eek:wnerProducer="USA"
ICISM:classification="U" ICISM:nonICmarkings="DIST_STMT_A" />
<!-- Other DDMS elements may appear here. -->
</ddms:Resource>

You can see the DDMS site at https://metadata.dod.mil/.
 
A

Adam Tauno Williams

Thanks but I am way over my head with XML, Python. I am working with
DDMS and need to output the individual resource nodes to their own
file. I hope that this helps and I need a good example and how to use
it.


If that is all you need XPath will spit it apart for you like
<http://coils.hg.sourceforge.net/hgw...src/coils/logic/workflow/actions/xml/xpath.py>


doc = etree.parse(self._rfile)
results = doc.xpath(xpath)
for result in results:
print str(result)

For example if your XML has an outermost element of ResultSet with inner row elements just do:
for record in doc.xpath(u'/ResultSet/row')

Implied import for these examples is "from lxml import etree"
 
S

Stefan Behnel

jakecjacobson, 29.01.2010 18:25:
I need to take a XML web resource and split it up into smaller XML
files. I am able to retrieve the web resource but I can't find any
good XML examples. I am just learning Python so forgive me if this
question has been answered many times in the past.

My resource is like:

<document>
...
...
</document>
<document>
...
...
</document>

Is this what you get as a document or is this just /contained/ in the document?

Note that XML does not allow more than one root element, so the above is
not XML. Each of the two <document>...</document> parts form an XML
document by themselves, though.

So in this example, I would need to output 2 files with the contents
of each file what is between the open and close document tag.

Are the two files formatted as you show above? In that case, you can simply
iterate over the lines and cut the document when you see "<document>". Or,
if you are sure that "<document>" only appears as top-most elements and not
inside of the documents, you can search for "<document>" in the content (a
string, I guess) and split it there.

As was pointed out before, once you have these two documents, use the
xml.etree package to work with them.

Something like this might work:

import xml.etree.ElementTree as ET

data = urllib2.urlopen(url).read()

for part in data.split('<document>'):
document = ET.fromstring('<document>'+part)
print(document.tag)
# ... do other stuff

Stefan
 
S

Sells, Fred

Google is your friend. Elementtree is one of the better documented
IMHO, but there are many modules to do this.
-----Original Message-----
From: [email protected]
[mailto:p[email protected]] On
Behalf Of Stefan Behnel
Sent: Friday, January 29, 2010 2:25 PM
To: (e-mail address removed)
Subject: Re: Processing XML File

jakecjacobson, 29.01.2010 18:25:
I need to take a XML web resource and split it up into smaller XML
files. I am able to retrieve the web resource but I can't find any
good XML examples. I am just learning Python so forgive me if this
question has been answered many times in the past.

My resource is like:

<document>
...
...
</document>
<document>
...
...
</document>

Is this what you get as a document or is this just /contained/ in the
document?

Note that XML does not allow more than one root element, so the above is
not XML. Each of the two <document>...</document> parts form an XML
document by themselves, though.

So in this example, I would need to output 2 files with the contents
of each file what is between the open and close document tag.

Are the two files formatted as you show above? In that case, you can
simply
iterate over the lines and cut the document when you see "<document>". Or,
if you are sure that "<document>" only appears as top-most elements and
not
inside of the documents, you can search for "<document>" in the content (a
string, I guess) and split it there.

As was pointed out before, once you have these two documents, use the
xml.etree package to work with them.

Something like this might work:

import xml.etree.ElementTree as ET

data = urllib2.urlopen(url).read()

for part in data.split('<document>'):
document = ET.fromstring('<document>'+part)
print(document.tag)
# ... do other stuff

Stefan
 
S

Stefan Behnel

Sells, Fred, 29.01.2010 20:31:
Google is your friend. Elementtree is one of the better documented
IMHO, but there are many modules to do this.

Unless the OP provides some more information, "do this" is rather
underdefined. And sending someone off to Google who is just learning the
basics of Python and XML and trying to solve a very specific problem with
them is not exactly the spirit I'm used to in this newsgroup.

Stefan
 
J

jakecjacobson

Sells, Fred, 29.01.2010 20:31:


Unless the OP provides some more information, "do this" is rather
underdefined. And sending someone off to Google who is just learning the
basics of Python and XML and trying to solve a very specific problem with
them is not exactly the spirit I'm used to in this newsgroup.

Stefan

Just want to thank everyone for their posts. I got it working after I
discovered a name space issue with this code.

xmlDoc = libxml2.parseDoc(guts)
# Ignore namespace and just get the Resource
resourceNodes = xmlDoc.xpathEval('//*[local-name()="Resource"]')
for rNode in resourceNodes:
print rNode
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top