Processing XML File

Discussion in 'Python' started by jakecjacobson, Jan 29, 2010.

  1. I need to take a XML web resource and split it up into smaller XML
    files. I am able to retrieve the web resource but I can't find any
    good XML examples. I am just learning Python so forgive me if this
    question has been answered many times in the past.

    My resource is like:

    <document>
    ...
    ...
    </document>
    <document>
    ...
    ...
    </document>

    So in this example, I would need to output 2 files with the contents
    of each file what is between the open and close document tag.
    jakecjacobson, Jan 29, 2010
    #1
    1. Advertising

  2. On Fri, 2010-01-29 at 09:25 -0800, jakecjacobson wrote:
    > I need to take a XML web resource and split it up into smaller XML
    > files. I am able to retrieve the web resource but I can't find any
    > good XML examples. I am just learning Python so forgive me if this
    > question has been answered many times in the past.
    > My resource is like:
    > <document>
    > ...
    > ...
    > </document>
    > <document>
    > ...
    > ...
    > </document>
    > So in this example, I would need to output 2 files with the contents
    > of each file what is between the open and close document tag.


    Do you want to parse the document or SaX?

    I have a SaX example at
    <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/99b227b08f7f/src/coils/logic/workflow/xml/bpml.py>
    Adam Tauno Williams, Jan 29, 2010
    #2
    1. Advertising

  3. On Jan 29, 1:04 pm, Adam Tauno Williams <>
    wrote:
    > On Fri, 2010-01-29 at 09:25 -0800, jakecjacobson wrote:
    > > I need to take a XML web resource and split it up into smaller XML
    > > files.  I am able to retrieve the web resource but I can't find any
    > > good XML examples.  I am just learning Python so forgive me if this
    > > question has been answered many times in the past.
    > > My resource is like:
    > > <document>
    > >      ...
    > >      ...
    > > </document>
    > > <document>
    > >      ...
    > >      ...
    > > </document>
    > > So in this example, I would need to output 2 files with the contents
    > > of each file what is between the open and close document tag.

    >
    > Do you want to parse the document or SaX?
    >
    > I have a SaX example at
    > <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/99b227b08f7f/s...>


    Thanks but I am way over my head with XML, Python. I am working with
    DDMS and need to output the individual resource nodes to their own
    file. I hope that this helps and I need a good example and how to use
    it.

    Here is what a resource node looks like:
    <ddms:Resource
    xsi:schemaLocation="https://metadata.dod.mil/mdr/ns/DDMS/1.4/
    https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
    xmlns:ddms="https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:ICISM="urn:us:gov:ic:ism:v2">
    <ddms:identifier ddms:qualifier="URL" ddms:value="https://
    metadata.dod.mil/mdr/ns/TBD/1.0/SampleTaxonomy.owl"/>
    <ddms:identifier ddms:qualifier="https://metadata.dod.mil/mdr/
    ns/MDR/1.0/MDR.owl#GovernanceNamespace" ddms:value="TBD"/>
    <ddms:identifier ddms:qualifier="Version" ddms:value="1.0"/>
    <ddms:title ICISM:eek:wnerProducer="USA"
    ICISM:classification="U">Sample Taxonomy</ddms:title>
    <ddms:description ICISM:eek:wnerProducer="USA"
    ICISM:classification="U">
    This is a sample taxonomy created for the Help page.
    </ddms:description>
    <ddms:dates ddms:posted="2007-11-24"/>
    <ddms:creator ICISM:eek:wnerProducer="USA"
    ICISM:classification="U">
    <ddms:person>
    <ddms:name>Sample</ddms:name>
    <ddms:surname>Developer</ddms:surname>
    <ddms:affiliation>FGM, Inc.</ddms:affiliation>
    <ddms:phone>703-885-1000</ddms:phone>
    <ddms:email></ddms:email>
    </ddms:person>
    </ddms:creator>
    <ddms:security ICISM:eek:wnerProducer="USA"
    ICISM:classification="U" ICISM:nonICmarkings="DIST_STMT_A" />
    <!-- Other DDMS elements may appear here. -->
    </ddms:Resource>

    You can see the DDMS site at https://metadata.dod.mil/.
    jakecjacobson, Jan 29, 2010
    #3
  4. On Fri, 2010-01-29 at 10:34 -0800, jakecjacobson wrote:
    > On Jan 29, 1:04 pm, Adam Tauno Williams <>
    > wrote:
    > > On Fri, 2010-01-29 at 09:25 -0800, jakecjacobson wrote:
    > > > I need to take a XML web resource and split it up into smaller XML
    > > > files. I am able to retrieve the web resource but I can't find any
    > > > good XML examples. I am just learning Python so forgive me if this
    > > > question has been answered many times in the past.
    > > > My resource is like:
    > > > <document>
    > > > ...
    > > > ...
    > > > </document>
    > > > <document>
    > > > </document>
    > > > So in this example, I would need to output 2 files with the contents
    > > > of each file what is between the open and close document tag.

    > > Do you want to parse the document or SaX?
    > > I have a SaX example at
    > > <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/99b227b08f7f/s...>

    > Thanks but I am way over my head with XML, Python. I am working with
    > DDMS and need to output the individual resource nodes to their own
    > file. I hope that this helps and I need a good example and how to use
    > it.



    If that is all you need XPath will spit it apart for you like
    <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/99b227b08f7f/src/coils/logic/workflow/actions/xml/xpath.py>


    doc = etree.parse(self._rfile)
    results = doc.xpath(xpath)
    for result in results:
    print str(result)

    For example if your XML has an outermost element of ResultSet with inner row elements just do:
    for record in doc.xpath(u'/ResultSet/row')

    Implied import for these examples is "from lxml import etree"


    > Here is what a resource node looks like:
    > <ddms:Resource
    > xsi:schemaLocation="https://metadata.dod.mil/mdr/ns/DDMS/1.4/
    > https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
    > xmlns:ddms="https://metadata.dod.mil/mdr/ns/DDMS/1.4/"
    > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    > xmlns:ICISM="urn:us:gov:ic:ism:v2">
    > <ddms:identifier ddms:qualifier="URL" ddms:value="https://
    > metadata.dod.mil/mdr/ns/TBD/1.0/SampleTaxonomy.owl"/>
    > <ddms:identifier ddms:qualifier="https://metadata.dod.mil/mdr/
    > ns/MDR/1.0/MDR.owl#GovernanceNamespace" ddms:value="TBD"/>
    > <ddms:identifier ddms:qualifier="Version" ddms:value="1.0"/>
    > <ddms:title ICISM:eek:wnerProducer="USA"
    > ICISM:classification="U">Sample Taxonomy</ddms:title>
    > <ddms:description ICISM:eek:wnerProducer="USA"
    > ICISM:classification="U">
    > This is a sample taxonomy created for the Help page.
    > </ddms:description>
    > <ddms:dates ddms:posted="2007-11-24"/>
    > <ddms:creator ICISM:eek:wnerProducer="USA"
    > ICISM:classification="U">
    > <ddms:person>
    > <ddms:name>Sample</ddms:name>
    > <ddms:surname>Developer</ddms:surname>
    > <ddms:affiliation>FGM, Inc.</ddms:affiliation>
    > <ddms:phone>703-885-1000</ddms:phone>
    > <ddms:email></ddms:email>
    > </ddms:person>
    > </ddms:creator>
    > <ddms:security ICISM:eek:wnerProducer="USA"
    > ICISM:classification="U" ICISM:nonICmarkings="DIST_STMT_A" />
    > <!-- Other DDMS elements may appear here. -->
    > </ddms:Resource>
    >
    > You can see the DDMS site at https://metadata.dod.mil/.



    --
    OpenGroupware developer:
    <http://whitemiceconsulting.blogspot.com/>
    OpenGroupare & Cyrus IMAPd documenation @
    <http://docs.opengroupware.org/Members/whitemice/wmogag/file_view>
    Adam Tauno Williams, Jan 29, 2010
    #4
  5. jakecjacobson, 29.01.2010 18:25:
    > I need to take a XML web resource and split it up into smaller XML
    > files. I am able to retrieve the web resource but I can't find any
    > good XML examples. I am just learning Python so forgive me if this
    > question has been answered many times in the past.
    >
    > My resource is like:
    >
    > <document>
    > ...
    > ...
    > </document>
    > <document>
    > ...
    > ...
    > </document>


    Is this what you get as a document or is this just /contained/ in the document?

    Note that XML does not allow more than one root element, so the above is
    not XML. Each of the two <document>...</document> parts form an XML
    document by themselves, though.


    > So in this example, I would need to output 2 files with the contents
    > of each file what is between the open and close document tag.


    Are the two files formatted as you show above? In that case, you can simply
    iterate over the lines and cut the document when you see "<document>". Or,
    if you are sure that "<document>" only appears as top-most elements and not
    inside of the documents, you can search for "<document>" in the content (a
    string, I guess) and split it there.

    As was pointed out before, once you have these two documents, use the
    xml.etree package to work with them.

    Something like this might work:

    import xml.etree.ElementTree as ET

    data = urllib2.urlopen(url).read()

    for part in data.split('<document>'):
    document = ET.fromstring('<document>'+part)
    print(document.tag)
    # ... do other stuff

    Stefan
    Stefan Behnel, Jan 29, 2010
    #5
  6. jakecjacobson

    Sells, Fred Guest

    Google is your friend. Elementtree is one of the better documented
    IMHO, but there are many modules to do this.

    > -----Original Message-----
    > From: python-list-bounces+frsells=
    > [mailto:python-list-bounces+frsells=] On
    > Behalf Of Stefan Behnel
    > Sent: Friday, January 29, 2010 2:25 PM
    > To:
    > Subject: Re: Processing XML File
    >
    > jakecjacobson, 29.01.2010 18:25:
    > > I need to take a XML web resource and split it up into smaller XML
    > > files. I am able to retrieve the web resource but I can't find any
    > > good XML examples. I am just learning Python so forgive me if this
    > > question has been answered many times in the past.
    > >
    > > My resource is like:
    > >
    > > <document>
    > > ...
    > > ...
    > > </document>
    > > <document>
    > > ...
    > > ...
    > > </document>

    >
    > Is this what you get as a document or is this just /contained/ in the
    > document?
    >
    > Note that XML does not allow more than one root element, so the above

    is
    > not XML. Each of the two <document>...</document> parts form an XML
    > document by themselves, though.
    >
    >
    > > So in this example, I would need to output 2 files with the contents
    > > of each file what is between the open and close document tag.

    >
    > Are the two files formatted as you show above? In that case, you can
    > simply
    > iterate over the lines and cut the document when you see "<document>".

    Or,
    > if you are sure that "<document>" only appears as top-most elements

    and
    > not
    > inside of the documents, you can search for "<document>" in the

    content (a
    > string, I guess) and split it there.
    >
    > As was pointed out before, once you have these two documents, use the
    > xml.etree package to work with them.
    >
    > Something like this might work:
    >
    > import xml.etree.ElementTree as ET
    >
    > data = urllib2.urlopen(url).read()
    >
    > for part in data.split('<document>'):
    > document = ET.fromstring('<document>'+part)
    > print(document.tag)
    > # ... do other stuff
    >
    > Stefan
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    Sells, Fred, Jan 29, 2010
    #6
  7. Sells, Fred, 29.01.2010 20:31:
    > Google is your friend. Elementtree is one of the better documented
    > IMHO, but there are many modules to do this.


    Unless the OP provides some more information, "do this" is rather
    underdefined. And sending someone off to Google who is just learning the
    basics of Python and XML and trying to solve a very specific problem with
    them is not exactly the spirit I'm used to in this newsgroup.

    Stefan
    Stefan Behnel, Jan 29, 2010
    #7
  8. On Jan 29, 2:41 pm, Stefan Behnel <> wrote:
    > Sells, Fred, 29.01.2010 20:31:
    >
    > > Google is your friend.  Elementtree is one of the better documented
    > > IMHO, but there are many modules to do this.

    >
    > Unless the OP provides some more information, "do this" is rather
    > underdefined. And sending someone off to Google who is just learning the
    > basics of Python and XML and trying to solve a very specific problem with
    > them is not exactly the spirit I'm used to in this newsgroup.
    >
    > Stefan


    Just want to thank everyone for their posts. I got it working after I
    discovered a name space issue with this code.

    xmlDoc = libxml2.parseDoc(guts)
    # Ignore namespace and just get the Resource
    resourceNodes = xmlDoc.xpathEval('//*[local-name()="Resource"]')
    for rNode in resourceNodes:
    print rNode
    jakecjacobson, Feb 1, 2010
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bomb Diggy
    Replies:
    0
    Views:
    433
    Bomb Diggy
    Jul 28, 2004
  2. Oleg  Paraschenko
    Replies:
    0
    Views:
    373
    Oleg Paraschenko
    Jun 6, 2005
  3. knorth
    Replies:
    0
    Views:
    353
    knorth
    Nov 5, 2005
  4. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    410
    Michael Foord
    Sep 17, 2004
  5. Erik Wasser
    Replies:
    5
    Views:
    437
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page