Parsing xml file using python

C

chad

Hello, all,

I am new to Python.

I need to read an XML document and ignore all XML tags and write only
those between the tags to a text file. In other words, if I have an
XML document like so:

<tag1>This</tag1>
<tag2>is</tag2>
<tag3>a</tag3>
<tag1>test</tag1>

I need to write "This is a test" to a text file. How do I achieve
this? Thanks.
 
P

Peter Hansen

chad said:
I am new to Python.

And XML?
I need to read an XML document and ignore all XML tags and write only
those between the tags to a text file. In other words, if I have an
XML document like so:

<tag1>This</tag1>
<tag2>is</tag2>
<tag3>a</tag3>
<tag1>test</tag1>

I need to write "This is a test" to a text file. How do I achieve
this? Thanks.

Note that what you have above is _not_ an XML file. That is, it looks
like XML but it's not well-formed, as it doesn't have a single enclosing
element.

You probably meant that just as a quickie example, but in case that's
like what your actual data format looks like, you'll have trouble using
any of the Python XML parsers. (The fix is pretty trivial though.)

-Peter
 
G

Georgy

There's standard xml modules with built-in SAX2 parser.

http://pyxml.sourceforge.net/topics/howto/xml-howto.html
http://www.python.org/doc/current/lib/markup.html
http://www.devarticles.com/c/a/Python/Parsing-XML-with-SAX-and-Python/2/

For more samples: http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=xml+python+sample&btnG=Google+Search

And the code you're looking for will be like this (not tested):

import sys
from xml.sax import make_parser, handler
class BodyOnly(handler.ContentHandler):
def characters( self, content ):
print content,
parser = make_parser()
parser.setContentHandler(BodyOnly())
parser.parse( "input.xml" )



| Hello, all,
|
| I am new to Python.
|
| I need to read an XML document and ignore all XML tags and write only
| those between the tags to a text file. In other words, if I have an
| XML document like so:
|
| <tag1>This</tag1>
| <tag2>is</tag2>
| <tag3>a</tag3>
| <tag1>test</tag1>
|
| I need to write "This is a test" to a text file. How do I achieve
| this? Thanks.
 
A

Andrew Clover

<tag1>This</tag1>
<tag2>is</tag2>
<tag3>a</tag3>
<tag1>test</tag1>
I need to write "This is a test"

Assuming no nested tags (in which case you'd have to specify the problem
more completely), and no entity reference issues, any DOM Level 1
implementation can do this, eg. with minidom:

from xml.dom import minidom
doc= minidom.parse(inputFilename)

parent= doc.documentElement
children= [child for child in parent.childNodes if child.nodeType==1]
content= ' '.join([child.firstChild.nodeValue for child in children])

fp= open(outputFilename, 'wb')
fp.write(content)
fp.close()

For more complicated structures, the 'textContent' property in DOM Level 3
might be of use. (Insert standard pxdom plug here.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top