XML Parsing

Discussion in 'Python' started by Alok Kothari, Apr 1, 2008.

  1. Alok Kothari

    Alok Kothari Guest

    Hello,
    I am new to XML parsing.Could you kindly tell me whats the
    problem with the following code:

    import xml.dom.minidom
    import xml.parsers.expat
    document = """<token pos="nn">Letterman</token><token pos="bez">is</
    token><token pos="jjr">better</token><token pos="cs">than</
    token><token pos="np">Jay</token><token pos="np">Leno</token>"""



    # 3 handler functions
    def start_element(name, attrs):
    print 'Start element:', name, attrs
    def end_element(name):
    print 'End element:', name
    def char_data(data):
    print 'Character data:', repr(data)

    p = xml.parsers.expat.ParserCreate()

    p.StartElementHandler = start_element
    p.EndElementHandler = end_element
    p.CharacterDataHandler = char_data
    p.Parse(document, 1)

    OUTPUT:

    Start element: token {u'pos': u'nn'}
    Character data: u'Letterman'
    End element: token

    Traceback (most recent call last):
    File "C:/Python25/Programs/eg.py", line 20, in <module>
    p.Parse(document, 1)
    ExpatError: junk after document element: line 1, column 33
    Alok Kothari, Apr 1, 2008
    #1
    1. Advertising

  2. On Apr 1, 12:42 pm, Alok Kothari <> wrote:
    > Hello,
    > I am new to XML parsing.Could you kindly tell me whats the
    > problem with the following code:
    >
    > import xml.dom.minidom
    > import xml.parsers.expat
    > document = """<token pos="nn">Letterman</token><token pos="bez">is</
    > token><token pos="jjr">better</token><token pos="cs">than</
    > token><token pos="np">Jay</token><token pos="np">Leno</token>"""
    >
    > # 3 handler functions
    > def start_element(name, attrs):
    > print 'Start element:', name, attrs
    > def end_element(name):
    > print 'End element:', name
    > def char_data(data):
    > print 'Character data:', repr(data)
    >
    > p = xml.parsers.expat.ParserCreate()
    >
    > p.StartElementHandler = start_element
    > p.EndElementHandler = end_element
    > p.CharacterDataHandler = char_data
    > p.Parse(document, 1)
    >
    > OUTPUT:
    >
    > Start element: token {u'pos': u'nn'}
    > Character data: u'Letterman'
    > End element: token
    >
    > Traceback (most recent call last):
    > File "C:/Python25/Programs/eg.py", line 20, in <module>
    > p.Parse(document, 1)
    > ExpatError: junk after document element: line 1, column 33


    Your XML is wrong. Don't put line breaks between </ and token>.
    Jason Scheirer, Apr 1, 2008
    #2
    1. Advertising

  3. Alok Kothari

    7stud Guest

    On Apr 1, 1:42 pm, Alok Kothari <> wrote:
    > Hello,
    >           I am new to XML parsing.Could you kindly tell me whats the
    > problem with the following code:
    >
    > import xml.dom.minidom
    > import xml.parsers.expat
    > document = """<token pos="nn">Letterman</token><token pos="bez">is</
    > token><token pos="jjr">better</token><token pos="cs">than</
    > token><token pos="np">Jay</token><token pos="np">Leno</token>"""
    >
    > # 3 handler functions
    > def start_element(name, attrs):
    >     print 'Start element:', name, attrs
    > def end_element(name):
    >     print 'End element:', name
    > def char_data(data):
    >     print 'Character data:', repr(data)
    >
    > p = xml.parsers.expat.ParserCreate()
    >
    > p.StartElementHandler = start_element
    > p.EndElementHandler = end_element
    > p.CharacterDataHandler = char_data
    > p.Parse(document, 1)
    >
    > OUTPUT:
    >
    > Start element: token {u'pos': u'nn'}
    > Character data: u'Letterman'
    > End element: token
    >
    > Traceback (most recent call last):
    >   File "C:/Python25/Programs/eg.py", line 20, in <module>
    >     p.Parse(document, 1)
    > ExpatError: junk after document element: line 1, column 33



    I don't know if you are aware of the BeautifulSoup module:


    import BeautifulSoup as bs

    xml = """<token pos="nn">Letterman</token><token pos="bez">is</
    token><token pos="jjr">better</token><token pos="cs">than</
    token><token pos="np">Jay</token><token pos="np">Leno</token>"""

    doc = bs.BeautifulStoneSoup(xml)

    tokens = doc.findAll("token")
    for token in tokens:
    for attr in token.attrs:
    print "%s : %s" % attr


    print token.string

    --output:--
    pos : nn
    Letterman
    pos : bez
    is
    pos : jjr
    better
    pos : cs
    than
    pos : np
    Jay
    pos : np
    Leno
    7stud, Apr 2, 2008
    #3
  4. En Tue, 01 Apr 2008 20:44:41 -0300, 7stud <>
    escribió:

    >>           I am new to XML parsing.Could you kindly tell me whats the
    >> problem with the following code:
    >>
    >> import xml.dom.minidom
    >> import xml.parsers.expat

    >
    > I don't know if you are aware of the BeautifulSoup module:
    >

    Or ElementTree:

    import xml.etree.ElementTree as ET

    doctext = """<tokens><token pos="nn">Letterman</token><token
    pos="bez">is</token><token pos="jjr">better</token><token
    pos="cs">than</token><token pos="np">Jay</token><token
    pos="np">Leno</token></tokens>"""

    doc = ET.fromstring(doctext)
    for token in doc.findall("token"):
    print 'pos:', token.get('pos')
    print 'text:', token.text

    --
    Gabriel Genellina
    Gabriel Genellina, Apr 2, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Per Magnus L?vold
    Replies:
    0
    Views:
    1,377
    Per Magnus L?vold
    Nov 15, 2004
  2. Greg Wogan-Browne
    Replies:
    1
    Views:
    804
    Uche Ogbuji
    Jan 28, 2005
  3. Replies:
    2
    Views:
    500
  4. John Levine
    Replies:
    0
    Views:
    729
    John Levine
    Feb 2, 2012
  5. Erik Wasser
    Replies:
    5
    Views:
    450
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page