Problem with xml.dom parser and xmlns attribute

Discussion in 'Python' started by Peter Maas, Apr 22, 2004.

  1. Peter Maas

    Peter Maas Guest

    Hi,

    I have a problem parsing html text with xmldom. The following code
    runs well:

    --------------------------------------------
    from xml.dom.ext.reader import HtmlLib
    from xml.dom.ext import PrettyPrint

    r = HtmlLib.Reader()
    doc = r.fromString(
    '''
    <html>
    <head>
    </head>
    <body>
    <p>hallo welt
    </body>
    </html>
    ''')
    PrettyPrint(doc)
    --------------------------------------------

    but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">
    I get the error

    Traceback (most recent call last):
    File "xhtml.py", line 5, in ?
    doc = r.fromString(
    File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 69, in fromString
    return self.fromStream(stream, ownerDoc, charset)
    File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 27, in fromStream
    self.parser.parse(stream)
    File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 57, in parse
    self._parser.parse(stream.read())
    File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 160, in finish_starttag
    unicode(value, self._charset))
    File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Element.py", line 177, in setAttributeNS
    attr = self.ownerDocument.createAttributeNS(namespaceURI, qualifiedName)
    File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Document.py", line 139, in createAttributeNS
    raise NamespaceErr()
    xml.dom.NamespaceErr: Invalid or illegal namespace operation
    >Exit code: 1


    A lot of HTML documents on Internet have this xmlns=.... Are
    they wrong or is this a PyXML bug?

    Mit freundlichen Gruessen,

    Peter Maas

    --
    -------------------------------------------------------------------
    Peter Maas, M+R Infosysteme, D-52070 Aachen, Hubert-Wienen-Str. 24
    Tel +49-241-93878-0 Fax +49-241-93878-20 eMail
    -------------------------------------------------------------------
     
    Peter Maas, Apr 22, 2004
    #1
    1. Advertising

  2. "Peter Maas" <> wrote in message news:c682uu$sco$...

    > but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">


    > A lot of HTML documents on Internet have this xmlns=.... Are
    > they wrong or is this a PyXML bug?


    If they are genuine XHTML documents, they should be well-formed XML,
    so you should be able to use an XML rather than an SGML parser.

    from xml.dom.ext.reader import Sax2
    r = Sax2.Reader()
     
    Richard Brodie, Apr 22, 2004
    #2
    1. Advertising

  3. Peter Maas

    Peter Maas Guest

    Richard Brodie wrote:
    > "Peter Maas" <> wrote in message news:c682uu$sco$...

    [...]
    >>but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">

    [...]
    >>A lot of HTML documents on Internet have this xmlns=.... Are
    >>they wrong or is this a PyXML bug?

    >
    >
    > If they are genuine XHTML documents, they should be well-formed XML,
    > so you should be able to use an XML rather than an SGML parser.
    >
    > from xml.dom.ext.reader import Sax2
    > r = Sax2.Reader()


    Thanks, Richard. But in the Internet most of the time I don't know
    what kind of document I'm dealing with when I start parsing. I guess
    I should use HTMLParser (?).

    Mit freundlichen Gruessen,

    Peter Maas

    --
    -------------------------------------------------------------------
    Peter Maas, M+R Infosysteme, D-52070 Aachen, Hubert-Wienen-Str. 24
    Tel +49-241-93878-0 Fax +49-241-93878-20 eMail
    -------------------------------------------------------------------
     
    Peter Maas, Apr 22, 2004
    #3
  4. "Peter Maas" <> wrote in message news:c68jai$g85$...

    > Thanks, Richard. But in the Internet most of the time I don't know
    > what kind of document I'm dealing with when I start parsing. I guess
    > I should use HTMLParser (?).


    If you're dealing with a wide range of web pages, chances are they
    will have all manner of rubbish in them. I would probably feed the
    stuff through Tidy (or uTidyLib) first, to convert to cleanish XHTML,
    then use an XML parser.
     
    Richard Brodie, Apr 23, 2004
    #4
  5. Peter Maas

    Uche Ogbuji Guest

    Peter Maas <> wrote in message news:<c682uu$sco$>...
    > Hi,
    >
    > I have a problem parsing html text with xmldom. The following code
    > runs well:
    >
    > --------------------------------------------
    > from xml.dom.ext.reader import HtmlLib
    > from xml.dom.ext import PrettyPrint
    >
    > r = HtmlLib.Reader()
    > doc = r.fromString(
    > '''
    > <html>
    > <head>
    > </head>
    > <body>
    > <p>hallo welt
    > </body>
    > </html>
    > ''')
    > PrettyPrint(doc)
    > --------------------------------------------
    >
    > but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">
    > I get the error
    >
    > Traceback (most recent call last):
    > File "xhtml.py", line 5, in ?
    > doc = r.fromString(
    > File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 69, in fromString
    > return self.fromStream(stream, ownerDoc, charset)
    > File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 27, in fromStream
    > self.parser.parse(stream)
    > File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 57, in parse
    > self._parser.parse(stream.read())
    > File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 160, in finish_starttag
    > unicode(value, self._charset))
    > File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Element.py", line 177, in setAttributeNS
    > attr = self.ownerDocument.createAttributeNS(namespaceURI, qualifiedName)
    > File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Document.py", line 139, in createAttributeNS
    > raise NamespaceErr()
    > xml.dom.NamespaceErr: Invalid or illegal namespace operation
    > >Exit code: 1

    >
    > A lot of HTML documents on Internet have this xmlns=.... Are
    > they wrong or is this a PyXML bug?


    This looks like a 4DOM bug. What are you hoping to do once you've
    parsed these documents? If we know we can either suggest an
    alternative tool to use or perhaps a workaround.

    --Uche
     
    Uche Ogbuji, May 10, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    1,433
  2. Replies:
    9
    Views:
    63,678
    ChazZeromus
    Feb 25, 2012
  3. Greg Collins [Microsoft MVP]

    Re: How to remove xmlns attribute from XML document (.net)

    Greg Collins [Microsoft MVP], Oct 25, 2006, in forum: XML
    Replies:
    0
    Views:
    854
    Greg Collins [Microsoft MVP]
    Oct 25, 2006
  4. Greg Collins [Microsoft MVP]

    Re: How to remove xmlns attribute from XML document (.net)

    Greg Collins [Microsoft MVP], Apr 7, 2007, in forum: XML
    Replies:
    0
    Views:
    702
    Greg Collins [Microsoft MVP]
    Apr 7, 2007
  5. afshar
    Replies:
    3
    Views:
    24,931
    aljar
    May 19, 2010
Loading...

Share This Page