Problem with xml.dom parser and xmlns attribute

Peter Maas · Apr 22, 2004

Hi,

I have a problem parsing html text with xmldom. The following code
runs well:

--------------------------------------------
from xml.dom.ext.reader import HtmlLib
from xml.dom.ext import PrettyPrint

r = HtmlLib.Reader()
doc = r.fromString(
'''
<html>
<head>
</head>
<body>
<p>hallo welt
</body>
</html>
''')
PrettyPrint(doc)
--------------------------------------------

but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">
I get the error

Traceback (most recent call last):
File "xhtml.py", line 5, in ?
doc = r.fromString(
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 69, in fromString
return self.fromStream(stream, ownerDoc, charset)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 27, in fromStream
self.parser.parse(stream)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 57, in parse
self._parser.parse(stream.read())
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 160, in finish_starttag
unicode(value, self._charset))
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Element.py", line 177, in setAttributeNS
attr = self.ownerDocument.createAttributeNS(namespaceURI, qualifiedName)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Document.py", line 139, in createAttributeNS
raise NamespaceErr()
xml.dom.NamespaceErr: Invalid or illegal namespace operation

>Exit code: 1

A lot of HTML documents on Internet have this xmlns=.... Are
they wrong or is this a PyXML bug?

Mit freundlichen Gruessen,

Peter Maas

Richard Brodie · Apr 22, 2004

Peter Maas said:
but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">

A lot of HTML documents on Internet have this xmlns=.... Are
they wrong or is this a PyXML bug?

If they are genuine XHTML documents, they should be well-formed XML,
so you should be able to use an XML rather than an SGML parser.

from xml.dom.ext.reader import Sax2
r = Sax2.Reader()

Peter Maas · Apr 22, 2004

Richard said:
but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml"> [...]
A lot of HTML documents on Internet have this xmlns=.... Are
they wrong or is this a PyXML bug?

Click to expand...

If they are genuine XHTML documents, they should be well-formed XML,
so you should be able to use an XML rather than an SGML parser.

from xml.dom.ext.reader import Sax2
r = Sax2.Reader()

Thanks, Richard. But in the Internet most of the time I don't know
what kind of document I'm dealing with when I start parsing. I guess
I should use HTMLParser (?).

Mit freundlichen Gruessen,

Peter Maas

Richard Brodie · Apr 23, 2004

Peter Maas said:
Thanks, Richard. But in the Internet most of the time I don't know
what kind of document I'm dealing with when I start parsing. I guess
I should use HTMLParser (?).

If you're dealing with a wide range of web pages, chances are they
will have all manner of rubbish in them. I would probably feed the
stuff through Tidy (or uTidyLib) first, to convert to cleanish XHTML,
then use an XML parser.

Uche Ogbuji · May 10, 2004

Peter Maas said:
Hi,

I have a problem parsing html text with xmldom. The following code
runs well:

--------------------------------------------
from xml.dom.ext.reader import HtmlLib
from xml.dom.ext import PrettyPrint

r = HtmlLib.Reader()
doc = r.fromString(
'''
<html>
<head>
</head>
<body>
<p>hallo welt
</body>
</html>
''')
PrettyPrint(doc)
--------------------------------------------

but if I replace <html> by <html xmlns="http://www.w3.org/1999/xhtml">
I get the error

Traceback (most recent call last):
File "xhtml.py", line 5, in ?
doc = r.fromString(
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 69, in fromString
return self.fromStream(stream, ownerDoc, charset)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\HtmlLib.py", line 27, in fromStream
self.parser.parse(stream)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 57, in parse
self._parser.parse(stream.read())
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\ext\reader\Sgmlop.py", line 160, in finish_starttag
unicode(value, self._charset))
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Element.py", line 177, in setAttributeNS
attr = self.ownerDocument.createAttributeNS(namespaceURI, qualifiedName)
File "C:\PROGRA~1\Python23\lib\site-packages\_xmlplus\dom\Document.py", line 139, in createAttributeNS
raise NamespaceErr()
xml.dom.NamespaceErr: Invalid or illegal namespace operation

A lot of HTML documents on Internet have this xmlns=.... Are
they wrong or is this a PyXML bug?

This looks like a 4DOM bug. What are you hoping to do once you've
parsed these documents? If we know we can either suggest an
alternative tool to use or perhaps a workaround.

--Uche

PyXML, Sax, error in processing external entity reference	2	Feb 25, 2004
AttributeError: 'list' object has no attribute 'lower'	13	Sep 8, 2012
need help with PyXML	1	Oct 1, 2003
Help with win32 com_error exception	2	Jun 2, 2007
AttributeError: 'Or' object has no attribute 'as_independent'	1	Feb 13, 2014
problem with google api / xml	3	May 31, 2006
Import Error with embedded python	0	Mar 16, 2007
SOAP failure	0	Dec 6, 2004

Problem with xml.dom parser and xmlns attribute

Peter Maas

Richard Brodie

Peter Maas

Richard Brodie

Uche Ogbuji

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads