Simple allowing of HTML elements/attributes?

Leif K-Brooks · Feb 11, 2004

I'm writing a site with mod_python which will have, among other things,
forums. I want to allow users to use some HTML (, , ,
etc.) on the forums, but I don't want to allow bad elements and
attributes (onclick, <script>, etc.). I would also like to do basic
validation (no overlapping elements like foo,
no missing end tags). I'm not asking anyone to write a script for me,
but does anyone have general ideas about how to do this quickly on an
active forum?

David M. Cooke · Feb 11, 2004

At some point said:
I'm writing a site with mod_python which will have, among other
things, forums. I want to allow users to use some HTML (,
, , etc.) on the forums, but I don't want to allow bad
elements and attributes (onclick, <script>, etc.). I would also like
to do basic validation (no overlapping elements like
foo, no missing end tags). I'm not asking
anyone to write a script for me, but does anyone have general ideas
about how to do this quickly on an active forum?

You could require valid XML, and use a validating XML parser to
check conformance. You'd have to make sure the output is correctly
quoted (for instance, check that HTML tags in a CDATA block get quoted).

Graham Fawcett · Feb 12, 2004

[email protected] (David M. Cooke) wrote in message news: said:
You could require valid XML, and use a validating XML parser to
check conformance. You'd have to make sure the output is correctly
quoted (for instance, check that HTML tags in a CDATA block get quoted).

You could use Tidy (or tidylib) to convert error-ridden input into
valid HTML or XHTML, and then grab the BODY contents via an XML
parser, as David suggested. I imagine that the library version of tidy
is quick enough to meet your needs.

Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
XHTML. (Not something I'm familiar with, but someone must have done
this before.)

There's a Python wrapper for tidylib at
http://utidylib.sourceforge.net/ .

-- Graham

Alan Kennedy · Feb 12, 2004

[Leif K-Brooks]
"Quickly" being an important consideration for you, I'm presuming.

(David M. Cooke)
Hmmm, I'd imagine that the average forum user isn't going to know what
well-formed XML is. Also, validating-XML support is one of the areas
where python is lacking. Lastly, wrapping HTML tags in a CDATA block
won't deliver much benefit. You still have to send that HTML to the
browser, which will probably render the contents of the CDATA block
anyway.

[Graham Fawcett]

You could use Tidy (or tidylib) to convert error-ridden input into
valid HTML or XHTML, and then grab the BODY contents via an XML
parser, as David suggested. I imagine that the library version of tidy
is quick enough to meet your needs.

This is a good idea. Tidy is always a good way to get easily
processable XML from badly-formed HTML. There are multiple ways to run
Tidy from python: use MAL's utidy library, use the command line
executable and pipes, or in jython use JTidy.

http://sourceforge.net/projects/jtidy

[Graham Fawcett]

Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
XHTML. (Not something I'm familiar with, but someone must have done
this before.)

However, this is not a good idea. XSLT requires an Object Model of the
document, meaning that you're going to use a lot of cpu-time and
memory. In extreme cases, e.g. where some black-hat attempts to upload
a 20 Mbyte HTML file, you're opening yourself up to a
Denial-Of-Service attack, when your server tries to build up a [D]OM
of that document.

The optimal solution, IMHO, is to tidy the HTML into XML, and then use
SAX to filter out the stuff you don't want. Here is some code that
does the latter. This should be nice and fast, and use a lot less
memory than object-model based approaches.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
import cStringIO as StringIO

permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

def startElement(self, elemname, attrs):
if elemname in permittedElements:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:
attrstr = "%s " % "%s='%s'" % (a, attrs[a])
self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
if elemname in permittedElements:
self.outbuf.write("</%s>" % (elemname,))

def characters(self, s):
self.outbuf.write("%s" % (s,))

testdoc = """
<html>
<body>
This paragraph contains only permitted elements.
This paragraph contains <i
onclick="javascript

op('porno.htm')">disallowed
attributes.
<img src="http://www.blackhat.com/session_hijack.gif"/>
This paragraph contains
<a href="http://www.jscript-attack.com/">a potential script
attack</a>
</body>
</html>
"""

if __name__ == "__main__":
parser = xml.sax.make_parser()
mycleaner = cleaner()
parser.setContentHandler(mycleaner)
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
parser.feed(testdoc)
print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Tidying the HTML to XML is left as an exercise to the reader ;-)

HTH,

Alan Kennedy · Feb 12, 2004

[Alan Kennedy]

The optimal solution, IMHO, is to tidy the HTML into XML, and then use
SAX to filter out the stuff you don't want. Here is some code that
does the latter. This should be nice and fast, and use a lot less
memory than object-model based approaches.

Unfortunately, in my haste to post a demonstration of a technique
earlier on, I posted running code that is both buggy and *INSECURE*.
The following are problems with it

1. A bug in making up the attribute string results in loss of
permitted attributes.

2. The failure to escape character data (i.e. map '<' to '<' and
'>' to '>') as it is written out gives rise to the possibility of a
code injection attack. It's easy to circumvent the check for malicious
code: I'll leave to y'all to figure out how.

3. I have a feeling that the failure to escape the attribute values
also opens the possibility of a code injection attack. I'm not
certain: it depends on the browser environment in which the final HTML
is rendered.

Anyway, here's some updated code that closes the SECURITY HOLES in the
earlier-posted version :-(

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO

permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

def startElement(self, elemname, attrs):
if elemname in permittedElements:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:
attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))
self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
if elemname in permittedElements:
self.outbuf.write("</%s>" % (elemname,))

def characters(self, s):
self.outbuf.write("%s" % (escape(s),))

testdoc = """
<html>
<body>
This paragraph contains only permitted
elements.
This paragraph contains <i
onclick="javascript

op('porno.htm')">disallowed
attributes.
<img src="http://www.blackhat.com/session_hijack.gif"/>
This paragraph contains
<script src="blackhat.js"/>a potential script
attack
</body>
</html>
"""

if __name__ == "__main__":
parser = xml.sax.make_parser()
mycleaner = cleaner()
parser.setContentHandler(mycleaner)
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
parser.feed(testdoc)
print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

regards,

Parsing HTML, extracting text and changing attributes.	9	Jun 18, 2007
Listing All HTML Elements with Specific attributes	3	Jan 20, 2010
Intro and asking for direction.(newbie am I)	2	Jun 29, 2020
Custom Elements	8	Jun 23, 2010
Accessing attributes in HTML with DOM	6	Jan 16, 2007
how to make a tree with randomly selected html tags from an array in python?	0	Mar 10, 2013
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Simple copy of attributes	3	Dec 18, 2006

Simple allowing of HTML elements/attributes?

Leif K-Brooks

David M. Cooke

Graham Fawcett

Alan Kennedy

Alan Kennedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads