Trimming X/HTML files

Thomas SMETS · Jul 28, 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear,

I need to parse XHTML/HTML files in all ways :
~ _ Removing comments and javascripts is a first issue
~ _ Retrieving the list of fields to submit is my following item (todo)

Any idea where I could find this already made ... ?

\T,

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFC6CAqqN0SJr+xLBURAkc5AKCAb5pdsUZjCezr8bcpd2GTJ9FueACg7raK
pMU5Y4Z7PaPGfGZkJ/wCDpw=
=iM7V
-----END PGP SIGNATURE-----

=?ISO-8859-1?Q?Walter_D=F6rwald?= · Jul 28, 2005

Thomas said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear,

I need to parse XHTML/HTML files in all ways :
~ _ Removing comments and javascripts is a first issue
~ _ Retrieving the list of fields to submit is my following item (todo)

Any idea where I could find this already made ... ?

You could try XIST (http://www.livinglogic.de/Python/xist).

Removing comments and javascripts works like this:

---
from ll.xist import xsc, parsers
from ll.xist.ns import html

e = parsers.parseURL("http://www.python.org/", tidy=True)

def removestuff(node, converter):
if isinstance(node, xsc.Comment):
node = xsc.Null
elif isinstance(node, html.script) and \
(unicode(node["type"]) == u"text/javascript" or \
unicode(node["language"]) == u"Javascript" \
):
node = xsc.Null
return node

e = e.mapped(removestuff)

print e.asBytes()
---

Retrieving the list of fields from all forms on a page might look like this:

---
from ll.xist import xsc, parsers, xfind
from ll.xist.ns import html

e = parsers.parseURL("http://www.python.org/", tidy=True)

for form in e//html.form:
print "Fields for %s" % form["action"]
for field in form//xfind.is_(html.input, html.textarea):
if "id" in field.attrs:
print "\t%s" % field["id"]
else:
print "\t%s" % field["name"]
---

This prints:

Fields for http://www.google.com/search
q
domains
sitesearch
sourceid
submit

Hope that helps!

Bye,
Walter Dörwald

Thomas SMETS · Jul 31, 2005

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The regular expression remove script out of an HTML/XHTML file is simple
enough but raises a major performance issue....

The following regular expression :
r'(<script(\s*\S+\s*)+</script>)'
takes ages to complete in python on simple HTML file more than 3 minutes
of CPU time on a 150 lines HTML file. In jython it just never completes
but returns a painfull RunTimeException : maximum number of ??? reached.

Is the only way out dealing with strings and "match" instead of regular
expression ?
More over Jython is not yet 2.3 compliant, hence advanced features of
2.3 regular expression are not yet available !

\T,

Thomas SMETS wrote:
|
| Dear,
|
| I need to parse XHTML/HTML files in all ways :
| ~ _ Removing comments and javascripts is a first issue
| ~ _ Retrieving the list of fields to submit is my following item (todo)
|
| Any idea where I could find this already made ... ?
|
| \T,
|
|

- --
Thomas SMETS
Bruxelles
@ : (e-mail address removed)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFC7OkTqN0SJr+xLBURAuTYAKDLxLv+hpnSrZ6uowOmUczVxgxLqwCYhfJ3
fwjPZzg88gh3lNY8jkG3SA==
=urIC
-----END PGP SIGNATURE-----

Eric4 vs Python3.1	1	Mar 3, 2010
force coredump from child created by subprocess	0	Oct 5, 2010
javax.xml.transform.Transformer and HTML entities	4	Oct 11, 2011
import reassignment different at module and function scope	0	Jan 31, 2009
how to run part of my python code as root	2	Feb 4, 2010
parsing XSD	2	Aug 10, 2008
ANNOUNCE - NHI1 / PLMK / libmsgque - Work-Package-II	0	Oct 1, 2010
Any way to loop through object variables?	0	May 28, 2008

Trimming X/HTML files

Thomas SMETS

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Thomas SMETS

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads