Trimming X/HTML files

T

Thomas SMETS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Dear,

I need to parse XHTML/HTML files in all ways :
~ _ Removing comments and javascripts is a first issue
~ _ Retrieving the list of fields to submit is my following item (todo)

Any idea where I could find this already made ... ?

\T,


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFC6CAqqN0SJr+xLBURAkc5AKCAb5pdsUZjCezr8bcpd2GTJ9FueACg7raK
pMU5Y4Z7PaPGfGZkJ/wCDpw=
=iM7V
-----END PGP SIGNATURE-----
 
?

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Thomas said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Dear,

I need to parse XHTML/HTML files in all ways :
~ _ Removing comments and javascripts is a first issue
~ _ Retrieving the list of fields to submit is my following item (todo)

Any idea where I could find this already made ... ?

You could try XIST (http://www.livinglogic.de/Python/xist).

Removing comments and javascripts works like this:

---
from ll.xist import xsc, parsers
from ll.xist.ns import html

e = parsers.parseURL("http://www.python.org/", tidy=True)

def removestuff(node, converter):
if isinstance(node, xsc.Comment):
node = xsc.Null
elif isinstance(node, html.script) and \
(unicode(node["type"]) == u"text/javascript" or \
unicode(node["language"]) == u"Javascript" \
):
node = xsc.Null
return node

e = e.mapped(removestuff)

print e.asBytes()
---

Retrieving the list of fields from all forms on a page might look like this:

---
from ll.xist import xsc, parsers, xfind
from ll.xist.ns import html

e = parsers.parseURL("http://www.python.org/", tidy=True)

for form in e//html.form:
print "Fields for %s" % form["action"]
for field in form//xfind.is_(html.input, html.textarea):
if "id" in field.attrs:
print "\t%s" % field["id"]
else:
print "\t%s" % field["name"]
---

This prints:

Fields for http://www.google.com/search
q
domains
sitesearch
sourceid
submit

Hope that helps!

Bye,
Walter Dörwald
 
T

Thomas SMETS

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


The regular expression remove script out of an HTML/XHTML file is simple
enough but raises a major performance issue....

The following regular expression :
r'(<script(\s*\S+\s*)+</script>)'
takes ages to complete in python on simple HTML file more than 3 minutes
of CPU time on a 150 lines HTML file. In jython it just never completes
but returns a painfull RunTimeException : maximum number of ??? reached.

Is the only way out dealing with strings and "match" instead of regular
expression ?
More over Jython is not yet 2.3 compliant, hence advanced features of
2.3 regular expression are not yet available !

\T,




Thomas SMETS wrote:
|
| Dear,
|
| I need to parse XHTML/HTML files in all ways :
| ~ _ Removing comments and javascripts is a first issue
| ~ _ Retrieving the list of fields to submit is my following item (todo)
|
| Any idea where I could find this already made ... ?
|
| \T,
|
|

- --
Thomas SMETS
Bruxelles
@ : (e-mail address removed)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFC7OkTqN0SJr+xLBURAuTYAKDLxLv+hpnSrZ6uowOmUczVxgxLqwCYhfJ3
fwjPZzg88gh3lNY8jkG3SA==
=urIC
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,192
Latest member
KalaReid2

Latest Threads

Top