I'm looking for html cleaner. Example : convert <h1><span><font>my title</font></span></h1> => <h1>m

Stéphane Klein · Mar 29, 2010

Hi,

I work on HTML cleaner.

I export OpenOffice.org documents to HTML.
Next, I would like clean this HTML export files :

* remove comment
* remove style
* remove dispensable tag
* ...

some difficulty :

* convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
* convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>

to do this process, I use lxml and pyquery.

Question :

* are there some xml helper tools in Python to do this process ? I've
looked for in pypi, I found nothing about it

If you confirm than this tools don't exists, I'll maybe publish a helper
package to do this "clean" processing.

Thanks for your help,
Stephane

Harishankar · Mar 29, 2010

Hi,

I work on HTML cleaner.

I export OpenOffice.org documents to HTML. Next, I would like clean this
HTML export files :

* remove comment
* remove style
* remove dispensable tag
* ...

some difficulty :

* convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
* convert <h1><span><font>my title</font></span></h1> => <h1>my
title</h1>

to do this process, I use lxml and pyquery.

Question :

* are there some xml helper tools in Python to do this process ? I've
looked for in pypi, I found nothing about it

If you confirm than this tools don't exists, I'll maybe publish a helper
package to do this "clean" processing.

Thanks for your help,
Stephane

Take a look at htmllib and HTMLParser (two different modules) in the
Python built-in library.

In Python 3.x there is one called html.parser

You can use this to parse out specific tags from HTML documents. If you
want something more advanced, consider using XML.

John Nagle · Mar 30, 2010

Stéphane Klein said:
Hi,

I work on HTML cleaner.

I export OpenOffice.org documents to HTML.
Next, I would like clean this HTML export files :

* remove comment
* remove style
* remove dispensable tag
* ...

Try parsing with HTML5 Parser ("http://code.google.com/p/html5lib/") which
is the closest thing to a good parser available for Python. It's basically
a reference implementation of HTML5, including all the handling of bad HTML.

Once you have a tree, write something to go through the tree and remove
empty tags from a list of tags which do nothing when empty. Then
regenerate HTML from the tree.

Or just use HTML Tidy: "http://www.w3.org/People/Raggett/tidy/"

John Nagle

Am using a Javascript for my small site.....	7	Mar 22, 2008
my first screen scraper	0	Dec 2, 2007
Resume: Design Verification Consultant (Specman)	2	Sep 25, 2003
Right tool and method to strip off html files (python, sed, awk?)	5	Jul 13, 2007
Possible to check for empty input boxes when names generated dynamically?	3	Sep 12, 2003
How to map your neighborhood or any USA neighborhood	1	Dec 20, 2006
Shining a nice sunshine on American neighborhoods	1	Dec 21, 2006
FAQ update (roundup of pending requests - for comment)	6	Jan 7, 2004

I'm looking for html cleaner. Example : convert <h1><span><font>my title</font></span></h1> => <h1>m

Stéphane Klein

Harishankar

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads