I'm looking for html cleaner. Example : convert <h1><span><font>my title</font></span></h1> => <h1>m

Discussion in 'Python' started by Stéphane Klein, Mar 29, 2010.

  1. Hi,

    I work on HTML cleaner.

    I export OpenOffice.org documents to HTML.
    Next, I would like clean this HTML export files :

    * remove comment
    * remove style
    * remove dispensable tag
    * ...

    some difficulty :

    * convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
    * convert <h1><span><font>my title</font></span></h1> => <h1>my title</h1>

    to do this process, I use lxml and pyquery.

    Question :

    * are there some xml helper tools in Python to do this process ? I've
    looked for in pypi, I found nothing about it

    If you confirm than this tools don't exists, I'll maybe publish a helper
    package to do this "clean" processing.

    Thanks for your help,
    Stephane
     
    Stéphane Klein, Mar 29, 2010
    #1
    1. Advertising

  2. Stéphane Klein

    Harishankar Guest

    Re: I'm looking for html cleaner. Example : convert<h1><span><font>my title</font></span></h1> => <h1>mytitle</h1>…

    On Mon, 29 Mar 2010 10:12:09 +0200, Stéphane Klein wrote:

    > Hi,
    >
    > I work on HTML cleaner.
    >
    > I export OpenOffice.org documents to HTML. Next, I would like clean this
    > HTML export files :
    >
    > * remove comment
    > * remove style
    > * remove dispensable tag
    > * ...
    >
    > some difficulty :
    >
    > * convert <p>my text <span>foo</span> bar</p> => <p>my text foo par</p>
    > * convert <h1><span><font>my title</font></span></h1> => <h1>my
    > title</h1>
    >
    > to do this process, I use lxml and pyquery.
    >
    > Question :
    >
    > * are there some xml helper tools in Python to do this process ? I've
    > looked for in pypi, I found nothing about it
    >
    > If you confirm than this tools don't exists, I'll maybe publish a helper
    > package to do this "clean" processing.
    >
    > Thanks for your help,
    > Stephane



    Take a look at htmllib and HTMLParser (two different modules) in the
    Python built-in library.

    In Python 3.x there is one called html.parser

    You can use this to parse out specific tags from HTML documents. If you
    want something more advanced, consider using XML.





    --
    Harishankar (http://harishankar.org http://literaryforums.org)
     
    Harishankar, Mar 29, 2010
    #2
    1. Advertising

  3. Stéphane Klein

    John Nagle Guest

    Stéphane Klein wrote:
    > Hi,
    >
    > I work on HTML cleaner.
    >
    > I export OpenOffice.org documents to HTML.
    > Next, I would like clean this HTML export files :
    >
    > * remove comment
    > * remove style
    > * remove dispensable tag
    > * ...


    Try parsing with HTML5 Parser ("http://code.google.com/p/html5lib/") which
    is the closest thing to a good parser available for Python. It's basically
    a reference implementation of HTML5, including all the handling of bad HTML.

    Once you have a tree, write something to go through the tree and remove
    empty tags from a list of tags which do nothing when empty. Then
    regenerate HTML from the tree.

    Or just use HTML Tidy: "http://www.w3.org/People/Raggett/tidy/"

    John Nagle
     
    John Nagle, Mar 30, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Al Moritz
    Replies:
    7
    Views:
    647
    Richard Laing
    Jul 22, 2003
  2. Ivan Voras

    HTML cleaner?

    Ivan Voras, Apr 25, 2005, in forum: Python
    Replies:
    7
    Views:
    855
    Terry Hancock
    Apr 26, 2005
  3. mttc
    Replies:
    2
    Views:
    2,425
    Roedy Green
    Jul 3, 2009
  4. Stefan Behnel
    Replies:
    0
    Views:
    492
    Stefan Behnel
    Mar 29, 2010
  5. Nagraj Kini

    Changing the time for the html span title tag

    Nagraj Kini, Oct 13, 2003, in forum: Javascript
    Replies:
    1
    Views:
    259
    Thomas 'PointedEars' Lahn
    Oct 14, 2003
Loading...

Share This Page