"Fixing html files"

Discussion in 'XML' started by John Resler, Mar 16, 2005.

  1. John Resler

    John Resler Guest

    Hi all,
    First I want to say I am fully aware of the huge scope of the problem
    of parsing and correcting files of any sort. I have been using the jTidy
    libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
    I use and convert it to xhtml if possible. Not to complain about Tidy,
    it is the only application I'm aware of that does what it does... I am
    just curious if there are any other applications/libraries that perform
    the same function, more completely?
    John Resler, Mar 16, 2005
    #1
    1. Advertising

  2. John Resler <> writes:

    > Hi all,
    > First I want to say I am fully aware of the huge scope of the problem
    > of parsing and correcting files of any sort. I have been using the jTidy
    > libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
    > I use and convert it to xhtml if possible. Not to complain about Tidy,
    > it is the only application I'm aware of that does what it does... I am
    > just curious if there are any other applications/libraries that perform
    > the same function, more completely?



    Hard to quantify "more completely". tidy does a better job than most.
    Alternative route might be for example John Cowan's tagsoup
    http://mercury.ccil.org/~cowan/XML/tagsoup/
    which will allow you to parse most html into an xml processing
    pipeline. It doesn't do any cleaning up really, but once you have it as
    xml you just hit it with enough xslt of your choice and it should all
    come out looking lovely, er, in theory....

    If you are feeling really brave there's my htmlparse xslt2 stylesheet
    but this is decidedly unsupported.
    http://www.dcarlisle.demon.co.uk/htmlparse.xsl

    David
    David Carlisle, Mar 16, 2005
    #2
    1. Advertising

  3. John Resler

    Nick Kew Guest

    John Resler wrote:
    > Hi all,
    > First I want to say I am fully aware of the huge scope of the
    > problem of parsing and correcting files of any sort. I have been using
    > the jTidy libraries (Dave Raggett W3C, I believe) to attempt to clean up


    Dave Raggett wrote the original tidy, but it's been some years since
    he was in charge of it.

    > the html I use and convert it to xhtml if possible. Not to complain
    > about Tidy, it is the only application I'm aware of that does what it
    > does... I am just curious if there are any other applications/libraries
    > that perform the same function, more completely?


    libxml2 parses html, including tagsoup html, and gives you SAX or DOM
    APIs on it. You can then serialise that to better HTML or XHTML.
    It's a different approach to tidy, and shares the same fundamental
    problem of having to guess blindly when presented with heavy-duty
    gibberish.

    A higher-level application based on libxml2 is AccessValet. Its
    real purpose is (X)HTML accessibility analysis and reporting, but it
    will also clean up (x)html. It takes a more brutal approach than
    tidy: instead of attempting to substitute for crap, it strips it.
    So if you take the default - which is strict output - it'll remove
    everything that's deprecated in HTML4/XHTML1, and
    <p align=center><font color=black>some text here<p>some more text
    becomes
    <p>some text here</p><p>some more text</p>

    I wouldn't recommend it over tidy for that particular purpose, but it's
    an option:)

    You can also fix markup on the fly when serving it. The state of the
    art there is mod_publisher, at
    http://apache.webthing.com/mod_publisher/
    and is far better than any of the tidy-in-a-webserver options.

    --
    Nick Kew
    Nick Kew, Mar 16, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page