HTML 4 BDTD?

Discussion in 'Java' started by John W. Kennedy, Jan 30, 2007.

  1. I'm in the process of de-frame-ing a website with a couple thousand
    pages of static HTML, and I've been building a tool that works pretty
    well, based on javax.swing.text.html.parser technology, which I've never
    used before. Large parts of the website are HTML 3.2, and everything's
    just ducky. But there are a good many pages that are HTML 4.0, and my
    program goes completely ca-ca on them, because I'm stuck with only the
    built-in html32.bdtd file.

    A) Is there any good reason that Sun didn't make up an html401.bdtd file
    yonks ago?

    B) Has anyone an html401.bdtd file to share?

    C) Is there any other solution available? (No XML-based tool is going to
    come close to handling this stuff -- it's all hand-written--not by me--
    and it was painful enough doing various text-based global fixes to make
    it parse properly as 3.2. -- lots of <b><i>blah</b></i> and that sort of
    thing.)

    --
    John W. Kennedy
    "The blind rulers of Logres
    Nourished the land on a fallacy of rational virtue."
    -- Charles Williams. "Taliessin through Logres: Prelude"
    John W. Kennedy, Jan 30, 2007
    #1
    1. Advertising

  2. John W. Kennedy

    Daniel Pitts Guest

    John W. Kennedy wrote:
    > I'm in the process of de-frame-ing a website with a couple thousand
    > pages of static HTML, and I've been building a tool that works pretty
    > well, based on javax.swing.text.html.parser technology, which I've never
    > used before. Large parts of the website are HTML 3.2, and everything's
    > just ducky. But there are a good many pages that are HTML 4.0, and my
    > program goes completely ca-ca on them, because I'm stuck with only the
    > built-in html32.bdtd file.
    >
    > A) Is there any good reason that Sun didn't make up an html401.bdtd file
    > yonks ago?
    >
    > B) Has anyone an html401.bdtd file to share?
    >
    > C) Is there any other solution available? (No XML-based tool is going to
    > come close to handling this stuff -- it's all hand-written--not by me--
    > and it was painful enough doing various text-based global fixes to make
    > it parse properly as 3.2. -- lots of <b><i>blah</b></i> and that sort of
    > thing.)
    >
    > --
    > John W. Kennedy
    > "The blind rulers of Logres
    > Nourished the land on a fallacy of rational virtue."
    > -- Charles Williams. "Taliessin through Logres: Prelude"


    Check out JTidy (or just tidy). It'll clean up your HTML. It might
    even be able to translate it to XHTML, and THEN you can use XML
    parsing no problem :)

    Standard java HTML parsing is very lacking (as you have discovered).
    At the very worst, you may want to work with regex instead.

    Oh, and see if Apache has anything (Maybe in Jakarta?), they tend to
    have useful utilities of the most surprising type :)

    Hope this helps,
    Daniel.
    Daniel Pitts, Jan 30, 2007
    #2
    1. Advertising

  3. Daniel Pitts wrote:
    > Check out JTidy (or just tidy). It'll clean up your HTML.


    Yes, but I'm not trying to tidy it (though my current code does that as
    a side effect, since I'm slurping each page into a tree and re-emitting
    it in clean HTML4); I'm trying to do major surgery on the content of
    every page, so that I can de-frame the whole website, which, although
    elegant-looking to the user, has become a nightmare of frame-juggling
    whenever I have to link from one page to another that is not a notional
    child, parent, or sibling. The last thing I want to do is degrade the
    existing HTML 4.0 pages (the majority of which are semantically
    marked-up, thoroughly CSSed, and W3C verified) to HTML 3.2. I also want
    a stable tool for future use, so that I can revise link menus in the
    event of a new branch on the site's conceptual tree; otherwise, I'll
    have to use SHTML for every single page.

    > It might
    > even be able to translate it to XHTML, and THEN you can use XML
    > parsing no problem :)


    Maybe I'll have to do that, but I'm annoyed that I won't be able to use
    real XHTML, but only XHTML-like HTML, thanks to Microsoft stabbing the
    W3C in the back. (It's a public-oriented website, so I can't say "Use
    Firefox", however much I'd like to.) I suppose I could make up the site
    in XHTML and then XSLT it to an HTML4 equivalent.

    Damn Microsoft! (And damn Apple for their cowardly acquiescence!)

    --
    John W. Kennedy
    "The blind rulers of Logres
    Nourished the land on a fallacy of rational virtue."
    -- Charles Williams. "Taliessin through Logres: Prelude"
    John W. Kennedy, Jan 30, 2007
    #3
  4. Daniel Pitts wrote:
    > Check out JTidy (or just tidy).


    On investigation, it appears to be able to be used as a library to read
    HTML into a DOM. I'm more or less doing that now, so it should be
    relatively straightforward to slot it in where I am using import
    javax.swing.text.html, etc..

    --
    John W. Kennedy
    "The blind rulers of Logres
    Nourished the land on a fallacy of rational virtue."
    -- Charles Williams. "Taliessin through Logres: Prelude"
    John W. Kennedy, Jan 30, 2007
    #4
  5. John W. Kennedy

    Rogan Dawes Guest

    John W. Kennedy wrote:
    > Daniel Pitts wrote:
    >> Check out JTidy (or just tidy).

    >
    > On investigation, it appears to be able to be used as a library to read
    > HTML into a DOM. I'm more or less doing that now, so it should be
    > relatively straightforward to slot it in where I am using import
    > javax.swing.text.html, etc..
    >

    Also consider htmlparser (htmlparser.sourceforge.net)

    Rogan
    Rogan Dawes, Feb 1, 2007
    #5
  6. Rogan Dawes wrote:
    > John W. Kennedy wrote:
    >> Daniel Pitts wrote:
    >>> Check out JTidy (or just tidy).


    >> On investigation, it appears to be able to be used as a library to
    >> read HTML into a DOM. I'm more or less doing that now, so it should be
    >> relatively straightforward to slot it in where I am using import
    >> javax.swing.text.html, etc..


    > Also consider htmlparser (htmlparser.sourceforge.net)


    I looked at it, but liked the feel of JTidy better.

    In practice, JTidy (as an in-program DOM-building tool, not as a
    standalone application) has worked fine. I plugged it into my program,
    replacing the javax.swing.text.html tools, in a few hours, and I can now
    read HTML 4 and HTML 3.2 equally well. The end of the project to
    de-frame the website and get all the pages 4.01-clean is now in sight.

    I do wish the JavaDoc was a little more complete. In a few places, I had
    to look at the source.

    --
    John W. Kennedy
    "The blind rulers of Logres
    Nourished the land on a fallacy of rational virtue."
    -- Charles Williams. "Taliessin through Logres: Prelude"
    John W. Kennedy, Feb 2, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark Kamoski
    Replies:
    1
    Views:
    7,093
  2. Mitchua
    Replies:
    1
    Views:
    7,069
    Ice Demon
    Jul 15, 2003
  3. Laura
    Replies:
    1
    Views:
    526
    Gunnar Hjalmarsson
    Jun 5, 2004
  4. Matthew Louden
    Replies:
    1
    Views:
    6,912
    Scott M.
    Oct 11, 2003
  5. Adam Akhtar
    Replies:
    9
    Views:
    524
    Florian Gilcher
    Aug 16, 2008
Loading...

Share This Page