Loading a simple XHTML transitional document into aorg.w3c.dom.Document

I

Ion Freeman

Hi!
I'm just trying to do the simplest thing in the world. Where input
is a java.io.File that contains an transitional XHTML 1.0 file, I do

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance
();
dbf.setNamespaceAware(false);
db = dbf.newDocumentBuilder();
Document doc = db.parse(input);

Unfortunately, this tries to pull the DTD from the W3C, and they
didn't like that. So, they give me a 503 error. I tried the
EntityResolver from http://forums.sun.com/thread.jspa?threadID=5244492,
but that just gives me a MalformedURLException. Either way, my parse
fails.

I'm sure that at least tens of thousands of people have written code
to do this, but I can't find a (working) reference online. I think
most of my XML parsing happened when the W3C would just give the DTDs
out -- I understand that they found that unworkable, but I still need
to parse my document.

How should I be doing this?

Thanks!

Ion
 
I

Ion Freeman

Thanks, markspace. I did try Axiom, but it looks like I have to figure
out how to do everything all over again -- like find an element by id
and replace it, all I really want to accomplish. I'd really just like
to get the Xerces parser to load my dtds locally, as opposed to
erroring out on the W3C site.

Ion said:
Hi!
   I'm just trying to do the simplest thing in the world. Where input
is a java.io.File that contains an transitional XHTML 1.0 file, [snip .....]
Unfortunately, this tries to pull the DTD from the W3C, and they
didn't like that. So, they give me a 503 error.

There might be some clues here:

http://www.javalobby.org/java/forums/t105916.html
 
M

markspace

Ion said:
Thanks, markspace. I did try Axiom, but it looks like I have to figure
out how to do everything all over again -- like find an element by id
and replace it, all I really want to accomplish. I'd really just like
to get the Xerces parser to load my dtds locally, as opposed to
erroring out on the W3C site.

Ion said:
Hi!
I'm just trying to do the simplest thing in the world. Where input
is a java.io.File that contains an transitional XHTML 1.0 file, [snip .....]
Unfortunately, this tries to pull the DTD from the W3C, and they
didn't like that. So, they give me a 503 error.
There might be some clues here:

http://www.javalobby.org/java/forums/t105916.html


I tried a quick little program of my own, which had a different problem
than yours did, although mine still threw a fatal error. My take away
from that error was that the Xerces parser just isn't going to pares the
looser syntax of a transitional HTML document. You'll have to use a
special one. The parses built into Java all seem to be XML and nothing
else, they don't allow for HTML's funky syntax. I'm guessing, but in
the small amount of work I did that seemed to be the case.
 
M

Mike Schilling

Ion said:
Hi!
I'm just trying to do the simplest thing in the world. Where input
is a java.io.File that contains an transitional XHTML 1.0 file, I do

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance
();
dbf.setNamespaceAware(false);
db = dbf.newDocumentBuilder();
Document doc = db.parse(input);

Unfortunately, this tries to pull the DTD from the W3C, and they
didn't like that. So, they give me a 503 error. I tried the
EntityResolver from
http://forums.sun.com/thread.jspa?threadID=5244492, but that just
gives me a MalformedURLException. Either way, my parse fails.

I'm sure that at least tens of thousands of people have written code
to do this, but I can't find a (working) reference online. I think
most of my XML parsing happened when the W3C would just give the DTDs
out -- I understand that they found that unworkable, but I still need
to parse my document.

How should I be doing this?

You should be able to solve this with an entity resolver that returns an
input source containing the right DTD text. They're not that difficut to
construct; just recognize the URL and return a StringReader or
ByteArrayInputStream. Return null for any URL you don't recognize.

If you know for a fact that the parser is Xerces (it's the default in Java
1.5 and later), you could try setting the Xerces-specific feature to ignore
DTDs. http://xml.org/sax/features/external-parameter-entities suggests that
you set http://xml.org/sax/features/external-parameter-entities to
"false", though we set
"http://apache.org/xml/features/nonvalidating/load-dtd-grammar" and
"http://apache.org/xml/features/nonvalidating/load-external-dtd" to false.
Be sure to call setValidating(false) too, though I'm pretty sure that's the
default anyway.
 
M

Mike Schilling

markspace said:
I tried a quick little program of my own, which had a different
problem than yours did, although mine still threw a fatal error. My
take away from that error was that the Xerces parser just isn't going
to pares the looser syntax of a transitional HTML document. You'll
have to use a special one. The parses built into Java all seem to be
XML and nothing else, they don't allow for HTML's funky syntax. I'm
guessing, but in the small amount of work I did that seemed to be the
case.

The original poster did say he's parsing xhtml, which is an XML-compatible
version of html. And DTDs (which is what's causing his problems) are a
standard and supported XML feature.
 
M

markspace

Mike said:
The original poster did say he's parsing xhtml, which is an XML-compatible
version of html. And DTDs (which is what's causing his problems) are a
standard and supported XML feature.


Theoretically, yes, but he said he was parsing a transitional document,
and I assume that means "web page." For my test, I used the home page
of http://cnn.com. It has 42 errors, according the the validator at
w3c.org. And Xerces barfed on stuff that the W3C validator passed.

My take away: transitional documents aren't. The OP will need a parser
specially built to deal with common errors that appear on web pages.
 
A

Arne Vajhøj

Ion said:
I'm just trying to do the simplest thing in the world. Where input
is a java.io.File that contains an transitional XHTML 1.0 file, I do

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance
();
dbf.setNamespaceAware(false);
db = dbf.newDocumentBuilder();
Document doc = db.parse(input);

Unfortunately, this tries to pull the DTD from the W3C, and they
didn't like that. So, they give me a 503 error. I tried the
EntityResolver from http://forums.sun.com/thread.jspa?threadID=5244492,
but that just gives me a MalformedURLException. Either way, my parse
fails.

I'm sure that at least tens of thousands of people have written code
to do this, but I can't find a (working) reference online. I think
most of my XML parsing happened when the W3C would just give the DTDs
out -- I understand that they found that unworkable, but I still need
to parse my document.

How should I be doing this?

Download the DTD and the 3 ENT files to your harddrive and tell
the parse to use those.

See code below.

Arne

=======================================================

import java.io.IOException;
import java.io.StringReader;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class XhtmlParse {
public static void main(String[] args) throws Exception{
String xml = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0
Transitional//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n<html>\r\n<head>\r\n<title>simple
document</title>\r\n</head>\r\n<body>\r\n<p>a simple
paragraph</p>\r\n</body>\r\n</html>";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(true);
DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(new DTDHandler());
Document doc = db.parse(new InputSource(new StringReader(xml)));
}
}

class DTDHandler implements EntityResolver {
@Override
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {

if(systemId.equals("http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"))
{
return new InputSource("C:\\xhtml1-transitional.dtd");
} else
if(systemId.equals("http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent")) {
return new InputSource("C:\\xhtml-lat1.ent");
} else
if(systemId.equals("http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent")) {
return new InputSource("C:\\xhtml-symbol.ent");
} else
if(systemId.equals("http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent")) {
return new InputSource("C:\\xhtml-special.ent");
} else {
return null;
}
}
}
 
A

Arne Vajhøj

markspace said:
Theoretically, yes, but he said he was parsing a transitional document,
and I assume that means "web page." For my test, I used the home page
of http://cnn.com. It has 42 errors, according the the validator at
w3c.org. And Xerces barfed on stuff that the W3C validator passed.

My take away: transitional documents aren't. The OP will need a parser
specially built to deal with common errors that appear on web pages.

CNN does not claim to be XHTML 1.0 Transitional.

CNN claims to be HTML 4.01 Transitional.

Difference.

There are web pages and there are web pages.

If something is valid XHTML, then it can be parsed
by an XML parser.

If something claims to be XHTML but are actually not
valid XHTML, then it may not be parseable by an
XML parser.

Arne
 
M

markspace

Arne said:
CNN does not claim to be XHTML 1.0 Transitional.

CNN claims to be HTML 4.01 Transitional.

Difference.


Hmm, Wikipedia said they were the same. Care to elaborate?

"XHTML 1.0 Transitional is the equivalent of HTML 4.01 Transitional, and
includes the presentational elements (such as center, font and strike)
excluded from the strict version."

http://en.wikipedia.org/wiki/XHTML
 
T

Tom Anderson

Ah, OK, thanks for that link. I thought the two had converged. Not
completely I see.

Theye haven't, and they won't.

HTML is SGML (kinda), and XHTML is XML, and despite what some have
claimed, XML is not a subset of SGML. As a minor but concrete example, the
form "<br/>" means something completely different in XML and SGML - in
XML, it's an empty br element, and in SGML, it's a 'null-end-tag-enabling
start tag' - the slash acts as the end of the tag, with the element
containing all the next until the next slash. So this:

<br/>I am a huge fan of AC/DC/

Means a br element containing the text ">I am a huge fan of AC" in SGML,
followed by the text "DC/".

It is possible to write text which is both valid XHTML and valid HTML, but
it takes quite a lot of effort, and i think it means you can't use certain
constructs at all. Are you allowed to write <a></a> for an element which
is declared as being EMPTY in XML? Happily, since nobody on the planet
actually cares if HTML is syntactically valid, you can just write XHTML
and feed it to browsers with an HTML content-type and they'll mostly
happily scarf it down.

tom
 
A

Arne Vajhøj

markspace said:
Ah, OK, thanks for that link. I thought the two had converged. Not
completely I see.

They did their best.

But HTML 4.01 had to be HTML compatible and that also mean
non-XML compatible in some cases.

Arne
 
A

Arne Vajhøj

Tom said:
Theye haven't, and they won't.

HTML is SGML (kinda),

It is. Valid HTML can be parsed by an SGML parser.
and XHTML is XML, and despite what some have
claimed, XML is not a subset of SGML.

some ?

You mean like in the first few lines of the XML specification ?

http://www.w3.org/TR/2008/REC-xml-20081126/

<quote>
Abstract

The Extensible Markup Language (XML) is a subset of SGML that is
completely described in this document. Its goal is to
As a minor but concrete example,
the form "<br/>" means something completely different in XML and SGML -
in XML, it's an empty br element, and in SGML, it's a
'null-end-tag-enabling start tag' - the slash acts as the end of the
tag, with the element containing all the next until the next slash.

It has one meaning in HTML and another meaning in XML.

I believe that different SGML applications can enable/disable
various SGML features.
It is possible to write text which is both valid XHTML and valid HTML,
but it takes quite a lot of effort, and i think it means you can't use
certain constructs at all.

Yep.

Arne
 
T

Tom Anderson

It is. Valid HTML can be parsed by an SGML parser.

Oh, well if you're talking about *valid* HTML, that's an entirely
different beast!

It's worth noting that HTML 5 will not be SGML:

http://dev.w3.org/html5/spec/Overview.html
some ?

You mean like in the first few lines of the XML specification ?

http://www.w3.org/TR/2008/REC-xml-20081126/

<quote>
Abstract

The Extensible Markup Language (XML) is a subset of SGML that is completely
described in this document. Its goal is to
</quote>

A very good example. Despite being in the spec, this is a lie.
It has one meaning in HTML and another meaning in XML.

I believe that different SGML applications can enable/disable various
SGML features.

True. I am by no means an SGML expert, but i think HTML leaves the
SHORTTAG features on:

http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html

And that means the NET-enabling start tags are, formally, operational in
HTML, unless disabled by an internal subset in the document, which i've
never seen.

tom
 
T

Tom Anderson

Interesting.

HTML 5 parsers will be from scratch then.

No, since no current browser parses HTML using an SGML parser. They're all
handwritten anyway. AIUI, the only SGML-based HTML parsers in production
are the online validators!
The XML specification lying about what XML is ????

Correct.

Unless <foo/> can be a legal way of writing an empty foo element
(including when foo is declared with a content model other than EMPTY) in
SGML, which i don't believe it can.

I think SGML also doesn't allow colons in names, which XML does. BICBW.

There is a thing called Web SGML, which is a slightly modified version of
SGML which i think *is* a superset of XML. But basically, that was
invented so that XML could be retrofitted into the SGML framework; it's
not 'proper' SGML.

I find this stuff hard to get my head round because SGML is that it's far
more customisable than XML - as well as the DTD, there's an 'SGML
declaration', which can do things like define what character is used to
mark the start of tags (hardwired to < in XML) and so on. This is very
powerful, but ludicrously complex. It can in fact be used to alter SGML to
the point that it gets very close to XML - and Web SGML enables it to go
the remainder of the distance.

tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top