Root element specified by DTD ?

A

Andy Dingley

What specifies the permitted root element(s) for a document ? HTML,
SGML, XHTML or XML ?


Valid HTML documents need to have a well-known DTD and a doctypedecl in
each document like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">

The document's root element is "HTML", and is specified by the
doctypedecl. For HTML and XHTML it's possible that the prose of their
recommendation restricts it too.


My question is, is there any way to author a non-HTML DTD (SGML or XML)
so as to restrict valid documents to only allow a certain subset of
their elements to be used as the root element? Can this restriction be
expressed _entirely_ within a DTD? Is this used within the HTML DTDs ?
(i.e. not just in the doctypedecl)

Is this fragment a valid HTML document ? If not, why isn't it? Just
which part of its definition is forbidding this fragmentary use?
<!DOCTYPE div PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<div>
<p>Foo</p>
</div>


Good tutorial refs on DTDs are also welcome. I don't know anything like
enough on DTD innards.

Thanks
 
L

Lachlan Hunt

Andy Dingley said:
What specifies the permitted root element(s) for a document ? HTML,
SGML, XHTML or XML ?

Any element may be the root element. There is nothing in the DTD that
says which elements may or may not be the root element. The element
used as the root element is specified by the DOCTYPE, just like in the
example you gave.
Is this fragment a valid HTML document ?...
<!DOCTYPE div PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<div>
<p>Foo</p>
</div>

Yes, it's valid. The validator would have told you that.
 
C

Chris Morris

Lachlan Hunt said:
Yes, it's valid. The validator would have told you that.

It's valid, but is it a valid *HTML* document? I think not, since
http://www.w3.org/TR/html4/struct/global.html
requires HTML documents to have title elements
"Every HTML document *must* have a TITLE element in the HEAD section."

Those requirements can't be fully enforced at the DTD level, but are
in the specification. It's clearly a valid SGML document, but I think
describing it as HTML is dubious.
 
A

Andy Dingley

Lachlan said:
Yes, it's valid. The validator would have told you that.

I don't know _what_ the validator is telling me. As an example (from
Tidy) it gives a warning
"inserting missing 'title' element"

Now to my mind, this suggests that it's seen as a valid serialisation
of a HTML document, but that after parsing it the HTML-specific tool
has implied the <html>, <head>, <title> and presumably <body> elements.
Now that's quite a different behaviour to "These documents are valid
as fragments based on any root element".

I also don't have a generic SGML parser to hand, just HTML ones. My
real interest here is in the XML or SGML cases, not anything
HTML-specific that is being implied by the context or HTTP headers.
 
J

Joe Kesselman

I don't know _what_ the validator is telling me. As an example (from
Tidy) it gives a warning
"inserting missing 'title' element"

Tidy isn't a validatator. It's a tool for repairing broken documents.
 
J

Joe Kesselman

Chris said:
It's valid, but is it a valid *HTML* document?

Please note: HTML is not an XML language; it's based on SGML, and its
DTDs follow somewhat different rules.

If you're talking about XML-validity and HTML in the same sentence, you
want to move to XHTML (and hope the tools you and your customers are
using support it). Or, work in XML at the source level, and then render
into HTML at the end for output to the user; XSLT can be used to do that.
 
P

Peter Flynn

Andy Dingley said:
What specifies the permitted root element(s) for a document ? HTML,
SGML, XHTML or XML ?

When using a DTD, any declared element type can be the root element.
It must be specified in the Document Type Declaration in the XML file.
The same is true for SGML, HTML, XHTML eg

<!DOCTYPE table PUBLIC "-//W3C//DTD HTML 4.01 Strict//EN">

specifies a document starting with <table> and containing anything
valid in HTML 4.01 tables.

Warning: *browsers* are not SGML conforming applications, so they won't
understand this. They *will* understand if you use XML or XHTML, but
I don't know what their reaction to a XHTML fragment would be.
My question is, is there any way to author a non-HTML DTD (SGML or XML)
so as to restrict valid documents to only allow a certain subset of
their elements to be used as the root element?

Yep, just use the element type name of your choice in the Document
Type Declaration. This is required to be supported by all conforming
editors using a DTD. If you use a Schema, all bets are off, as the
specification of a root element type is done quite differently there.
Can this restriction be
expressed _entirely_ within a DTD?

No, not at all. *Any* element type of a DTD can be used as the root
element type.

But conforming applications (eg editors) usually make a good guess
if they are worth anything, when they parse the DTD -- it's not
hard for them to spot that at least one element type is never used
in the content model of any other element type, and is therefore a
good choice for a default root element type. Oddly, some otherwise
very good editors fail to do this, possibly because their programmers
simply didn't grok XML markup.
Is this used within the HTML DTDs ?
(i.e. not just in the doctypedecl)

Not explicitly.
Is this fragment a valid HTML document ?

Yes, perfectly.
If not, why isn't it? Just
which part of its definition is forbidding this fragmentary use?
<!DOCTYPE div PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<div>
<p>Foo</p>
</div>

You can test this by running it through any SGML validating parser
(eg nsgmls).
Good tutorial refs on DTDs are also welcome. I don't know anything like
enough on DTD innards.

The best by far is still Eve Maler and Jeanne El Andaloussi, "Developing
SGML DTDs -- from text to model to markup", Prentice Hall, 1996. You
just have to skip the bits which refer to those parts of SGML which were
dropped in the XML Specification (see the list in the FAQ on converting
DTDs to XML at http://xml.silmaril.ie/developers/dtdconv/).

But you should also bone up on Relax NG, which is a schema language with
a short (DTD-like) syntax as well as a verbose syntax, from which you
can generate DTDs, W3C Schemas, and more. This may be an easier way into
document modelling.

///Peter
 
P

Peter Flynn

Chris said:
It's valid, but is it a valid *HTML* document? I think not, since
http://www.w3.org/TR/html4/struct/global.html
requires HTML documents to have title elements
"Every HTML document *must* have a TITLE element in the HEAD section."

Those requirements can't be fully enforced at the DTD level, but are
in the specification. It's clearly a valid SGML document, but I think
describing it as HTML is dubious.

It's a HTML *fragment*. Browsers may gag on it. Properly conformant
software won't.

///Peter
 
J

Jukka K. Korpela

Peter Flynn said:
Yes, perfectly.

No, it is a valid SGML document, but it is not an HTML document, as defined
in HTML specifications. (Of course, most "HTML documents" on the Web are not
HTML documents in that sense, but the question is meaningful only if
interpreted as relating to specifications. "HTML document" in the loose
sense - as well as "XML document" when well-formedness is not required - is
far too fuzzy a concept to be argued about.)
You can test this by running it through any SGML validating parser
(eg nsgmls).

That would indicate the validity, but the HTML 4.01 specification requires
that one of three specific DOCTYPE declarations be used - not just that one
of three DTDs be used. And this isn't one of them. Moreover, the
specification explicitly says:
"After document type declaration, the remainder of an HTML document is
contained by the HTML element."
http://www.w3.org/TR/REC-html40/struct/global.html#h-7.3
 
J

Joe Kesselman

In other words: As always, a DTD -- or a schema -- is only a partial
description of what makes a document correct and meaningful. Think of
these as "higher-level syntax checking"; the application is always going
to impose semantic constraints as well.

Having the schema or DTD describes the document's structure in a
machine-readable form that tools can take advantage of, so they don't
have to do *all* the checking themselves. That's valuable. But don't
expect it to be complete.
 
J

Jukka K. Korpela

Joe Kesselman said:
In other words:

In future, please quote or paraphrase the message that you are commenting
on.
As always, a DTD -- or a schema -- is only a partial
description of what makes a document correct and meaningful.

It depends on. There's no law that requires additional rules, though pure
syntax as such _is_ somewhat boring.
Think of
these as "higher-level syntax checking"; the application is always
going to impose semantic constraints as well.

What's "higher-level" here? Anyway, in the issue discussed in this thread,
it is the additional _syntactic_ constraints that imply that a certain kind
of document is not an HTML document. There's nothing semantic in the
requirement that a document contain a specific DOCTYPE declaration or that a
document contain a <title> element. (Requiring that the <title> element
contain text that is a descriptive name for the document, especially for use
as a title for it in different contexts, would be a semantic requirement.
Whether HTML specifications make such a requirement is debatable; the prose
in the specs is a mixture of normative-looking prose, comments, hints,
wishful thinking, etc.)
 
H

Henri Sivonen

My question is, is there any way to author a non-HTML DTD (SGML or XML)
so as to restrict valid documents to only allow a certain subset of
their elements to be used as the root element? Can this restriction be
expressed _entirely_ within a DTD?

No and no.

RELAX NG can restrict the allowed roots and does not allow the document
to override.
Is this fragment a valid HTML document ?
<!DOCTYPE div PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<div>
<p>Foo</p>
</div>

Valid in the SGML sense but not conforming to the HTML 4.01 spec.
Validity is overrated. DTD-validity is especially overrated.
Good tutorial refs on DTDs are also welcome. I don't know anything like
enough on DTD innards.

Since you haven't learning invested in DTDs, unless you have a
non-negotiable requirement to use them, I suggest learning RELAX NG
Compact Syntax instead:
http://relaxng.org/compact-tutorial-20030326.html
 
A

Alan J. Flavell

In other words:

Who and what are you trying to restate? Your header says it's
<[email protected]> by Jukka, but readers have
no idea which part(s) of that posting you are trying to comment, on,
contradict, misquote, or whatever. Please observe customary usenet
courtesies.
As always, a DTD -- or a schema -- is only a partial
description of what makes a document correct and meaningful.

The W3C HTML specification requires the document root to be the <html>
element. That seems to me to be a syntactic constraint on anything
which lays claim to being an "HTML document" (as opposed to a
fragment). Which is part of what Jukka said, and which you appear to
be trying to obfuscate.
Think of these as "higher-level syntax checking"; the application is
always going to impose semantic constraints as well.

Of course; but your comment, far from being a restatement "in other
words" of the article you were following-up to, appears to be some
quite unrelated issue, that throws little or no light on what Jukka
said. By failing to quote the relevant parts on which you are
commenting, you give the unfortunate impression that you are making it
harder for readers to see just how the reasoning is being de-railed.
Having the schema or DTD describes the document's structure in a
machine-readable form that tools can take advantage of, so they
don't have to do *all* the checking themselves. That's valuable. But
don't expect it to be complete.

It seems to me that you could do well to distinguish between an "HTML
document", and an HTML fragment. The kind of HTML fragment under
discussion here is not (IMO) an "HTML document" within the meaning of
the applicable specifications, and that is on syntactic grounds.

Jukka is going a bit far at the point where he says:

|the HTML 4.01 specification requires that one of three specific
|DOCTYPE declarations be used ...

- since this would appear to rule out ISO HTML as being a bona fide
kind of HTML, quite apart from the various custom DTD which are
around, and which I think most folks would accept as *kinds* of HTML
document, albeit not approved by the W3C.

But the main argument does not hinge on that detail, as far as I can
tell. Their root element (express or implied) needs to be <html>
before they can be an "HTML document".

h t h
 
H

Henri Sivonen

Alan J. Flavell said:
Jukka is going a bit far at the point where he says:

|the HTML 4.01 specification requires that one of three specific
|DOCTYPE declarations be used ...

- since this would appear to rule out ISO HTML as being a bona fide
kind of HTML,

I think it is quite appropriate to claim that ISO HTML is not conforming
HTML *4.01*.
 
A

Alan J. Flavell

I think it is quite appropriate to claim that ISO HTML is not
conforming HTML *4.01*.

Oh, indeed. What Jukka said was entirely reasonable within its own
terms, but what light did it throw on a generic definition of the term
"HTML document"? I suppose I was griping more about what he didn't
say, than about what he did. Sorry.

Maybe we're losing sight of where this discussion came from:

|> > Just
|> > which part of its definition is forbidding this fragmentary use?
|> > <!DOCTYPE div PUBLIC "-//W3C//DTD HTML 4.01//EN"
|> > "http://www.w3.org/TR/html4/strict.dtd">
|> > <div>
|> > <p>Foo</p>
|> > </div>

It seems entirely plausible to test *that* particular question against
the HTML/4.01 specification, since it calls-out the HTML/4.01 DTD [1]

But then we have to differentiate the question 'what defines an "HTML
document" according to this or that specific flavour of HTML?' from
the more general question of 'who is entitled to define the term "HTML
document" without reference to any specific flavour of HTML, and where
would we find such a definition?'.

I'm saying that - no matter which specific HTML DTD were to be called
out from the above DOCTYPE - the result could be an HTML fragment, but
it would be unreasonable to claim it as an "HTML document". But I'm
not sure that I would be able to give you chapter and verse to settle
that argument authoritiatively. And no review of definitions of each
/individual version of HTML/ could suffice to define the term "HTML"
generically.

regards

[1] Yes, I've reviewed the historic arguments about an SGML DTD not
defining what we all had thought it did. But they relied on doing
things which HTML rules out, but which SGML does not allow to be ruled
out. Taken to its logical conclusion, that would result in HTML
disappearing entirely in a puff of logic. I didn't want to go there.
 
J

Jack

Henri said:
I think it is quite appropriate to claim that ISO HTML is not
conforming HTML *4.01*.
Would you care to expand on this apparently rather odd statement?

As far as I am aware, ISO HTML is essentially a restatement of W3C HTML
4.01, with certain recommendations transformed into requirements, and
certain deprecations transformed into exclusions. Apart from that, the
recommended DTD declaration is different; but the exact DTD to be
declared is not a requirement of W3C HTML 4.01 anyway.

Pleae explain whatever I may have misunderstood!
 
H

Henri Sivonen

Jack said:
Would you care to expand on this apparently rather odd statement?

The specs make incompatible requirements about the doctype, which means
conformance to the specs is mutually exclusive.
As far as I am aware, ISO HTML is essentially a restatement of W3C HTML
4.01, with certain recommendations transformed into requirements, and
certain deprecations transformed into exclusions. Apart from that, the
recommended DTD declaration is different; but the exact DTD to be
declared is not a requirement of W3C HTML 4.01 anyway.

But Jukka Korpela pointed out in the quoted part that W3C HTML 4.01 does
have a requirement of particular doctypes.

(Whether these requirements should be considered bogus or not is another
matter.)
 
V

VK

Alan said:
I'm saying that - no matter which specific HTML DTD were to be called
out from the above DOCTYPE - the result could be an HTML fragment, but
it would be unreasonable to claim it as an "HTML document".

You have no choice but claim it as "HTML document". It is served from
the served with "Content-Type: text/html", for local files it is served
as the same type by association .html,.htm... --> text/html.

So before any DTD you /have/ to explicetly declare what document you
are serving - this is the only way to make an application to react on
it. This way however you would twist around an HTML code, it is always
/HTML document/ for the recipient: correctly formatted or badly broken
is another issue. Out of curiosity you can serve a page from your
server such as:

Content-Type: text/html\n\n
!@#$%&*


P.S. I'm really glad to see that the discussion at
<http://groups.google.com/group/comp...oring.html/browse_frm/thread/4fd4218808cd53ce>

triggered your curiosity and the thinking process in whole.

Just try to not put your frustration on Mr.Kesselman - he has nothing
to do with it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top