Benjamin Niemann said:
The problem is probably that the SGML specification is not freely (as
in beer) available - in contrast to the XML specification.
I would expect someone to spend a few bucks on the very basic documents, if
he intends to write software like a validator.
And the SGML spec is really a monstrosity,
No doubt about that. That's one reason why people shouldn't try to write
software like validators light-heartedly.
But I must admit that the answer is pretty obvious for everyone with
some understanding of HTML and SGML - even without reading the spec.
In fact, it can only be known from the SGML standard - either by actually
reading the standard, or consulting someone you trust to have read it,
understood it, and willing to help you. As far as I remember, the HTML
specifications, for example, don't mention this issue at all - in
principle, the reference to SGML or XML is sufficient, and the authors of
HTML specs didn't bother mentioning this particular detail. It is true that
e.g. the HTML 4.01 specification explicitly mentions that element names are
case insensitive. But it does not say that the "html" in a DOCTYPE
declaration is an element name (still less that it a generic identifier,
which is the SGML jargon for element name).
Reading the HTML spec can actually make you doubt what it wants to say.
The hilariously titled "HTML version information" section,
http://www.w3.org/TR/html4/struct/global.html#h-7.2
says: "HTML 4.01 specifies three DTDs, so authors must include one of the
following document type declarations in their documents." This sentence
seems to present a logical implication (with "so"), but it's a non
sequitur. The use of one of three document type definitions does not imply
that you need to use one of three document type declarations. (Anyone who
does not see this should not dream of writing a validator, or feel
competent to discuss what is a validator and what is not, before spending
quite some time in a study room with good books.)
From the SGML viewpoint, the wording would best be understood so that the
three document type declarations are just _examples_. But we know that
browsers have based "doctype sniffing" on the document type declarations,
so that e.g. the presence or absence of a URL can be decisive.
If the words "authors must include one of the following document type
declarations in their documents" are to be read as an independent
requirement imposed in conforming documents (and not as a logical
implication from something else), the next question is: how literally shall
this be interpreted? When presenting
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"
http://www.w3.org/TR/html4/strict.dtd">
as one of the three permitted options, does the specification mandate this
exact format with
- the redundant URL included
- uppercase in keywords DOCTYPE, HTML, and PUBLIC
- that particular use of white space?
Even the last point is not self-evident. By SGML rules, the amount and kind
of white space between the two quoted strings here is not significant; but
it all seems like the spec wants to impose _additional_ requirement,
requiring a very specific document type declaration.
Not yet sufficiently confused? Then please read how the online version of
the HTML specification itself starts at source level:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
(I'm not referring to Transitional vs. Strict issue, in which the
specification actually practices something else than it teaches; here I'm
referring to the use of a document type declaration without a URL.)