most XHTML on the web is invalid?

John Salerno · Feb 5, 2006

What exactly does this mean:

"Document sent as text/html are handled as tag soup [1] by most UAs.
This means that authors are not checking for validity, and thus
most XHTML documents on the web now are invalid. Therefore the main
advantage of using XHTML, that it has to be valid, is lost of the
document is then sent as text/html."

To me it sounds like he is saying that *any* document written in XHTML
and then served as text/html is invalid. But is that really the case? Or
is he saying that the document *could* be invalid because it could still
be prone to the methods of HTML (e.g., no closing tags, etc.)?

I assume if you validate your XHTML, then simply serving it as text/html
doesn't harm it, right? It doesn't suddenly make it "invalid," does it?
(Perhaps in a strict sense it does, because it's not truly XHTML, but as
far as the actually words in the document themselves, they are still
valid, right? And if it was then served as application/xhtml-xml, it
would be valid, correct?)

John Salerno · Feb 5, 2006

John said:
What exactly does this mean:

"Document sent as text/html are handled as tag soup [1] by most UAs.
This means that authors are not checking for validity, and thus
most XHTML documents on the web now are invalid. Therefore the main
advantage of using XHTML, that it has to be valid, is lost of the
document is then sent as text/html."

To me it sounds like he is saying that *any* document written in XHTML
and then served as text/html is invalid. But is that really the case? Or
is he saying that the document *could* be invalid because it could still
be prone to the methods of HTML (e.g., no closing tags, etc.)?

I assume if you validate your XHTML, then simply serving it as text/html
doesn't harm it, right? It doesn't suddenly make it "invalid," does it?
(Perhaps in a strict sense it does, because it's not truly XHTML, but as
far as the actually words in the document themselves, they are still
valid, right? And if it was then served as application/xhtml-xml, it
would be valid, correct?)

Here's a related quote:

"If you ever switch your documents that claim to be XHTML from
text/html to application/xhtml+xml, then you will in all likelyhood
end up with a considerable number of XML errors, meaning your
content won't be readable by users. (See above: most of these
documents do not validate.)"

To me, this argument seems valid only because it carries the
presupposition that most authors are still writing as if they are using
HTML instead of XHTML. If you actually know what you are doing (i.e.,
know how an XML language needs to be structured) and you do that, then
this point is moot.

John Salerno · Feb 5, 2006

John said:
What exactly does this mean:

"Document sent as text/html are handled as tag soup [1] by most UAs.
This means that authors are not checking for validity, and thus
most XHTML documents on the web now are invalid. Therefore the main
advantage of using XHTML, that it has to be valid, is lost of the
document is then sent as text/html."

To me it sounds like he is saying that *any* document written in XHTML
and then served as text/html is invalid. But is that really the case? Or
is he saying that the document *could* be invalid because it could still
be prone to the methods of HTML (e.g., no closing tags, etc.)?

I assume if you validate your XHTML, then simply serving it as text/html
doesn't harm it, right? It doesn't suddenly make it "invalid," does it?
(Perhaps in a strict sense it does, because it's not truly XHTML, but as
far as the actually words in the document themselves, they are still
valid, right? And if it was then served as application/xhtml-xml, it
would be valid, correct?)

One more quote, this time from Andy from this newsgroup:

"However (the bad news) this is an XML technique and so only works with
XHTML documents that are XML documents, not the Appendix C XHTML non-XML
documents we've already mentioned as being the only ones that are yet
ready for use on the web. You can still use these techniques, but it's
not simple, much of your audience may have problems with them, and
compatibility issues are significant."

So basically these last three posts of mine are asking the same
question: why is "Appendix C" XHTML considered invalid? What more is
there to writing real XHTML than simply starting with the basic rules
(lower-case, proper nesting, closing tags, etc.)?

Obviously you can eventually include other XML namespaces, but for now,
why is it said that documents written in Appendix C XHTML will
eventually be invalid when actually served as application/xhtml+xml?

Thanks!

Jukka K. Korpela · Feb 5, 2006

John Salerno said:
What exactly does this mean:

"Document sent as text/html are handled as tag soup [1] by most
UAs. This means that authors are not checking for validity, and
thus most XHTML documents on the web now are invalid. Therefore the
main advantage of using XHTML, that it has to be valid, is lost of
the document is then sent as text/html."

Looks pretty clear to me, though the phrase "it has to be valid" is
actually just a successful meme (in a particular environment), not a
correct statement of facts. It is apparently meant to say that browsers
_must_ check for validity (for documents, HTML specifications have
always required validity), but what the XHTML 1.0 document actually
says is something very different. It only requires that well-formedness
(i.e., being XML in the first place) is "evaluated" (with no
requirement on reporting the result of the evaluation), and there is
clearly no requirement on checking validity:
http://www.w3.org/TR/xhtml1/#uaconf
In fact, XHTML 1.0 even _requires_ certain processing of unrecognized
elements and attributes, which means rules for processing _invalid_
documents. In classic HTML, this was just common practice (and a
_suggestion_ in the specs).

To me it sounds like he is saying that *any* document written in
XHTML and then served as text/html is invalid.

No, that's not at all what it says.

The general idea is that authors who think they are using XHTML do not,
in fact, use XHTML (but violate validity requirements, prose
requirements, and perhaps even well-formedness requirements) and do not
observe this, since browsers don't report the errors. The idea seems to
be that browser _would_ report errors if application/xhtml+xml were
used, but as I explained, there is no such requirements - and browsers
are even _required_ to process invalid documents in a particular manner
(though we can perhaps deduce that they _may_ also flag errors).

I assume if you validate your XHTML, then simply serving it as
text/html doesn't harm it, right?

Right. But you gain nothing either.

It doesn't suddenly make it
"invalid," does it? (Perhaps in a strict sense it does, because
it's not truly XHTML, but as far as the actually words in the
document themselves, - -

Validity has nothing to do with the Internet media type used to serve a
document. Validity is an inherent property of a document.

Neredbojias · Feb 5, 2006

With neither quill nor qualm, John Salerno quothed:

What exactly does this mean:

"Document sent as text/html are handled as tag soup [1] by most UAs.
This means that authors are not checking for validity, and thus
most XHTML documents on the web now are invalid. Therefore the main
advantage of using XHTML, that it has to be valid, is lost of the
document is then sent as text/html."

To me it sounds like he is saying that *any* document written in XHTML
and then served as text/html is invalid. But is that really the case? Or
is he saying that the document *could* be invalid because it could still
be prone to the methods of HTML (e.g., no closing tags, etc.)?

I assume if you validate your XHTML, then simply serving it as text/html
doesn't harm it, right? It doesn't suddenly make it "invalid," does it?
(Perhaps in a strict sense it does, because it's not truly XHTML, but as
far as the actually words in the document themselves, they are still
valid, right? And if it was then served as application/xhtml-xml, it
would be valid, correct?)

Using xhtml "today" without a special and rather esoteric need for it is
just plain silly. There's nothing profound about the format or markup,
nothing really to be learned by doing xhtml pages except the most
simplistic tactical techniques one could imagine. Perhaps someday when
the so-called experts figure out what they're supposed to be doing and
are actually able to do it, this will change, but that day is not going
to be soon. Far better a valid and structurally-correct html-strict
opus than anything in xhtml at all.

Spartanicus · Feb 5, 2006

John Salerno said:
"Document sent as text/html are handled as tag soup [1] by most UAs.
This means that authors are not checking for validity, and thus
most XHTML documents on the web now are invalid. Therefore the main
advantage of using XHTML, that it has to be valid, is lost of the
document is then sent as text/html."

IIRC Hixie makes this point about validating parsers. Imo he pushes the
boat out to far with his arguments against XHTML. The paragraph above is
an example, validating parsers are and will continue to be a rarity in
UAs that are not validators.

At the same time Hixie doesn't address many of the myths that cause
authors to believe that XHTML is preferred over HTML. Have a look at the
previously mentioned http://www.spartanicus.utvinternet.ie/no-xhtml.htm

Alan J. Flavell · Feb 5, 2006

What exactly does this mean:

"Document sent as text/html are handled as tag soup [1] by most UAs.
This means that authors are not checking for validity, and thus most
XHTML documents on the web now are invalid.

I interpreted it as meaning that *most* authors who jumped on the
XHTML bandwagon - because it was sexy, rather than because they knew
what they were doing - are still writing tag soup - just that now
they're writing XHTML-flavoured tag soup whereas previously they were
writing HTML-flavoured tag soup.

Therefore the main advantage of using XHTML, that it has to be
valid, is lost of the document is then sent as text/html."

To me it sounds like he is saying that *any* document written in
XHTML and then served as text/html is invalid.

I don't think he meant that.

One of the claimed benefits for XHTML was that it would put an end to
tag soup, and would produce only documents which were valid, thus
putting an end to the problem of browsers having to guess
heuristically what they were supposed to do with invalid markup. We
were told by its proponents that a new generation of XML-based
browsers would be able to get rid of all that ballast of error fixup
code, and just parse the valid XML-based markups that they would be
given. Which of course could be far more elaborate than mere HTML -
containing additional XML-based markups including SVG and MathML, and
so on.

What he's alerting us to, AIUI, is that in reality, many/most of those
who *imagine* they are producing XHTML are producing no such thing -
they are producing XHTML-flavoured tag soup, sending it out as
text/html, and continuing to rely on old error-correcting browsers
which were designed for parsing HTML tag soup (courtesy of the W3C's
misguided provisions of "Appendix C" to do so).

Here we know better, of course, since we not only know how to use a
validator (or, even better, use an authoring process which is designed
such that it can only generate valid output); we also know what's
meant by semantic markup (even if we have lesser disagreements about
exactly what it means). But "we" are in a tiny minority compared with
the billions of pages that are out there on the WWW.

I assume if you validate your XHTML, then simply serving it as
text/html doesn't harm it, right? It doesn't suddenly make it
"invalid," does it?

Well, text/html used to mean in theory "this is HTML" - in practice it
meant "this is almost certainly HTML-like tag soup, although
occasionally it will be HTML"; whereas under the provisions of
Appendix C, it now means "this is almost certainly one or other
flavour of tag soup, although occasionally it will be either HTML or
Appendix-C XHTML/1.0".

No, valid XHTML/1.0 Appendix C isn't actually *invalid* as HTML; it
just (per the SHORTTAG problem) *means* something different, and
relies on a widespread browser bug to get itself rendered as intended
- rather than as specified by SGML.

Remember, the "SGML Declaration" for HTML is non-negotiable. It's
published in the HTML specification(s), e.g
http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html , and it forms an
implied part of every HTML transaction. Unlike the contents of the
DTD (which are referenced from the DOCTYPE), the SGML Declaration
forms no part of the negotiation between the client and the server -
there is no URL from which the client could, even in principle,
retrieve the "SGML Declaration". And there's no doubt that the SGML
Declaration for HTML says "SHORTTAG YES", whether the supporters of
Appendix C care to hear it or not.

Appendix C relies upon the fact that client agents don't take the SGML
Declaration seriously. And in order to cope with this, the so-called
HTML validators need a mode switch, which takes a sneaky look at the
DOCTYPE and decides whether to switch from an SGML mode into an XML
mode for the validation. This is all very heuristic - it's not based
on any well-founded theoretical model at all.

AIUI, Hixie would like application/xhtml+xml when sent from a server
to mean "I warrant this to be XHTML", with no parachute provided for
cases where that turns out to be false.

No, I don't think he's demanding that every browser must be a
validating parser, guaranteeing an error report instead of rendering
documents which prove to be invalid[1]. (XML does, however, mandate
reporting an error for well-formedness errors.) He's only saying that
serving XHTML *should* represent a warranty of validity, with the
*sender* accepting any consequences of the warranty being broken, and
removing the implied requirement on every recipient to perform the QA
corrections which the author failed to do.

That's my interpretation of it, anyway. I don't know Hixie personally
and can only base my understanding of his position on what I've read.
YMMV and all that.

best

[1] As long as so many authors continue to use their favourite browser
as the sole arbiter of correctness, however, it really would be a good
idea if their browser would do precisely that. But I *know* it isn't
going to happen, so I'm not losing any sleep over it.

cwdjrxyz · Feb 5, 2006

John said:
What exactly does this mean:

"Document sent as text/html are handled as tag soup [1] by most UAs.
This means that authors are not checking for validity, and thus
most XHTML documents on the web now are invalid. Therefore the main
advantage of using XHTML, that it has to be valid, is lost of the
document is then sent as text/html."

To me it sounds like he is saying that *any* document written in XHTML
and then served as text/html is invalid. But is that really the case? Or
is he saying that the document *could* be invalid because it could still
be prone to the methods of HTML (e.g., no closing tags, etc.)?

I assume if you validate your XHTML, then simply serving it as text/html
doesn't harm it, right? It doesn't suddenly make it "invalid," does it?
(Perhaps in a strict sense it does, because it's not truly XHTML, but as
far as the actually words in the document themselves, they are still
valid, right? And if it was then served as application/xhtml-xml, it
would be valid, correct?)

You have had many answers with useful information. I just want to add a
practical consideration or two.

No matter if you write your page in perfect html or perfect xhtml, if
you label the page as something.html, it gets served as html by most
servers. If you want to serve a xhtml page as true xhtml you need to
set the mime type of application/xhtml+xml, associated with some
extension such as .xhtml, on the server and serve the page with the
extension such as .xhtml selected. A good xhtml page then gets served
correctly. And it will not be viewed by browsers such as IE6 that do
not understand the mentioned mime type. If you want IE6 and such to
view the page, you either have to serve a separate html page for it, or
use a trick such as writing a php include at the top of the
something.php page that will serve the page as xhtml only if the
server/browser header exchange indicates that the browser will at least
accept application/xhtml+xml. If not, then the page gets served as
html. A few browsers will not tell if they will accept this mime type
or not even when they will accept true xhtml. In that case you would
serve the page as html anyway to be on the safe side, or you are back
to writing separate pages for html and xhtml. Then you need to check
the page on a few browsers to make certain the automatic conversion to
a html page for IE6 and such is working.

When you use a xhtml aware browser(Opera, Firefox, etc) and serve the
page as application/xhtml+xml, the browser then parses the page as xml.
Little errors that might not cause much of a problem on an html page
then sometimes result in the page not showing, and you also often get
an xml parse error.

Concerning the W3C validator, if you serve a page (written in html or
xhtml code) as something.html, the validator views the page as html.
If you serve as application/xhtml+xml, the validator views the page as
xhtml. If you use the extended interface at the validator, it will tell
you the mime type of the page it views. However the validator validates
the code of the page as indicated in the Doctype. Thus if you serve an
xhtml page as html, the validator will still validate the code as
xhtml, gripes about unclosed img and br tags, etc. However it does not
determine if the page is being served correctly(other than with the
mime type displayed using the extended interface). I suspect this is
one reason some think they are serving xhtml, when they actually are
not.

Andy Dingley · Feb 5, 2006

What exactly does this mean:

"Document sent as text/html are handled as tag soup [1] by most UAs.
This means that authors are not checking for validity, and thus
most XHTML documents on the web now are invalid.

This is a fallacious conclusion to draw from those initial conditions.
Empirically the evidence supports the same conclusion, but still not
that logic.

XHTML out there is bogus and badly formed. But that's nothing to do with
them being sent as text/html.

Therefore the main
advantage of using XHTML, that it has to be valid, is lost of the
document is then sent as text/html."

This has never been an advantage of XHTML, at any incarnation beyond the
whiteboard stage. The web has _always_ taken a best-guess approach to
error recovery of any format, and XHTML never seriously attempted to
reverse that.

Maybe it should. Maybe it would have been better if all badly-formed XML
was rejected out of hand (as indeed it is in the applications world).
But XHTML crept onto the web gradually, as an evolution of HTML by the
designers, not as a move of real coders from the desktop onto the web.
It inherited HTML's sloppy approaches and we have to work from that as
out starting point - anything else is just pointless theorising.

XHTML - how extend/create ELEMENT body in my DTD?	0	Oct 29, 2019
IE9 beta finally seems to support xhtml properly	2	Sep 24, 2010
HTML to XHTML	20	Jan 14, 2011
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
xhtml table attributes	33	Dec 11, 2010
How do I hide the modal after 5 seconds?	4	Jun 1, 2023
XML/XHTML/HTML differences, bugs... and howto	0	Jan 23, 2013
question about frames and xhtml	8	Apr 24, 2012

most XHTML on the web is invalid?

John Salerno

John Salerno

John Salerno

Jukka K. Korpela

Neredbojias

Spartanicus

Alan J. Flavell

cwdjrxyz

Andy Dingley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads