XML 1.x: URIs' and IRIs' impact on well-formedness

Nicolai Stange · Dec 13, 2009

Hello all,

in the following I'm referring to
XML11 http://www.w3.org/TR/xml11/
RFC3986 http://www.rfc-editor.org/rfc/rfc3986.txt
RFC3987 http://www.rfc-editor.org/rfc/rfc3987.txt

I'm a little bit confused about Section 4.2.2 "External Entities"
(http://www.w3.org/TR/xml11/#sec-external-ent) in XML11 and URI usage.

First of all note that every character is allowed in a SystemLiteral
(production [11] in XML11) as long as no RestrictedChar (production [2a]
in XML11) is contained verbosely (although it could be contained
indirectly through the usage of parameter entities).
So after all, there's no restriction.

,----
| 1.) %-escaping
`----
Now, reading the XML11 Section 4.2.2, the following is being stated
there: "Since escaping is not always a fully reversible process, it MUST
be performed only when absolutely necessary and as late as possible in a
processing chain. In particular, neither the process of converting a
relative URI to an absolute one nor the process of passing a URI
reference to a process or software component responsible for
dereferencing it SHOULD trigger escaping."

So, why should any XML processor (being meant not to include "process or
software component responsible for dereferencing it") start escaping? I
cannot think of any reason.

Another point: Is an XML document's author part of the "processing
chain" mentioned above? He/she should as the retrieving software
component has no mean to determine if the URI given has already been
percent escaped and percent escaping a URI twice must not occur
(RFC3986, Section 2.4 "When to Encode or Decode").
An example:
If an author writes
http://example.org/%99_tax

does he/she mean
http://example.org/%99_tax
or
http://example.org/%99_tax

On the other hand: What if the data of some URI-component contains some
reserved (that is delimiter) character (production "reserved", RFC3986)?
Then this character has to be percent-escaped by the author because
otherwise this character would be mistaken as being a delimiter by the
whole processing chain.

I think the XML11-percent-escaping rule is ambigous, a much better way
would have been to require
1.) The processing chain (including the author) should percent-escape
only characters being a member of the reserved-production of RFC3986 and
the percent-sign if occuring within the data of some component thus
ending up with a partly percent-escaped URI.
2.) The percent sign should not be escaped any further in the processing
chain (including the retrieving component). Only the non-delimiter
characters that have to be escaped should be escaped by the retrieving
component.

Or am I missing some point and that XML11 section is unambigous?
And if so, how should a retrieving component decide whether a URI has
been already percent-scaped as a whole or not? Heuristics?

,----
| 2.) What they refer to as URIs, aren't that in fact IRIs?
`----
Ok, given that an XML document's author needs to percent-escape an URI as a
whole because he's using a '#' in some of its data components (maybe in
a file name residing on a web server) and XML11 only allows for whole
escaping or no escaping of URIs at all (am I correct?). They tell us how to
escape characters above 0x7F, namely encoding them as UTF-8 and then
%-encode the resulting octets. Isn't this in fact the IRI-to-URI mapping
as defined in RFC3987? Why then, aren't they naming that
SystemLiteral-thing IRI and require that a derefencing component must
take IRIs?

,----
| 3.) Impact on well-formedness
`----
Given that a XML processor can decide whether a URI already has
been %-encoded by the author and it comes out that this is true in a
particular case. And now consider that this URI contains some character
that's not allowed in percent-encoded URIs by RFC3986, should the XML
processor reject that document as not being well-formed?

Or neglecting percent encoding, what if a URI doesn't match the URI
production? Maybe there's no "scheme:" at the beginning. Is the document
not well-formed then?

Thank you for making that many questions clear for me!

Best wishes

Nicolai Stange

Peter Flynn · Dec 13, 2009

Nicolai said:
Hello all,

in the following I'm referring to
XML11 http://www.w3.org/TR/xml11/
RFC3986 http://www.rfc-editor.org/rfc/rfc3986.txt
RFC3987 http://www.rfc-editor.org/rfc/rfc3987.txt

I'm a little bit confused about Section 4.2.2 "External Entities"
(http://www.w3.org/TR/xml11/#sec-external-ent) in XML11 and URI usage.

First of all note that every character is allowed in a SystemLiteral
(production [11] in XML11) as long as no RestrictedChar (production [2a]
in XML11) is contained verbosely (although it could be contained
indirectly through the usage of parameter entities).
So after all, there's no restriction.

Nearly. A SystemLiteral may contain non-XML characters (in order to
allow an External Entity Reference to resolve to a filename, for
example). But a SystemLiteral is not parsed for markup, so you cannot
use Parameter Entity References in a SystemLiteral and expect them to be
resolved.

The restriction imposed in [2a] applies to the content of all XML
documents, and a SystemLiteral can only practicably occur in a DOCTYPE
declaration, or in an internal or external DTD subset, and then only in
the SYSTEM identifier of a declared ENTITY or NOTATION; my understanding
is that a RestrictedChar cannot occur even in a SystemLiteral despite
the implication of the RE [^"] in Production 11; but that in any case,
the RestrictedChar characters do not occur in filenames in any
conventional computing system that I am aware of (except in error), so
the apparent conflict should not be one which will arise in normal use.

Basically, if you want to use control characters or "unwise" characters
in filenames, and reference them in SYSTEM identifiers, don't expect any
XML parser to accept them.

Now, reading the XML11 Section 4.2.2, the following is being stated
there: "Since escaping is not always a fully reversible process, it MUST
be performed only when absolutely necessary and as late as possible in a
processing chain. In particular, neither the process of converting a
relative URI to an absolute one nor the process of passing a URI
reference to a process or software component responsible for
dereferencing it SHOULD trigger escaping."

So, why should any XML processor (being meant not to include "process or
software component responsible for dereferencing it") start escaping? I
cannot think of any reason.

It shouldn't. It isn't any business of an XML processor to escape
characters for you. It *might* be the business of some earlier or later
process, one that handles the dereferencing of URIs, but like you I
cannot see any good reason for an XML processor to do this. This kind of
network-related dereferencing is the business of other levels of the
operating system.

It's also good working practice never to use SystemLiterals which
require this, so that the problem never arises. If you are dealing with
ill-defined business methods which rely on poorly-designed systems that
use filenames which mean this might occur, it's best to change the
character set for filenames earlier in the process, not to use XML
software to compensate for other people's thoughtlessness.

Another point: Is an XML document's author part of the "processing
chain" mentioned above?

You could probably make a case for it, but "processing chain"
conventionally means what happens to the document *after* it leaves the
author.

He/she should [?be?] as the retrieving software
component has no mean to determine if the URI given has already been
percent escaped and percent escaping a URI twice must not occur
(RFC3986, Section 2.4 "When to Encode or Decode").
An example:
If an author writes
http://example.org/%99_tax

does he/she mean
http://example.org/%99_tax
or
http://example.org/%99_tax

On the other hand: What if the data of some URI-component contains some
reserved (that is delimiter) character (production "reserved", RFC3986)?
Then this character has to be percent-escaped by the author because
otherwise this character would be mistaken as being a delimiter by the
whole processing chain.

Correct. This is an example of the need for care that I mentioned above.
I'm not sure that this qualifies the author to be part of the meaning of
"processing chain", but it's certainly arguable.

I think the XML11-percent-escaping rule is ambigous, a much better way
would have been to require
1.) The processing chain (including the author) should percent-escape
only characters being a member of the reserved-production of RFC3986 and
the percent-sign if occuring within the data of some component thus
ending up with a partly percent-escaped URI.

You cannot realistically expect authors to know, understand, or even
realise this.

It should be the business of editors (I mean the programs, not the
humans) to check things like this before committing them to the
document. For example, by testing the URI for retrieval.

2.) The percent sign should not be escaped any further in the processing
chain (including the retrieving component). Only the non-delimiter
characters that have to be escaped should be escaped by the retrieving
component.

Or am I missing some point and that XML11 section is unambigous?
And if so, how should a retrieving component decide whether a URI has
been already percent-scaped as a whole or not? Heuristics?

I think it's currently a non-question. If people find that supplying a
URI in a pre-escaped form doesn't work in their processing chain,
they'll just edit it and note the fact for later observation.

Ok, given that an XML document's author needs to percent-escape an URI as a
whole because he's using a '#' in some of its data components (maybe in
a file name residing on a web server) and XML11 only allows for whole
escaping or no escaping of URIs at all (am I correct?). They tell us how to
escape characters above 0x7F, namely encoding them as UTF-8 and then
%-encode the resulting octets. Isn't this in fact the IRI-to-URI mapping
as defined in RFC3987? Why then, aren't they naming that
SystemLiteral-thing IRI and require that a derefencing component must
take IRIs?

I can't answer that.

Given that a XML processor can decide whether a URI already has
been %-encoded by the author and it comes out that this is true in a
particular case. And now consider that this URI contains some character
that's not allowed in percent-encoded URIs by RFC3986, should the XML
processor reject that document as not being well-formed?

No, only if the character is forbidden by XML. It's no business of XML
processors to have to check other standards, only their own.

Or neglecting percent encoding, what if a URI doesn't match the URI
production? Maybe there's no "scheme:" at the beginning. Is the document
not well-formed then?

A URI/IRI is an opaque string as far as the XML document is concerned,
the same as an ISBN or DOI. Provided it doesn't contain any forbidden
characters, it's not the parser's business to object.

///Peter

Nicolai Stange · Dec 13, 2009

Hi again,

at first, Peter, thank you for your exhaustive reply!

First of all note that every character is allowed in a SystemLiteral
(production [11] in XML11) as long as no RestrictedChar (production [2a]
in XML11) is contained verbosely (although it could be contained
indirectly through the usage of parameter entities).
So after all, there's no restriction.

Click to expand...

Nearly. A SystemLiteral may contain non-XML characters (in order to
allow an External Entity Reference to resolve to a filename, for
example). But a SystemLiteral is not parsed for markup, so you cannot
use Parameter Entity References in a SystemLiteral and expect them to
be resolved.

Good point.

The restriction imposed in [2a] applies to the content of all XML
documents, and a SystemLiteral can only practicably occur in a DOCTYPE
declaration, or in an internal or external DTD subset, and then only
in the SYSTEM identifier of a declared ENTITY or NOTATION; my
understanding is that a RestrictedChar cannot occur even in a
SystemLiteral despite the implication of the RE [^"] in Production 11;

I totally agree: Production [1] of XML11 explicitly forbids a
RestrictedChar anywhere.
So, in the end a SystemLiteral may contain anything except a
RestrictedChar and the opening quote.

Basically, if you want to use control characters or "unwise"
characters in filenames, and reference them in SYSTEM identifiers,
don't expect any XML parser to accept them.

So in the end, it's implementation dependent if and if yes, which, URI
is actually being fetched? And the XML processor's URI-dereferencing
backend should do its best to interpret the URI and retrieve something?

Good night

Nicolai

[XXE] XMLmind XML Editor V3.0 Patch 1	0	Dec 5, 2005
FAQ 9.10 How do I decode or create those %-encodings on the web?	0	Apr 5, 2011
xml escapedness	0	Feb 22, 2008
[XUS] An XML URN Scheme	0	Feb 25, 2005
CFP: WWW'06 Workshop on Identity, Reference, and the Web (IRW2006)	0	Nov 29, 2005
My XML Feed Choked	3	Nov 3, 2005
Guy Steele on Parallel Programing	1	Feb 5, 2011
Call for Papers: International Conference on Soft Computing and Applications	0	May 21, 2012

XML 1.x: URIs' and IRIs' impact on well-formedness

Nicolai Stange

Peter Flynn

Nicolai Stange

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads