N
Nicolai Stange
Hello all,
in the following I'm referring to
XML11 http://www.w3.org/TR/xml11/
RFC3986 http://www.rfc-editor.org/rfc/rfc3986.txt
RFC3987 http://www.rfc-editor.org/rfc/rfc3987.txt
I'm a little bit confused about Section 4.2.2 "External Entities"
(http://www.w3.org/TR/xml11/#sec-external-ent) in XML11 and URI usage.
First of all note that every character is allowed in a SystemLiteral
(production [11] in XML11) as long as no RestrictedChar (production [2a]
in XML11) is contained verbosely (although it could be contained
indirectly through the usage of parameter entities).
So after all, there's no restriction.
,----
| 1.) %-escaping
`----
Now, reading the XML11 Section 4.2.2, the following is being stated
there: "Since escaping is not always a fully reversible process, it MUST
be performed only when absolutely necessary and as late as possible in a
processing chain. In particular, neither the process of converting a
relative URI to an absolute one nor the process of passing a URI
reference to a process or software component responsible for
dereferencing it SHOULD trigger escaping."
So, why should any XML processor (being meant not to include "process or
software component responsible for dereferencing it") start escaping? I
cannot think of any reason.
Another point: Is an XML document's author part of the "processing
chain" mentioned above? He/she should as the retrieving software
component has no mean to determine if the URI given has already been
percent escaped and percent escaping a URI twice must not occur
(RFC3986, Section 2.4 "When to Encode or Decode").
An example:
If an author writes
http://example.org/%99_tax
does he/she mean
http://example.org/%99_tax
or
http://example.org/%99_tax
On the other hand: What if the data of some URI-component contains some
reserved (that is delimiter) character (production "reserved", RFC3986)?
Then this character has to be percent-escaped by the author because
otherwise this character would be mistaken as being a delimiter by the
whole processing chain.
I think the XML11-percent-escaping rule is ambigous, a much better way
would have been to require
1.) The processing chain (including the author) should percent-escape
only characters being a member of the reserved-production of RFC3986 and
the percent-sign if occuring within the data of some component thus
ending up with a partly percent-escaped URI.
2.) The percent sign should not be escaped any further in the processing
chain (including the retrieving component). Only the non-delimiter
characters that have to be escaped should be escaped by the retrieving
component.
Or am I missing some point and that XML11 section is unambigous?
And if so, how should a retrieving component decide whether a URI has
been already percent-scaped as a whole or not? Heuristics?
,----
| 2.) What they refer to as URIs, aren't that in fact IRIs?
`----
Ok, given that an XML document's author needs to percent-escape an URI as a
whole because he's using a '#' in some of its data components (maybe in
a file name residing on a web server) and XML11 only allows for whole
escaping or no escaping of URIs at all (am I correct?). They tell us how to
escape characters above 0x7F, namely encoding them as UTF-8 and then
%-encode the resulting octets. Isn't this in fact the IRI-to-URI mapping
as defined in RFC3987? Why then, aren't they naming that
SystemLiteral-thing IRI and require that a derefencing component must
take IRIs?
,----
| 3.) Impact on well-formedness
`----
Given that a XML processor can decide whether a URI already has
been %-encoded by the author and it comes out that this is true in a
particular case. And now consider that this URI contains some character
that's not allowed in percent-encoded URIs by RFC3986, should the XML
processor reject that document as not being well-formed?
Or neglecting percent encoding, what if a URI doesn't match the URI
production? Maybe there's no "scheme:" at the beginning. Is the document
not well-formed then?
Thank you for making that many questions clear for me!
Best wishes
Nicolai Stange
in the following I'm referring to
XML11 http://www.w3.org/TR/xml11/
RFC3986 http://www.rfc-editor.org/rfc/rfc3986.txt
RFC3987 http://www.rfc-editor.org/rfc/rfc3987.txt
I'm a little bit confused about Section 4.2.2 "External Entities"
(http://www.w3.org/TR/xml11/#sec-external-ent) in XML11 and URI usage.
First of all note that every character is allowed in a SystemLiteral
(production [11] in XML11) as long as no RestrictedChar (production [2a]
in XML11) is contained verbosely (although it could be contained
indirectly through the usage of parameter entities).
So after all, there's no restriction.
,----
| 1.) %-escaping
`----
Now, reading the XML11 Section 4.2.2, the following is being stated
there: "Since escaping is not always a fully reversible process, it MUST
be performed only when absolutely necessary and as late as possible in a
processing chain. In particular, neither the process of converting a
relative URI to an absolute one nor the process of passing a URI
reference to a process or software component responsible for
dereferencing it SHOULD trigger escaping."
So, why should any XML processor (being meant not to include "process or
software component responsible for dereferencing it") start escaping? I
cannot think of any reason.
Another point: Is an XML document's author part of the "processing
chain" mentioned above? He/she should as the retrieving software
component has no mean to determine if the URI given has already been
percent escaped and percent escaping a URI twice must not occur
(RFC3986, Section 2.4 "When to Encode or Decode").
An example:
If an author writes
http://example.org/%99_tax
does he/she mean
http://example.org/%99_tax
or
http://example.org/%99_tax
On the other hand: What if the data of some URI-component contains some
reserved (that is delimiter) character (production "reserved", RFC3986)?
Then this character has to be percent-escaped by the author because
otherwise this character would be mistaken as being a delimiter by the
whole processing chain.
I think the XML11-percent-escaping rule is ambigous, a much better way
would have been to require
1.) The processing chain (including the author) should percent-escape
only characters being a member of the reserved-production of RFC3986 and
the percent-sign if occuring within the data of some component thus
ending up with a partly percent-escaped URI.
2.) The percent sign should not be escaped any further in the processing
chain (including the retrieving component). Only the non-delimiter
characters that have to be escaped should be escaped by the retrieving
component.
Or am I missing some point and that XML11 section is unambigous?
And if so, how should a retrieving component decide whether a URI has
been already percent-scaped as a whole or not? Heuristics?
,----
| 2.) What they refer to as URIs, aren't that in fact IRIs?
`----
Ok, given that an XML document's author needs to percent-escape an URI as a
whole because he's using a '#' in some of its data components (maybe in
a file name residing on a web server) and XML11 only allows for whole
escaping or no escaping of URIs at all (am I correct?). They tell us how to
escape characters above 0x7F, namely encoding them as UTF-8 and then
%-encode the resulting octets. Isn't this in fact the IRI-to-URI mapping
as defined in RFC3987? Why then, aren't they naming that
SystemLiteral-thing IRI and require that a derefencing component must
take IRIs?
,----
| 3.) Impact on well-formedness
`----
Given that a XML processor can decide whether a URI already has
been %-encoded by the author and it comes out that this is true in a
particular case. And now consider that this URI contains some character
that's not allowed in percent-encoded URIs by RFC3986, should the XML
processor reject that document as not being well-formed?
Or neglecting percent encoding, what if a URI doesn't match the URI
production? Maybe there's no "scheme:" at the beginning. Is the document
not well-formed then?
Thank you for making that many questions clear for me!
Best wishes
Nicolai Stange