URL encoding api in Java 1.4.2

A

angrybaldguy

Fantastic explanation.
To clarify, when you see a URL like:

http://www.google.co.uk/search?hl=en&safe=off&q=my+query&btnG=Search

There are *two* *different* layers of syntax here. First is the URI/URL,
syntax, which breaks the string down to:

Scheme: http
Authority:www.google.co.uk
Path: search
Query: hl=en&safe=off&q=my+query&btnG=Search

Second is x-www-form-urlencoding of the query part, which breaks it down
to:

hl: en
safe: off
q: my query
btnG: Search

Note that it is permitted to have raw + signs in the query part: they're
reserved characters in URI syntax, but in the lesser 'subcomponent
delimiter' set, rather than the greater 'generic delimiter' set, and that
means that they can be used unescaped in a part, provided that the syntax
for that part permits it. I can't find anything in a specification of the
http URL scheme that forbids + from the query part, and thus, applying
ancient Anglo-Saxon legal principles, it's permitted. If you don't like
it, you can always escape them:

http://www.google.co.uk/search?hl=en&safe=off&q=my+query&btnG=Search

I sincerely believe that that URL is exactly equivalent to the one above.
Although i note that Google doesn't think so. Hmm.

Google's right. The query part of that URL expands to

hl: en
save: off
q: my+query <-- Note the plus.
btnG: Search

In practice, a web app which treats

GET /foo?a+b HTTP/1.1
and
GET /foo?a%20b HTTP/1.1

differently is going to break one way or another. However, when
comparing URLs, those are two distinct URLs. Off the top of my head,
this comes up for caching (which is normally keyed by URL or URL
fragment) and for XML namespaces (namespaces are URIs, and the rules
for when two namespace URIs are equivalent are fairly strict).

-o
 
T

Tom Anderson

Fantastic explanation.

I'm hoping you mean that as a complement, rather than as an assertion that
it's a fantasy! :)
Google's right. The query part of that URL expands to

hl: en
save: off
q: my+query <-- Note the plus.
btnG: Search

The query part is encoded using x-www-form-urlencoding. Or rather, it's
encoded using the encoding specified in the form's enctype attribute,
which has a default value of application/x-www-form-urlencoding. The
specification for that says that spaces are escaped as pluses:

http://www.w3.org/TR/html4/interact/forms.html#h-17.13

So that plus *is* a space.

Clearly, it's not handled that way in practice. Furthermore, i'm a bit
dubious about the interaction between x-www-form-urlencoding and URI
escaping. For instance, what happens when a form value has a & or a = in
it? Those are used as delimiters in the x-www-form-urlencoding syntax, so
they have to be escaped. But what the spec says to do is to encode them
using %hh, which is the URI escape notation. Under my model for the
interaction of x-www-form-urlencoding and URI escaping, that would mean
that a form dataset like this:

text: this is &lt;html&gt;

Would be x-www-form-urlencoded as:

text=this+is+%26lt%3bhtml%26gt%3b

And using that as a query part would make a URI like:

http://example.com/search?text=this+is+%26lt+53bhtml+2526gt+253b

That is, with the %s escaped again!

What actually happens that the %-escaped characters in the query part are
not %-escaped again - and nor are the +s.

I assume that what happened is that URI encoding and
x-www-form-urlencoding have been treated as a single process, with the
structure of the query part being considered part of the structure of the
URI, despite what the specifications say. Or rather, that they really are
part of the same process, and my reading of the specifications is wrong.
This may be covered under the slightly handwavey bits in the URI spec that
talk about scheme-specific syntax; if we consider x-www-form-urlencoding
part of the http scheme's syntax, rather than a separate layer of encoding
on top, i think it makes sense. Even though that's not what the HTML spec
says.

Although in fact, what the HTML spec says about form submission is:

If the method is "get" and the action is an HTTP URI, the user agent
takes the value of action, appends a `?' to it, then appends the form
data set, encoded using the "application/x-www-form-urlencoded" content
type. The user agent then traverses the link to this URI.

Note that it *doesn't* say that the encoded string is used as a query part
- it says it's appended directly to the action URL. Which sort of means
that URLs carrying form data are not strictly URLs at all ...

Anyway, i've gone mad now, so i'll leave it at that.
In practice, a web app which treats

GET /foo?a+b HTTP/1.1
and
GET /foo?a%20b HTTP/1.1

differently is going to break one way or another. However, when
comparing URLs, those are two distinct URLs.

No, those URLs are equivalent. From RFC 3986, in the section about how to
compare URIs for equality:

6.2.2.2. Percent-Encoding Normalization

The percent-encoding mechanism (Section 2.1) is a frequent source of
variance among otherwise identical URIs. In addition to the case
normalization issue noted above, some URI producers percent-encode octets
that do not require percent-encoding, resulting in URIs that are
equivalent to their non-encoded counterparts. These URIs should be
normalized by decoding any percent-encoded octet that corresponds to an
unreserved character, as described in Section 2.3.

And on this, for once, Google at least agrees with me - try these:

http://www.google.co.uk/search?q=a+b
http://www.google.co.uk/search?q=a b

tom
 
J

John B. Matthews

Fantastic explanation.

I'm hoping you mean that as a complement, rather than as an assertion that
it's a fantasy! :)[/QUOTE]

I would reiterate it as a compliment, rather than as a completion,
angular counterpart or immune protein.
 
M

Mark Space

John said:
I'm hoping you mean that as a complement, rather than as an assertion that
it's a fantasy! :)

I would reiterate it as a compliment, rather than as a completion,
angular counterpart or immune protein.
[/QUOTE]

Oh. Here I was hoping it should be subtracted from 180 degrees.
 
A

angrybaldguy

I'm hoping you mean that as a complement, rather than as an assertion that
it's a fantasy! :)

It was. :)
The query part is encoded using x-www-form-urlencoding. Or rather, it's
encoded using the encoding specified in the form's enctype attribute,
which has a default value of application/x-www-form-urlencoding. The
specification for that says that spaces are escaped as pluses:

http://www.w3.org/TR/html4/interact/forms.html#h-17.13

So that plus *is* a space.

You are in a maze of twisty little standards, all different.

I can see how that reading is supported by the HTML spec, which is
*disgustingly* vague, but that's not what happens. The spec also
supports my reading.

You could accurately describe the encoding process that's actually
used in most browsers:

1. Convert all the reserved characters *except spaces* to their %-
encoded equivalents. (This converts +s in form fields to %2b, &s to
%26, and ?s to %3f, among others)
2. Convert spaces to +s.
3. Join each key to its value using =.
4. Join each key-value pair in the order they appear in the form using
& as a separator.
5(get). Append the resulting string to the submission URL, offset by
an unescaped ?.
5(post). Submit the resulting string as the request body.

Yes, this sucks. Removing the exception for spaces in step 1 and
removing step 2 entirely gives a simpler encoding process with
equivalent power.

I'm fairly sure the process I just described is a result of
incremental growth, and it's probably Mosaic's fault. The initial
encoding was probably "convert spaces to +s," before someone noticed
that sometimes people want to enter strings with +s (and ?s and &s) on
forms.
Clearly, it's not handled that way in practice. Furthermore, i'm a bit
dubious about the interaction between x-www-form-urlencoding and URI
escaping.

The x-www-form-urlencoding process is done instead of URI escaping,
rather than as well as. The result is a valid URI and can be correctly
converted back to the form data mostly via the URI unescaping rules:

1. Split the query into key-value pairs at every &.
2. Split the key-value pairs into keys and values at the first =.
3. Convert +s to %20.
4. URI-unescape the keys and values.
Although in fact, what the HTML spec says about form submission is:

  If the method is "get" and the action is an HTTP URI, the user agent
  takes the value of action, appends a `?' to it, then appends the form
  data set, encoded using the "application/x-www-form-urlencoded" content
  type. The user agent then traverses the link to this URI.

Note that it *doesn't* say that the encoded string is used as a query part
- it says it's appended directly to the action URL. Which sort of means
that URLs carrying form data are not strictly URLs at all ...

They are: the resulting strings fit the syntax requirements for URLs
and URIs; they also fit the structural requirements for HTTP URLs.
No, those URLs are equivalent. From RFC 3986, in the section about how to
compare URIs for equality:

  6.2.2.2. Percent-Encoding Normalization

  The percent-encoding mechanism (Section 2.1) is a frequent source of
  variance among otherwise identical URIs. In addition to the case
  normalization issue noted above, some URI producers percent-encode octets
  that do not require percent-encoding, resulting in URIs that are
  equivalent to their non-encoded counterparts. These URIs should be
  normalized by decoding any percent-encoded octet that corresponds to an
  unreserved character, as described in Section 2.3.

6.2.2 (Syntax-based Normalization) begins with "Implementations may",
not "Implementations must". RFC 3986 doesn't specify which way HTTP
implementations fall on this one. RFC 2616 (HTTP 1.1) 3.2.3 URI
Comparison states:
When comparing two URIs to decide if they match or not, a client SHOULD
use a case-sensitive octet-by-octet comparison of the entire URIs, with
these exceptions:

- A port that is empty or not given is equivalent to the default
port for that URI-reference;
- Comparisons of host names MUST be case-insensitive;
- Comparisons of scheme names MUST be case-insensitive;
- An empty abs_path is equivalent to an abs_path of "/".

Characters other than those in the "reserved" and "unsafe" sets (see
RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

The "reserved" set is section 2.2 of RFC 3986:

reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

So, ...?q=a%2bb and ...?q=a+b are *not* equivalent, since + is in the
"reserved" set. Neither are ...?q=a%20b and ...?q=a+b, since %20 is
not a hex encoding of +. (RFC 3986 does not impose that the +
separator must be used for spaces, just that + is a separator.).

And since RFC 2616 only says "a client SHOULD" regarding the
exceptions, you can only reliably trust byte-for-byte identical URIs
to be treated as equivalent - which is used in some places http: URIs
show up outside of HTTP, like XML namespaces (see
http://www.w3.org/TR/xml-names/#NSNameComparison).
And on this, for once, Google at least agrees with me - try these:

http://www.google.co.uk/search?q=a+b
http://www.google.co.uk/search?q=a b

Different URLs, logically-different resources (cached separately,
except Google results are not cacheable), identical content - which is
what I was trying to say was the Right Thing.

The real-world implementations of these rules are usually what I
described. The various RFCs and W3C recommendations are woefully
loose, but in this case it's pretty cut and dried.

-o
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top