L
Lew
Mark said:Yes well, the internet is a series of tubes.
Twisty little tubes, all different.
Aligned in a xyzzygy.
Mark said:Yes well, the internet is a series of tubes.
Twisty little tubes, all different.
To clarify, when you see a URL like:
http://www.google.co.uk/search?hl=en&safe=off&q=my+query&btnG=Search
There are *two* *different* layers of syntax here. First is the URI/URL,
syntax, which breaks the string down to:
Scheme: http
Authority:www.google.co.uk
Path: search
Query: hl=en&safe=off&q=my+query&btnG=Search
Second is x-www-form-urlencoding of the query part, which breaks it down
to:
hl: en
safe: off
q: my query
btnG: Search
Note that it is permitted to have raw + signs in the query part: they're
reserved characters in URI syntax, but in the lesser 'subcomponent
delimiter' set, rather than the greater 'generic delimiter' set, and that
means that they can be used unescaped in a part, provided that the syntax
for that part permits it. I can't find anything in a specification of the
http URL scheme that forbids + from the query part, and thus, applying
ancient Anglo-Saxon legal principles, it's permitted. If you don't like
it, you can always escape them:
http://www.google.co.uk/search?hl=en&safe=off&q=my+query&btnG=Search
I sincerely believe that that URL is exactly equivalent to the one above.
Although i note that Google doesn't think so. Hmm.
Fantastic explanation.
Google's right. The query part of that URL expands to
hl: en
save: off
q: my+query <-- Note the plus.
btnG: Search
In practice, a web app which treats
GET /foo?a+b HTTP/1.1
and
GET /foo?a%20b HTTP/1.1
differently is going to break one way or another. However, when
comparing URLs, those are two distinct URLs.
Fantastic explanation.
John said:I'm hoping you mean that as a complement, rather than as an assertion that
it's a fantasy!
John said:Oh. Here I was hoping it should be subtracted from 180 degrees.
You sure have a cute angle on that.
Lew said:You sure have a cute angle on that.
Hence the expression cutie-pie, in radians.
Lew said:You sure have a cute angle on that.
I'm hoping you mean that as a complement, rather than as an assertion that
it's a fantasy!
The query part is encoded using x-www-form-urlencoding. Or rather, it's
encoded using the encoding specified in the form's enctype attribute,
which has a default value of application/x-www-form-urlencoding. The
specification for that says that spaces are escaped as pluses:
http://www.w3.org/TR/html4/interact/forms.html#h-17.13
So that plus *is* a space.
Clearly, it's not handled that way in practice. Furthermore, i'm a bit
dubious about the interaction between x-www-form-urlencoding and URI
escaping.
Although in fact, what the HTML spec says about form submission is:
If the method is "get" and the action is an HTTP URI, the user agent
takes the value of action, appends a `?' to it, then appends the form
data set, encoded using the "application/x-www-form-urlencoded" content
type. The user agent then traverses the link to this URI.
Note that it *doesn't* say that the encoded string is used as a query part
- it says it's appended directly to the action URL. Which sort of means
that URLs carrying form data are not strictly URLs at all ...
No, those URLs are equivalent. From RFC 3986, in the section about how to
compare URIs for equality:
6.2.2.2. Percent-Encoding Normalization
The percent-encoding mechanism (Section 2.1) is a frequent source of
variance among otherwise identical URIs. In addition to the case
normalization issue noted above, some URI producers percent-encode octets
that do not require percent-encoding, resulting in URIs that are
equivalent to their non-encoded counterparts. These URIs should be
normalized by decoding any percent-encoded octet that corresponds to an
unreserved character, as described in Section 2.3.
When comparing two URIs to decide if they match or not, a client SHOULD
use a case-sensitive octet-by-octet comparison of the entire URIs, with
these exceptions:
- A port that is empty or not given is equivalent to the default
port for that URI-reference;
- Comparisons of host names MUST be case-insensitive;
- Comparisons of scheme names MUST be case-insensitive;
- An empty abs_path is equivalent to an abs_path of "/".
Characters other than those in the "reserved" and "unsafe" sets (see
RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
And on this, for once, Google at least agrees with me - try these:
http://www.google.co.uk/search?q=a+b
http://www.google.co.uk/search?q=a b
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.