urllib interpretation of URL with ".."

?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
Here's a URL, found in a link, which gives us trouble
when we try to follow the link:

http://sportsbra.co.uk/../acatalog/shop.html

Browsers immediately turn this into

http://sportsbra.co.uk/acatalog/shop.html

and go from there, but urllib tries to open it explicitly, which
results in an HTTP error 400.

Is "urllib" wrong?

I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Regards,
Martin
 
J

John Nagle

Martin said:
I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Regards,
Martin

I think you're right. The problem is that there is apparently a de-facto
standard in browsers that any number of "../" sequences at the beginning of
the path part of a URL have no effect. Even Google seems to use that
interpretation; not only does it follow that link, it lists it in Google
without the "..".

John Nagle
 
D

Duncan Booth

Martin v. Löwis said:
I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Section 5.2 is also relevant here. In particular:
g) If the resulting buffer string still begins with one or more
complete path segments of "..", then the reference is
considered to be in error. Implementations may handle this
error by retaining these components in the resolved path (i.e.,
treating them as part of the final URI), by removing them from
the resolved path (i.e., discarding relative levels above the
root), or by avoiding traversal of the reference.

The common practice seems to be for client-side implementations to handle
this using option 2 (removing them) and servers to use option 3 (avoiding
traversal of the reference). urllib uses option 1 which is also correct but
not as useful as it might be.
 
J

John Nagle

Section 5.2 is also relevant here. In particular:




The common practice seems to be for client-side implementations to handle
this using option 2 (removing them) and servers to use option 3 (avoiding
traversal of the reference). urllib uses option 1 which is also correct but
not as useful as it might be.

That's helpful. Thanks.

In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

(Yet another hassle in processing real-world HTML.)

John Nagle
 
J

John J. Lee

John Nagle said:
That's helpful. Thanks.

In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

(Yet another hassle in processing real-world HTML.)

Note that RFC 3986 obsoletes RFC 2396, and attempts to codify current
good practice re generic URL syntax (URI and relative reference
syntax, to use the precise terminology of the RFC). It discusses
normalisation at length, quite sensibly and pragmatically. And very
readable and useful it is too.

Somebody submitted a module implementing the URL splitting / joining
algorithms specified in RFC 3986 for inclusion in Python 2.6 -- I
haven't looked at that recently...

See also RFC 3987.


John
 
S

sergio

John said:
In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's
being used by a client or a server, so it, reasonably enough, takes option
1.
'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.
 
G

Gabriel Genellina

'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.

I'd say it's an annoyance, not a bug. Write your own urljoin function with
your exact desired behavior - since all "meaningful" .. and . should have
been already processed by urljoin, a simple url =
url.replace("/../","/").replace("/./","/") may be enough.
 
S

sergio

Gabriel said:
I'd say it's an annoyance, not a bug. Write your own urljoin function with
your exact desired behavior - since all "meaningful" .. and . should have
been already processed by urljoin, a simple url =
url.replace("/../","/").replace("/./","/") may be enough.

I had exactly the same though the solution is simply this:

urlparse.urljoin(base,path).replace("/../","/")


Many thanks,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,434
Messages
2,571,690
Members
48,796
Latest member
Greg L.

Latest Threads

Top