urllib interpretation of URL with ".."

John Nagle · Jun 23, 2007

Here's a URL, found in a link, which gives us trouble
when we try to follow the link:

http://sportsbra.co.uk/../acatalog/shop.html

Browsers immediately turn this into

http://sportsbra.co.uk/acatalog/shop.html

and go from there, but urllib tries to open it explicitly, which
results in an HTTP error 400.

Is "urllib" wrong?

John Nagle

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jun 23, 2007

John said:
Here's a URL, found in a link, which gives us trouble
when we try to follow the link:

http://sportsbra.co.uk/../acatalog/shop.html

Browsers immediately turn this into

http://sportsbra.co.uk/acatalog/shop.html

and go from there, but urllib tries to open it explicitly, which
results in an HTTP error 400.

Is "urllib" wrong?

I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Regards,
Martin

John Nagle · Jun 23, 2007

Martin said:
I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Regards,
Martin

I think you're right. The problem is that there is apparently a de-facto
standard in browsers that any number of "../" sequences at the beginning of
the path part of a URL have no effect. Even Google seems to use that
interpretation; not only does it follow that link, it lists it in Google
without the "..".

John Nagle

Duncan Booth · Jun 25, 2007

Martin v. Löwis said:
I can't see how. HTTP 1.1 says that the parameter to the GET
request should be an abs_path; RFC 2396 says that
/../acatalog/shop.html is indeed an abs_path, as .. is a valid
segment. That RFC also has a section on relative identifiers
and normalization; it defines what .. means *in a relative path*.

Section 4 is explicit about .. in absolute URIs:
# The syntax for relative URI is a shortened form of that for absolute
# URI, where some prefix of the URI is missing and certain path
# components ("." and "..") have a special meaning when, and only when,
# interpreting a relative path.

Notice the "and only when": the browsers who modify above
URL before sending it seem to be in clear violation of
RFC 2396.

Section 5.2 is also relevant here. In particular:

g) If the resulting buffer string still begins with one or more
complete path segments of "..", then the reference is
considered to be in error. Implementations may handle this
error by retaining these components in the resolved path (i.e.,
treating them as part of the final URI), by removing them from
the resolved path (i.e., discarding relative levels above the
root), or by avoiding traversal of the reference.

The common practice seems to be for client-side implementations to handle
this using option 2 (removing them) and servers to use option 3 (avoiding
traversal of the reference). urllib uses option 1 which is also correct but
not as useful as it might be.

John Nagle · Jun 25, 2007

Section 5.2 is also relevant here. In particular:

The common practice seems to be for client-side implementations to handle
this using option 2 (removing them) and servers to use option 3 (avoiding
traversal of the reference). urllib uses option 1 which is also correct but
not as useful as it might be.

That's helpful. Thanks.

In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

(Yet another hassle in processing real-world HTML.)

John Nagle

John J. Lee · Jun 25, 2007

John Nagle said:
That's helpful. Thanks.

In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's being
used by a client or a server, so it, reasonably enough, takes option 1.

(Yet another hassle in processing real-world HTML.)

Note that RFC 3986 obsoletes RFC 2396, and attempts to codify current
good practice re generic URL syntax (URI and relative reference
syntax, to use the precise terminology of the RFC). It discusses
normalisation at length, quite sensibly and pragmatically. And very
readable and useful it is too.

Somebody submitted a module implementing the URL splitting / joining
algorithms specified in RFC 3986 for inclusion in Python 2.6 -- I
haven't looked at that recently...

See also RFC 3987.

John

sergio · Jun 26, 2007

John said:
In Python, of course, "urlparse.urlparse", which is
the main function used to disassemble a URL, has no idea whether it's
being used by a client or a server, so it, reasonably enough, takes option
1.

'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.

Gabriel Genellina · Jun 26, 2007

'http://somesite.com/../page.html'

For me this is a bug and is very annoying because I can't simply trip ../
from path because base could have a level.

I'd say it's an annoyance, not a bug. Write your own urljoin function with
your exact desired behavior - since all "meaningful" .. and . should have
been already processed by urljoin, a simple url =
url.replace("/../","/").replace("/./","/") may be enough.

sergio · Jun 27, 2007

Gabriel said:
I'd say it's an annoyance, not a bug. Write your own urljoin function with
your exact desired behavior - since all "meaningful" .. and . should have
been already processed by urljoin, a simple url =
url.replace("/../","/").replace("/./","/") may be enough.

I had exactly the same though the solution is simply this:

urlparse.urljoin(base,path).replace("/../","/")

Many thanks,

More urllib timeout issues.	5	Apr 27, 2007
urllib timeout hole - long timeout if site doesn't send headers.	0	Jan 3, 2008
charset problems with urllib/urllib2	0	Feb 23, 2009
Help with Visual Lightbox: Scripts	2	May 3, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
No way to set a timeout in "urllib".	4	Dec 29, 2006
Python-URL! - weekly Python news and links (Mar 31)	4	Mar 31, 2012
Can't wrap text around image and one more	1	Jul 25, 2025

urllib interpretation of URL with ".."

John Nagle

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John Nagle

Duncan Booth

John Nagle

John J. Lee

sergio

Gabriel Genellina

sergio

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads