Martin v. Löwis said:
and thanks from me too.
Unfortunately, this isn't normative, but "we recommend". In
addition, it talks about URIs found HTML only. If somebody writes
a user agent written in Python, they are certainly free to follow
this recommendation - but I think this is a case where Python should
refuse the temptation to guess.
So you believe that because something is only recommended by a standard
Python should refuse to implement it? This is the kind of thinking that in
the 1980's gave us a version of gcc where any attempt to use #pragma (which
according to the standard invokes undefined behaviour) would spawn a copy
of nethack or rogue.
You don't seem to have realised yet, but my objection to the behaviour of
urllib.unquote is precisely that it does guess, and it guesses wrongly. In
fact it guesses latin1 instead of utf8. If it threw an exception for non-
ascii values, then it would match the standard (in the sense of not
following a recommendation because it doesn't have to) and it would be
purely a quality of implementation issue.
If you don't believe me that it guesses latin1, try it. For all valid URIs
(i.e. ignoring those with non-ascii characters already in them) in the
current implementation where u is a unicode object:
unquote(u)==unquote(u.encode('ascii')).decode('latin1')
I generally agree that Python should avoid guessing, so I wouldn't really
object if it threw an exception or always returned a byte string even
though the html standard recommends using utf8 and the uri rfc requires it
for all new uri schemes. However, in this case I think it would be useful
behaviour: e.g. a decent xml parser is going to give me back the attributes
including encoded uris in unicode. To handle those correctly you must
encode to ascii before unquoting. This is an avoidable pitfall in the
standard library.
On second thoughts, perhaps the current behaviour is actually closer to:
unquote(u)==unquote(u.encode('latin1')).decode('latin1')
as that also matches the current behaviour for uris which contain non-ascii
characters when the characters have a latin1 encoding. To fully conform
with the html standard's recommendation it should actually be equivalent
to:
unquote(u)==unquote(u.encode('utf8')).decode('utf8')
The catch with the current behaviour is that it doesn't exactly mimic any
sensible behaviour at all. It decodes the escaped octets as though they
were latin1 encoded, but it mixes them into a unicode string so there is no
way to correct its bad guess. In other words the current behaviour is
actively harmful.