urllib.unquote and unicode

Discussion in 'Python' started by George Sakkis, Dec 19, 2006.

  1. The following snippet results in different outcome for (at least) the
    last three major releases:

    >>> import urllib
    >>> urllib.unquote(u'%94')


    # Python 2.3.4
    u'%94'

    # Python 2.4.2
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
    ordinal not in range(128)

    # Python 2.5
    u'\x94'

    Is the current version the "right" one or is this function supposed to
    change every other week ?

    George
     
    George Sakkis, Dec 19, 2006
    #1
    1. Advertising

  2. George Sakkis

    Leo Kislov Guest

    George Sakkis wrote:
    > The following snippet results in different outcome for (at least) the
    > last three major releases:
    >
    > >>> import urllib
    > >>> urllib.unquote(u'%94')

    >
    > # Python 2.3.4
    > u'%94'
    >
    > # Python 2.4.2
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
    > ordinal not in range(128)
    >
    > # Python 2.5
    > u'\x94'
    >
    > Is the current version the "right" one or is this function supposed to
    > change every other week ?


    IMHO, none of the results is right. Either unicode string should be
    rejected by raising ValueError or it should be encoded with ascii
    encoding and result should be the same as
    urllib.unquote(u'%94'.encode('ascii')) that is '\x94'. You can consider
    current behaviour as undefined just like if you pass a random object
    into some function you can get different outcome in different python
    versions.

    -- Leo
     
    Leo Kislov, Dec 19, 2006
    #2
    1. Advertising

  3. George Sakkis

    Peter Otten Guest

    George Sakkis wrote:

    > The following snippet results in different outcome for (at least) the
    > last three major releases:
    >
    >>>> import urllib
    >>>> urllib.unquote(u'%94')


    > # Python 2.4.2
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
    > ordinal not in range(128)


    Python 2.4.3 (#3, Aug 23 2006, 09:40:15)
    [GCC 3.3.3 (SuSE Linux)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import urllib
    >>> urllib.unquote(u"%94")

    u'\x94'
    >>>


    From the above I infer that the 2.4.2 behaviour was considered a bug.

    Peter
     
    Peter Otten, Dec 19, 2006
    #3
  4. George Sakkis wrote:

    > The following snippet results in different outcome for (at least) the
    > last three major releases:
    >
    >>>> import urllib
    >>>> urllib.unquote(u'%94')

    >
    > # Python 2.3.4
    > u'%94'
    >
    > # Python 2.4.2
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
    > ordinal not in range(128)
    >
    > # Python 2.5
    > u'\x94'
    >
    > Is the current version the "right" one or is this function supposed to
    > change every other week ?


    why are you passing non-ASCII Unicode strings to a function designed for
    fixing up 8-bit strings in the first place? if you do proper encoding
    before you quote things, it'll work the same way in all Python releases.

    </F>
     
    Fredrik Lundh, Dec 19, 2006
    #4
  5. George Sakkis

    Duncan Booth Guest

    "Leo Kislov" <> wrote:

    > George Sakkis wrote:
    >> The following snippet results in different outcome for (at least) the
    >> last three major releases:
    >>
    >> >>> import urllib
    >> >>> urllib.unquote(u'%94')

    >>
    >> # Python 2.3.4
    >> u'%94'
    >>
    >> # Python 2.4.2
    >> UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position
    >> 0: ordinal not in range(128)
    >>
    >> # Python 2.5
    >> u'\x94'
    >>
    >> Is the current version the "right" one or is this function supposed
    >> to change every other week ?

    >
    > IMHO, none of the results is right. Either unicode string should be
    > rejected by raising ValueError or it should be encoded with ascii
    > encoding and result should be the same as
    > urllib.unquote(u'%94'.encode('ascii')) that is '\x94'. You can
    > consider current behaviour as undefined just like if you pass a random
    > object into some function you can get different outcome in different
    > python versions.


    I agree with you that none of the results is right, but not that the
    behaviour should be undefined.

    The way that uri encoding is supposed to work is that first the input
    string in unicode is encoded to UTF-8 and then each byte which is not in
    the permitted range for characters is encoded as % followed by two hex
    characters.

    That means that the string u'\x94' should be encoded as %c2%94. The
    string %94 should generate a unicode decode error, but it should be the
    utf-8 codec raising the error not the ascii codec.

    Unfortunately RFC3986 isn't entirely clear-cut on this issue:

    > When a new URI scheme defines a component that represents textual
    > data consisting of characters from the Universal Character Set [UCS],
    > the data should first be encoded as octets according to the UTF-8
    > character encoding [STD63]; then only those octets that do not
    > correspond to characters in the unreserved set should be percent-
    > encoded. For example, the character A would be represented as "A",
    > the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
    > as "%C3%80", and the character KATAKANA LETTER A would be represented
    > as "%E3%82%A2".


    I think it leaves open the possibility that existing URI schemes which do
    not support unicode characters can use other encodings, but given that the
    original posting started by decoding a unicode string I think that utf-8
    should definitely be assumed in this case.

    Also, urllib.quote() should encode into utf-8 instead of throwing KeyError
    for a unicode string.
     
    Duncan Booth, Dec 19, 2006
    #5
  6. Fredrik Lundh wrote:
    > George Sakkis wrote:
    >
    > > The following snippet results in different outcome for (at least) the
    > > last three major releases:
    > >
    > >>>> import urllib
    > >>>> urllib.unquote(u'%94')

    > >
    > > # Python 2.3.4
    > > u'%94'
    > >
    > > # Python 2.4.2
    > > UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 0:
    > > ordinal not in range(128)
    > >
    > > # Python 2.5
    > > u'\x94'
    > >
    > > Is the current version the "right" one or is this function supposed to
    > > change every other week ?

    >
    > why are you passing non-ASCII Unicode strings to a function designed for
    > fixing up 8-bit strings in the first place? if you do proper encoding
    > before you quote things, it'll work the same way in all Python releases.


    I'm using BeautifulSoup, which from version 3 returns Unicode only, and
    I stumbled on a page with such bogus char encodings; I have the
    impression that whatever generated it used ord() to encode reserved
    characters instead of the proper hex representation in latin-1. If
    that's the case, unquote() won't do anyway and I'd have to go with
    chr() on the number part.

    George
     
    George Sakkis, Dec 19, 2006
    #6
  7. Duncan Booth schrieb:
    > The way that uri encoding is supposed to work is that first the input
    > string in unicode is encoded to UTF-8 and then each byte which is not in
    > the permitted range for characters is encoded as % followed by two hex
    > characters.


    Can you back up this claim ("is supposed to work") by reference to
    a specification (ideally, chapter and verse)?

    In URIs, it is entirely unspecified what the encoding is of non-ASCII
    characters, and whether % escapes denote characters in the first place.

    > Unfortunately RFC3986 isn't entirely clear-cut on this issue:
    >
    >> When a new URI scheme defines a component that represents textual
    >> data consisting of characters from the Universal Character Set [UCS],
    >> the data should first be encoded as octets according to the UTF-8
    >> character encoding [STD63]; then only those octets that do not
    >> correspond to characters in the unreserved set should be percent-
    >> encoded. For example, the character A would be represented as "A",
    >> the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
    >> as "%C3%80", and the character KATAKANA LETTER A would be represented
    >> as "%E3%82%A2".


    This is irrelevant, it talks about new URI schemes only.

    > I think it leaves open the possibility that existing URI schemes which do
    > not support unicode characters can use other encodings, but given that the
    > original posting started by decoding a unicode string I think that utf-8
    > should definitely be assumed in this case.


    No, the http scheme is defined by RFC 2616 instead. It doesn't really
    talk about encodings, but hints an interpretation in 3.2.3:

    # When comparing two URIs to decide if they match or not, a client
    # SHOULD use a case-sensitive octet-by-octet comparison of the entire
    # URIs, [...]
    # Characters other than those in the "reserved" and "unsafe" sets (see
    # RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

    Now, RFC 2396 already says that URIs are sequences of characters,
    not sequences of octets, yet RFC 2616 fails to recognize that issue
    and refuses to specify a character set for its scheme (which
    RFC 2396 says that it could).

    The conventional wisdom is that the choice of URI encoding for HTTP
    is a server-side decision; for that reason, IRIs were introduced.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Dec 19, 2006
    #7
  8. George Sakkis

    Duncan Booth Guest

    "Martin v. Löwis" <> wrote:

    > Duncan Booth schrieb:
    >> The way that uri encoding is supposed to work is that first the input
    >> string in unicode is encoded to UTF-8 and then each byte which is not
    >> in the permitted range for characters is encoded as % followed by two
    >> hex characters.

    >
    > Can you back up this claim ("is supposed to work") by reference to
    > a specification (ideally, chapter and verse)?


    I'm not sure I have time to read the various RFC's in depth right now,
    so I may have to come back on this thread later. The one thing I'm
    convinced of is that the current implementations of urllib.quote and
    urllib.unquote are broken in respect to their handling of unicode. In
    particular % encoding is defined in terms of octets, so when given a
    unicode string urllib.quote should either encoded it, or throw a suitable
    exception (not KeyError which is what it seems to throw now).

    My objection to urllib.unquote is that urllib.unquote(u'%a3') returns
    u'\xa3' which is a character not an octet. I think it should always return
    a byte string, or it should calculate a byte string and then decode it
    according to some suitable encoding, or it should throw an exception
    [choose any of the above].

    Adding an optional encoding parameter to quote/unquote be one option,
    although since you can encode/decode the parameter it doesn't add much.

    > No, the http scheme is defined by RFC 2616 instead. It doesn't really
    > talk about encodings, but hints an interpretation in 3.2.3:


    The applicable RFC is 3986. See RFC2616 section 3.2.1:
    > For definitive information on URL syntax and semantics, see "Uniform
    > Resource Identifiers (URI):
    > Generic Syntax and Semantics," RFC 2396 [42] (which replaces RFCs
    > 1738 [4] and RFC 1808 [11]).


    and RFC 2396:
    > Obsoleted by: 3986



    > Now, RFC 2396 already says that URIs are sequences of characters,
    > not sequences of octets, yet RFC 2616 fails to recognize that issue
    > and refuses to specify a character set for its scheme (which
    > RFC 2396 says that it could).


    and RFC2277, 3.1 says that it MUST identify which charset is used (although
    that's just a best practice document not a standard). (The block capitals
    are the RFC's not mine.)

    > The conventional wisdom is that the choice of URI encoding for HTTP
    > is a server-side decision; for that reason, IRIs were introduced.


    Yes, I know that in practice some systems use other character sets.
     
    Duncan Booth, Dec 20, 2006
    #8
  9. Martin v. Löwis wrote:
    > Duncan Booth schrieb:
    >> The way that uri encoding is supposed to work is that first the input
    >> string in unicode is encoded to UTF-8 and then each byte which is not in
    >> the permitted range for characters is encoded as % followed by two hex
    >> characters.

    >
    > Can you back up this claim ("is supposed to work") by reference to
    > a specification (ideally, chapter and verse)?
    >
    > In URIs, it is entirely unspecified what the encoding is of non-ASCII
    > characters, and whether % escapes denote characters in the first place.


    http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

    Servus,
    Walter
     
    =?ISO-8859-1?Q?Walter_D=F6rwald?=, Dec 21, 2006
    #9
  10. >>> The way that uri encoding is supposed to work is that first the input
    >>> string in unicode is encoded to UTF-8 and then each byte which is not in
    >>> the permitted range for characters is encoded as % followed by two hex
    >>> characters.

    >> Can you back up this claim ("is supposed to work") by reference to
    >> a specification (ideally, chapter and verse)?

    > http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1


    Thanks. Unfortunately, this isn't normative, but "we recommend". In
    addition, it talks about URIs found HTML only. If somebody writes
    a user agent written in Python, they are certainly free to follow
    this recommendation - but I think this is a case where Python should
    refuse the temptation to guess.

    If somebody implemented IRIs, that would be an entirely different
    matter.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Dec 21, 2006
    #10
  11. George Sakkis

    Duncan Booth Guest

    "Martin v. Löwis" <> wrote:

    >>>> The way that uri encoding is supposed to work is that first the
    >>>> input string in unicode is encoded to UTF-8 and then each byte
    >>>> which is not in the permitted range for characters is encoded as %
    >>>> followed by two hex characters.
    >>> Can you back up this claim ("is supposed to work") by reference to
    >>> a specification (ideally, chapter and verse)?

    >> http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

    >
    > Thanks.

    and thanks from me too.

    > Unfortunately, this isn't normative, but "we recommend". In
    > addition, it talks about URIs found HTML only. If somebody writes
    > a user agent written in Python, they are certainly free to follow
    > this recommendation - but I think this is a case where Python should
    > refuse the temptation to guess.


    So you believe that because something is only recommended by a standard
    Python should refuse to implement it? This is the kind of thinking that in
    the 1980's gave us a version of gcc where any attempt to use #pragma (which
    according to the standard invokes undefined behaviour) would spawn a copy
    of nethack or rogue.

    You don't seem to have realised yet, but my objection to the behaviour of
    urllib.unquote is precisely that it does guess, and it guesses wrongly. In
    fact it guesses latin1 instead of utf8. If it threw an exception for non-
    ascii values, then it would match the standard (in the sense of not
    following a recommendation because it doesn't have to) and it would be
    purely a quality of implementation issue.

    If you don't believe me that it guesses latin1, try it. For all valid URIs
    (i.e. ignoring those with non-ascii characters already in them) in the
    current implementation where u is a unicode object:

    unquote(u)==unquote(u.encode('ascii')).decode('latin1')

    I generally agree that Python should avoid guessing, so I wouldn't really
    object if it threw an exception or always returned a byte string even
    though the html standard recommends using utf8 and the uri rfc requires it
    for all new uri schemes. However, in this case I think it would be useful
    behaviour: e.g. a decent xml parser is going to give me back the attributes
    including encoded uris in unicode. To handle those correctly you must
    encode to ascii before unquoting. This is an avoidable pitfall in the
    standard library.

    On second thoughts, perhaps the current behaviour is actually closer to:

    unquote(u)==unquote(u.encode('latin1')).decode('latin1')

    as that also matches the current behaviour for uris which contain non-ascii
    characters when the characters have a latin1 encoding. To fully conform
    with the html standard's recommendation it should actually be equivalent
    to:

    unquote(u)==unquote(u.encode('utf8')).decode('utf8')

    The catch with the current behaviour is that it doesn't exactly mimic any
    sensible behaviour at all. It decodes the escaped octets as though they
    were latin1 encoded, but it mixes them into a unicode string so there is no
    way to correct its bad guess. In other words the current behaviour is
    actively harmful.
     
    Duncan Booth, Dec 22, 2006
    #11
  12. Duncan Booth schrieb:
    > So you believe that because something is only recommended by a standard
    > Python should refuse to implement it?


    Yes. In the face of ambiguity, refuse the temptation to guess.

    This is *deeply* ambiguous; people have been using all kinds of
    encodings in http URLs.

    > You don't seem to have realised yet, but my objection to the behaviour of
    > urllib.unquote is precisely that it does guess, and it guesses wrongly.


    Yes, it seems that this was a bad move.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Dec 22, 2006
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. William Tasso

    quote, unquote

    William Tasso, Nov 11, 2003, in forum: HTML
    Replies:
    5
    Views:
    434
    George Self
    Nov 12, 2003
  2. koara

    urllib.unquote + unicode

    koara, Nov 13, 2007, in forum: Python
    Replies:
    1
    Views:
    686
    Gabriel Genellina
    Nov 14, 2007
  3. Jonathan Gardner

    Asynchronous urllib (urllib+asyncore)?

    Jonathan Gardner, Feb 26, 2008, in forum: Python
    Replies:
    1
    Views:
    506
    Terry Jones
    Feb 27, 2008
  4. Maciej Bliziñski

    urllib2.unquote() vs unicode

    Maciej Bliziñski, Mar 18, 2008, in forum: Python
    Replies:
    1
    Views:
    792
    Gabriel Genellina
    Mar 18, 2008
  5. Mats

    Extract until unquote or EOL

    Mats, Jul 18, 2005, in forum: Perl Misc
    Replies:
    4
    Views:
    162
Loading...

Share This Page