urlencode with high characters

J

Jim

Hello,

I'm trying to do urllib.urlencode() with unicode correctly, and I
wonder if some kind person could set me straight?

My understanding is that I am supposed to be able to urlencode anything
up to the top half of latin-1 -- decimal 128-255.

I can't just send urlencode a unicode character:

Python 2.3.5 (#2, May 4 2005, 08:51:39)
[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.3/urllib.py", line 1206, in urlencode
v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in
position 3: ordinal not in range(128)

Is it instead Right that I should send a unicode string to urlencode by
first encoding it to 'latin-1' ?
'x=abc%F6def'

If it is Right, I'm puzzled as to why urlencode doesn't do it. Or am I
missing something? urllib.ulrencode() contains the lines:

elif _is_unicode(v):
# is there a reasonable way to convert to ASCII?
# encode generates a string, but "replace" or "ignore"
# lose information and "strict" can raise UnicodeError
v = quote_plus(v.encode("ASCII","replace"))
l.append(k + '=' + v)

so I think that it is *not* liking latin-1.

Thank you,
Jim
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Jim said:
My understanding is that I am supposed to be able to urlencode anything
up to the top half of latin-1 -- decimal 128-255.

I believe your understanding is incorrect. Without being able to quote
RFCs precisely, I think your understanding should be this:

- the URL literal syntax only allows for ASCII characters
- bytes with no meaning in ASCII can be quoted through %hh in URLs
- the precise meaning of such bytes in the URL is defined in the
URL scheme, and may vary from URL scheme to URL scheme
- the http scheme does not specify any interpretation of the bytes,
but apparantly assumes that they denote characters, and follow
some encoding - which encoding is something that the web server
defines, when mapping URLs to resources.

If you get the impression that this is underspecified: your impression
is correct; it is underspecified indeed.

There is a recent attempt to tighten the specification through IRIs.
The IRI RFC defines a mapping between IRIs and URIs, and it uses
UTF-8 as the encoding, not latin-1.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top