urllib.quote fails on Unicode URL

J

John Nagle

The code in urllib.quote fails on Unicode input, when
called by robotparser.

That bit of code needs some attention.
- It still assumes ASCII goes up to 255, which hasn't been true in Python
for a while now.
- The initialization may not be thread-safe; a table is being initialized
on first use. The code is too clever and uncommented.

"robotparser" was trying to check if a URL,
"http://www.highbeam.com/DynamicContent/”/mysaved/privacyPref.asp""
could be accessed, and there are some wierd characters in there. Unicode
URLs are legal, so this is a real bug.

Logged in as Bug #1712522.

John Nagle
 
P

Peter Otten

John said:
The code in urllib.quote fails on Unicode input, when
called by robotparser.

That bit of code needs some attention.
- It still assumes ASCII goes up to 255, which hasn't been true in
Python
for a while now.
- The initialization may not be thread-safe; a table is being
initialized
on first use. The code is too clever and uncommented.

"robotparser" was trying to check if a URL,
"http://www.highbeam.com/DynamicContent/”/mysaved/privacyPref.asp""
could be accessed, and there are some wierd characters in there. Unicode
URLs are legal, so this is a real bug.

Logged in as Bug #1712522.

There has been a related discussion:

http://groups.google.com/group/comp...read/thread/b331dc3625dbfc41/ce6e6a3c0635e340

IIRC the outcome was that while UTF-8 is recommended
urllib.quote()/unquote() should not guess the encoding.

What changes that would imply for robotparser I don't know...

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top