urllib.quote fails on Unicode URL

Discussion in 'Python' started by John Nagle, May 4, 2007.

  1. John Nagle

    John Nagle Guest

    The code in urllib.quote fails on Unicode input, when
    called by robotparser.

    That bit of code needs some attention.
    - It still assumes ASCII goes up to 255, which hasn't been true in Python
    for a while now.
    - The initialization may not be thread-safe; a table is being initialized
    on first use. The code is too clever and uncommented.

    "robotparser" was trying to check if a URL,
    "http://www.highbeam.com/DynamicContent/”/mysaved/privacyPref.asp""
    could be accessed, and there are some wierd characters in there. Unicode
    URLs are legal, so this is a real bug.

    Logged in as Bug #1712522.

    John Nagle
     
    John Nagle, May 4, 2007
    #1
    1. Advertisements

  2. John Nagle

    Peter Otten Guest

    There has been a related discussion:

    http://groups.google.com/group/comp...read/thread/b331dc3625dbfc41/ce6e6a3c0635e340

    IIRC the outcome was that while UTF-8 is recommended
    urllib.quote()/unquote() should not guess the encoding.

    What changes that would imply for robotparser I don't know...

    Peter
     
    Peter Otten, May 4, 2007
    #2
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.