URL 'special character' replacements

Discussion in 'Python' started by Claude Henchoz, Jan 9, 2006.

  1. Hi guys

    I have a huge list of URLs. These URLs all have ASCII codes for special
    characters, like "%20" for a space or "%21" for an exclamation mark.

    I've already googled quite some time, but I have not been able to find
    any elegant way on how to replace these with their 'real' counterparts
    (" " and "!").

    Of course, I could just replace(), but that seems to be a lot of work.

    Thanks for any help.

    Cheers, Claude
     
    Claude Henchoz, Jan 9, 2006
    #1
    1. Advertising

  2. [Claude]
    > I have a huge list of URLs. These URLs all have ASCII codes for special
    > characters, like "%20" for a space or "%21" for an exclamation mark.


    You need urllib.unquote:

    >>> import urllib
    >>> help(urllib.unquote)

    Help on function unquote in module urllib:

    unquote(s)
    unquote('abc%20def') -> 'abc def'.

    --
    Richie Hindle
     
    Richie Hindle, Jan 9, 2006
    #2
    1. Advertising

  3. Claude Henchoz

    Duncan Booth Guest

    Claude Henchoz wrote:

    > I have a huge list of URLs. These URLs all have ASCII codes for special
    > characters, like "%20" for a space or "%21" for an exclamation mark.
    >
    > I've already googled quite some time, but I have not been able to find
    > any elegant way on how to replace these with their 'real' counterparts
    > (" " and "!").
    >
    > Of course, I could just replace(), but that seems to be a lot of work.
    >


    urllib.unquote() or urllib.unquote_plus() as appropriate:

    unquote( string)

    Replace "%xx" escapes by their single-character equivalent.
    Example: unquote('/%7Econnolly/') yields '/~connolly/'.


    unquote_plus( string)

    Like unquote(), but also replaces plus signs by spaces, as required for
    unquoting HTML form values.
     
    Duncan Booth, Jan 9, 2006
    #3
  4. Claude Henchoz wrote:

    > I have a huge list of URLs. These URLs all have ASCII codes for special
    > characters, like "%20" for a space or "%21" for an exclamation mark.
    >
    > I've already googled quite some time, but I have not been able to find
    > any elegant way on how to replace these with their 'real' counterparts
    > (" " and "!").
    >
    > Of course, I could just replace(), but that seems to be a lot of work.


    >>> import urllib
    >>> urllib.unquote("http://docs.python.org/lib/module-urllib.html%20%21")

    'http://docs.python.org/lib/module-urllib.html !'

    </F>
     
    Fredrik Lundh, Jan 9, 2006
    #4
  5. My outline for a solution would be:

    - Use StringIO or cStringIO for reading the original URLs character for
    character, and to build the result URLs character for character

    - When you read a '%' then read the next 2 character (should be
    digits!!!) and create a new string with them
    - The numbers like '20' etc. are hexadecimal values, meaning integers
    with base 16.
    Get the actual int-value like this:
    code_int = int(code_str, 16)
    - Convert to character as: code_chr = chr(code_int)
    - Write this character to the output cStringIO buffer
    - When the whole URL is done, do getvalue() to get the string of the
    new URL and close the cStringIO buffer.

    Is that sufficiently comprehensible? Or still too convoluted for you?

    (PS: I researched doing it the manual way, 'the hard way'. However,
    there are plenty of libraries in Python for all sorts of internet
    stuff. Perhaps urllib or urllib2 already has the functionality that you
    need -- didn't look it up)

    cheers,

    --Tim
     
    Tim N. van der Leeuw, Jan 9, 2006
    #5
  6. Claude Henchoz wrote:
    > Hi guys
    >
    > I have a huge list of URLs. These URLs all have ASCII codes for special
    > characters, like "%20" for a space or "%21" for an exclamation mark.
    >
    > I've already googled quite some time, but I have not been able to find
    > any elegant way on how to replace these with their 'real' counterparts
    > (" " and "!").
    >
    > Of course, I could just replace(), but that seems to be a lot of work.
    >
    > Thanks for any help.
    >
    > Cheers, Claude
    >


    The standard library module 'urllib' gies you two choices, depending on
    the exact behavior you'd like:

    http://www.python.org/doc/2.3.2/lib/module-urllib.html
    unquote(string)
    Replace "%xx" escapes by their single-character equivalent.

    Example: unquote('/%7Econnolly/') yields '/~connolly/'.

    unquote_plus(string)
    Like unquote(), but also replaces plus signs by spaces, as required
    for unquoting HTML form values.


    --
    // Today's Oblique Strategy (© Brian Eno/Peter Schmidt):
    // Accretion
    // Brett g Porter *
     
    Brett g Porter, Jan 9, 2006
    #6
  7. Thanks guys, I like the urllib solution. Stupid me, looked at urllib
    reference, but thought that "quote" and "unquote" deal with
    _&_n_b_s_p_;_ style entities.
     
    Claude Henchoz, Jan 9, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Bengtsson

    Squezing in replacements into strings

    Peter Bengtsson, Apr 25, 2005, in forum: Python
    Replies:
    3
    Views:
    324
    Peter Bengtsson
    Apr 25, 2005
  2. Brian McCullough
    Replies:
    0
    Views:
    513
    Brian McCullough
    Feb 16, 2007
  3. Antoine De Groote
    Replies:
    10
    Views:
    446
    Duncan Booth
    Oct 25, 2006
  4. Replies:
    10
    Views:
    560
    Robert Kern
    Apr 9, 2008
  5. Replies:
    4
    Views:
    2,253
    Daniel Crichton
    Feb 18, 2008
Loading...

Share This Page