URL 'special character' replacements

Claude Henchoz · Jan 9, 2006

Hi guys

I have a huge list of URLs. These URLs all have ASCII codes for special
characters, like "%20" for a space or "%21" for an exclamation mark.

I've already googled quite some time, but I have not been able to find
any elegant way on how to replace these with their 'real' counterparts
(" " and "!").

Of course, I could just replace(), but that seems to be a lot of work.

Thanks for any help.

Cheers, Claude

Richie Hindle · Jan 9, 2006

[Claude]

I have a huge list of URLs. These URLs all have ASCII codes for special
characters, like "%20" for a space or "%21" for an exclamation mark.

You need urllib.unquote:
Help on function unquote in module urllib:

unquote(s)
unquote('abc%20def') -> 'abc def'.

Duncan Booth · Jan 9, 2006

Claude said:
I have a huge list of URLs. These URLs all have ASCII codes for special
characters, like "%20" for a space or "%21" for an exclamation mark.

I've already googled quite some time, but I have not been able to find
any elegant way on how to replace these with their 'real' counterparts
(" " and "!").

Of course, I could just replace(), but that seems to be a lot of work.

urllib.unquote() or urllib.unquote_plus() as appropriate:

unquote( string)

Replace "%xx" escapes by their single-character equivalent.
Example: unquote('/%7Econnolly/') yields '/~connolly/'.

unquote_plus( string)

Like unquote(), but also replaces plus signs by spaces, as required for
unquoting HTML form values.

Fredrik Lundh · Jan 9, 2006

Claude said:
I have a huge list of URLs. These URLs all have ASCII codes for special
characters, like "%20" for a space or "%21" for an exclamation mark.

I've already googled quite some time, but I have not been able to find
any elegant way on how to replace these with their 'real' counterparts
(" " and "!").

Of course, I could just replace(), but that seems to be a lot of work.

'http://docs.python.org/lib/module-urllib.html !'

</F>

Tim N. van der Leeuw · Jan 9, 2006

My outline for a solution would be:

- Use StringIO or cStringIO for reading the original URLs character for
character, and to build the result URLs character for character

- When you read a '%' then read the next 2 character (should be
digits!!!) and create a new string with them
- The numbers like '20' etc. are hexadecimal values, meaning integers
with base 16.
Get the actual int-value like this:
code_int = int(code_str, 16)
- Convert to character as: code_chr = chr(code_int)
- Write this character to the output cStringIO buffer
- When the whole URL is done, do getvalue() to get the string of the
new URL and close the cStringIO buffer.

Is that sufficiently comprehensible? Or still too convoluted for you?

(PS: I researched doing it the manual way, 'the hard way'. However,
there are plenty of libraries in Python for all sorts of internet
stuff. Perhaps urllib or urllib2 already has the functionality that you
need -- didn't look it up)

cheers,

--Tim

Brett g Porter · Jan 9, 2006

Claude said:
Hi guys

I have a huge list of URLs. These URLs all have ASCII codes for special
characters, like "%20" for a space or "%21" for an exclamation mark.

I've already googled quite some time, but I have not been able to find
any elegant way on how to replace these with their 'real' counterparts
(" " and "!").

Of course, I could just replace(), but that seems to be a lot of work.

Thanks for any help.

Cheers, Claude

The standard library module 'urllib' gies you two choices, depending on
the exact behavior you'd like:

http://www.python.org/doc/2.3.2/lib/module-urllib.html
unquote(string)
Replace "%xx" escapes by their single-character equivalent.

Example: unquote('/%7Econnolly/') yields '/~connolly/'.

unquote_plus(string)
Like unquote(), but also replaces plus signs by spaces, as required
for unquoting HTML form values.

Claude Henchoz · Jan 9, 2006

Thanks guys, I like the urllib solution. Stupid me, looked at urllib
reference, but thought that "quote" and "unquote" deal with
_&_n_b_s_p_;_ style entities.

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
special character "&" and " ' " in request.querystring URL	4	Feb 17, 2008
numpy 00 character bug?	2	Jun 5, 2009
How to convert MS Word special characters to HTML codes?	1	Mar 31, 2012
URL Character Decoding	4	Jan 30, 2006
ignore special characters in python regex	2	Jul 21, 2009
Python-URL! - weekly Python news and links (Mar 31)	4	Mar 31, 2012
Server-Eye API requesting errors and warnings	0	Apr 6, 2022

URL 'special character' replacements

Claude Henchoz

Richie Hindle

Duncan Booth

Fredrik Lundh

Tim N. van der Leeuw

Brett g Porter

Claude Henchoz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads