Unescaping URLs in Python

John Nagle · Dec 25, 2006

Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&id=1860142">About Us</a>

Yes, that "&" is in the source text of the page.

This is, in fact, correct HTML. See

http://www.htmlhelp.com/tools/validator/problems.html#amp

What's the appropriate Python function to call to unescape a URL which might
contain things like that? Will this interfere with the usual "%" type escapes
in URLs?

What's actually needed to get this right is something that goes from
HTML escaped form to URL escaped form, because, in general, there is no
unescaped form that will work for all URLs.

There's "htmldecode" at "http://zesty.ca/python/scrape.py", which works,
but this should be a standard library function.

John Nagle

Lawrence D'Oliveiro · Dec 25, 2006

John Nagle said:
Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&id=1860142">About Us</a>

What's the appropriate Python function to call to unescape a URL
which might contain things like that?

Just use any HTML-parsing library. I think the standard Python HTMLParser
will do the trick, provided there aren't any errors in the HTML.

Will this interfere with the usual "%" type escapes in URLs?

No. Just think of it as an HTML attribute value; the fact that it's a URL is
a question of later interpretation, nothing to do with the fact that it
comes from an HTML attribute.

John Nagle · Dec 25, 2006

Lawrence said:
In message <[email protected]>, John Nagle
wrote:

Just use any HTML-parsing library. I think the standard Python HTMLParser
will do the trick, provided there aren't any errors in the HTML.

I'm using BeautifulSoup, because I need to process real world
HTML. At least by default, it doesn't unescape URLs like that.

Nor, on the output side, does it escape standalone "&" characters,
as in text like "Sales & Advertising Department".
But there are various BeautifulSoup options; more on this later.

John Nagle

Jeffrey Froman · Dec 25, 2006

John said:
What's the appropriate Python function to call to unescape a URL which
might contain things like that?
xml.sax.saxutils.unescape()

Will this interfere with the usual "%"
type escapes in URLs?

Nope, and urllib.unquote() can be used to translate URL escapes manually.

Jeffrey

open urls in browser	2	Jul 18, 2011
URLs and ampersands	8	Aug 5, 2008
dashes in URLs	27	Jan 16, 2013
Problem in getting dashboard page from login page in python pycharm using POST command	0	Dec 24, 2022
Should HTML entity translation accept "&amp"?	3	Jan 7, 2008
Processing in Python help	0	Aug 31, 2022
changing URLs in webpages, python solutions?	3	Jan 18, 2009
Python client/server that reads HTML body from server	1	Apr 12, 2023

Unescaping URLs in Python

John Nagle

Lawrence D'Oliveiro

John Nagle

Jeffrey Froman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads