Unescaping URLs in Python

J

John Nagle

Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a>

Yes, that "&amp;" is in the source text of the page.

This is, in fact, correct HTML. See

http://www.htmlhelp.com/tools/validator/problems.html#amp

What's the appropriate Python function to call to unescape a URL which might
contain things like that? Will this interfere with the usual "%" type escapes
in URLs?

What's actually needed to get this right is something that goes from
HTML escaped form to URL escaped form, because, in general, there is no
unescaped form that will work for all URLs.

There's "htmldecode" at "http://zesty.ca/python/scrape.py", which works,
but this should be a standard library function.

John Nagle
 
L

Lawrence D'Oliveiro

John Nagle said:
Here's a URL from a link on the home page of a major company.

<a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a>

What's the appropriate Python function to call to unescape a URL
which might contain things like that?

Just use any HTML-parsing library. I think the standard Python HTMLParser
will do the trick, provided there aren't any errors in the HTML.
Will this interfere with the usual "%" type escapes in URLs?

No. Just think of it as an HTML attribute value; the fact that it's a URL is
a question of later interpretation, nothing to do with the fact that it
comes from an HTML attribute.
 
J

John Nagle

Lawrence said:
In message <[email protected]>, John Nagle
wrote:




Just use any HTML-parsing library. I think the standard Python HTMLParser
will do the trick, provided there aren't any errors in the HTML.

I'm using BeautifulSoup, because I need to process real world
HTML. At least by default, it doesn't unescape URLs like that.

Nor, on the output side, does it escape standalone "&" characters,
as in text like "Sales & Advertising Department".
But there are various BeautifulSoup options; more on this later.

John Nagle
 
J

Jeffrey Froman

John said:
What's the appropriate Python function to call to unescape a URL which
might contain things like that?
xml.sax.saxutils.unescape()


Will this interfere with the usual "%"
type escapes in URLs?

Nope, and urllib.unquote() can be used to translate URL escapes manually.



Jeffrey
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top