URLs and ampersands

  • Thread starter Steven D'Aprano
  • Start date
G

Gabriel Genellina

En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.

This works fine for me:

py> import urllib
py> fn =
urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
&c=4551022")[0]
py> open(fn,"rb").read()
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

So it's not urlretrieve escaping the url, but something else in your
code...
 
S

Steven D'Aprano

En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't
find anything helpful.

This works fine for me:

py> import urllib
py> fn =
urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
&c=4551022")[0]
py> open(fn,"rb").read()
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

So it's not urlretrieve escaping the url, but something else in your
code...

I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file. I have searched
for, but been unable to find, standard library functions that escapes or
unescapes URLs. Are there any such functions?
 
R

Richard Brodie

I could just do a string replace, but is there a "right" way to escape
and unescape URLs?

The right way is to parse your HTML with an HTML parser. URLs are not
exempt from the normal HTML escaping rules, although there are an awful lot
of pages that get this wrong.

You didn't post any code, so it's hard to tell but maybe something like
ElementTree or lxml would be a better tool than the ones you are currently using.
 
G

Gabriel Genellina

En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't
find anything helpful.

This works fine for me:

py> import urllib
py> fn =
urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
&c=4551022")[0]
py> open(fn,"rb").read()
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

So it's not urlretrieve escaping the url, but something else in your
code...

I didn't say it urlretrieve was escaping the URL. I actually think the
URLs are pre-escaped when I scrape them from a HTML file.

(Ok, you didn't even menction you were scraping HTML pages...)
I have searched
for, but been unable to find, standard library functions that escapes or
unescapes URLs. Are there any such functions?

Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.
How are you scraping the HTML source? Both BeautifulSoup and ElementTree.HTMLTreeBuilder already do that work for you.
 
M

Matthew Woodcraft

Steven said:
I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
snag with URLs containing ampersands:

http://www.example.com/parrot.php?x=1&y=2

Somewhere in the process, urls like the above are escaped to:

http://www.example.com/parrot.php?x=1&y=2

which naturally fails to exist.

I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.


I don't believe there is a concept of 'escaping a URL' as such. How you
escape or unescape a URL depends on what context you're embedding it in
or extracting it from.

In this case, it looks like you have URLs which have been escaped to go
into an html CDATA attribute value (such as <a href="...">).

I believe there is no documented function in the Python standard library
which reverses this escaping (short of putting your string into a
larger document and parsing that with a full html or xml parser).

-M-
 
M

Matthew Woodcraft

Gabriel said:
Steven D'Aprano wrote:
Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.

I don't see a cgi.unescape in the standard library.

I don't think xml.sax.saxutils.unescape will be suitable for Steven's
purpose, because it doesn't process numeric character references (which
are both legal and seen in the wild in /href/ attributes).

-M-
 
S

Steven D'Aprano

Whenever you put a URL into an HTML file you need to escape it, so
naturally you will also need to unescape it when it is retrieved from
the file. However, whatever you use to parse the HMTL ought to be
unescaping text and attributes as part of the parsing process, so you
shouldn't need a separate function for this.
....

Even Python's builtin HTMLParser class will do this for you. What parser
are you using?

A regex.

I know, I know, now I have two problems :)

It's a quick and dirty hack, not a production piece of code, and I have a
quick and dirty fix by just using url.replace('&amp;', '&').

Thanks to everybody who replied. I guess I really have to bite the bullet
and learn how to use a proper HTML parser.
 
P

Paul Rubin

Steven D'Aprano said:
I could just do a string replace, but is there a "right" way to escape
and unescape URLs? I've looked through the standard lib, but I can't find
anything helpful.

xml.sax.utils.unescape()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top