URLs and ampersands

Discussion in 'Python' started by Steven D'Aprano, Aug 5, 2008.

  1. I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
    snag with URLs containing ampersands:

    http://www.example.com/parrot.php?x=1&y=2

    Somewhere in the process, urls like the above are escaped to:

    http://www.example.com/parrot.php?x=1&y=2

    which naturally fails to exist.

    I could just do a string replace, but is there a "right" way to escape
    and unescape URLs? I've looked through the standard lib, but I can't find
    anything helpful.


    --
    Steven
     
    Steven D'Aprano, Aug 5, 2008
    #1
    1. Advertising

  2. En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
    <> escribi�:

    > I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
    > snag with URLs containing ampersands:
    >
    > http://www.example.com/parrot.php?x=1&y=2
    >
    > Somewhere in the process, urls like the above are escaped to:
    >
    > http://www.example.com/parrot.php?x=1&amp;y=2
    >
    > which naturally fails to exist.
    >
    > I could just do a string replace, but is there a "right" way to escape
    > and unescape URLs? I've looked through the standard lib, but I can't find
    > anything helpful.


    This works fine for me:

    py> import urllib
    py> fn =
    urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
    &c=4551022")[0]
    py> open(fn,"rb").read()
    '\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...

    So it's not urlretrieve escaping the url, but something else in your
    code...

    --
    Gabriel Genellina
     
    Gabriel Genellina, Aug 5, 2008
    #2
    1. Advertising

  3. On Mon, 04 Aug 2008 23:16:46 -0300, Gabriel Genellina wrote:

    > En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
    > <> escribi�:
    >
    >> I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
    >> snag with URLs containing ampersands:
    >>
    >> http://www.example.com/parrot.php?x=1&y=2
    >>
    >> Somewhere in the process, urls like the above are escaped to:
    >>
    >> http://www.example.com/parrot.php?x=1&amp;y=2
    >>
    >> which naturally fails to exist.
    >>
    >> I could just do a string replace, but is there a "right" way to escape
    >> and unescape URLs? I've looked through the standard lib, but I can't
    >> find anything helpful.

    >
    > This works fine for me:
    >
    > py> import urllib
    > py> fn =
    > urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
    > &c=4551022")[0]
    > py> open(fn,"rb").read()
    > '\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...
    >
    > So it's not urlretrieve escaping the url, but something else in your
    > code...


    I didn't say it urlretrieve was escaping the URL. I actually think the
    URLs are pre-escaped when I scrape them from a HTML file. I have searched
    for, but been unable to find, standard library functions that escapes or
    unescapes URLs. Are there any such functions?



    --
    Steven
     
    Steven D'Aprano, Aug 5, 2008
    #3
  4. "Steven D'Aprano" <> wrote in message
    news:00a78f7e$0$20302$...

    > I could just do a string replace, but is there a "right" way to escape
    > and unescape URLs?


    The right way is to parse your HTML with an HTML parser. URLs are not
    exempt from the normal HTML escaping rules, although there are an awful lot
    of pages that get this wrong.

    You didn't post any code, so it's hard to tell but maybe something like
    ElementTree or lxml would be a better tool than the ones you are currently using.
     
    Richard Brodie, Aug 5, 2008
    #4
  5. En Tue, 05 Aug 2008 06:59:20 -0300, Steven D'Aprano <> escribió:

    > On Mon, 04 Aug 2008 23:16:46 -0300, Gabriel Genellina wrote:
    >
    >> En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
    >> <> escribi�:
    >>
    >>> I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
    >>> snag with URLs containing ampersands:
    >>>
    >>> http://www.example.com/parrot.php?x=1&y=2
    >>>
    >>> Somewhere in the process, urls like the above are escaped to:
    >>>
    >>> http://www.example.com/parrot.php?x=1&amp;y=2
    >>>
    >>> which naturally fails to exist.
    >>>
    >>> I could just do a string replace, but is there a "right" way to escape
    >>> and unescape URLs? I've looked through the standard lib, but I can't
    >>> find anything helpful.

    >>
    >> This works fine for me:
    >>
    >> py> import urllib
    >> py> fn =
    >> urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
    >> &c=4551022")[0]
    >> py> open(fn,"rb").read()
    >> '\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...
    >>
    >> So it's not urlretrieve escaping the url, but something else in your
    >> code...

    >
    > I didn't say it urlretrieve was escaping the URL. I actually think the
    > URLs are pre-escaped when I scrape them from a HTML file.


    (Ok, you didn't even menction you were scraping HTML pages...)

    > I have searched
    > for, but been unable to find, standard library functions that escapes or
    > unescapes URLs. Are there any such functions?


    Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.
    How are you scraping the HTML source? Both BeautifulSoup and ElementTree.HTMLTreeBuilder already do that work for you.

    --
    Gabriel Genellina
     
    Gabriel Genellina, Aug 5, 2008
    #5
  6. Steven D'Aprano wrote:
    > I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
    > snag with URLs containing ampersands:
    >
    > http://www.example.com/parrot.php?x=1&y=2
    >
    > Somewhere in the process, urls like the above are escaped to:
    >
    > http://www.example.com/parrot.php?x=1&amp;y=2
    >
    > which naturally fails to exist.
    >
    > I could just do a string replace, but is there a "right" way to escape
    > and unescape URLs? I've looked through the standard lib, but I can't find
    > anything helpful.



    I don't believe there is a concept of 'escaping a URL' as such. How you
    escape or unescape a URL depends on what context you're embedding it in
    or extracting it from.

    In this case, it looks like you have URLs which have been escaped to go
    into an html CDATA attribute value (such as <a href="...">).

    I believe there is no documented function in the Python standard library
    which reverses this escaping (short of putting your string into a
    larger document and parsing that with a full html or xml parser).

    -M-
     
    Matthew Woodcraft, Aug 5, 2008
    #6
  7. Gabriel Genellina wrote:
    > Steven D'Aprano wrote:


    >> I have searched for, but been unable to find, standard library
    >> functions that escapes or unescapes URLs. Are there any such
    >> functions?


    > Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.


    I don't see a cgi.unescape in the standard library.

    I don't think xml.sax.saxutils.unescape will be suitable for Steven's
    purpose, because it doesn't process numeric character references (which
    are both legal and seen in the wild in /href/ attributes).

    -M-
     
    Matthew Woodcraft, Aug 5, 2008
    #7
  8. On Tue, 05 Aug 2008 12:07:39 +0000, Duncan Booth wrote:

    > Whenever you put a URL into an HTML file you need to escape it, so
    > naturally you will also need to unescape it when it is retrieved from
    > the file. However, whatever you use to parse the HMTL ought to be
    > unescaping text and attributes as part of the parsing process, so you
    > shouldn't need a separate function for this.


    ....

    > Even Python's builtin HTMLParser class will do this for you. What parser
    > are you using?


    A regex.

    I know, I know, now I have two problems :)

    It's a quick and dirty hack, not a production piece of code, and I have a
    quick and dirty fix by just using url.replace('&amp;', '&').

    Thanks to everybody who replied. I guess I really have to bite the bullet
    and learn how to use a proper HTML parser.



    --
    Steven
     
    Steven D'Aprano, Aug 6, 2008
    #8
  9. Steven D'Aprano

    Paul Rubin Guest

    Steven D'Aprano <> writes:
    > I could just do a string replace, but is there a "right" way to escape
    > and unescape URLs? I've looked through the standard lib, but I can't find
    > anything helpful.


    xml.sax.utils.unescape()
     
    Paul Rubin, Aug 6, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kaidi
    Replies:
    5
    Views:
    478
    Andrew Thompson
    Jan 4, 2004
  2. Andy Dingley
    Replies:
    1
    Views:
    350
    Bjoern Hoehrmann
    Jun 10, 2004
  3. Nathan Sokalski

    Converting Relative URLs into Absolute URLs

    Nathan Sokalski, Aug 11, 2008, in forum: ASP .Net
    Replies:
    1
    Views:
    752
    Sriram Srivatsan
    Aug 12, 2008
  4. Adam Monsen

    JDBC URLs ...not really URLs?

    Adam Monsen, Feb 6, 2009, in forum: Java
    Replies:
    11
    Views:
    6,230
    Adam Monsen
    Feb 8, 2009
  5. Replies:
    1
    Views:
    89
    Thomas 'PointedEars' Lahn
    Apr 10, 2005
Loading...

Share This Page