setting Referer for urllib.urlretrieve

Discussion in 'Python' started by samwyse, Aug 9, 2009.

  1. samwyse

    samwyse Guest

    Here's what I have so far:

    import urllib

    class AppURLopener(urllib.FancyURLopener):
    version = "App/1.7"
    referrer = None
    def __init__(self, *args):
    urllib.FancyURLopener.__init__(self, *args)
    if self.referrer:
    addheader('Referer', self.referrer)

    urllib._urlopener = AppURLopener()

    Unfortunately, the 'Referer' header potentially varies for each url
    that I retrieve, and the way the module is written, I can't change the
    calls to __init__ or open. The best idea I've had is to assign a new
    value to my class variable just before calling urllib.urlretrieve(),
    but that just seems ugly. Any ideas? Thanks.

    PS for anyone not familiar with the RFCs: Yes, I'm spelling
    "referrer" correctly everywhere in my code.
     
    samwyse, Aug 9, 2009
    #1
    1. Advertising

  2. On Sun, 09 Aug 2009 06:13:38 -0700, samwyse wrote:

    > Here's what I have so far:
    >
    > import urllib
    >
    > class AppURLopener(urllib.FancyURLopener):
    > version = "App/1.7"
    > referrer = None
    > def __init__(self, *args):
    > urllib.FancyURLopener.__init__(self, *args)
    > if self.referrer:
    > addheader('Referer', self.referrer)
    >
    > urllib._urlopener = AppURLopener()
    >
    > Unfortunately, the 'Referer' header potentially varies for each url that
    > I retrieve, and the way the module is written, I can't change the calls
    > to __init__ or open. The best idea I've had is to assign a new value to
    > my class variable just before calling urllib.urlretrieve(), but that
    > just seems ugly. Any ideas? Thanks.


    [Aside: an int variable is an int. A str variable is a str. A list
    variable is a list. A class variable is a class. You probably mean a
    class attribute, not a variable. If other languages want to call it a
    variable, or a sausage, that's their problem.]

    If you're prepared for a bit of extra work, you could take over all the
    URL handling instead of relying on automatic openers. This will give you
    much finer control, but it will also require more effort on your part.
    The basic idea is, instead of installing openers, and then ask the urllib
    module to handle the connection, you handle the connection yourself:

    make a Request object using urllib2.Request
    make an Opener object using urllib2.build_opener
    call opener.open(request) to connect to the server
    deal with the connection (retry, fail or read)

    Essentially, you use the Request object instead of a URL, and you would
    add the appropriate referer header to the Request object.

    Another approach, perhaps a more minimal change than the above, would be
    something like this:

    # untested
    class AppURLopener(urllib.FancyURLopener):
    version = "App/1.7"
    def __init__(self, *args):
    urllib.FancyURLopener.__init__(self, *args)
    def add_referrer(self, url=None):
    if url:
    addheader('Referer', url)

    urllib._urlopener = AppURLopener()
    urllib._urlopener.add_referrer("http://example.com/")



    --
    Steven
     
    Steven D'Aprano, Aug 9, 2009
    #2
    1. Advertising

  3. samwyse

    samwyse Guest

    On Aug 9, 9:41 am, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Sun, 09 Aug 2009 06:13:38 -0700,samwysewrote:
    > > Here's what I have so far:

    >
    > > import urllib

    >
    > > class AppURLopener(urllib.FancyURLopener):
    > >     version = "App/1.7"
    > >     referrer = None
    > >     def __init__(self, *args):
    > >         urllib.FancyURLopener.__init__(self, *args)
    > >         if self.referrer:
    > >             addheader('Referer', self.referrer)

    >
    > > urllib._urlopener = AppURLopener()

    >
    > > Unfortunately, the 'Referer' header potentially varies for each url that
    > > I retrieve, and the way the module is written, I can't change the calls
    > > to __init__ or open. The best idea I've had is to assign a new value to
    > > my class variable just before calling urllib.urlretrieve(), but that
    > > just seems ugly.  Any ideas?  Thanks.

    >
    > [Aside: an int variable is an int. A str variable is a str. A list
    > variable is a list. A class variable is a class. You probably mean a
    > class attribute, not a variable. If other languages want to call it a
    > variable, or a sausage, that's their problem.]
    >
    > If you're prepared for a bit of extra work, you could take over all the
    > URL handling instead of relying on automatic openers. This will give you
    > much finer control, but it will also require more effort on your part.
    > The basic idea is, instead of installing openers, and then ask the urllib
    > module to handle the connection, you handle the connection yourself:
    >
    > make a Request object using urllib2.Request
    > make an Opener object using urllib2.build_opener
    > call opener.open(request) to connect to the server
    > deal with the connection (retry, fail or read)
    >
    > Essentially, you use the Request object instead of a URL, and you would
    > add the appropriate referer header to the Request object.
    >
    > Another approach, perhaps a more minimal change than the above, would be
    > something like this:
    >
    > # untested
    > class AppURLopener(urllib.FancyURLopener):
    >     version = "App/1.7"
    >     def __init__(self, *args):
    >         urllib.FancyURLopener.__init__(self, *args)
    >     def add_referrer(self, url=None):
    >         if url:
    >             addheader('Referer', url)
    >
    > urllib._urlopener = AppURLopener()
    > urllib._urlopener.add_referrer("http://example.com/")


    Thanks for the ideas. I'd briefly considered something similar to
    your first idea, implementing my own version of urlretrieve to accept
    a Request object, but it does seem like a good bit of work. Maybe
    over Labor Day. :)

    The second idea is pretty much what I'm going to go with for now. The
    program that I'm writing is almost a clone of wget, but it fixes some
    personal dislikes with the way recursive retrievals are done. (Or
    maybe I just don't understand wget's full array of options well
    enough.) This means that my referrer changes as I bounce up and down
    the hierarchy, which makes this less convenient. Still, it does seem
    more convenient that re-writing the module from scratch.
     
    samwyse, Aug 10, 2009
    #3
  4. samwyse

    E Guest

    On Aug 10, 10:21 am, samwyse <> wrote:
    > On Aug 9, 9:41 am, Steven D'Aprano <st...@REMOVE-THIS-
    >
    > cybersource.com.au> wrote:
    > > On Sun, 09 Aug 2009 06:13:38 -0700,samwysewrote:
    > > > Here's what I have so far:

    >
    > > > import urllib

    >
    > > > class AppURLopener(urllib.FancyURLopener):
    > > >     version = "App/1.7"
    > > >     referrer = None
    > > >     def __init__(self, *args):
    > > >         urllib.FancyURLopener.__init__(self, *args)
    > > >         if self.referrer:
    > > >             addheader('Referer', self.referrer)

    >
    > > > urllib._urlopener = AppURLopener()

    >
    > > > Unfortunately, the 'Referer' header potentially varies for each url that
    > > > I retrieve, and the way the module is written, I can't change the calls
    > > > to __init__ or open. The best idea I've had is to assign a new value to
    > > > my class variable just before calling urllib.urlretrieve(), but that
    > > > just seems ugly.  Any ideas?  Thanks.

    >
    > > [Aside: an int variable is an int. A str variable is a str. A list
    > > variable is a list. A class variable is a class. You probably mean a
    > > class attribute, not a variable. If other languages want to call it a
    > > variable, or a sausage, that's their problem.]

    >
    > > If you're prepared for a bit of extra work, you could take over all the
    > > URL handling instead of relying on automatic openers. This will give you
    > > much finer control, but it will also require more effort on your part.
    > > The basic idea is, instead of installing openers, and then ask the urllib
    > > module to handle the connection, you handle the connection yourself:

    >
    > > make a Request object using urllib2.Request
    > > make an Opener object using urllib2.build_opener
    > > call opener.open(request) to connect to the server
    > > deal with the connection (retry, fail or read)

    >
    > > Essentially, you use the Request object instead of a URL, and you would
    > > add the appropriate referer header to the Request object.

    >
    > > Another approach, perhaps a more minimal change than the above, would be
    > > something like this:

    >
    > > # untested
    > > class AppURLopener(urllib.FancyURLopener):
    > >     version = "App/1.7"
    > >     def __init__(self, *args):
    > >         urllib.FancyURLopener.__init__(self, *args)
    > >     def add_referrer(self, url=None):
    > >         if url:
    > >             addheader('Referer', url)

    >
    > > urllib._urlopener = AppURLopener()
    > > urllib._urlopener.add_referrer("http://example.com/")

    >
    > Thanks for the ideas.  I'd briefly considered something similar to
    > your first idea, implementing my own version of urlretrieve to accept
    > a Request object, but it does seem like a good bit of work.  Maybe
    > over Labor Day.  :)
    >
    > The second idea is pretty much what I'm going to go with for now.  The
    > program that I'm writing is almost a clone of wget, but it fixes some
    > personal dislikes with the way recursive retrievals are done.  (Or
    > maybe I just don't understand wget's full array of options well
    > enough.)  This means that my referrer changes as I bounce up and down
    > the hierarchy, which makes this less convenient.  Still, it does seem
    > more convenient that re-writing the module from scratch.



    Just wanted to add a note. I used the sample code posted above, and I
    would get this syntax error:
    NameError: global name 'addheader' is not defined
    The fix for the code is to change the line that references addheader
    to say this:
    self.addheader('Referer', url)

    ~E
     
    E, Sep 5, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sam Sungshik Kong

    urllib.urlretrieve error

    Sam Sungshik Kong, May 23, 2004, in forum: Python
    Replies:
    2
    Views:
    628
    Sam Sungshik Kong
    May 24, 2004
  2. ralobao

    Problem with urllib.urlretrieve

    ralobao, Jun 12, 2004, in forum: Python
    Replies:
    1
    Views:
    829
    fishboy
    Jun 12, 2004
  3. Abandoned

    configure urllib.urlretrieve timeout

    Abandoned, Oct 7, 2007, in forum: Python
    Replies:
    1
    Views:
    682
    Steve Holden
    Oct 7, 2007
  4. Juanlu_001
    Replies:
    0
    Views:
    529
    Juanlu_001
    Dec 26, 2010
  5. Даниил Рыжков

    urllib, urlretrieve method, how to get headers?

    Даниил Рыжков, Jul 1, 2011, in forum: Python
    Replies:
    6
    Views:
    1,104
    Даниил Рыжков
    Jul 2, 2011
Loading...

Share This Page