setting Referer for urllib.urlretrieve

S

samwyse

Here's what I have so far:

import urllib

class AppURLopener(urllib.FancyURLopener):
version = "App/1.7"
referrer = None
def __init__(self, *args):
urllib.FancyURLopener.__init__(self, *args)
if self.referrer:
addheader('Referer', self.referrer)

urllib._urlopener = AppURLopener()

Unfortunately, the 'Referer' header potentially varies for each url
that I retrieve, and the way the module is written, I can't change the
calls to __init__ or open. The best idea I've had is to assign a new
value to my class variable just before calling urllib.urlretrieve(),
but that just seems ugly. Any ideas? Thanks.

PS for anyone not familiar with the RFCs: Yes, I'm spelling
"referrer" correctly everywhere in my code.
 
S

Steven D'Aprano

Here's what I have so far:

import urllib

class AppURLopener(urllib.FancyURLopener):
version = "App/1.7"
referrer = None
def __init__(self, *args):
urllib.FancyURLopener.__init__(self, *args)
if self.referrer:
addheader('Referer', self.referrer)

urllib._urlopener = AppURLopener()

Unfortunately, the 'Referer' header potentially varies for each url that
I retrieve, and the way the module is written, I can't change the calls
to __init__ or open. The best idea I've had is to assign a new value to
my class variable just before calling urllib.urlretrieve(), but that
just seems ugly. Any ideas? Thanks.

[Aside: an int variable is an int. A str variable is a str. A list
variable is a list. A class variable is a class. You probably mean a
class attribute, not a variable. If other languages want to call it a
variable, or a sausage, that's their problem.]

If you're prepared for a bit of extra work, you could take over all the
URL handling instead of relying on automatic openers. This will give you
much finer control, but it will also require more effort on your part.
The basic idea is, instead of installing openers, and then ask the urllib
module to handle the connection, you handle the connection yourself:

make a Request object using urllib2.Request
make an Opener object using urllib2.build_opener
call opener.open(request) to connect to the server
deal with the connection (retry, fail or read)

Essentially, you use the Request object instead of a URL, and you would
add the appropriate referer header to the Request object.

Another approach, perhaps a more minimal change than the above, would be
something like this:

# untested
class AppURLopener(urllib.FancyURLopener):
version = "App/1.7"
def __init__(self, *args):
urllib.FancyURLopener.__init__(self, *args)
def add_referrer(self, url=None):
if url:
addheader('Referer', url)

urllib._urlopener = AppURLopener()
urllib._urlopener.add_referrer("http://example.com/")
 
S

samwyse

Here's what I have so far:
import urllib
class AppURLopener(urllib.FancyURLopener):
    version = "App/1.7"
    referrer = None
    def __init__(self, *args):
        urllib.FancyURLopener.__init__(self, *args)
        if self.referrer:
            addheader('Referer', self.referrer)
urllib._urlopener = AppURLopener()
Unfortunately, the 'Referer' header potentially varies for each url that
I retrieve, and the way the module is written, I can't change the calls
to __init__ or open. The best idea I've had is to assign a new value to
my class variable just before calling urllib.urlretrieve(), but that
just seems ugly.  Any ideas?  Thanks.

[Aside: an int variable is an int. A str variable is a str. A list
variable is a list. A class variable is a class. You probably mean a
class attribute, not a variable. If other languages want to call it a
variable, or a sausage, that's their problem.]

If you're prepared for a bit of extra work, you could take over all the
URL handling instead of relying on automatic openers. This will give you
much finer control, but it will also require more effort on your part.
The basic idea is, instead of installing openers, and then ask the urllib
module to handle the connection, you handle the connection yourself:

make a Request object using urllib2.Request
make an Opener object using urllib2.build_opener
call opener.open(request) to connect to the server
deal with the connection (retry, fail or read)

Essentially, you use the Request object instead of a URL, and you would
add the appropriate referer header to the Request object.

Another approach, perhaps a more minimal change than the above, would be
something like this:

# untested
class AppURLopener(urllib.FancyURLopener):
    version = "App/1.7"
    def __init__(self, *args):
        urllib.FancyURLopener.__init__(self, *args)
    def add_referrer(self, url=None):
        if url:
            addheader('Referer', url)

urllib._urlopener = AppURLopener()
urllib._urlopener.add_referrer("http://example.com/")

Thanks for the ideas. I'd briefly considered something similar to
your first idea, implementing my own version of urlretrieve to accept
a Request object, but it does seem like a good bit of work. Maybe
over Labor Day. :)

The second idea is pretty much what I'm going to go with for now. The
program that I'm writing is almost a clone of wget, but it fixes some
personal dislikes with the way recursive retrievals are done. (Or
maybe I just don't understand wget's full array of options well
enough.) This means that my referrer changes as I bounce up and down
the hierarchy, which makes this less convenient. Still, it does seem
more convenient that re-writing the module from scratch.
 
E

E

[Aside: an int variable is an int. A str variable is a str. A list
variable is a list. A class variable is a class. You probably mean a
class attribute, not a variable. If other languages want to call it a
variable, or a sausage, that's their problem.]
If you're prepared for a bit of extra work, you could take over all the
URL handling instead of relying on automatic openers. This will give you
much finer control, but it will also require more effort on your part.
The basic idea is, instead of installing openers, and then ask the urllib
module to handle the connection, you handle the connection yourself:
make a Request object using urllib2.Request
make an Opener object using urllib2.build_opener
call opener.open(request) to connect to the server
deal with the connection (retry, fail or read)
Essentially, you use the Request object instead of a URL, and you would
add the appropriate referer header to the Request object.
Another approach, perhaps a more minimal change than the above, would be
something like this:
# untested
class AppURLopener(urllib.FancyURLopener):
    version = "App/1.7"
    def __init__(self, *args):
        urllib.FancyURLopener.__init__(self, *args)
    def add_referrer(self, url=None):
        if url:
            addheader('Referer', url)
urllib._urlopener = AppURLopener()
urllib._urlopener.add_referrer("http://example.com/")

Thanks for the ideas.  I'd briefly considered something similar to
your first idea, implementing my own version of urlretrieve to accept
a Request object, but it does seem like a good bit of work.  Maybe
over Labor Day.  :)

The second idea is pretty much what I'm going to go with for now.  The
program that I'm writing is almost a clone of wget, but it fixes some
personal dislikes with the way recursive retrievals are done.  (Or
maybe I just don't understand wget's full array of options well
enough.)  This means that my referrer changes as I bounce up and down
the hierarchy, which makes this less convenient.  Still, it does seem
more convenient that re-writing the module from scratch.


Just wanted to add a note. I used the sample code posted above, and I
would get this syntax error:
NameError: global name 'addheader' is not defined
The fix for the code is to change the line that references addheader
to say this:
self.addheader('Referer', url)

~E
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top