Impersonating other broswers...

S

sboyle55

So I wrote a quick python program (my first ever) that needs to
download pages off the web.

I'm using urlopen, and it works fine. But I'd like to be able to
change my browser string from "Python-urllib/1.15" to instead
impersonate Internet Explorer.

I know this can be done very easily with Perl, so I'm assuming it's
also easy in Python. How do I do it?
 
D

Diez B. Roggisch

So I wrote a quick python program (my first ever) that needs to
download pages off the web.

I'm using urlopen, and it works fine. But I'd like to be able to
change my browser string from "Python-urllib/1.15" to instead
impersonate Internet Explorer.

I know this can be done very easily with Perl, so I'm assuming it's
also easy in Python. How do I do it?

from the urllib docs:

'''
class URLopener(
[proxies[, **x509]])

Base class for opening and reading URLs. Unless you need to support opening
objects using schemes other than http:, ftp:, gopher: or file:, you
probably want to use FancyURLopener.

By default, the URLopener class sends a User-Agent: header of "urllib/VVV",
where VVV is the urllib version number. Applications can define their own
User-Agent: header by subclassing URLopener or FancyURLopener and setting
the instance attribute version to an appropriate string value before the
open() method is called.


The optional proxies parameter should be a dictionary mapping scheme names
to proxy URLs, where an empty dictionary turns proxies off completely. Its
default value is None, in which case environmental proxy settings will be
used if present, as discussed in the definition of urlopen(), above.


Additional keyword parameters, collected in x509, are used for
authentication with the https: scheme. The keywords key_file and cert_file
are supported; both are needed to actually retrieve a resource at an https:
URL.

'''
 
S

Skip Montanaro

sboyle> I'm using urlopen, and it works fine. But I'd like to be able
sboyle> to change my browser string from "Python-urllib/1.15" to instead
sboyle> impersonate Internet Explorer.

sboyle> I know this can be done very easily with Perl, so I'm assuming
sboyle> it's also easy in Python. How do I do it?

Easy is in the eye of the beholder I suppose. It doesn't look as
straightforward as I would have thought. You can subclass the
FancyURLopener class like so:

class MSIEURLopener(urllib.FancyURLopener):
version = "Internet Exploder"

then set urllib._urlopener to it:

urllib._urlopener = MSIEURLopener

After that, urllib.urlopen() should spit out your user-agent string.

Seems like FancyURLopener should support setting the user agent string
directly. You can accomplish that with something like this:

class FlexibleUAopener(urllib.FancyURLopener):
def set_user_agent(self, user_agent):
ua = [(hdr, val) for (hdr, val) in self.addheaders
if hdr == "User-agent"]
while ua:
self.addheaders.remove(ua[0])
ua.pop()
self.addheader(("User-agent", user_agent))

You'd then be able to set the user agent, but have to use your new opener
class directly:

opener = FlexibleUAopener(...)
opener.set_user_agent("Internet Exploder")
f = opener.open(url)
print f.read()

It doesn't look any easier to do this using urllib2. Seems like a
semi-obvious oversight for both modules. That suggests few people have ever
desired this capability.

Skip
 
E

Eric Pederson

Skip Montanaro said:
It doesn't look any easier to do this using urllib2. Seems like a
semi-obvious oversight for both modules. That suggests few people have
ever
desired this capability.


my $.02:

I have trouble believing few people have not desired this for two reasons:

(1) some web sites will shut out user agents they do not recognize to preserve bandwidth or for other reasons; the right User Agent ID can be required to get the data one wants;

(2) It seems like it is a worthwhile courtesy to identify oneself when spidering or data scraping, and the User Agent ID seems like the obvious way to do that. I'd guess (and like to think) that Python users are generally a little more concerned with such courtesies than the user population of some other languages.

e.g. Your website might get a hit from: "Mozilla/5.0 (Songzilla MP3 Blog, http://songzilla.blogspot.com) Gecko/20041107 Firefox/1.0"

And you'll get to decide whether to shut them out or not, but at least it won't seem like the black hats are attacking.




Eric Pederson
http://www.songzilla.blogspot.com
:::::::::::::::::::::::::::::::::::
domainNot="@something.com"
domainIs=domainNot.replace("s","z")
ePrefix="".join([chr(ord(x)+1) for x in "do"])
mailMeAt=ePrefix+domainIs
:::::::::::::::::::::::::::::::::::
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top