get wikipedia source failed (urrlib2)

S

shahargs

Hi,
I'm trying to get wikipedia page source with urllib2:
usock = urllib2.urlopen("http://en.wikipedia.org/wiki/
Albert_Einstein")
data = usock.read();
usock.close();
return data
I got exception because HTTP 403 error. why? with my browser i can't
access it without any problem?

Thanks,
Shahar.
 
S

shahargs

Hi,
I'm trying to get wikipedia page source with urllib2:
usock = urllib2.urlopen("http://en.wikipedia.org/wiki/
Albert_Einstein")
data = usock.read();
usock.close();
return data
I got exception because HTTP 403 error. why? with my browser i can't
access it without any problem?

Thanks,
Shahar.

This source works fine for other site. the problem is in wikipedia. is
someone now any solution for this problem?
 
M

Michael J‭. ‬Fromberger

‭ ‬[email protected] wrote‭:‬

‭> ‬Hi‭,‬
‭> ‬I'm trying to get wikipedia page source with urllib2‭:‬
‭> ‬usock‭ = ‬urllib2‭.‬urlopen‭("‬http‭://‬en.wikipedia.org/wiki‭/‬
‭> ‬Albert_Einstein‭")‬
‭> ‬data‭ = ‬usock.read‭();‬
‭> ‬usock.close‭();‬
‭> ‬return data
‭> ‬I got exception because HTTP 403‭ ‬error‭. ‬why‭? ‬with my browser i can't
‭> ‬access it without any problem‭?‬
‭> ‬
‭> ‬Thanks‭,‬
‭> ‬Shahar‭.‬

It appears that Wikipedia may inspect the contents of the User-Agent‭ ‬
HTTP header‭, ‬and that it does not particularly like the string it‭ ‬
receives from Python's urllib‭. ‬I was able to make it work with urllib‭ ‬
via the following code‭:‬

import urllib

class CustomURLopener‭ (‬urllib.FancyURLopener‭):‬
‭ ‬version‭ = '‬Mozilla/5.0‭'‬

urllib‭.‬_urlopener‭ = ‬CustomURLopener‭()‬

u‭ = ‬urllib.urlopen‭('‬http‭://‬en.wikipedia.org/wiki/Albert_Einstein‭')‬
data‭ = ‬u.read‭()‬

I'm assuming a similar trick could be used with urllib2‭, ‬though I didn't‭ ‬
actually try it‭. ‬Another thing to watch out for‭, ‬is that some sites‭ ‬
will redirect a public URL X to an internal URL Y‭, ‬and will check that‭ ‬
access to Y is only permitted if the Referer field indicates coming from‭ ‬
somewhere internal to the site‭. ‬I have seen both of these techniques‭ ‬
used to foil screen-scraping‭.‬

Cheers‭,‬
‭-‬M

‭-- ‬
Michael J‭. ‬Fromberger‭ | ‬Lecturer‭, ‬Dept‭. ‬of Computer Science
http‭://‬www.dartmouth.edu‭/‬~sting‭/ | ‬Dartmouth College‭, ‬Hanover‭, ‬NH‭, ‬USA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top