get wikipedia source failed (urrlib2)

Discussion in 'Python' started by shahargs@gmail.com, Aug 7, 2007.

  1. Guest

    Hi,
    I'm trying to get wikipedia page source with urllib2:
    usock = urllib2.urlopen("http://en.wikipedia.org/wiki/
    Albert_Einstein")
    data = usock.read();
    usock.close();
    return data
    I got exception because HTTP 403 error. why? with my browser i can't
    access it without any problem?

    Thanks,
    Shahar.
    , Aug 7, 2007
    #1
    1. Advertising

  2. Guest

    On 7 , 11:54, wrote:
    > Hi,
    > I'm trying to get wikipedia page source with urllib2:
    > usock = urllib2.urlopen("http://en.wikipedia.org/wiki/
    > Albert_Einstein")
    > data = usock.read();
    > usock.close();
    > return data
    > I got exception because HTTP 403 error. why? with my browser i can't
    > access it without any problem?
    >
    > Thanks,
    > Shahar.


    This source works fine for other site. the problem is in wikipedia. is
    someone now any solution for this problem?
    , Aug 7, 2007
    #2
    1. Advertising

  3. <> wrote:
    > This source works fine for other site. the problem is in wikipedia. is
    > someone now any solution for this problem?


    Wikipedia, AFAIK, bans requests without a User Agent.
    http://www.voidspace.org.uk/python/articles/urllib2.shtml#headers

    --
    Lawrence, oluyede.org - neropercaso.it
    "It is difficult to get a man to understand
    something when his salary depends on not
    understanding it" - Upton Sinclair
    Lawrence Oluyede, Aug 7, 2007
    #3
  4. Re‭: ‬get wikipedia source failed‭ (‬urrlib2‭)‬

    In article‭ <‬‭>,‬
    ‭ ‬ wrote‭:‬

    ‭> ‬Hi‭,‬
    ‭> ‬I'm trying to get wikipedia page source with urllib2‭:‬
    ‭> ‬usock‭ = ‬urllib2‭.‬urlopen‭("‬http‭://‬en.wikipedia.org/wiki‭/‬
    ‭> ‬Albert_Einstein‭")‬
    ‭> ‬data‭ = ‬usock.read‭();‬
    ‭> ‬usock.close‭();‬
    ‭> ‬return data
    ‭> ‬I got exception because HTTP 403‭ ‬error‭. ‬why‭? ‬with my browser i can't
    ‭> ‬access it without any problem‭?‬
    ‭> ‬
    ‭> ‬Thanks‭,‬
    ‭> ‬Shahar‭.‬

    It appears that Wikipedia may inspect the contents of the User-Agent‭ ‬
    HTTP header‭, ‬and that it does not particularly like the string it‭ ‬
    receives from Python's urllib‭. ‬I was able to make it work with urllib‭ ‬
    via the following code‭:‬

    import urllib

    class CustomURLopener‭ (‬urllib.FancyURLopener‭):‬
    ‭ ‬version‭ = '‬Mozilla/5.0‭'‬

    urllib‭.‬_urlopener‭ = ‬CustomURLopener‭()‬

    u‭ = ‬urllib.urlopen‭('‬http‭://‬en.wikipedia.org/wiki/Albert_Einstein‭')‬
    data‭ = ‬u.read‭()‬

    I'm assuming a similar trick could be used with urllib2‭, ‬though I didn't‭ ‬
    actually try it‭. ‬Another thing to watch out for‭, ‬is that some sites‭ ‬
    will redirect a public URL X to an internal URL Y‭, ‬and will check that‭ ‬
    access to Y is only permitted if the Referer field indicates coming from‭ ‬
    somewhere internal to the site‭. ‬I have seen both of these techniques‭ ‬
    used to foil screen-scraping‭.‬

    Cheers‭,‬
    ‭-‬M

    ‭-- ‬
    Michael J‭. ‬Fromberger‭ | ‬Lecturer‭, ‬Dept‭. ‬of Computer Science
    http‭://‬www.dartmouth.edu‭/‬~sting‭/ | ‬Dartmouth College‭, ‬Hanover‭, ‬NH‭, ‬USA
    Michael J‭. ‬Fromberger, Aug 7, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. T.J.
    Replies:
    7
    Views:
    10,480
    Andy Dingley
    Apr 18, 2005
  2. Steve Lackey

    XML Wiki schemas (like WikiPedia)?

    Steve Lackey, Feb 11, 2004, in forum: XML
    Replies:
    0
    Views:
    404
    Steve Lackey
    Feb 11, 2004
  3. Claudio Grondi
    Replies:
    3
    Views:
    437
    Claudio Grondi
    Mar 22, 2005
  4. Replies:
    2
    Views:
    360
    Facundo Batista
    Jan 17, 2007
  5. dirknbr

    Urrlib2 IncompleteRead error

    dirknbr, Jul 27, 2010, in forum: Python
    Replies:
    1
    Views:
    769
    Gabriel Genellina
    Aug 4, 2010
Loading...

Share This Page