Re: urlopen returns forbidden

Discussion in 'Python' started by Chris Rebert, Feb 28, 2011.

  1. Chris Rebert

    Chris Rebert Guest

    On Sun, Feb 27, 2011 at 9:38 PM, monkeys paw <> wrote:
    > I have a working urlopen routine which opens
    > a url, parses it for <a> tags and prints out
    > the links in the page. On some sites, wikipedia for
    > instance, i get a
    >
    > HTTP error 403, forbidden.
    >
    > What is the difference in accessing the site through a web browser
    > and opening/reading the URL with python urllib2.urlopen?


    The User-Agent header (http://en.wikipedia.org/wiki/User_agent ).
    "By default, the URLopener class sends a User-Agent header of
    urllib/VVV, where VVV is the urllib version number."
    – http://docs.python.org/library/urllib.html

    Some sites block obvious non-search-engine bots based on their HTTP
    User-Agent header value.

    You can override the urllib default:
    http://docs.python.org/library/urllib.html#urllib.URLopener.version

    Sidenote: Wikipedia has a proper API for programmatic browsing, likely
    hence why it's blocking your program.

    Cheers,
    Chris
     
    Chris Rebert, Feb 28, 2011
    #1
    1. Advertising

  2. On Sun, 27 Feb 2011 22:19:18 -0800, Chris Rebert wrote:

    > On Sun, Feb 27, 2011 at 9:38 PM, monkeys paw <>
    > wrote:
    >> I have a working urlopen routine which opens a url, parses it for <a>
    >> tags and prints out the links in the page. On some sites, wikipedia for
    >> instance, i get a
    >>
    >> HTTP error 403, forbidden.
    >>
    >> What is the difference in accessing the site through a web browser and
    >> opening/reading the URL with python urllib2.urlopen?

    [...]
    > Sidenote: Wikipedia has a proper API for programmatic browsing, likely
    > hence why it's blocking your program.


    What he said. Please don't abuse Wikipedia by screen-scraping it.


    --
    Steven
     
    Steven D'Aprano, Feb 28, 2011
    #2
    1. Advertising

  3. On 2011-02-28, Chris Rebert <> wrote:
    > On Sun, Feb 27, 2011 at 9:38 PM, monkeys paw <> wrote:
    >> I have a working urlopen routine which opens
    >> a url, parses it for <a> tags and prints out
    >> the links in the page. On some sites, wikipedia for
    >> instance, i get a
    >>
    >> HTTP error 403, forbidden.
    >>
    >> What is the difference in accessing the site through a web browser
    >> and opening/reading the URL with python urllib2.urlopen?

    >
    > The User-Agent header (http://en.wikipedia.org/wiki/User_agent ).


    Sometimes you also need to set the Referrer header for pages that
    don't allow direct-linking from "outside".

    As somebody else has already said, if the site provides an API that
    they want you to use you should do so rather than hammering their web
    server with a screen-scraper.

    Not only is is a lot less load on the site, it's usually a lot easier.

    --
    Grant Edwards grant.b.edwards Yow! Look DEEP into the
    at OPENINGS!! Do you see any
    gmail.com ELVES or EDSELS ... or a
    HIGHBALL?? ...
     
    Grant Edwards, Feb 28, 2011
    #3
  4. Chris Rebert

    Terry Reedy Guest

    On 2/28/2011 10:21 AM, Grant Edwards wrote:

    > As somebody else has already said, if the site provides an API that
    > they want you to use you should do so rather than hammering their web
    > server with a screen-scraper.


    If there any generic method for finding out 'if the site provides an
    API" and specifically, how to find Wikipedia's?

    I looked as the Wikipedia articles on API and web services and did not
    find any mention of thiers (though there is one for Amazon).

    --
    Terry Jan Reedy
     
    Terry Reedy, Feb 28, 2011
    #4
  5. Chris Rebert

    Chris Rebert Guest

    On Mon, Feb 28, 2011 at 9:44 AM, Terry Reedy <> wrote:
    > On 2/28/2011 10:21 AM, Grant Edwards wrote:
    >> As somebody else has already said, if the site provides an API that
    >> they want you to use you should do so rather than hammering their web
    >> server with a screen-scraper.

    >
    > If there any generic method for finding out 'if the site provides an API"
    > and specifically, how to find Wikipedia's?
    >
    > I looked as the Wikipedia articles on API and web services and did not find
    > any mention of thiers (though there is one for Amazon).


    Technically it's Wikipedia's underlying wiki software (MediaWiki)'s API:
    http://www.mediawiki.org/wiki/API

    Cheers,
    Chris
     
    Chris Rebert, Feb 28, 2011
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Hunt
    Replies:
    4
    Views:
    11,367
    gullsinn
    Oct 4, 2009
  2. Xu, C.S.
    Replies:
    5
    Views:
    507
    John J. Lee
    Sep 17, 2003
  3. Chris
    Replies:
    0
    Views:
    1,093
    Chris
    Jul 10, 2005
  4. Glenn G. Chappell

    python3 urlopen(...).read() returns bytes

    Glenn G. Chappell, Dec 22, 2008, in forum: Python
    Replies:
    6
    Views:
    891
    ajaksu
    Dec 23, 2008
  5. Mark J. McGinty

    IIS HTTP 403.1 Forbidden: Execute Access Forbidden

    Mark J. McGinty, Dec 9, 2005, in forum: ASP General
    Replies:
    2
    Views:
    380
    Kyle Peterson
    Dec 9, 2005
Loading...

Share This Page