Why doesn't Python's "robotparser" like Wikipedia's "robots.txt"file?

Discussion in 'Python' started by John Nagle, Oct 2, 2007.

  1. John Nagle

    John Nagle Guest

    For some reason, Python's parser for "robots.txt" files
    doesn't like Wikipedia's "robots.txt" file:

    >>> import robotparser
    >>> url = 'http://wikipedia.org/robots.txt'
    >>> chk = robotparser.RobotFileParser()
    >>> chk.set_url(url)
    >>> chk.read()
    >>> testurl = 'http://wikipedia.org'
    >>> chk.can_fetch('Mozilla', testurl)

    False
    >>>


    The Wikipedia robots.txt file passes robots.txt validation,
    and it doesn't disallow unknown browsers. But the Python
    parser doesn't see it that way. No matter what user agent or URL is
    specified; for that robots.txt file, the only answer is "False".
    It's failing in Python 2.4 on Windows and 2.5 on Fedora Core.

    I use "robotparser" on lots of other robots.txt files, and it
    normally works. It even used to work on Wikipedia's older file.
    But there's something in there now that robotparser doesn't like.
    Any ideas?

    John Nagle
    John Nagle, Oct 2, 2007
    #1
    1. Advertising

  2. Re: Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

    In message <HYiMi.9932$>, John Nagle
    wrote:

    > For some reason, Python's parser for "robots.txt" files
    > doesn't like Wikipedia's "robots.txt" file:
    >
    > >>> import robotparser
    > >>> url = 'http://wikipedia.org/robots.txt'
    > >>> chk = robotparser.RobotFileParser()
    > >>> chk.set_url(url)
    > >>> chk.read()
    > >>> testurl = 'http://wikipedia.org'
    > >>> chk.can_fetch('Mozilla', testurl)

    > False
    > >>>


    >>> chk.errcode

    403

    Significant?
    Lawrence D'Oliveiro, Oct 2, 2007
    #2
    1. Advertising

  3. On 02/10/2007, John Nagle <> wrote:
    >
    > But there's something in there now that robotparser doesn't like.
    > Any ideas?


    Wikipedia denies _all_ access for the standard urllib user agent, and
    when the robotparser gets a 401 or 403 response when trying to fetch
    robots.txt, it is equivalent to "Disallow: *".

    http://infix.se/2006/05/17/robotparser

    It could also be worth mentioning that if you were planning on
    crawling a lot of Wikipedia pages, you may be better off downloading
    the whole thing instead: <http://download.wikimedia.org/>
    (perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
    wiki markup to HTML).
    --
    filip salomonsson
    Filip Salomonsson, Oct 2, 2007
    #3
  4. John Nagle

    John Nagle Guest

    Lawrence D'Oliveiro wrote:
    > In message <HYiMi.9932$>, John Nagle
    > wrote:
    >
    >> For some reason, Python's parser for "robots.txt" files
    >> doesn't like Wikipedia's "robots.txt" file:
    >>
    >> >>> import robotparser
    >> >>> url = 'http://wikipedia.org/robots.txt'
    >> >>> chk = robotparser.RobotFileParser()
    >> >>> chk.set_url(url)
    >> >>> chk.read()
    >> >>> testurl = 'http://wikipedia.org'
    >> >>> chk.can_fetch('Mozilla', testurl)

    >> False
    >> >>>

    >
    > >>> chk.errcode

    > 403
    >
    > Significant?
    >

    Helpful. Also an undocumented feature. See

    http://docs.python.org/lib/module-robotparser.html

    John Nagle
    John Nagle, Oct 2, 2007
    #4
  5. John Nagle

    John Nagle Guest

    Filip Salomonsson wrote:
    > On 02/10/2007, John Nagle <> wrote:
    >> But there's something in there now that robotparser doesn't like.
    >> Any ideas?

    >
    > Wikipedia denies _all_ access for the standard urllib user agent, and
    > when the robotparser gets a 401 or 403 response when trying to fetch
    > robots.txt, it is equivalent to "Disallow: *".
    >
    > http://infix.se/2006/05/17/robotparser


    That explains it. It's an undocumented feature of "robotparser",
    as is the 'errcode' variable. The documentation of "robotparser" is
    silent on error handling (can it raise an exception?) and should be
    updated.

    > It could also be worth mentioning that if you were planning on
    > crawling a lot of Wikipedia pages, you may be better off downloading
    > the whole thing instead: <http://download.wikimedia.org/>
    > (perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
    > wiki markup to HTML).


    This is for SiteTruth, the site rating system (see "sitetruth.com"),
    and we never look at more than 21 pages per site. We're looking for
    the name and address of the business behind the web site, and if we
    can't find that after looking in 20 of the most obvious places to
    look, it's either not there or not "prominently disclosed".

    John Nagle
    John Nagle, Oct 2, 2007
    #5
  6. Re: Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

    In article <ActMi.30614$>,
    John Nagle <> wrote:

    > Filip Salomonsson wrote:
    > > On 02/10/2007, John Nagle <> wrote:
    > >> But there's something in there now that robotparser doesn't like.
    > >> Any ideas?

    > >
    > > Wikipedia denies _all_ access for the standard urllib user agent, and
    > > when the robotparser gets a 401 or 403 response when trying to fetch
    > > robots.txt, it is equivalent to "Disallow: *".
    > >
    > > http://infix.se/2006/05/17/robotparser

    >
    > That explains it. It's an undocumented feature of "robotparser",
    > as is the 'errcode' variable. The documentation of "robotparser" is
    > silent on error handling (can it raise an exception?) and should be
    > updated.


    Hi John,
    Robotparser is probably following the never-approved RFC for robots.txt
    which is the closest thing there is to a standard. It says, "On server
    response indicating access restrictions (HTTP Status Code 401 or 403) a
    robot should regard access to the site completely restricted."
    http://www.robotstxt.org/wc/norobots-rfc.html

    If you're interested, I have a replacement for the robotparser module
    that works a little better (IMHO) and which you might also find better
    documented. I'm using it in production code:
    http://nikitathespider.com/python/rerp/

    Happy spidering

    --
    Philip
    http://NikitaTheSpider.com/
    Whole-site HTML validation, link checking and more
    Nikita the Spider, Oct 4, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joe Blow

    Missing robots.txt file

    Joe Blow, Aug 29, 2004, in forum: HTML
    Replies:
    5
    Views:
    1,708
    data64
    Aug 30, 2004
  2. Replies:
    53
    Views:
    2,089
    John Bokma
    Aug 26, 2006
  3. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,656
    Smokey Grindel
    Dec 2, 2006
  4. John Nagle
    Replies:
    5
    Views:
    435
    Nikita the Spider
    Jul 13, 2007
  5. Tim w

    meta robots and robots txt

    Tim w, May 22, 2014, in forum: HTML
    Replies:
    1
    Views:
    87
Loading...

Share This Page