Why doesn't Python's "robotparser" like Wikipedia's "robots.txt"file?

Discussion in 'Python' started by John Nagle, Oct 2, 2007.

  1. John Nagle

    John Nagle Guest

    For some reason, Python's parser for "robots.txt" files
    doesn't like Wikipedia's "robots.txt" file:

    The Wikipedia robots.txt file passes robots.txt validation,
    and it doesn't disallow unknown browsers. But the Python
    parser doesn't see it that way. No matter what user agent or URL is
    specified; for that robots.txt file, the only answer is "False".
    It's failing in Python 2.4 on Windows and 2.5 on Fedora Core.

    I use "robotparser" on lots of other robots.txt files, and it
    normally works. It even used to work on Wikipedia's older file.
    But there's something in there now that robotparser doesn't like.
    Any ideas?

    John Nagle
     
    John Nagle, Oct 2, 2007
    #1
    1. Advertisements

  2. 403

    Significant?
     
    Lawrence D'Oliveiro, Oct 2, 2007
    #2
    1. Advertisements

  3. Wikipedia denies _all_ access for the standard urllib user agent, and
    when the robotparser gets a 401 or 403 response when trying to fetch
    robots.txt, it is equivalent to "Disallow: *".

    http://infix.se/2006/05/17/robotparser

    It could also be worth mentioning that if you were planning on
    crawling a lot of Wikipedia pages, you may be better off downloading
    the whole thing instead: <http://download.wikimedia.org/>
    (perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
    wiki markup to HTML).
     
    Filip Salomonsson, Oct 2, 2007
    #3
  4. John Nagle

    John Nagle Guest

    John Nagle, Oct 2, 2007
    #4
  5. John Nagle

    John Nagle Guest

    That explains it. It's an undocumented feature of "robotparser",
    as is the 'errcode' variable. The documentation of "robotparser" is
    silent on error handling (can it raise an exception?) and should be
    updated.
    This is for SiteTruth, the site rating system (see "sitetruth.com"),
    and we never look at more than 21 pages per site. We're looking for
    the name and address of the business behind the web site, and if we
    can't find that after looking in 20 of the most obvious places to
    look, it's either not there or not "prominently disclosed".

    John Nagle
     
    John Nagle, Oct 2, 2007
    #5
  6. Hi John,
    Robotparser is probably following the never-approved RFC for robots.txt
    which is the closest thing there is to a standard. It says, "On server
    response indicating access restrictions (HTTP Status Code 401 or 403) a
    robot should regard access to the site completely restricted."
    http://www.robotstxt.org/wc/norobots-rfc.html

    If you're interested, I have a replacement for the robotparser module
    that works a little better (IMHO) and which you might also find better
    documented. I'm using it in production code:
    http://nikitathespider.com/python/rerp/

    Happy spidering
     
    Nikita the Spider, Oct 4, 2007
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.