robotparser behavior on 403 (Forbidden) robot.txt files

John Nagle · Jun 2, 2008

I just discovered that the "robotparser" module interprets
a 403 ("Forbidden") status on a "robots.txt" file as meaning
"all access disallowed". That's unexpected behavior.

A major site ("http://www.aplus.net/robot.txt") has their
"robots.txt" file set up that way.

There's no real "robots.txt" standard, unfortunately.
So it's not definitively a bug.

John Nagle
SiteTruth

Martin v. Löwis · Jun 2, 2008

I just discovered that the "robotparser" module interprets

a 403 ("Forbidden") status on a "robots.txt" file as meaning
"all access disallowed". That's unexpected behavior.

That's specified in the "norobots RFC":

http://www.robotstxt.org/norobots-rfc.txt

- On server response indicating access restrictions (HTTP Status
Code 401 or 403) a robot should regard access to the site
completely restricted.

So if a site returns 403, we should assume that it did so
deliberately, and doesn't want to be indexed.

A major site ("http://www.aplus.net/robot.txt") has their
"robots.txt" file set up that way.

You should try "http://www.aplus.net/robots.txt" instead,
which can be accessed just fine.

Regards,
Martin

Python "robots.txt" parser broken since 2003	5	Apr 21, 2007
Problem with Python's "robots.txt" file parser in module robotparser	5	Jul 11, 2007
Problems with HttpWebRequest file upload - (403) forbidden	0	Mar 7, 2006
IMAP4_SSL, libgmail, GMail and corporate firewall/proxy	1	Feb 16, 2011
Some notes on a high-performance Python application.	4	Mar 26, 2008
[ANN] pyparsing 1.5.3 released	0	Jun 25, 2010
More urllib timeout issues.	5	Apr 27, 2007
Using Tools/freeze.py on AIX -- having problems	1	Dec 22, 2006

robotparser behavior on 403 (Forbidden) robot.txt files

John Nagle

Martin v. Löwis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads