Python "robots.txt" parser broken since 2003

Discussion in 'Python' started by John Nagle, Apr 21, 2007.

  1. John Nagle

    John Nagle Guest

    This bug, "[ 813986 ] robotparser interactively prompts for username and
    password", has been open since 2003. It killed a big batch job of ours
    last night.

    Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
    If the server asks for basic authentication on that file, "robotparser"
    prompts for the password on standard input. Which is rarely what you
    want. You can demonstrate this with:

    import robotparser
    url = 'http://mueblesmoraleda.com' # this site is password-protected.
    parser = robotparser.RobotFileParser()
    parser.set_url(url)
    parser.read() # Prompts for password

    That's the tandard, although silly, "urllib" behavior.

    This was reported in 2003, and a patch was uploaded in 2005, but the patch
    never made it into Python 2.4 or 2.5.

    A temporary workaround is this:

    import robotparser
    def prompt_user_passwd(self, host, realm):
    return None, None
    robotparser.URLopener.prompt_user_passwd = prompt_user_passwd # temp patch


    John Nagle
     
    John Nagle, Apr 21, 2007
    #1
    1. Advertising

  2. John Nagle

    Terry Reedy Guest

    "John Nagle" <> wrote in message
    news:FvtWh.11824$...
    | This was reported in 2003, and a patch was uploaded in 2005, but the
    patch
    | never made it into Python 2.4 or 2.5.

    If the patch is still open, perhaps you could review it.

    tjr
     
    Terry Reedy, Apr 22, 2007
    #2
    1. Advertising

  3. John Nagle

    John Nagle Guest

    Terry Reedy wrote:
    > "John Nagle" <> wrote in message
    > news:FvtWh.11824$...
    > | This was reported in 2003, and a patch was uploaded in 2005, but the
    > patch
    > | never made it into Python 2.4 or 2.5.
    >
    > If the patch is still open, perhaps you could review it.
    >

    I tried it on Python 2.4 and it's in our production system now.
    But someone who regularly does check-ins should do this.

    John Nagle
     
    John Nagle, Apr 22, 2007
    #3
  4. John Nagle wrote:
    > Terry Reedy wrote:
    >> "John Nagle" <> wrote in message
    >> news:FvtWh.11824$...
    >> | This was reported in 2003, and a patch was uploaded in 2005, but the
    >> patch
    >> | never made it into Python 2.4 or 2.5.
    >>
    >> If the patch is still open, perhaps you could review it.
    >>

    > I tried it on Python 2.4 and it's in our production system now.
    > But someone who regularly does check-ins should do this.


    If you post such a review (even just the short sentence above) to the
    patch tracker, it often increases the chance of someone committing the
    patch.

    Steve
     
    Steven Bethard, Apr 22, 2007
    #4
  5. In article <FvtWh.11824$>,
    John Nagle <> wrote:

    > This bug, "[ 813986 ] robotparser interactively prompts for username and
    > password", has been open since 2003. It killed a big batch job of ours
    > last night.
    >
    > Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
    > If the server asks for basic authentication on that file, "robotparser"
    > prompts for the password on standard input. Which is rarely what you
    > want. You can demonstrate this with:
    >
    > import robotparser
    > url = 'http://mueblesmoraleda.com' # this site is password-protected.
    > parser = robotparser.RobotFileParser()
    > parser.set_url(url)
    > parser.read() # Prompts for password
    >
    > That's the tandard, although silly, "urllib" behavior.


    John,
    robotparser is (IMO) suboptimal in a few other ways, too.
    - It doesn't handle non-ASCII characters. (They're infrequent but when
    writing a spider which sees thousands of robots.txt files in a short
    time, "infrequent" can become "daily").
    - It doesn't account for BOMs in robots.txt (which are rare).
    - It ignores any Expires header sent with the robots.txt
    - It handles some ambiguous return codes (e.g. 503) that it ought to
    pass up to the caller.

    I wrote my own parser to address these problems. It probably suffers
    from the same urllib hang that you've found (I have not encountered it
    myself) and I appreciate you posting a fix. Here's the code &
    documentation in case you're interested:
    http://NikitaTheSpider.com/python/rerp/

    Cheers

    --
    Philip
    http://NikitaTheSpider.com/
    Whole-site HTML validation, link checking and more
     
    Nikita the Spider, Apr 22, 2007
    #5
  6. John Nagle

    John Nagle Guest

    Steven Bethard wrote:
    > John Nagle wrote:
    >
    >> Terry Reedy wrote:
    >>
    >>> "John Nagle" <> wrote in message
    >>> news:FvtWh.11824$...
    >>> | This was reported in 2003, and a patch was uploaded in 2005, but
    >>> the patch
    >>> | never made it into Python 2.4 or 2.5.
    >>>
    >>> If the patch is still open, perhaps you could review it.
    >>>

    >> I tried it on Python 2.4 and it's in our production system now.
    >> But someone who regularly does check-ins should do this.

    >
    >
    > If you post such a review (even just the short sentence above) to the
    > patch tracker, it often increases the chance of someone committing the
    > patch.
    >
    > Steve


    OK, updated the tracker comments.

    John Nagle
     
    John Nagle, Apr 22, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Frankie

    OT: Opinions on Robots.txt

    Frankie, Oct 9, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    1,032
    S. Justin Gengo
    Oct 10, 2005
  2. Daniel Vesma
    Replies:
    15
    Views:
    1,536
    Jacqui or (maybe) Pete
    Jul 2, 2003
  3. John Nagle
    Replies:
    5
    Views:
    451
    Nikita the Spider
    Jul 13, 2007
  4. John Nagle
    Replies:
    5
    Views:
    1,107
    Nikita the Spider
    Oct 4, 2007
  5. Tim w

    meta robots and robots txt

    Tim w, May 22, 2014, in forum: HTML
    Replies:
    1
    Views:
    146
Loading...

Share This Page