uses of robots.txt

Discussion in 'HTML' started by Math, Oct 6, 2007.

  1. Math

    Math Guest

    Hi,

    There is something I really don't understand ; and I would like your
    advises...

    1. Some websites, (for instance news.google.fr) contains a
    syndication feed (like http://news.google.fr/nwshp?topic=po&output=atom).

    2. Theses websites have a robots.txt file preventing some robots
    (declared by user-agents) from indexation.
    For example : http://news.google.fr/robots.txt contains (extract) :
    User-agent: *
    Disallow: /nwshp

    3. I've developped an syndication aggregator, and I woul'd like to
    respect these robots.txt files. but as I can see and understand, my
    user-agent isn't authorized to acces /nwshp?topic=po&output=atom
    because of this robots.txt...

    So, is it normal ? robots.txt files are only for indexation robots ?
    to sum up, my syndication aggregator should respect these files or
    not ?

    Thanks.
     
    Math, Oct 6, 2007
    #1
    1. Advertising

  2. In article <>,
    Math <> wrote:

    > Hi,
    >
    > There is something I really don't understand ; and I would like your
    > advises...
    >
    > 1. Some websites, (for instance news.google.fr) contains a
    > syndication feed (like http://news.google.fr/nwshp?topic=po&output=atom).
    >
    > 2. Theses websites have a robots.txt file preventing some robots
    > (declared by user-agents) from indexation.
    > For example : http://news.google.fr/robots.txt contains (extract) :
    > User-agent: *
    > Disallow: /nwshp
    >
    > 3. I've developped an syndication aggregator, and I woul'd like to
    > respect these robots.txt files. but as I can see and understand, my
    > user-agent isn't authorized to acces /nwshp?topic=po&output=atom
    > because of this robots.txt...
    >
    > So, is it normal ? robots.txt files are only for indexation robots ?
    > to sum up, my syndication aggregator should respect these files or
    > not ?


    Hi Math,
    It's hard to say, but if they prefer to keep this content from being
    copied to other sites, robots.txt is the way to do it. In other words,
    you can't assume they just want to keep indexing bots out, they might
    want to keep all bots out.

    If your aggregator is only being used by you and a few friends, then
    probably Google et al wouldn't care if your bot visits them once per
    hour or so. But if you want this aggregator to be used by lots of
    people, then I'd say you need to respect robots.txt.

    BTW the closest thing there is to a standard for robots.txt is here:
    http://www.robotstxt.org/wc/norobots-rfc.html

    When describing robots, it focuses on indexing bots. But it was written
    at a time when Web robots were less varied then they are now, so the
    author may not have considered your case.

    Good luck

    --
    Philip
    http://NikitaTheSpider.com/
    Whole-site HTML validation, link checking and more
     
    Nikita the Spider, Oct 7, 2007
    #2
    1. Advertising

  3. Math

    Newsgroups Guest

    Thanks for your answers Nikita the Spider,


    > If your aggregator is only being used by you and a few friends,

    Currently, yes ;-( but I developped it also for anybody who want to use
    it. :)

    > But if you want this aggregator to be used by lots of
    > people, then I'd say you need to respect robots.txt.

    The problem is : where is the limit between "few friends" and "lots of
    people"...


    > When describing robots, it focuses on indexing bots. But it was written
    > at a time when Web robots were less varied then they are now, so the
    > author may not have considered your case.

    Yes, I agree. It's another debate, and I'm not used to reed rfc, so what
    mean "Expires June 4, 1997" on this rfc ? Mean that Comments are not
    considered after this date ? If not, I could comment this rfc. :)
     
    Newsgroups, Oct 7, 2007
    #3
  4. Math

    Ken Sims Guest

    On Sat, 06 Oct 2007 23:19:49 -0400, Nikita the Spider
    <> wrote:

    >In article <>,
    > Math <> wrote:
    >>
    >> So, is it normal ? robots.txt files are only for indexation robots ?
    >> to sum up, my syndication aggregator should respect these files or
    >> not ?

    >
    >Hi Math,
    >It's hard to say, but if they prefer to keep this content from being
    >copied to other sites, robots.txt is the way to do it. In other words,
    >you can't assume they just want to keep indexing bots out, they might
    >want to keep all bots out.
    >
    >If your aggregator is only being used by you and a few friends, then
    >probably Google et al wouldn't care if your bot visits them once per
    >hour or so. But if you want this aggregator to be used by lots of
    >people, then I'd say you need to respect robots.txt.


    I missed the original message because it was posted from Google
    Gropes, but my opinion is that *all* automated software should
    retrieve and respect robots.txt. I enforce it on my server by
    blocking the IP addresses of bad software at the router.

    --
    Ken
    http://www.kensims.net/
     
    Ken Sims, Oct 7, 2007
    #4
  5. In article <1191750311.5505.12.camel@localhost>,
    Newsgroups <> wrote:

    > Thanks for your answers Nikita the Spider,
    >
    >
    > > If your aggregator is only being used by you and a few friends,

    > Currently, yes ;-( but I developped it also for anybody who want to use
    > it. :)
    >
    > > But if you want this aggregator to be used by lots of
    > > people, then I'd say you need to respect robots.txt.

    > The problem is : where is the limit between "few friends" and "lots of
    > people"...


    That's where it gets tricky. =) But consider this -- if you obey
    robots.txt 100% from the start, you'll always be doing the right thing
    no matter how many people use your aggregator.

    > > When describing robots, it focuses on indexing bots. But it was written
    > > at a time when Web robots were less varied then they are now, so the
    > > author may not have considered your case.

    > Yes, I agree. It's another debate, and I'm not used to reed rfc, so what
    > mean "Expires June 4, 1997" on this rfc ? Mean that Comments are not
    > considered after this date ? If not, I could comment this rfc. :)


    That RFC was only a draft and it expired before it was approved.
    However, no other RFC governing the use of robots.txt has ever been
    approved or even written as far as I know, so that RFC is the closest
    thing we have to a official standard.

    --
    Philip
    http://NikitaTheSpider.com/
    Whole-site HTML validation, link checking and more
     
    Nikita the Spider, Oct 8, 2007
    #5
  6. Math

    Newsgroups Guest

    > That's where it gets tricky. =) But consider this -- if you obey
    > robots.txt 100% from the start, you'll always be doing the right thing
    > no matter how many people use your aggregator.


    I agree ; but, If I obeyrobots.txt, my aggregator won't aggregate lots
    of RSS. Who want to use an aggregator which do not aggregate :)

    For information : There is currently about 70 users that use my
    aggregator... It's difficult for me to recruite :) But i really wants
    to be 100% conform with rules and standards...

    Thanks for your help and opinion.
     
    Newsgroups, Oct 8, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Frankie

    OT: Opinions on Robots.txt

    Frankie, Oct 9, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    1,016
    S. Justin Gengo
    Oct 10, 2005
  2. Daniel Vesma
    Replies:
    15
    Views:
    1,527
    Jacqui or (maybe) Pete
    Jul 2, 2003
  3. Neil White

    Re: robots.txt

    Neil White, Aug 8, 2003, in forum: HTML
    Replies:
    0
    Views:
    407
    Neil White
    Aug 8, 2003
  4. lostinspace

    Re: robots.txt

    lostinspace, Aug 8, 2003, in forum: HTML
    Replies:
    0
    Views:
    383
    lostinspace
    Aug 8, 2003
  5. Tim w

    meta robots and robots txt

    Tim w, May 22, 2014, in forum: HTML
    Replies:
    1
    Views:
    131
Loading...

Share This Page