Web robots

Discussion in 'HTML' started by Paul, Aug 23, 2006.

  1. Paul

    Paul Guest

    I am tearing my hear out. It apears my website is under atack from
    these search engins. I have heard that I can place code in my header
    som where to stop this. Any help/

    the browser information that I have collected show up the following

    Mozilla/5.0 (compatible; Yahoo! Slurp;
    http://help.yahoo.com/help/us/ysearch/slurp)

    Mozilla/5.0 (compatible; Googlebot/2.1;
    +http://www.google.com/bot.html)

    Please help.

    Desmond.
     
    Paul, Aug 23, 2006
    #1
    1. Advertising

  2. Paul

    Andy Dingley Guest

    Paul wrote:

    > It apears my website is under atack from these search engins.


    Evil Google! No doughnut!

    Web or newsgroup search for "robots.txt"

    Apart from that, post a URL to your site if you want better advice.
    We're not psychic.
     
    Andy Dingley, Aug 23, 2006
    #2
    1. Advertising

  3. Paul

    Paul Guest

    The website is www.des-otoole.co.uk
    Also can I add that I do not have any meta data describing the site.
    Can someone nominate me to a search engine? They should not have found
    me in the first place

    Andy Dingley wrote:
    > Paul wrote:
    >
    > > It apears my website is under atack from these search engins.

    >
    > Evil Google! No doughnut!
    >
    > Web or newsgroup search for "robots.txt"
    >
    > Apart from that, post a URL to your site if you want better advice.
    > We're not psychic.
     
    Paul, Aug 23, 2006
    #3
  4. Paul wrote:
    > Can someone nominate me to a search engine? They should not have found
    > me in the first place


    Someone could have linked to your site from a site that the search
    engines know about.

    Please don't top post.
     
    David Dorward, Aug 23, 2006
    #4
  5. TreatmentPlant, Aug 23, 2006
    #5
  6. Paul

    Ken Sims Guest

    Hi Paul -

    On 23 Aug 2006 03:34:44 -0700, "Paul" <> wrote:

    >The website is www.des-otoole.co.uk


    You need a robots.txt text file at the root of the site (e.g.
    accessible as <www.des-otoole.co.uk/robots.txt>).

    See http://www.robotstxt.org/wc/norobots.html

    This robots.txt file tells all robots to not access any part of your
    website:

    User-agent: *
    Disallow: /

    Of course bad robots won't bother to even retrieve the file or will
    retrieve it and ignore it, but that's another issue.

    Google, Yahoo, MSN, etc. will retrieve and obey the robots.txt (though
    you may still see some activity for a little while since they use
    multiple servers for indexing and it may take a while for any given
    server to retrieve an up-to-date copy of robots.txt).

    --
    Ken
    http://www.kensims.net/
     
    Ken Sims, Aug 23, 2006
    #6
  7. In article <>,
    "Paul" <> wrote:

    > I am tearing my hear out. It apears my website is under atack from
    > these search engins. I have heard that I can place code in my header
    > som where to stop this. Any help/
    >
    > the browser information that I have collected show up the following
    >
    > Mozilla/5.0 (compatible; Yahoo! Slurp;
    > http://help.yahoo.com/help/us/ysearch/slurp)
    >
    > Mozilla/5.0 (compatible; Googlebot/2.1;
    > +http://www.google.com/bot.html)



    Desmond,
    Ken has already given you good practical advice to which I have nothing
    to add. But I'm wondering what you mean by saying your Web site is
    "under attack". Yahoo! Slurp and Googlebot try to be reasonably polite
    when spidering a site.

    --
    Philip
    http://NikitaTheSpider.com/
    Whole-site HTML validation, link checking and more
     
    Nikita the Spider, Aug 24, 2006
    #7
  8. Paul

    Mike Collins Guest

    On 23 Aug 2006 02:36:09 -0700, "Paul" <> wrote:

    >I am tearing my hear out. It apears my website is under atack from
    >these search engins. I have heard that I can place code in my header
    >som where to stop this. Any help/


    http://danielwebb.us/software/bot-trap/

    You need a bot-trap. It catches bots that ignore robots.txt and writes
    the IP to a blacklist. The one referenced above works with PHP/Apache.

    >
    >the browser information that I have collected show up the following
    >
    >Mozilla/5.0 (compatible; Yahoo! Slurp;
    >http://help.yahoo.com/help/us/ysearch/slurp)
    >
    >Mozilla/5.0 (compatible; Googlebot/2.1;
    >+http://www.google.com/bot.html)
    >
    >Please help.
    >
    >Desmond.
     
    Mike Collins, Aug 24, 2006
    #8
  9. Paul

    Mike Collins Guest

    On Thu, 24 Aug 2006 12:42:38 GMT, Mike Collins
    <webspammer_@_yaho-o_.com> wrote:

    >On 23 Aug 2006 02:36:09 -0700, "Paul" <> wrote:
    >
    >>I am tearing my hear out. It apears my website is under atack from
    >>these search engins. I have heard that I can place code in my header
    >>som where to stop this. Any help/

    >
    >http://danielwebb.us/software/bot-trap/
    >
    >You need a bot-trap. It catches bots that ignore robots.txt and writes
    >the IP to a blacklist. The one referenced above works with PHP/Apache.


    http://www.homelandstupidity.us/software/bad-behavior/

    bad-behavior will control aggressive scraping bots
     
    Mike Collins, Aug 24, 2006
    #9
  10. Paul

    rf Guest

    Mike Collins wrote:

    >>You need a bot-trap. It catches bots that ignore robots.txt and writes
    >>the IP to a blacklist. The one referenced above works with PHP/Apache.

    >
    > http://www.homelandstupidity.us/software/bad-behavior/


    Hmmm.

    "Help contribute directly to Bad Behaviour Development"
    followed by a list of monetory amounts in $US, pounds sterling and Euros.

    I guess this site does not want my Australian dollars.Fine with me :)

    (short sighted bastards)

    --
    Cheers
    Richard.
     
    rf, Aug 24, 2006
    #10
  11. "Nikita the Spider" <> wrote in message
    news:...
    > In article <>,
    > "Paul" <> wrote:
    >


    >
    > Desmond,
    > Ken has already given you good practical advice to which I have nothing
    > to add. But I'm wondering what you mean by saying your Web site is
    > "under attack". Yahoo! Slurp and Googlebot try to be reasonably polite
    > when spidering a site.
    >
    >


    They are unless you have a shopping cart on your site... if you're building
    carts, you have to be aware that bots will follow any link.

    This includes links that may add products to a temp cart, or delete them,
    this can play havoc if you are using any kind of real time SKU tracking
    code.

    Google and Yahoo are pretty good at obeying robots.txt exclusions, certain
    image indexer bots are not.

    Runnin'
     
    Runnin' on Empty, Aug 24, 2006
    #11
  12. Paul

    Paul Guest

    I have a hitcounter that logs how many visitors I get. Over the last
    month this counter has gone through the roof. It know apears that it is
    Robots. My website does not have any meta tags like keywords
    description. So they should not be going there. I think someone has
    nominated me to them, but I would not know. The database records
    clearly indicate a date. I can re-adjust the counter because I have
    database records. but I don't want robots increasing my counter.

    Desmond.

    Nikita the Spider wrote:
    > In article <>,
    > "Paul" <> wrote:
    >
    > > I am tearing my hear out. It apears my website is under atack from
    > > these search engins. I have heard that I can place code in my header
    > > som where to stop this. Any help/
    > >
    > > the browser information that I have collected show up the following
    > >
    > > Mozilla/5.0 (compatible; Yahoo! Slurp;
    > > http://help.yahoo.com/help/us/ysearch/slurp)
    > >
    > > Mozilla/5.0 (compatible; Googlebot/2.1;
    > > +http://www.google.com/bot.html)

    >
    >
    > Desmond,
    > Ken has already given you good practical advice to which I have nothing
    > to add. But I'm wondering what you mean by saying your Web site is
    > "under attack". Yahoo! Slurp and Googlebot try to be reasonably polite
    > when spidering a site.
    >
    > --
    > Philip
    > http://NikitaTheSpider.com/
    > Whole-site HTML validation, link checking and more
     
    Paul, Aug 26, 2006
    #12
  13. In article <>,
    "Paul" <> wrote:
    > Nikita the Spider wrote:
    > > In article <>,
    > > "Paul" <> wrote:
    > >
    > > > I am tearing my hear out. It apears my website is under atack from
    > > > these search engins. I have heard that I can place code in my header
    > > > som where to stop this. Any help/
    > > >

    > > to add. But I'm wondering what you mean by saying your Web site is
    > > "under attack". Yahoo! Slurp and Googlebot try to be reasonably polite
    > > when spidering a site.

    >
    > I have a hitcounter that logs how many visitors I get. Over the last
    > month this counter has gone through the roof. It know apears that it is
    > Robots. My website does not have any meta tags like keywords
    > description. So they should not be going there. I think someone has
    > nominated me to them, but I would not know. The database records
    > clearly indicate a date. I can re-adjust the counter because I have
    > database records. but I don't want robots increasing my counter.


    Desmond,
    I think you misunderstand how search engine bots work. It is an
    unwritten rule on the Net that any site that is public is open to anyone
    who wants to visit, be that a human with a Web browser or a search
    engine bot or any other kind of user agent. Search spiders don't wait
    for an invitation to spider a Web site. You don't have to have meta tags
    and you don't have to submit your site to the search engines. Any public
    mention of your site (such as in this newsgroup!) or in some cases even
    a non-public mention (such as a URL sent via GMail, which might be
    picked up by Google) can make search engines aware of your site. THey're
    aggressively competing against one another to provide the best results
    and part of "best" is "most complete" which means that if search engine
    A knows about more Web sites than search engine B, then A has an
    advantage -- hence their enthusiasm for discovering new sites.

    They also realize that they will get banned from sites if they spider
    them too aggressively and piss people off, so they're (usually) polite
    and will try not to overwhelm a site with too many requests at once.
    That statement is almost sure to spur a comment from a Webmaster who
    feels that her site has been abused by Googlebot/Yahoo Slurp/MSNBot and
    I'm sure that happens once in a while, but by and large they try to be
    nice because generating hostility works heavily against them.

    Also note (as I believe someone else mentioned) that the user agent that
    is sent along with a request is based on an honor system. It is trivial
    for an evil bot to masquerade as some other bot via the user agent
    string.

    Please don't top-post.
    http://en.wikipedia.org/wiki/Top_posting

    --
    Philip
    http://NikitaTheSpider.com/
    Whole-site HTML validation, link checking and more
     
    Nikita the Spider, Aug 26, 2006
    #13
  14. Paul

    jokla Guest

    Paul wrote:
    > I am tearing my hear out. It apears my website is under atack from
    > these search engins. I have heard that I can place code in my header
    > som where to stop this. Any help/
    >
    > the browser information that I have collected show up the following
    >
    > Mozilla/5.0 (compatible; Yahoo! Slurp;
    > http://help.yahoo.com/help/us/ysearch/slurp)
    >
    > Mozilla/5.0 (compatible; Googlebot/2.1;
    > +http://www.google.com/bot.html)
    >
    > Please help.
    >
    > Desmond.



    Layst week Yahoo and MSN came to one of my sites a dozen times a day
    and Google did this yesterday . . . looks like they're catching up with
    the crawling
     
    jokla, Aug 27, 2006
    #14
  15. While the city slept, Nikita the Spider ()
    feverishly typed...

    [...]
    > But I'm wondering what you mean by saying your Web
    > site is "under attack". Yahoo! Slurp and Googlebot try to be
    > reasonably polite when spidering a site.


    Indeed... Last time around, Googlebot even made me a cup of tea! ;-)

    Cheers,
    Nige

    --
    Nigel Moss http://www.nigenet.org.uk
    Mail address will bounce. | Take the DOG. out!
    "Your mother ate my dog!", "Not all of him!"
     
    nice.guy.nige, Aug 30, 2006
    #15
  16. Paul

    Paul Guest

    Can I get a web robot to only see one or 2 files as they link to areas
    of my site that I do want indexed. LIKE

    User-agent: *
    Allow: /history.php
    Disallow: /

    This would help me enormously.



    nice.guy.nige wrote:
    > While the city slept, Nikita the Spider ()
    > feverishly typed...
    >
    > [...]
    > > But I'm wondering what you mean by saying your Web
    > > site is "under attack". Yahoo! Slurp and Googlebot try to be
    > > reasonably polite when spidering a site.

    >
    > Indeed... Last time around, Googlebot even made me a cup of tea! ;-)
    >
    > Cheers,
    > Nige
    >
    > --
    > Nigel Moss http://www.nigenet.org.uk
    > Mail address will bounce. | Take the DOG. out!
    > "Your mother ate my dog!", "NTt all of him!"
     
    Paul, Sep 1, 2006
    #16
  17. In article <>,
    "Paul" <> wrote:
    > nice.guy.nige wrote:
    > > While the city slept, Nikita the Spider ()
    > > feverishly typed...
    > >
    > > [...]
    > > > But I'm wondering what you mean by saying your Web
    > > > site is "under attack". Yahoo! Slurp and Googlebot try to be
    > > > reasonably polite when spidering a site.

    > >
    > > Indeed... Last time around, Googlebot even made me a cup of tea! ;-)
    > >

    >
    > Can I get a web robot to only see one or 2 files as they link to areas
    > of my site that I do want indexed. LIKE
    >
    > User-agent: *
    > Allow: /history.php
    > Disallow: /


    Paul,
    My guess is that this will probably work with most bots, but it isn't a
    sure thing. Oddly enough, robots.txt is not as clearly standardized as
    HTML or HTTP. The authoritative reference for it is those few pages on
    robotstxt.org -- there's no RFC that defines the format. The original
    description of robots.txt in 1994 didn't permit "Allow:" fields. An
    updated proposal from 1996 defines the Allow fields, but that proposal
    never made it beyond draft stage:
    http://www.robotstxt.org/wc/norobots-rfc.html

    Since it was a draft proposal, does that make it more or less
    authoritative than the 1994 document? It's up to robot authors to
    decide. My spider (see my sig) obeys all of the 1994 and 1996
    specifications (except for one small part where the 1996 spec
    contradicts the 1994 document), so my spider would understand Allow:
    fields in your robots.txt.

    Yahoo and MSNBot make no mention of it and state clearly that they
    follow the 1994 version of the spec:
    http://help.yahoo.com/help/us/ysearch/slurp/index.html
    http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIn
    dexing.htm

    I don't know if Googlebot will be as nice to you as it is to Nige (who
    made me laugh), but even though Google says the same as Yahoo & MSNBot,
    they also use "Allow" fields in their examples, so they clearly support
    it.

    My guess is that all of the big name bots support it, just because it
    isn't hard to support. Robots.txt just isn't that hard to parse in the
    first place. But I can't back up my assertion with anything other than
    warm fuzzies which sound nice but are no substitute for hard facts (or
    even documentation!) which I can't provide.

    HTH

    --
    Philip
    http://NikitaTheSpider.com/
    Whole-site HTML validation, link checking and more
     
    Nikita the Spider, Sep 1, 2006
    #17
  18. Nikita the Spider wrote:
    > My guess is that this will probably work with most bots, but it isn't a
    > sure thing. Oddly enough, robots.txt is not as clearly standardized as
    > HTML or HTTP.


    Nor should it be in its current state. It's very poorly-implemented. The
    idea of using a resource for this purpose is ridiculous: asking for
    permission to use a resource is a metadata issue; it should be handled
    with a special HTTP header, or something similar.

    Using a resource for other resources' metadata has practical problems;
    for instance, Wiki software which lets users create their own files has
    to special-case the name 'robots.txt'. If they don't do that, a
    malicious user could, through the standard Wiki interface, cause search
    engines to ignore the site. Robots.txt is broken.
     
    Leif K-Brooks, Oct 22, 2006
    #18
  19. not sure that I understand the problem with this.Those of use lucky enough
    to be running an Apache web server don't need to bother with the "IP
    blacklist" - I simply put the IP into the httpd.conf file after the word
    "Deny" - it works like a charm - and carrieson logging their failed attempts


    Frederick



    "rf" <> wrote in message
    news:vOiHg.17359$...
    > Mike Collins wrote:
    >
    >>>You need a bot-trap. It catches bots that ignore robots.txt and writes
    >>>the IP to a blacklist. The one referenced above works with PHP/Apache.

    >>
    >> http://www.homelandstupidity.us/software/bad-behavior/

    >
    > Hmmm.
    >
    > "Help contribute directly to Bad Behaviour Development"
    > followed by a list of monetory amounts in $US, pounds sterling and Euros.
    >
    > I guess this site does not want my Australian dollars.Fine with me :)
    >
    > (short sighted bastards)
    >
    > --
    > Cheers
    > Richard.
    >
     
    Frederick Smith, Oct 22, 2006
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Frankie

    OT: Opinions on Robots.txt

    Frankie, Oct 9, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    1,032
    S. Justin Gengo
    Oct 10, 2005
  2. Arthur T. Murray

    Re: How Robots Will Steal Your Job

    Arthur T. Murray, Aug 20, 2003, in forum: Java
    Replies:
    1,479
    Views:
    18,049
    Noah Roberts
    Jan 22, 2004
  3. Bent C Dalager

    Re: How Robots Will Steal Your Job

    Bent C Dalager, Aug 26, 2003, in forum: Java
    Replies:
    1
    Views:
    392
    Roedy Green
    Aug 26, 2003
  4. maflu
    Replies:
    2
    Views:
    195
    Eric Bohlman
    Nov 27, 2003
  5. Tim w

    meta robots and robots txt

    Tim w, May 22, 2014, in forum: HTML
    Replies:
    1
    Views:
    146
Loading...

Share This Page