Web robots

Paul · Aug 23, 2006

I am tearing my hear out. It apears my website is under atack from
these search engins. I have heard that I can place code in my header
som where to stop this. Any help/

the browser information that I have collected show up the following

Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)

Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)

Please help.

Desmond.

Andy Dingley · Aug 23, 2006

Paul said:
It apears my website is under atack from these search engins.

Evil Google! No doughnut!

Web or newsgroup search for "robots.txt"

Apart from that, post a URL to your site if you want better advice.
We're not psychic.

Paul · Aug 23, 2006

The website is www.des-otoole.co.uk
Also can I add that I do not have any meta data describing the site.
Can someone nominate me to a search engine? They should not have found
me in the first place

David Dorward · Aug 23, 2006

Paul said:
Can someone nominate me to a search engine? They should not have found
me in the first place

Someone could have linked to your site from a site that the search
engines know about.

Please don't top post.

TreatmentPlant · Aug 23, 2006

David said:
Someone could have linked to your site from a site that the search
engines know about.

Please don't top post.

http://www.google.com/support/webmasters/bin/topic.py?topic=8843

http://www.google.com/support/webmasters/bin/topic.py?topic=8459

might help?

Ken Sims · Aug 23, 2006

Hi Paul -

The website is www.des-otoole.co.uk

You need a robots.txt text file at the root of the site (e.g.
accessible as <www.des-otoole.co.uk/robots.txt>).

See http://www.robotstxt.org/wc/norobots.html

This robots.txt file tells all robots to not access any part of your
website:

User-agent: *
Disallow: /

Of course bad robots won't bother to even retrieve the file or will
retrieve it and ignore it, but that's another issue.

Google, Yahoo, MSN, etc. will retrieve and obey the robots.txt (though
you may still see some activity for a little while since they use
multiple servers for indexing and it may take a while for any given
server to retrieve an up-to-date copy of robots.txt).

Nikita the Spider · Aug 24, 2006

"Paul said:
I am tearing my hear out. It apears my website is under atack from
these search engins. I have heard that I can place code in my header
som where to stop this. Any help/

the browser information that I have collected show up the following

Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)

Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)

Desmond,
Ken has already given you good practical advice to which I have nothing
to add. But I'm wondering what you mean by saying your Web site is
"under attack". Yahoo! Slurp and Googlebot try to be reasonably polite
when spidering a site.

Mike Collins · Aug 24, 2006

I am tearing my hear out. It apears my website is under atack from
these search engins. I have heard that I can place code in my header
som where to stop this. Any help/

http://danielwebb.us/software/bot-trap/

You need a bot-trap. It catches bots that ignore robots.txt and writes
the IP to a blacklist. The one referenced above works with PHP/Apache.

Mike Collins · Aug 24, 2006

http://danielwebb.us/software/bot-trap/

You need a bot-trap. It catches bots that ignore robots.txt and writes
the IP to a blacklist. The one referenced above works with PHP/Apache.

http://www.homelandstupidity.us/software/bad-behavior/

bad-behavior will control aggressive scraping bots

rf · Aug 24, 2006

Mike said:
http://www.homelandstupidity.us/software/bad-behavior/

Hmmm.

"Help contribute directly to Bad Behaviour Development"
followed by a list of monetory amounts in $US, pounds sterling and Euros.

I guess this site does not want my Australian dollars.Fine with me

(short sighted bastards)

Runnin' on Empty · Aug 24, 2006

Desmond,
Ken has already given you good practical advice to which I have nothing
to add. But I'm wondering what you mean by saying your Web site is
"under attack". Yahoo! Slurp and Googlebot try to be reasonably polite
when spidering a site.

They are unless you have a shopping cart on your site... if you're building
carts, you have to be aware that bots will follow any link.

This includes links that may add products to a temp cart, or delete them,
this can play havoc if you are using any kind of real time SKU tracking
code.

Google and Yahoo are pretty good at obeying robots.txt exclusions, certain
image indexer bots are not.

Runnin'

Paul · Aug 26, 2006

I have a hitcounter that logs how many visitors I get. Over the last
month this counter has gone through the roof. It know apears that it is
Robots. My website does not have any meta tags like keywords
description. So they should not be going there. I think someone has
nominated me to them, but I would not know. The database records
clearly indicate a date. I can re-adjust the counter because I have
database records. but I don't want robots increasing my counter.

Desmond.

Nikita the Spider · Aug 26, 2006

I have a hitcounter that logs how many visitors I get. Over the last
month this counter has gone through the roof. It know apears that it is
Robots. My website does not have any meta tags like keywords
description. So they should not be going there. I think someone has
nominated me to them, but I would not know. The database records
clearly indicate a date. I can re-adjust the counter because I have
database records. but I don't want robots increasing my counter.

Desmond,
I think you misunderstand how search engine bots work. It is an
unwritten rule on the Net that any site that is public is open to anyone
who wants to visit, be that a human with a Web browser or a search
engine bot or any other kind of user agent. Search spiders don't wait
for an invitation to spider a Web site. You don't have to have meta tags
and you don't have to submit your site to the search engines. Any public
mention of your site (such as in this newsgroup!) or in some cases even
a non-public mention (such as a URL sent via GMail, which might be
picked up by Google) can make search engines aware of your site. THey're
aggressively competing against one another to provide the best results
and part of "best" is "most complete" which means that if search engine
A knows about more Web sites than search engine B, then A has an
advantage -- hence their enthusiasm for discovering new sites.

They also realize that they will get banned from sites if they spider
them too aggressively and piss people off, so they're (usually) polite
and will try not to overwhelm a site with too many requests at once.
That statement is almost sure to spur a comment from a Webmaster who
feels that her site has been abused by Googlebot/Yahoo Slurp/MSNBot and
I'm sure that happens once in a while, but by and large they try to be
nice because generating hostility works heavily against them.

Also note (as I believe someone else mentioned) that the user agent that
is sent along with a request is based on an honor system. It is trivial
for an evil bot to masquerade as some other bot via the user agent
string.

Please don't top-post.
http://en.wikipedia.org/wiki/Top_posting

jokla · Aug 27, 2006

Paul said:
I am tearing my hear out. It apears my website is under atack from
these search engins. I have heard that I can place code in my header
som where to stop this. Any help/

the browser information that I have collected show up the following

Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)

Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)

Please help.

Desmond.

Layst week Yahoo and MSN came to one of my sites a dozen times a day
and Google did this yesterday . . . looks like they're catching up with
the crawling

nice.guy.nige · Aug 30, 2006

While the city slept, Nikita the Spider ([email protected])
feverishly typed...

[...]

But I'm wondering what you mean by saying your Web
site is "under attack". Yahoo! Slurp and Googlebot try to be
reasonably polite when spidering a site.

Indeed... Last time around, Googlebot even made me a cup of tea! ;-)

Cheers,
Nige

Paul · Sep 1, 2006

Can I get a web robot to only see one or 2 files as they link to areas
of my site that I do want indexed. LIKE

User-agent: *
Allow: /history.php
Disallow: /

This would help me enormously.

nice.guy.nige said:
While the city slept, Nikita the Spider ([email protected])
feverishly typed...

[...]

But I'm wondering what you mean by saying your Web
site is "under attack". Yahoo! Slurp and Googlebot try to be
reasonably polite when spidering a site.

Click to expand...

Indeed... Last time around, Googlebot even made me a cup of tea! ;-)

Cheers,
Nige

Nikita the Spider · Sep 1, 2006

nice.guy.nige said:
nice.guy.nige said:

While the city slept, Nikita the Spider ([email protected])
feverishly typed...

[...]

But I'm wondering what you mean by saying your Web
site is "under attack". Yahoo! Slurp and Googlebot try to be
reasonably polite when spidering a site.

Click to expand...

Indeed... Last time around, Googlebot even made me a cup of tea! ;-)

Click to expand...

Can I get a web robot to only see one or 2 files as they link to areas
of my site that I do want indexed. LIKE

User-agent: *
Allow: /history.php
Disallow: /

Paul,
My guess is that this will probably work with most bots, but it isn't a
sure thing. Oddly enough, robots.txt is not as clearly standardized as
HTML or HTTP. The authoritative reference for it is those few pages on
robotstxt.org -- there's no RFC that defines the format. The original
description of robots.txt in 1994 didn't permit "Allow:" fields. An
updated proposal from 1996 defines the Allow fields, but that proposal
never made it beyond draft stage:
http://www.robotstxt.org/wc/norobots-rfc.html

Since it was a draft proposal, does that make it more or less
authoritative than the 1994 document? It's up to robot authors to
decide. My spider (see my sig) obeys all of the 1994 and 1996
specifications (except for one small part where the 1996 spec
contradicts the 1994 document), so my spider would understand Allow:
fields in your robots.txt.

Yahoo and MSNBot make no mention of it and state clearly that they
follow the 1994 version of the spec:
http://help.yahoo.com/help/us/ysearch/slurp/index.html
http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIn
dexing.htm

I don't know if Googlebot will be as nice to you as it is to Nige (who
made me laugh), but even though Google says the same as Yahoo & MSNBot,
they also use "Allow" fields in their examples, so they clearly support
it.

My guess is that all of the big name bots support it, just because it
isn't hard to support. Robots.txt just isn't that hard to parse in the
first place. But I can't back up my assertion with anything other than
warm fuzzies which sound nice but are no substitute for hard facts (or
even documentation!) which I can't provide.

HTH

Leif K-Brooks · Oct 22, 2006

Nikita said:
My guess is that this will probably work with most bots, but it isn't a
sure thing. Oddly enough, robots.txt is not as clearly standardized as
HTML or HTTP.

Nor should it be in its current state. It's very poorly-implemented. The
idea of using a resource for this purpose is ridiculous: asking for
permission to use a resource is a metadata issue; it should be handled
with a special HTTP header, or something similar.

Using a resource for other resources' metadata has practical problems;
for instance, Wiki software which lets users create their own files has
to special-case the name 'robots.txt'. If they don't do that, a
malicious user could, through the standard Wiki interface, cause search
engines to ignore the site. Robots.txt is broken.

Frederick Smith · Oct 22, 2006

not sure that I understand the problem with this.Those of use lucky enough
to be running an Apache web server don't need to bother with the "IP
blacklist" - I simply put the IP into the httpd.conf file after the word
"Deny" - it works like a charm - and carrieson logging their failed attempts

Frederick

robots.txt and regular expressions?	3	May 3, 2008
User_agent and web robot names	3	Jun 10, 2007
Padding is invalid and cannot be removed.	4	Oct 2, 2007
googlebot and CheckVirtualFileExists Exceptions!	5	Jul 12, 2007
HTML Anchor tag not working	2	Dec 15, 2020
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
Help needed with code	5	Mar 7, 2021
HTTP Module - Global Error Handler	2	Feb 11, 2005

Web robots

Paul

Andy Dingley

Paul

David Dorward

TreatmentPlant

Ken Sims

Nikita the Spider

Mike Collins

Mike Collins

rf

Runnin' on Empty

Paul

Nikita the Spider

jokla

nice.guy.nige

Paul

Nikita the Spider

Leif K-Brooks

Frederick Smith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads