Web robots

A

Andy Dingley

Paul said:
It apears my website is under atack from these search engins.

Evil Google! No doughnut!

Web or newsgroup search for "robots.txt"

Apart from that, post a URL to your site if you want better advice.
We're not psychic.
 
P

Paul

The website is www.des-otoole.co.uk
Also can I add that I do not have any meta data describing the site.
Can someone nominate me to a search engine? They should not have found
me in the first place
 
D

David Dorward

Paul said:
Can someone nominate me to a search engine? They should not have found
me in the first place

Someone could have linked to your site from a site that the search
engines know about.

Please don't top post.
 
K

Ken Sims

Hi Paul -


You need a robots.txt text file at the root of the site (e.g.
accessible as <www.des-otoole.co.uk/robots.txt>).

See http://www.robotstxt.org/wc/norobots.html

This robots.txt file tells all robots to not access any part of your
website:

User-agent: *
Disallow: /

Of course bad robots won't bother to even retrieve the file or will
retrieve it and ignore it, but that's another issue.

Google, Yahoo, MSN, etc. will retrieve and obey the robots.txt (though
you may still see some activity for a little while since they use
multiple servers for indexing and it may take a while for any given
server to retrieve an up-to-date copy of robots.txt).
 
N

Nikita the Spider

"Paul said:
I am tearing my hear out. It apears my website is under atack from
these search engins. I have heard that I can place code in my header
som where to stop this. Any help/

the browser information that I have collected show up the following

Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)

Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)


Desmond,
Ken has already given you good practical advice to which I have nothing
to add. But I'm wondering what you mean by saying your Web site is
"under attack". Yahoo! Slurp and Googlebot try to be reasonably polite
when spidering a site.
 
M

Mike Collins

I am tearing my hear out. It apears my website is under atack from
these search engins. I have heard that I can place code in my header
som where to stop this. Any help/

http://danielwebb.us/software/bot-trap/

You need a bot-trap. It catches bots that ignore robots.txt and writes
the IP to a blacklist. The one referenced above works with PHP/Apache.
 
R

Runnin' on Empty

Desmond,
Ken has already given you good practical advice to which I have nothing
to add. But I'm wondering what you mean by saying your Web site is
"under attack". Yahoo! Slurp and Googlebot try to be reasonably polite
when spidering a site.

They are unless you have a shopping cart on your site... if you're building
carts, you have to be aware that bots will follow any link.

This includes links that may add products to a temp cart, or delete them,
this can play havoc if you are using any kind of real time SKU tracking
code.

Google and Yahoo are pretty good at obeying robots.txt exclusions, certain
image indexer bots are not.

Runnin'
 
P

Paul

I have a hitcounter that logs how many visitors I get. Over the last
month this counter has gone through the roof. It know apears that it is
Robots. My website does not have any meta tags like keywords
description. So they should not be going there. I think someone has
nominated me to them, but I would not know. The database records
clearly indicate a date. I can re-adjust the counter because I have
database records. but I don't want robots increasing my counter.

Desmond.
 
N

Nikita the Spider

I have a hitcounter that logs how many visitors I get. Over the last
month this counter has gone through the roof. It know apears that it is
Robots. My website does not have any meta tags like keywords
description. So they should not be going there. I think someone has
nominated me to them, but I would not know. The database records
clearly indicate a date. I can re-adjust the counter because I have
database records. but I don't want robots increasing my counter.

Desmond,
I think you misunderstand how search engine bots work. It is an
unwritten rule on the Net that any site that is public is open to anyone
who wants to visit, be that a human with a Web browser or a search
engine bot or any other kind of user agent. Search spiders don't wait
for an invitation to spider a Web site. You don't have to have meta tags
and you don't have to submit your site to the search engines. Any public
mention of your site (such as in this newsgroup!) or in some cases even
a non-public mention (such as a URL sent via GMail, which might be
picked up by Google) can make search engines aware of your site. THey're
aggressively competing against one another to provide the best results
and part of "best" is "most complete" which means that if search engine
A knows about more Web sites than search engine B, then A has an
advantage -- hence their enthusiasm for discovering new sites.

They also realize that they will get banned from sites if they spider
them too aggressively and piss people off, so they're (usually) polite
and will try not to overwhelm a site with too many requests at once.
That statement is almost sure to spur a comment from a Webmaster who
feels that her site has been abused by Googlebot/Yahoo Slurp/MSNBot and
I'm sure that happens once in a while, but by and large they try to be
nice because generating hostility works heavily against them.

Also note (as I believe someone else mentioned) that the user agent that
is sent along with a request is based on an honor system. It is trivial
for an evil bot to masquerade as some other bot via the user agent
string.

Please don't top-post.
http://en.wikipedia.org/wiki/Top_posting
 
J

jokla

Paul said:
I am tearing my hear out. It apears my website is under atack from
these search engins. I have heard that I can place code in my header
som where to stop this. Any help/

the browser information that I have collected show up the following

Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)

Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)

Please help.

Desmond.


Layst week Yahoo and MSN came to one of my sites a dozen times a day
and Google did this yesterday . . . looks like they're catching up with
the crawling
 
N

nice.guy.nige

While the city slept, Nikita the Spider ([email protected])
feverishly typed...

[...]
But I'm wondering what you mean by saying your Web
site is "under attack". Yahoo! Slurp and Googlebot try to be
reasonably polite when spidering a site.

Indeed... Last time around, Googlebot even made me a cup of tea! ;-)

Cheers,
Nige
 
P

Paul

Can I get a web robot to only see one or 2 files as they link to areas
of my site that I do want indexed. LIKE

User-agent: *
Allow: /history.php
Disallow: /

This would help me enormously.



nice.guy.nige said:
While the city slept, Nikita the Spider ([email protected])
feverishly typed...

[...]
But I'm wondering what you mean by saying your Web
site is "under attack". Yahoo! Slurp and Googlebot try to be
reasonably polite when spidering a site.

Indeed... Last time around, Googlebot even made me a cup of tea! ;-)

Cheers,
Nige
 
N

Nikita the Spider

nice.guy.nige said:
While the city slept, Nikita the Spider ([email protected])
feverishly typed...

[...]
But I'm wondering what you mean by saying your Web
site is "under attack". Yahoo! Slurp and Googlebot try to be
reasonably polite when spidering a site.

Indeed... Last time around, Googlebot even made me a cup of tea! ;-)

Can I get a web robot to only see one or 2 files as they link to areas
of my site that I do want indexed. LIKE

User-agent: *
Allow: /history.php
Disallow: /

Paul,
My guess is that this will probably work with most bots, but it isn't a
sure thing. Oddly enough, robots.txt is not as clearly standardized as
HTML or HTTP. The authoritative reference for it is those few pages on
robotstxt.org -- there's no RFC that defines the format. The original
description of robots.txt in 1994 didn't permit "Allow:" fields. An
updated proposal from 1996 defines the Allow fields, but that proposal
never made it beyond draft stage:
http://www.robotstxt.org/wc/norobots-rfc.html

Since it was a draft proposal, does that make it more or less
authoritative than the 1994 document? It's up to robot authors to
decide. My spider (see my sig) obeys all of the 1994 and 1996
specifications (except for one small part where the 1996 spec
contradicts the 1994 document), so my spider would understand Allow:
fields in your robots.txt.

Yahoo and MSNBot make no mention of it and state clearly that they
follow the 1994 version of the spec:
http://help.yahoo.com/help/us/ysearch/slurp/index.html
http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_FAQ_MSNBotIn
dexing.htm

I don't know if Googlebot will be as nice to you as it is to Nige (who
made me laugh), but even though Google says the same as Yahoo & MSNBot,
they also use "Allow" fields in their examples, so they clearly support
it.

My guess is that all of the big name bots support it, just because it
isn't hard to support. Robots.txt just isn't that hard to parse in the
first place. But I can't back up my assertion with anything other than
warm fuzzies which sound nice but are no substitute for hard facts (or
even documentation!) which I can't provide.

HTH
 
L

Leif K-Brooks

Nikita said:
My guess is that this will probably work with most bots, but it isn't a
sure thing. Oddly enough, robots.txt is not as clearly standardized as
HTML or HTTP.

Nor should it be in its current state. It's very poorly-implemented. The
idea of using a resource for this purpose is ridiculous: asking for
permission to use a resource is a metadata issue; it should be handled
with a special HTTP header, or something similar.

Using a resource for other resources' metadata has practical problems;
for instance, Wiki software which lets users create their own files has
to special-case the name 'robots.txt'. If they don't do that, a
malicious user could, through the standard Wiki interface, cause search
engines to ignore the site. Robots.txt is broken.
 
F

Frederick Smith

not sure that I understand the problem with this.Those of use lucky enough
to be running an Apache web server don't need to bother with the "IP
blacklist" - I simply put the IP into the httpd.conf file after the word
"Deny" - it works like a charm - and carrieson logging their failed attempts


Frederick
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top