robot.txt

D

David Graham

Hi
I have a folder on my site that I use to practice on, I don't want robots
indexing this folder. I believe the meta tag is not as good as a robot.txt
file. I would like to use a robot.txt file but...

1. What is the syntax of the line that I write to prevent access to a folder
(the folder is called 'sefriendly' and it lives off the root folder which is
called 'www'

2. In which folder is the robot.txt file stored?

thanks

David
 
P

PeterMcC

David said:
Hi
I have a folder on my site that I use to practice on, I don't want
robots indexing this folder. I believe the meta tag is not as good as
a robot.txt file. I would like to use a robot.txt file but...

1. What is the syntax of the line that I write to prevent access to a
folder (the folder is called 'sefriendly' and it lives off the root
folder which is called 'www'

User-agent: *
Disallow: /sefriendly/
2. In which folder is the robot.txt file stored?
in your root - in your case, www - folder

There's lots of info at:
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
And a script that checks your robot.txt file
 
D

David Graham

PeterMcC said:
User-agent: *
Disallow: /sefriendly/

I put the robot.txt file into the www folder containing the two lines above
(exactly as you indicate i.e. on two lines) but I can still visit the site
using IE6. I thought those two lines ban access from all UA's. I have
cleared out my browsers cache in case that was what I was viewing, but that
made no difference. I will read up on this subject, but could you point out
were my thinking is a bit off here. Does the robot.txt file just ban spiders
and not browsers?

TIA
David
 
P

PeterMcC

David said:
I put the robot.txt file into the www folder containing the two lines
above (exactly as you indicate i.e. on two lines) but I can still
visit the site using IE6. I thought those two lines ban access from
all UA's. I have cleared out my browsers cache in case that was what
I was viewing, but that made no difference. I will read up on this
subject, but could you point out were my thinking is a bit off here.
Does the robot.txt file just ban spiders and not browsers?

Just spiders.
 
P

PeterMcC

PeterMcC said:
David Graham wrote:

Just spiders.

BTW - if you don't have a link to a page, it won't get spidered because the
spider only follows links.

If you want to have links to the page but don't want it spidering or seeing
by others, use .htaccess to password protect the directory that holds the
page.

HTH
 
D

David Graham

PeterMcC said:
BTW - if you don't have a link to a page, it won't get spidered because the
spider only follows links.

If you want to have links to the page but don't want it spidering or seeing
by others, use .htaccess to password protect the directory that holds the
page.

HTH
--
PeterMcC
If you feel that any of the above is incorrect,
inappropriate or offensive in any way,
please ignore it and accept my apologies.

Thanks for the help. I have one more question. Google indexed one of my
practice sites, before I had a chance to use a robot.txt file. Do you know
how long it will be before Google deletes the cached version of this site
which I never intended to be indexed. The reason I ask is because the
unwanted site is competing in the search results with the site which I want
to be indexed (the unwanted site is doing better than the wanted site - I
have not yet got round to making my main site more optimised for search
engines)

TIA
David
 
D

Denise Enck

David Graham said:
Hi
I have a folder on my site that I use to practice on, I don't want robots
indexing this folder. I believe the meta tag is not as good as a robot.txt
file. I would like to use a robot.txt file but...

1. What is the syntax of the line that I write to prevent access to a folder
(the folder is called 'sefriendly' and it lives off the root folder which is
called 'www'

2. In which folder is the robot.txt file stored?

thanks

David


the file should be called robots.txt rather than robot.txt else it won't
keep any spiders out ~

Denise
 
D

David Graham

Denise Enck said:
which


the file should be called robots.txt rather than robot.txt else it won't
keep any spiders out ~

Denise
Thanks loads - didn't know it had to have the the 's' on the name

David
 
P

PeterMcC

David said:
Thanks loads - didn't know it had to have the the 's' on the name

Ooops - picked up the "robot.txt" from the OP and it didn't register.
Thanks, Denise.
 
J

Jukka K. Korpela

Headless said:
That would be silly and it would make the concept practically
unusable.

_What_ would be silly? The robots.txt concept _is_ defined the way I
described, both in the HTML specification I referred to and in the
"Robots Exclusion Standard".
I'm on a bog standard shared Apache user web space provided with my
dial account (so virtual root). Using a robots.txt works fine (I
can see that it works because I use Atomz site search on one of my
sites, it echos back the robots.txt exclusions as it indexes the
site).

What you see is what the Atomz software does. Everyone and his dog or
search system may use a name like robots.txt, or robot.txt, or
foo.bar for some private purposes. But that's _not_ what the Robots
Exclusion Standard for the World Wide Web means.

Don't get lured by statements of compliance. On the average, any
statement about complying with some standard is bogus.

If Atomz actually uses robots.txt other than at the server root, then
http://www.atomz.com/search/faqs.htm#189 is misleading, to put it
mildly. It says: "Yes, Atomz Search is compliant with the Robots
Exclusion Protocol and it will examine the robots.txt file if it is
present on your site." and refers to common resources on that
protocol/standard. And those resources make it clear that robots.txt is
_server-wide_, residing at address /robots.txt. In particular,
http://www.robotstxt.org/wc/faq.html#noindex
says:
"What if I can't make a /robots.txt file?
Sometimes you cannot make a /robots.txt file, because you don't
administer the entire server. All is not lost: there is a new standard
for using HTML META tags to keep robots out of your documents. - -"

(Of course, "sometimes" and "new" are somewhat funny words in this
context.)
 
H

Headless

Jukka K. Korpela said:
_What_ would be silly? The robots.txt concept _is_ defined the way I
described, both in the HTML specification I referred to and in the
"Robots Exclusion Standard".

Afaics you read to much into references to "/" and " only a server
administrator can maintain such a list". "/" refers to the root of my
web space, and I am the "server administrator" (virtually ;-).

Afaik there is no way for a robot to access the physical server root (as
opposed to the virtual server root).


Headless
 
D

David Graham

Jukka K. Korpela said:
_What_ would be silly? The robots.txt concept _is_ defined the way I
described, both in the HTML specification I referred to and in the
"Robots Exclusion Standard".


What you see is what the Atomz software does. Everyone and his dog or
search system may use a name like robots.txt, or robot.txt, or
foo.bar for some private purposes. But that's _not_ what the Robots
Exclusion Standard for the World Wide Web means.

Don't get lured by statements of compliance. On the average, any
statement about complying with some standard is bogus.

If Atomz actually uses robots.txt other than at the server root, then
http://www.atomz.com/search/faqs.htm#189 is misleading, to put it
mildly. It says: "Yes, Atomz Search is compliant with the Robots
Exclusion Protocol and it will examine the robots.txt file if it is
present on your site." and refers to common resources on that
protocol/standard. And those resources make it clear that robots.txt is
_server-wide_, residing at address /robots.txt. In particular,
http://www.robotstxt.org/wc/faq.html#noindex
says:
"What if I can't make a /robots.txt file?
Sometimes you cannot make a /robots.txt file, because you don't
administer the entire server. All is not lost: there is a new standard
for using HTML META tags to keep robots out of your documents. - -"

(Of course, "sometimes" and "new" are somewhat funny words in this
context.)

Yucca has my respect, his answers are good, but Headless is no dummy either.
Has Headless conceded defeat on this one? Anyway, I will be adding the meta
tag exclusion thing to every page. Thanks to everyone who helped.
David
 
J

Jukka K. Korpela

Headless said:
Afaics you read to much into references to "/" and " only a server
administrator can maintain such a list". "/" refers to the root of
my web space, and I am the "server administrator" (virtually ;-).

No, I don't. The meaning of a URL that begins with "/" is well-defined
in URL specifications, and this part of the specs is honored by all
relevant parties. The meaning of "/robots.txt" only depends on the
server part of the base address, and the meaning is
http://www.sample.example/robots.txt
where www.sample.example is the server part of the base address.
There's no vagueness here. Ref.: RFC 2396.

And the Robots Exclusion Standard defines that URL only as the
residence of the file for exclusion specifications.
Afaik there is no way for a robot to access the physical server
root (as opposed to the virtual server root).

The only thing that a robot, or a browser for that matter, knows and
cares is that it sends a request for
http://www.sample.example/robots.txt
How the server www.sample.example processes it is its business. For all
that robots (or browsers) can know, the server might pick up file
vdsdghuigae.fig from folder yhftgy\dahjks\fhgj, transmogrify its
content, and send back the result. Or it might run a server-side script
to generate something. Or it might connect to typing machines operated
by chimpanzees and record and send back what they are currently
producing.
 
D

David Graham

lostinspace said:
----- Original Message -----
From: David Graham <>
Newsgroups: alt.html
Sent: Saturday, June 28, 2003 6:23 AM
Subject: robot.txt

which

David,
Perhaps it's just an off day for most folks?
I've seen some very knowlegable folks here provide incomplete information.

Robots.txt will NOT ban any robot.
Instead, it is a "suggestion" to honorable bots to comply.
Most dishonorbale bots won't read your robots.txt anyway. Any path in there
will only point them towards the possibly hidden and unprotected direction.
Jdmorgan has some extensive suggestion on robots:
http://www.webmasterworld.com/forum23/2200.htm

On the other hand if your interested in banning and denying admission of
bots than in most instances that requires the use of htaccess.
See the "Close to Perfect Ban"
http://www.webmasterworld.com/forum13/687.htm?highlight=perfect+ban a very
long thread.
Thanks, I will read the links. I thought this robots.txt post would just be
a simple little matter - perhaps not!
thanks
David
 
J

Jukka K. Korpela

Jacqui or (maybe) Pete said:
The spec at http://www.robotstxt.org isn't exactly clear on
anything

http://www.robotstxt.org/wc/norobots.html#method says:

'The method used to exclude robots from a server is to create a
file on
the server which specifies an access policy for robots. This file
must be accessible via HTTP on the local URL "/robots.txt".'

The only thing that isn't quite clear IMHO is why they call it "local
URL" when they apparently mean _relative_ URL, which _must_ be globally
accessible of course. But URL terminology is generally confused, and
the intentions are clear.
Now what does that mean? Take porjes.com/robots.txt [1]. Its
intention is *not* to ask robots to exclude files from the server
(ananke.affordablehost.com). However it _is_ accessible at the URL
http://porjes.com/robots.txt.

By the robots exclusion standard, it _is_ such a resource that is to be
used for restricting robot access to any URLs that begin with
http://porjes.com/ (and only them). Physical servers are irrelevant in
URL considerations.
 
J

Jukka K. Korpela

Headless said:
please clarify the following phrases:

_server root_

The address http://www.foo.example/ or the physical directory
corresponding to it, depending on whether you consider the situation
from the robot and client perspective or the author perspective.
Normal authors

The majority of Web authors who just create (and possibly maintain)
pages and try to avoid knowing about any server issues.
own server

An server controlled by the person in question.
Folk in this group typically host their websites on a shared
server. This presents no problems with regard to using a robots.txt
as long as they have their own domain or if the site has this type
of url: http://www.user.host.com

Folk in this group maybe (I have no statistics on this), but surely
most people who create pages just put them somewhere without owning a
domain.

In the situation you describe, thought perhaps not with the particular
URL you mention (domain host.com exists, but subdomain user.host.com
doesn't [there's an implicit hint here, suggesting that sample URLs
should be flagged as such using .example]), the author has control over
the server root. So I was inexact in that "unless they run their own
server", in the sense that it need not be a separate HTTP server but
can be a server "only" from the viewpoint of everyone else
The only situation that does present a problem is if the site has
this type of url: http://www.host.com/~user

In that particular case, it simply depends on
http://www.host.com/.htaccess, which does not currently exist.

But there is a _very_ common situation where an author has control over
a single page, or set of pages, like
http://www.foo.example/somestuff/...
where ... denotes an arbitrary string. If he creates
http://www.foo.example/somestuff/robots.txt
it won't affect normal indexing robots the least (though it might
affect Atomz). He would need to talk to (e-mail address removed) to make
her modify http://www.foo.example/robots.txt. Or, more realistically,
he would just use <meta name="robots" ...> tags.
 
P

PeterMcC

David said:
I can't follow most of this thread, could you very simply, in
non-technical jargon, just confirm if robots.txt is any good or not!
If it helps, I own the domain
http://www.catalysys.co.uk
which is hosted by phpwebhosting.

As far as your implementation of the robots.txt file is concerned, it looks
to be the correct way to *ask* the spiders not to index the sefriendly
folder.

User-agent: *
Disallow: /sefriendly/

Most search engines seem to adhere to the rules but, as has been pointed
out, robots.txt doesn't present any barrier other than putting up a keep-out
sign.

If you don't have a link to the page from an already spidered site, your
sefriendly directory won't be found anyway - robots.txt or not.

And, if you really want to be safe, you could always password protect the
directory with .htaccess - dead easy and the spiders don't get past the
password protect.
 
H

Headless

Jukka K. Korpela said:
The address http://www.foo.example/ or the physical directory
corresponding to it, depending on whether you consider the situation
from the robot and client perspective or the author perspective.

"Server root" means something entirely different from a sysadmin angle.
I suggest using a different terminology to remove the ambiguity,
"(sub)domain root" seems more appropriate.
The majority of Web authors who just create (and possibly maintain)
pages and try to avoid knowing about any server issues.

Assuming that "The majority of Web authors" use
http://www.host.com/~user url's is a very bold claim.
An server controlled by the person in question.

I don't control any "server", yet usage of robots.txt on my site is
fully valid, correct and functioning.
Folk in this group maybe (I have no statistics on this), but surely
most people who create pages just put them somewhere without owning a
domain.

Again there is a risk of ambiguity here, http://www.user.host.com should
be labeled as a "sub-domain", it's not registered anywhere and it's not
portable, so you certainly can not call it "owning a domain".
In that particular case, it simply depends on
http://www.host.com/.htaccess, which does not currently exist.

I don't see how the robots.txt convention relates to Apache .htaccess
files. Regardless of any .htaccess file anywhere,
http://www.host.com/~user would resolve to
http://www.host.com/robots.txt for compliant clients looking for a
robots.txt
http://www.foo.example/somestuff/robots.txt
it won't affect normal indexing robots the least (though it might
affect Atomz).

You have not provided any evidence that Atomz does not follow the
correct procedure for retrieving a robots.txt. It works correctly on my
site because it should (all my sites use http://www.user.host.com urls).


Headless
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top