robot.txt

Discussion in 'HTML' started by David Graham, Jun 28, 2003.

  1. David Graham

    David Graham Guest

    Hi
    I have a folder on my site that I use to practice on, I don't want robots
    indexing this folder. I believe the meta tag is not as good as a robot.txt
    file. I would like to use a robot.txt file but...

    1. What is the syntax of the line that I write to prevent access to a folder
    (the folder is called 'sefriendly' and it lives off the root folder which is
    called 'www'

    2. In which folder is the robot.txt file stored?

    thanks

    David
     
    David Graham, Jun 28, 2003
    #1
    1. Advertising

  2. David Graham

    PeterMcC Guest

    David Graham wrote:
    > Hi
    > I have a folder on my site that I use to practice on, I don't want
    > robots indexing this folder. I believe the meta tag is not as good as
    > a robot.txt file. I would like to use a robot.txt file but...
    >
    > 1. What is the syntax of the line that I write to prevent access to a
    > folder (the folder is called 'sefriendly' and it lives off the root
    > folder which is called 'www'


    User-agent: *
    Disallow: /sefriendly/

    > 2. In which folder is the robot.txt file stored?

    in your root - in your case, www - folder

    There's lots of info at:
    http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
    And a script that checks your robot.txt file

    --
    PeterMcC
    If you feel that any of the above is incorrect,
    inappropriate or offensive in any way,
    please ignore it and accept my apologies.
     
    PeterMcC, Jun 28, 2003
    #2
    1. Advertising

  3. David Graham

    David Graham Guest

    "PeterMcC" <> wrote in message
    news:uweLa.44927$9.net...
    > David Graham wrote:
    > > Hi
    > > I have a folder on my site that I use to practice on, I don't want
    > > robots indexing this folder. I believe the meta tag is not as good as
    > > a robot.txt file. I would like to use a robot.txt file but...
    > >
    > > 1. What is the syntax of the line that I write to prevent access to a
    > > folder (the folder is called 'sefriendly' and it lives off the root
    > > folder which is called 'www'

    >
    > User-agent: *
    > Disallow: /sefriendly/
    >
    > > 2. In which folder is the robot.txt file stored?

    > in your root - in your case, www - folder
    >
    > There's lots of info at:
    > http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
    > And a script that checks your robot.txt file


    Thanks for the link

    David
     
    David Graham, Jun 28, 2003
    #3
  4. David Graham

    David Graham Guest

    "PeterMcC" <> wrote in message
    news:uweLa.44927$9.net...
    > David Graham wrote:
    > > Hi
    > > I have a folder on my site that I use to practice on, I don't want
    > > robots indexing this folder. I believe the meta tag is not as good as
    > > a robot.txt file. I would like to use a robot.txt file but...
    > >
    > > 1. What is the syntax of the line that I write to prevent access to a
    > > folder (the folder is called 'sefriendly' and it lives off the root
    > > folder which is called 'www'

    >
    > User-agent: *
    > Disallow: /sefriendly/
    >


    I put the robot.txt file into the www folder containing the two lines above
    (exactly as you indicate i.e. on two lines) but I can still visit the site
    using IE6. I thought those two lines ban access from all UA's. I have
    cleared out my browsers cache in case that was what I was viewing, but that
    made no difference. I will read up on this subject, but could you point out
    were my thinking is a bit off here. Does the robot.txt file just ban spiders
    and not browsers?

    TIA
    David
     
    David Graham, Jun 28, 2003
    #4
  5. David Graham

    PeterMcC Guest

    David Graham wrote:
    > "PeterMcC" <> wrote in message
    > news:uweLa.44927$9.net...
    >> David Graham wrote:
    >>> Hi
    >>> I have a folder on my site that I use to practice on, I don't want
    >>> robots indexing this folder. I believe the meta tag is not as good
    >>> as a robot.txt file. I would like to use a robot.txt file but...
    >>>
    >>> 1. What is the syntax of the line that I write to prevent access to
    >>> a folder (the folder is called 'sefriendly' and it lives off the
    >>> root folder which is called 'www'

    >>
    >> User-agent: *
    >> Disallow: /sefriendly/
    >>

    >
    > I put the robot.txt file into the www folder containing the two lines
    > above (exactly as you indicate i.e. on two lines) but I can still
    > visit the site using IE6. I thought those two lines ban access from
    > all UA's. I have cleared out my browsers cache in case that was what
    > I was viewing, but that made no difference. I will read up on this
    > subject, but could you point out were my thinking is a bit off here.
    > Does the robot.txt file just ban spiders and not browsers?


    Just spiders.

    --
    PeterMcC
    If you feel that any of the above is incorrect,
    inappropriate or offensive in any way,
    please ignore it and accept my apologies.
     
    PeterMcC, Jun 28, 2003
    #5
  6. David Graham

    PeterMcC Guest

    PeterMcC wrote:
    > David Graham wrote:

    <snip>
    >> I put the robot.txt file into the www folder containing the two lines
    >> above (exactly as you indicate i.e. on two lines) but I can still
    >> visit the site using IE6. I thought those two lines ban access from
    >> all UA's. I have cleared out my browsers cache in case that was what
    >> I was viewing, but that made no difference. I will read up on this
    >> subject, but could you point out were my thinking is a bit off here.
    >> Does the robot.txt file just ban spiders and not browsers?

    >
    > Just spiders.


    BTW - if you don't have a link to a page, it won't get spidered because the
    spider only follows links.

    If you want to have links to the page but don't want it spidering or seeing
    by others, use .htaccess to password protect the directory that holds the
    page.

    HTH
    --
    PeterMcC
    If you feel that any of the above is incorrect,
    inappropriate or offensive in any way,
    please ignore it and accept my apologies.
     
    PeterMcC, Jun 28, 2003
    #6
  7. In article <U%gLa.1981$>,
    says...
    >
    > "PeterMcC" <> wrote in message
    > news:uweLa.44927$9.net...
    > > David Graham wrote:


    > > > I have a folder on my site that I use to practice on, I don't want
    > > > robots indexing this folder. I believe the meta tag is not as good as
    > > > a robot.txt file. I would like to use a robot.txt file but...

    ....
    > > User-agent: *
    > > Disallow: /sefriendly/
    > >

    ....
    > Does the robot.txt file just ban spiders
    > and not browsers?
    >

    Correct.
     
    Jacqui or (maybe) Pete, Jun 28, 2003
    #7
  8. David Graham

    David Graham Guest

    "PeterMcC" <> wrote in message
    news:SBhLa.44961$9.net...
    > PeterMcC wrote:
    > > David Graham wrote:

    > <snip>
    > >> I put the robot.txt file into the www folder containing the two lines
    > >> above (exactly as you indicate i.e. on two lines) but I can still
    > >> visit the site using IE6. I thought those two lines ban access from
    > >> all UA's. I have cleared out my browsers cache in case that was what
    > >> I was viewing, but that made no difference. I will read up on this
    > >> subject, but could you point out were my thinking is a bit off here.
    > >> Does the robot.txt file just ban spiders and not browsers?

    > >
    > > Just spiders.

    >
    > BTW - if you don't have a link to a page, it won't get spidered because

    the
    > spider only follows links.
    >
    > If you want to have links to the page but don't want it spidering or

    seeing
    > by others, use .htaccess to password protect the directory that holds the
    > page.
    >
    > HTH
    > --
    > PeterMcC
    > If you feel that any of the above is incorrect,
    > inappropriate or offensive in any way,
    > please ignore it and accept my apologies.


    Thanks for the help. I have one more question. Google indexed one of my
    practice sites, before I had a chance to use a robot.txt file. Do you know
    how long it will be before Google deletes the cached version of this site
    which I never intended to be indexed. The reason I ask is because the
    unwanted site is competing in the search results with the site which I want
    to be indexed (the unwanted site is doing better than the wanted site - I
    have not yet got round to making my main site more optimised for search
    engines)

    TIA
    David
     
    David Graham, Jun 28, 2003
    #8
  9. David Graham

    Denise Enck Guest

    "David Graham" <> wrote in message
    news:n6eLa.339$...
    > Hi
    > I have a folder on my site that I use to practice on, I don't want robots
    > indexing this folder. I believe the meta tag is not as good as a robot.txt
    > file. I would like to use a robot.txt file but...
    >
    > 1. What is the syntax of the line that I write to prevent access to a

    folder
    > (the folder is called 'sefriendly' and it lives off the root folder which

    is
    > called 'www'
    >
    > 2. In which folder is the robot.txt file stored?
    >
    > thanks
    >
    > David
    >



    the file should be called robots.txt rather than robot.txt else it won't
    keep any spiders out ~

    Denise
     
    Denise Enck, Jun 28, 2003
    #9
  10. David Graham

    David Graham Guest

    "Denise Enck" <> wrote in message
    news:tQiLa.69023$...
    > "David Graham" <> wrote in message
    > news:n6eLa.339$...
    > > Hi
    > > I have a folder on my site that I use to practice on, I don't want

    robots
    > > indexing this folder. I believe the meta tag is not as good as a

    robot.txt
    > > file. I would like to use a robot.txt file but...
    > >
    > > 1. What is the syntax of the line that I write to prevent access to a

    > folder
    > > (the folder is called 'sefriendly' and it lives off the root folder

    which
    > is
    > > called 'www'
    > >
    > > 2. In which folder is the robot.txt file stored?
    > >
    > > thanks
    > >
    > > David
    > >

    >
    >
    > the file should be called robots.txt rather than robot.txt else it won't
    > keep any spiders out ~
    >
    > Denise
    >

    Thanks loads - didn't know it had to have the the 's' on the name

    David
     
    David Graham, Jun 28, 2003
    #10
  11. David Graham

    PeterMcC Guest

    David Graham wrote:
    > "Denise Enck" <> wrote in message
    > news:tQiLa.69023$...
    >> "David Graham" <> wrote in message
    >> news:n6eLa.339$...
    >>> Hi
    >>> I have a folder on my site that I use to practice on, I don't want
    >>> robots indexing this folder. I believe the meta tag is not as good
    >>> as a robot.txt file. I would like to use a robot.txt file but...
    >>>
    >>> 1. What is the syntax of the line that I write to prevent access to
    >>> a folder (the folder is called 'sefriendly' and it lives off the
    >>> root folder

    > which
    >> is
    >>> called 'www'
    >>>
    >>> 2. In which folder is the robot.txt file stored?
    >>>
    >>> thanks
    >>>
    >>> David
    >>>

    >>
    >>
    >> the file should be called robots.txt rather than robot.txt else it
    >> won't keep any spiders out ~
    >>
    >> Denise
    >>

    > Thanks loads - didn't know it had to have the the 's' on the name


    Ooops - picked up the "robot.txt" from the OP and it didn't register.
    Thanks, Denise.

    --
    PeterMcC
    If you feel that any of the above is incorrect,
    inappropriate or offensive in any way,
    please ignore it and accept my apologies.
     
    PeterMcC, Jun 28, 2003
    #11
  12. Headless <> wrote:

    > "Jukka K. Korpela" <> wrote:
    >
    >>> the file should be called robots.txt rather than robot.txt else
    >>> it won't keep any spiders out ~

    >>
    >>Besides, it needs to reside in the _server root_. Normal authors
    >>have no access to it, unless they run their own server.

    >
    > That would be silly and it would make the concept practically
    > unusable.


    _What_ would be silly? The robots.txt concept _is_ defined the way I
    described, both in the HTML specification I referred to and in the
    "Robots Exclusion Standard".

    > I'm on a bog standard shared Apache user web space provided with my
    > dial account (so virtual root). Using a robots.txt works fine (I
    > can see that it works because I use Atomz site search on one of my
    > sites, it echos back the robots.txt exclusions as it indexes the
    > site).


    What you see is what the Atomz software does. Everyone and his dog or
    search system may use a name like robots.txt, or robot.txt, or
    foo.bar for some private purposes. But that's _not_ what the Robots
    Exclusion Standard for the World Wide Web means.

    Don't get lured by statements of compliance. On the average, any
    statement about complying with some standard is bogus.

    If Atomz actually uses robots.txt other than at the server root, then
    http://www.atomz.com/search/faqs.htm#189 is misleading, to put it
    mildly. It says: "Yes, Atomz Search is compliant with the Robots
    Exclusion Protocol and it will examine the robots.txt file if it is
    present on your site." and refers to common resources on that
    protocol/standard. And those resources make it clear that robots.txt is
    _server-wide_, residing at address /robots.txt. In particular,
    http://www.robotstxt.org/wc/faq.html#noindex
    says:
    "What if I can't make a /robots.txt file?
    Sometimes you cannot make a /robots.txt file, because you don't
    administer the entire server. All is not lost: there is a new standard
    for using HTML META tags to keep robots out of your documents. - -"

    (Of course, "sometimes" and "new" are somewhat funny words in this
    context.)

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
     
    Jukka K. Korpela, Jun 29, 2003
    #12
  13. David Graham

    Headless Guest

    "Jukka K. Korpela" <> wrote:

    >>>Besides, it needs to reside in the _server root_. Normal authors
    >>>have no access to it, unless they run their own server.

    >>
    >> That would be silly and it would make the concept practically
    >> unusable.

    >
    >_What_ would be silly? The robots.txt concept _is_ defined the way I
    >described, both in the HTML specification I referred to and in the
    >"Robots Exclusion Standard".


    Afaics you read to much into references to "/" and " only a server
    administrator can maintain such a list". "/" refers to the root of my
    web space, and I am the "server administrator" (virtually ;-).

    Afaik there is no way for a robot to access the physical server root (as
    opposed to the virtual server root).


    Headless
     
    Headless, Jun 29, 2003
    #13
  14. David Graham

    David Graham Guest

    "Jukka K. Korpela" <> wrote in message
    news:Xns93A962B0A5029jkorpelacstutfi@193.229.0.31...
    > Headless <> wrote:
    >
    > > "Jukka K. Korpela" <> wrote:
    > >
    > >>> the file should be called robots.txt rather than robot.txt else
    > >>> it won't keep any spiders out ~
    > >>
    > >>Besides, it needs to reside in the _server root_. Normal authors
    > >>have no access to it, unless they run their own server.

    > >
    > > That would be silly and it would make the concept practically
    > > unusable.

    >
    > _What_ would be silly? The robots.txt concept _is_ defined the way I
    > described, both in the HTML specification I referred to and in the
    > "Robots Exclusion Standard".
    >
    > > I'm on a bog standard shared Apache user web space provided with my
    > > dial account (so virtual root). Using a robots.txt works fine (I
    > > can see that it works because I use Atomz site search on one of my
    > > sites, it echos back the robots.txt exclusions as it indexes the
    > > site).

    >
    > What you see is what the Atomz software does. Everyone and his dog or
    > search system may use a name like robots.txt, or robot.txt, or
    > foo.bar for some private purposes. But that's _not_ what the Robots
    > Exclusion Standard for the World Wide Web means.
    >
    > Don't get lured by statements of compliance. On the average, any
    > statement about complying with some standard is bogus.
    >
    > If Atomz actually uses robots.txt other than at the server root, then
    > http://www.atomz.com/search/faqs.htm#189 is misleading, to put it
    > mildly. It says: "Yes, Atomz Search is compliant with the Robots
    > Exclusion Protocol and it will examine the robots.txt file if it is
    > present on your site." and refers to common resources on that
    > protocol/standard. And those resources make it clear that robots.txt is
    > _server-wide_, residing at address /robots.txt. In particular,
    > http://www.robotstxt.org/wc/faq.html#noindex
    > says:
    > "What if I can't make a /robots.txt file?
    > Sometimes you cannot make a /robots.txt file, because you don't
    > administer the entire server. All is not lost: there is a new standard
    > for using HTML META tags to keep robots out of your documents. - -"
    >
    > (Of course, "sometimes" and "new" are somewhat funny words in this
    > context.)
    >
    > --
    > Yucca, http://www.cs.tut.fi/~jkorpela/
    > Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html


    Yucca has my respect, his answers are good, but Headless is no dummy either.
    Has Headless conceded defeat on this one? Anyway, I will be adding the meta
    tag exclusion thing to every page. Thanks to everyone who helped.
    David
     
    David Graham, Jun 29, 2003
    #14
  15. Headless <> wrote:

    > Afaics you read to much into references to "/" and " only a server
    > administrator can maintain such a list". "/" refers to the root of
    > my web space, and I am the "server administrator" (virtually ;-).


    No, I don't. The meaning of a URL that begins with "/" is well-defined
    in URL specifications, and this part of the specs is honored by all
    relevant parties. The meaning of "/robots.txt" only depends on the
    server part of the base address, and the meaning is
    http://www.sample.example/robots.txt
    where www.sample.example is the server part of the base address.
    There's no vagueness here. Ref.: RFC 2396.

    And the Robots Exclusion Standard defines that URL only as the
    residence of the file for exclusion specifications.

    > Afaik there is no way for a robot to access the physical server
    > root (as opposed to the virtual server root).


    The only thing that a robot, or a browser for that matter, knows and
    cares is that it sends a request for
    http://www.sample.example/robots.txt
    How the server www.sample.example processes it is its business. For all
    that robots (or browsers) can know, the server might pick up file
    vdsdghuigae.fig from folder yhftgy\dahjks\fhgj, transmogrify its
    content, and send back the result. Or it might run a server-side script
    to generate something. Or it might connect to typing machines operated
    by chimpanzees and record and send back what they are currently
    producing.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
     
    Jukka K. Korpela, Jun 29, 2003
    #15
  16. David Graham

    David Graham Guest

    "lostinspace" <> wrote in message
    news:UNALa.3268$...
    > ----- Original Message -----
    > From: David Graham <>
    > Newsgroups: alt.html
    > Sent: Saturday, June 28, 2003 6:23 AM
    > Subject: robot.txt
    >
    >
    > > Hi
    > > I have a folder on my site that I use to practice on, I don't want

    robots
    > > indexing this folder. I believe the meta tag is not as good as a

    robot.txt
    > > file. I would like to use a robot.txt file but...
    > >
    > > 1. What is the syntax of the line that I write to prevent access to a

    > folder
    > > (the folder is called 'sefriendly' and it lives off the root folder

    which
    > is
    > > called 'www'
    > >
    > > 2. In which folder is the robot.txt file stored?
    > >
    > > thanks
    > >
    > > David
    > >
    > >

    >
    > David,
    > Perhaps it's just an off day for most folks?
    > I've seen some very knowlegable folks here provide incomplete information.
    >
    > Robots.txt will NOT ban any robot.
    > Instead, it is a "suggestion" to honorable bots to comply.
    > Most dishonorbale bots won't read your robots.txt anyway. Any path in

    there
    > will only point them towards the possibly hidden and unprotected

    direction.
    > Jdmorgan has some extensive suggestion on robots:
    > http://www.webmasterworld.com/forum23/2200.htm
    >
    > On the other hand if your interested in banning and denying admission of
    > bots than in most instances that requires the use of htaccess.
    > See the "Close to Perfect Ban"
    > http://www.webmasterworld.com/forum13/687.htm?highlight=perfect ban a very
    > long thread.
    >

    Thanks, I will read the links. I thought this robots.txt post would just be
    a simple little matter - perhaps not!
    thanks
    David
     
    David Graham, Jun 29, 2003
    #16
  17. Jacqui or (maybe) Pete <> wrote:

    > The spec at http://www.robotstxt.org isn't exactly clear on
    > anything
    >
    > http://www.robotstxt.org/wc/norobots.html#method says:
    >
    > 'The method used to exclude robots from a server is to create a
    > file on
    > the server which specifies an access policy for robots. This file
    > must be accessible via HTTP on the local URL "/robots.txt".'


    The only thing that isn't quite clear IMHO is why they call it "local
    URL" when they apparently mean _relative_ URL, which _must_ be globally
    accessible of course. But URL terminology is generally confused, and
    the intentions are clear.

    > Now what does that mean? Take porjes.com/robots.txt [1]. Its
    > intention is *not* to ask robots to exclude files from the server
    > (ananke.affordablehost.com). However it _is_ accessible at the URL
    > http://porjes.com/robots.txt.


    By the robots exclusion standard, it _is_ such a resource that is to be
    used for restricting robot access to any URLs that begin with
    http://porjes.com/ (and only them). Physical servers are irrelevant in
    URL considerations.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
     
    Jukka K. Korpela, Jun 29, 2003
    #17
  18. Headless <> wrote:

    > please clarify the following phrases:
    >
    > _server root_


    The address http://www.foo.example/ or the physical directory
    corresponding to it, depending on whether you consider the situation
    from the robot and client perspective or the author perspective.

    > Normal authors


    The majority of Web authors who just create (and possibly maintain)
    pages and try to avoid knowing about any server issues.

    > own server


    An server controlled by the person in question.

    > Folk in this group typically host their websites on a shared
    > server. This presents no problems with regard to using a robots.txt
    > as long as they have their own domain or if the site has this type
    > of url: http://www.user.host.com


    Folk in this group maybe (I have no statistics on this), but surely
    most people who create pages just put them somewhere without owning a
    domain.

    In the situation you describe, thought perhaps not with the particular
    URL you mention (domain host.com exists, but subdomain user.host.com
    doesn't [there's an implicit hint here, suggesting that sample URLs
    should be flagged as such using .example]), the author has control over
    the server root. So I was inexact in that "unless they run their own
    server", in the sense that it need not be a separate HTTP server but
    can be a server "only" from the viewpoint of everyone else

    > The only situation that does present a problem is if the site has
    > this type of url: http://www.host.com/~user


    In that particular case, it simply depends on
    http://www.host.com/.htaccess, which does not currently exist.

    But there is a _very_ common situation where an author has control over
    a single page, or set of pages, like
    http://www.foo.example/somestuff/...
    where ... denotes an arbitrary string. If he creates
    http://www.foo.example/somestuff/robots.txt
    it won't affect normal indexing robots the least (though it might
    affect Atomz). He would need to talk to ple to make
    her modify http://www.foo.example/robots.txt. Or, more realistically,
    he would just use <meta name="robots" ...> tags.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
     
    Jukka K. Korpela, Jun 30, 2003
    #18
  19. David Graham

    PeterMcC Guest

    David Graham wrote:
    > "Jukka K. Korpela" <> wrote in message
    > news:Xns93A9A09F49E7Ajkorpelacstutfi@193.229.0.31...
    >> Jacqui or (maybe) Pete <> wrote:
    >>
    >>> The spec at http://www.robotstxt.org isn't exactly clear on
    >>> anything

    >
    > I can't follow most of this thread, could you very simply, in
    > non-technical jargon, just confirm if robots.txt is any good or not!
    > If it helps, I own the domain
    > http://www.catalysys.co.uk
    > which is hosted by phpwebhosting.


    As far as your implementation of the robots.txt file is concerned, it looks
    to be the correct way to *ask* the spiders not to index the sefriendly
    folder.

    User-agent: *
    Disallow: /sefriendly/

    Most search engines seem to adhere to the rules but, as has been pointed
    out, robots.txt doesn't present any barrier other than putting up a keep-out
    sign.

    If you don't have a link to the page from an already spidered site, your
    sefriendly directory won't be found anyway - robots.txt or not.

    And, if you really want to be safe, you could always password protect the
    directory with .htaccess - dead easy and the spiders don't get past the
    password protect.

    --
    PeterMcC
    If you feel that any of the above is incorrect,
    inappropriate or offensive in any way,
    please ignore it and accept my apologies.
     
    PeterMcC, Jun 30, 2003
    #19
  20. David Graham

    Headless Guest

    "Jukka K. Korpela" <> wrote:

    >> please clarify the following phrases:
    >>
    >> _server root_

    >
    >The address http://www.foo.example/ or the physical directory
    >corresponding to it, depending on whether you consider the situation
    >from the robot and client perspective or the author perspective.


    "Server root" means something entirely different from a sysadmin angle.
    I suggest using a different terminology to remove the ambiguity,
    "(sub)domain root" seems more appropriate.

    >> Normal authors

    >
    >The majority of Web authors who just create (and possibly maintain)
    >pages and try to avoid knowing about any server issues.


    Assuming that "The majority of Web authors" use
    http://www.host.com/~user url's is a very bold claim.

    >> own server

    >
    >An server controlled by the person in question.


    I don't control any "server", yet usage of robots.txt on my site is
    fully valid, correct and functioning.

    >> Folk in this group typically host their websites on a shared
    >> server. This presents no problems with regard to using a robots.txt
    >> as long as they have their own domain or if the site has this type
    >> of url: http://www.user.host.com

    >
    >Folk in this group maybe (I have no statistics on this), but surely
    >most people who create pages just put them somewhere without owning a
    >domain.


    Again there is a risk of ambiguity here, http://www.user.host.com should
    be labeled as a "sub-domain", it's not registered anywhere and it's not
    portable, so you certainly can not call it "owning a domain".

    >> The only situation that does present a problem is if the site has
    >> this type of url: http://www.host.com/~user

    >
    >In that particular case, it simply depends on
    >http://www.host.com/.htaccess, which does not currently exist.


    I don't see how the robots.txt convention relates to Apache .htaccess
    files. Regardless of any .htaccess file anywhere,
    http://www.host.com/~user would resolve to
    http://www.host.com/robots.txt for compliant clients looking for a
    robots.txt

    >http://www.foo.example/somestuff/robots.txt
    >it won't affect normal indexing robots the least (though it might
    >affect Atomz).


    You have not provided any evidence that Atomz does not follow the
    correct procedure for retrieving a robots.txt. It works correctly on my
    site because it should (all my sites use http://www.user.host.com urls).


    Headless
     
    Headless, Jun 30, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gonzal kamikadze
    Replies:
    2
    Views:
    398
    Joerg Jooss
    Apr 6, 2005
  2. Sameen
    Replies:
    2
    Views:
    446
    Victor Bazarov
    Aug 29, 2005
  3. David Lozzi

    404, SEO, robot.txt?

    David Lozzi, Mar 5, 2008, in forum: ASP .Net
    Replies:
    2
    Views:
    418
    David Lozzi
    Mar 5, 2008
  4. John Nagle
    Replies:
    1
    Views:
    314
    Martin v. Löwis
    Jun 2, 2008
  5. Jochen Brenzlinger
    Replies:
    7
    Views:
    5,577
    Roedy Green
    Sep 15, 2011
Loading...

Share This Page