Stopping robots searching particular page

Discussion in 'HTML' started by dorayme, Sep 11, 2007.

  1. dorayme

    dorayme Guest

    A website is on a server. Just one or two of the pages are not
    for public consumption. They are not top secret and no big harm
    would be done if it was not 100% possible, but it would be best
    if they did not come up in search engines. (A sort of provision
    by a company for making some files available to those who have
    the address. Company does not want password protection; but I am
    considering persuading them).

    What is the simplest and most effective way of stopping robots
    searching a particular html pages on a server. Am looking for an
    actual example and clear instructions. Getting confused by
    looking at http://www.searchtools.com/index.html though doubtless
    I will get less confused after much study.

    --
    dorayme
    dorayme, Sep 11, 2007
    #1
    1. Advertising

  2. Scripsit dorayme:

    > What is the simplest and most effective way of stopping robots
    > searching a particular html pages on a server.


    Put the following into the head part of each of those pages:

    <meta name="robots" content="noindex">

    Replace "noindex" by "noindex, nofollow" if you also want to stop robots
    from following any links on the page (i.e. from finding new indexable pages
    through it).

    This follows the de-facto standard (Robots Exclusion Standard) that has long
    been obeyed by any well-behaving indexing robots. And there's not much you
    can do to the ill-behaving robots.

    --
    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Sep 11, 2007
    #2
    1. Advertising

  3. dorayme

    Tina Peters Guest

    "dorayme" <> wrote in message
    news:...
    >A website is on a server. Just one or two of the pages are not
    > for public consumption. They are not top secret and no big harm
    > would be done if it was not 100% possible, but it would be best
    > if they did not come up in search engines.


    If its not linked to any other webpage, in any way, it shouldn't be
    spidered.

    --Tina
    --
    AxisHOST.com - cPanel Hosting
    BuyAVPS.com - VPS Accounts
    Serving the web since 1997
    Tina Peters, Sep 11, 2007
    #3
  4. Scripsit Tina Peters:

    > If its not linked to any other webpage, in any way, it shouldn't be
    > spidered.


    Yet it may be spidered. Actually, it would be an interesting exercise in a
    course on web issues to ask the students list down 10 possible situations
    where the page might be spidered.

    And to make the task a little more difficult, let's exclude the perhaps most
    obvious scenario: someone who knows the page address submits it to a search
    engine via its "Add URL" form.

    --
    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Sep 11, 2007
    #4
  5. dorayme

    dorayme Guest

    In article <2sEFi.222058$>,
    "Jukka K. Korpela" <> wrote:

    > Scripsit dorayme:
    >
    > > What is the simplest and most effective way of stopping robots
    > > searching a particular html pages on a server.

    >
    > Put the following into the head part of each of those pages:
    >
    > <meta name="robots" content="noindex">
    >
    > Replace "noindex" by "noindex, nofollow" if you also want to stop robots
    > from following any links on the page (i.e. from finding new indexable pages
    > through it).
    >
    > This follows the de-facto standard (Robots Exclusion Standard) that has long
    > been obeyed by any well-behaving indexing robots. And there's not much you
    > can do to the ill-behaving robots.


    Thank you. This is the level of exclusion that I want. Job done.

    --
    dorayme
    dorayme, Sep 12, 2007
    #5
  6. dorayme <> writes:

    > What is the simplest and most effective way of stopping robots
    > searching a particular html pages on a server.


    There are two popular "standards" (neither of which is a standard in
    the formal sense). One uses <meta ...> elements in your HTML, and the
    other uses separate robots.txt files. Both are described here:

    <http://www.robotstxt.org/>

    Both approaches depend on cooperative robots. For uncooperative robots,
    all you can do is shout "klaatu barada nikto" and hope for the best.

    sherm--

    --
    Web Hosting by West Virginians, for West Virginians: http://wv-www.net
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Sherm Pendley, Sep 12, 2007
    #6
  7. dorayme

    dorayme Guest

    In article <>,
    Sherm Pendley <> wrote:

    > dorayme <> writes:
    >
    > > What is the simplest and most effective way of stopping robots
    > > searching a particular html pages on a server.

    >
    > There are two popular "standards" (neither of which is a standard in
    > the formal sense). One uses <meta ...> elements in your HTML, and the
    > other uses separate robots.txt files. Both are described here:
    >
    > <http://www.robotstxt.org/>
    >
    > Both approaches depend on cooperative robots. For uncooperative robots,
    > all you can do is shout "klaatu barada nikto" and hope for the best.
    >


    Thanks. If I get any reports of the pages concerned being found
    now that I have gone the meta route, I will look further into the
    robots.txt approach.

    (Actually, sherm, I started reading about this last before
    posting my question, got restless and slightly confused and
    thought, I know what to do, I will pop my head above the trench
    line a mo and see if something comes back from alt.htm to make
    this thing stop buzzing around my brain. I know, it was a bit
    reckless. But it who dares... you know... <g>

    I also have a search engine on the particular site concerned and
    they have various masking procedures I have since looked into.)

    --
    dorayme
    dorayme, Sep 12, 2007
    #7
  8. On Sep 11, 6:47 pm, "Jukka K. Korpela" <> wrote:
    > > If its not linked to any other webpage, in any way, it shouldn't be
    > > spidered.

    > Yet it may be spidered. Actually, it would be an interesting exercise in a
    > course on web issues to ask the students list down 10 possible situations
    > where the page might be spidered.


    Well that's just a stupid asignment. The students might actually be
    forced to learn something from it. What the heck is your problem
    suggesting something were a student could learn...
    Travis Newbury, Sep 12, 2007
    #8
  9. dorayme

    Ben C Guest

    On 2007-09-11, Jukka K. Korpela <> wrote:
    > Scripsit Tina Peters:
    >
    >> If its not linked to any other webpage, in any way, it shouldn't be
    >> spidered.

    >
    > Yet it may be spidered. Actually, it would be an interesting exercise in a
    > course on web issues to ask the students list down 10 possible situations
    > where the page might be spidered.
    >
    > And to make the task a little more difficult, let's exclude the perhaps most
    > obvious scenario: someone who knows the page address submits it to a search
    > engine via its "Add URL" form.


    1. Someone posts the URL to a newsgroup.
    2. You forget to turn off the webserver's AutoIndex or similar, so the
    spider can just navigate its way to the url going through auto
    generated directory indexes.

    What are the other 8?
    Ben C, Sep 12, 2007
    #9
  10. Scripsit Ben C:

    > 1. Someone posts the URL to a newsgroup.
    > 2. You forget to turn off the webserver's AutoIndex or similar, so the
    > spider can just navigate its way to the url going through auto
    > generated directory indexes.
    >
    > What are the other 8?


    To mention some other scenarios of having a page indexed without having been
    linked to from any other web page*), here's one relatively obvious one and
    one imaginary though realistic (we know such things are being done with
    email addresses for spamming purposes):

    3. The page _was_ linked to from another page.

    4. An indexing robot generates URLs automatically, more or less at random,
    and tries them. It might for example try servers known to exist and append
    to the server name some strings that are known to be common for web pages,
    like /help.htm, /news.html....

    *) Of course an author cannot prevent linking by others. You tell the URL to
    your friend, who tells it to his pal, who sets up a link. But this common
    way of getting indexed against your will falls outside the current exercise.

    --
    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Sep 12, 2007
    #10
  11. dorayme

    Dylan Parry Guest

    Jukka K. Korpela wrote:

    >> 1. Someone posts the URL to a newsgroup.
    >> 2. You forget to turn off the webserver's AutoIndex or similar, so the
    >> spider can just navigate its way to the url going through auto
    >> generated directory indexes.
    >>

    > 3. The page _was_ linked to from another page.
    >
    > 4. An indexing robot generates URLs automatically, more or less at random,
    > and tries them. It might for example try servers known to exist and append
    > to the server name some strings that are known to be common for web pages,
    > like /help.htm, /news.html....


    5. Someone visits your page[1] and has the Google Toolbar (or others
    similar things) installed and reporting back to Google about the sites
    they are visiting, thus allowing Google to add the site to their index.

    ____
    [1] How they got the URL in the first place might be an issue here, but
    it could be that you personally gave it to them or that it was written
    down somewhere that wasn't necessarily an online resource (business card
    etc).

    --
    Dylan Parry
    http://electricfreedom.org | http://webpageworkshop.co.uk

    The opinions stated above are not necessarily representative of
    those of my cats. All opinions expressed are entirely your own.
    Dylan Parry, Sep 12, 2007
    #11
  12. On Sep 12, 5:55 am, Ben C <> wrote:

    >
    > >

    > 2. You forget to turn off the webserver's AutoIndex or similar, so the
    > spider can just navigate its way to the url going through auto
    > generated directory indexes.



    At least one robot does this. I have a template page (definitely not
    mentioned anywhere else) in a subdirectory that seems to get spidered
    by the yahoo slurp robot.

    Nick

    --
    Nick Theodorakis
    Nick Theodorakis, Sep 12, 2007
    #12
  13. Gazing into my crystal ball I observed dorayme
    <> writing in news:doraymeRidThis-
    :

    > A website is on a server. Just one or two of the pages are not
    > for public consumption. They are not top secret and no big harm
    > would be done if it was not 100% possible, but it would be best
    > if they did not come up in search engines. (A sort of provision
    > by a company for making some files available to those who have
    > the address. Company does not want password protection; but I am
    > considering persuading them).
    >
    > What is the simplest and most effective way of stopping robots
    > searching a particular html pages on a server. Am looking for an
    > actual example and clear instructions. Getting confused by
    > looking at http://www.searchtools.com/index.html though doubtless
    > I will get less confused after much study.
    >


    1. Robots exclusion, you can name a particular file, eg. backoffice.asp
    2. Meta route (in my experience, not quite as reliable as the first)

    --
    Adrienne Boswell at Home
    Arbpen Web Site Design Services
    http://www.cavalcade-of-coding.info
    Please respond to the group so others can share
    Adrienne Boswell, Sep 12, 2007
    #13
  14. dorayme

    Ed Mullen Guest

    dorayme wrote:
    > A website is on a server. Just one or two of the pages are not
    > for public consumption. They are not top secret and no big harm
    > would be done if it was not 100% possible, but it would be best
    > if they did not come up in search engines. (A sort of provision
    > by a company for making some files available to those who have
    > the address. Company does not want password protection; but I am
    > considering persuading them).
    >
    > What is the simplest and most effective way of stopping robots
    > searching a particular html pages on a server. Am looking for an
    > actual example and clear instructions. Getting confused by
    > looking at http://www.searchtools.com/index.html though doubtless
    > I will get less confused after much study.
    >


    Why not just put it in a password-protected directory?

    --
    Ed Mullen
    http://edmullen.net
    http://mozilla.edmullen.net
    http://abington.edmullen.net
    I used to be schizophrenic, but we're all right now.
    Ed Mullen, Sep 12, 2007
    #14
  15. dorayme

    dorayme Guest

    In article <>,
    Ed Mullen <> wrote:

    > it would be best
    > > if they did not come up in search engines. (A sort of provision
    > > by a company for making some files available to those who have
    > > the address. Company does not want password protection; but I am
    > > considering persuading them).
    > >
    > > What is the simplest and most effective way of stopping robots
    > > searching a particular html pages on a server.
    > >

    >
    > Why not just put it in a password-protected directory?


    I guess because it puts up a hurdle for the company and the
    particular companies to which they need to communicate this
    address. People forget passwords and it is extra work to be
    transmitting password information. I understand the reluctance on
    this occasion. But see above.

    [I am working on a psychologically based scheme at the moment,
    Ed, in consultation with my psychologist, to make pages that have
    a level of natural repugnance. The level must be such that people
    with no real need or interest in the purpose of the page will
    flee from it quickly whereas those with a task that requires the
    resources to be found on that page will persist till they get
    them. At the crudest level, perhaps a picture of a dead
    decomposing rat at the top? Animated gif of fumes emanating from
    it? Embedded horrible dead rat sounds? If you care to invest in
    the further development of this promising new scheme, please send
    $10.]

    --
    dorayme
    dorayme, Sep 12, 2007
    #15
  16. dorayme

    John Clayton Guest

    "dorayme" <> wrote in message
    news:...
    > In article <2sEFi.222058$>,
    > "Jukka K. Korpela" <> wrote:
    >
    >> Scripsit dorayme:
    >>
    >> > What is the simplest and most effective way of stopping robots
    >> > searching a particular html pages on a server.

    >>
    >> Put the following into the head part of each of those pages:
    >>
    >> <meta name="robots" content="noindex">
    >>
    >> Replace "noindex" by "noindex, nofollow" if you also want to stop robots
    >> from following any links on the page (i.e. from finding new indexable
    >> pages
    >> through it).



    Would this also help answer the recent, earlier question "how to prevent
    spiders from indexing 'mailto' addresses"?
    Just asking.

    John
    John Clayton, Sep 12, 2007
    #16
  17. dorayme

    Ed Mullen Guest

    dorayme wrote:
    > In article <>,
    > Ed Mullen <> wrote:
    >
    >> it would be best
    >>> if they did not come up in search engines. (A sort of provision
    >>> by a company for making some files available to those who have
    >>> the address. Company does not want password protection; but I am
    >>> considering persuading them).
    >>>
    >>> What is the simplest and most effective way of stopping robots
    >>> searching a particular html pages on a server.
    >>>

    >> Why not just put it in a password-protected directory?

    >
    > I guess because it puts up a hurdle for the company and the
    > particular companies to which they need to communicate this
    > address. People forget passwords and it is extra work to be
    > transmitting password information. I understand the reluctance on
    > this occasion. But see above.


    But, most browsers have the ability to "remember" logon info so it's a a
    case of "do it once". Geez, how hard is that? Set up an example and
    show them. I have two different sites with protected pages/files. My
    Mozilla-based browsers remember the logon info just fine. I click on a
    link/favorite/bookmark, the logon pop-up comes up, I click OK.

    >
    > [I am working on a psychologically based scheme at the moment,
    > Ed, in consultation with my psychologist, to make pages that have
    > a level of natural repugnance. The level must be such that people
    > with no real need or interest in the purpose of the page will
    > flee from it quickly whereas those with a task that requires the
    > resources to be found on that page will persist till they get
    > them. At the crudest level, perhaps a picture of a dead
    > decomposing rat at the top? Animated gif of fumes emanating from
    > it? Embedded horrible dead rat sounds? If you care to invest in
    > the further development of this promising new scheme, please send
    > $10.]
    >


    I doubt that decomposing rats will be a sufficiently universal
    deterrence. In fact, I'm not sure you can settle on any image that will,
    say, tick off, what? 80% of viewers? 90%? Now, if you could be certain
    that everyone was browsing with sound on and the volume set to max, well
    .... ooooo, baby! Then we got something!

    --
    Ed Mullen
    http://edmullen.net
    http://mozilla.edmullen.net
    http://abington.edmullen.net
    Give me ambiguity or give me something else.
    Ed Mullen, Sep 13, 2007
    #17
  18. dorayme

    dorayme Guest

    In article <>,
    Ed Mullen <> wrote:

    > But, most browsers have the ability to "remember" logon info so it's a a
    > case of "do it once". Geez, how hard is that? Set up an example and
    > show them. I have two different sites with protected pages/files. My
    > Mozilla-based browsers remember the logon info just fine. I click on a
    > link/favorite/bookmark, the logon pop-up comes up, I click OK.


    I knew someone would take this line <g> I remind you that I said
    that I am considering so persuading in my original post. That is
    point one. And yes, I am aware of some browsers having such
    facilities, I would be personally lost without them or the Mac
    keychain. But step back, Ed, and see why I am only considering
    persuading and not headlong rushing into it. You are a young man,
    full of natural enthusiasms, I am a 570 year old martian,
    reserved, restrained, conservative, not the least pushy.

    You are basically asking me to persuade not only the company to
    change browsers but also to persuade them to persuade their
    clients/suppliers (all over the world, rich and poor countries)
    who need the resources on the page concerned to make sure they
    have the appropriate browsers. How hard is that? It is much
    harder than me not doing anything but sticking in the meta thing
    that JK said on the nice web page I made for them and now sitting
    back with pleasant thoughts of sorting out pictures of the dog I
    walk, of all the gorgeous pics from babyhood to married of some
    family members, of a new screen (cheap from Dell) for my desk and
    getting ready to go and have a swim on a Sydney beach this avo
    (have you any idea how lovely Sydney smells and feels today,
    jasmine and clear blue sky... Almost a caricature of spring,
    except it is real).

    [not a snowflake in sight - Whack!]

    --
    dorayme
    dorayme, Sep 13, 2007
    #18
  19. dorayme

    Ed Mullen Guest

    dorayme wrote:
    > In article <>,
    > Ed Mullen <> wrote:
    >
    >> But, most browsers have the ability to "remember" logon info so it's a a
    >> case of "do it once". Geez, how hard is that? Set up an example and
    >> show them. I have two different sites with protected pages/files. My
    >> Mozilla-based browsers remember the logon info just fine. I click on a
    >> link/favorite/bookmark, the logon pop-up comes up, I click OK.

    >
    > I knew someone would take this line <g> I remind you that I said
    > that I am considering so persuading in my original post. That is
    > point one. And yes, I am aware of some browsers having such
    > facilities, I would be personally lost without them or the Mac
    > keychain. But step back, Ed, and see why I am only considering
    > persuading and not headlong rushing into it. You are a young man,
    > full of natural enthusiasms, I am a 570 year old martian,
    > reserved, restrained, conservative, not the least pushy.
    >
    > You are basically asking me to persuade not only the company to
    > change browsers but also to persuade them to persuade their
    > clients/suppliers (all over the world, rich and poor countries)
    > who need the resources on the page concerned to make sure they
    > have the appropriate browsers. How hard is that? It is much
    > harder than me not doing anything but sticking in the meta thing
    > that JK said on the nice web page I made for them and now sitting
    > back with pleasant thoughts of sorting out pictures of the dog I
    > walk, of all the gorgeous pics from babyhood to married of some
    > family members, of a new screen (cheap from Dell) for my desk and
    > getting ready to go and have a swim on a Sydney beach this avo
    > (have you any idea how lovely Sydney smells and feels today,
    > jasmine and clear blue sky... Almost a caricature of spring,
    > except it is real).
    >
    > [not a snowflake in sight - Whack!]
    >


    I gotta go get a drink. I read it, I (sorta) got it, and now my head
    hurts so much ...

    Do it or don't do it. It is a solution. If you or your client don't
    like it, fine. Your choice. But, it's simple, it exists, and, let's
    face it, if it's a commercial app? "Use of this site/facility requires
    ...." And it is NOT onerous.

    Ok, I'm wandering downstairs now ...

    --
    Ed Mullen
    http://edmullen.net
    http://mozilla.edmullen.net
    http://abington.edmullen.net
    An ounce of practice is worth more than tons of preaching. - Mohandas Gandhi
    Ed Mullen, Sep 13, 2007
    #19
  20. Scripsit John Clayton:

    >>> <meta name="robots" content="noindex">

    - -
    > Would this also help answer the recent, earlier question "how to
    > prevent spiders from indexing 'mailto' addresses"?


    It would prevent well-behaving robots from indexing the page at all, but
    robots that collect addresses for spamming can hardly be expected to be
    well-behaving.

    (If you want to prevent spammers to get your email address at any cost, get
    rid of all email addresses you have and don't ever get one. That's the only
    method that actually works for the purpose. If you just want to use email
    for something useful, find out an optimal way of doing spam filtering. Do
    _not_ make this your visitors' problem.)

    --
    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Sep 13, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?U3BlbmNlciBILiBQcnVl?=

    Set particular web page as startup page

    =?Utf-8?B?U3BlbmNlciBILiBQcnVl?=, Feb 19, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    348
    Guest
    Feb 19, 2005
  2. Philip
    Replies:
    3
    Views:
    957
    Karl Groves
    Jun 28, 2004
  3. K.
    Replies:
    4
    Views:
    364
  4. stumblng.tumblr
    Replies:
    1
    Views:
    204
    stumblng.tumblr
    Feb 4, 2008
  5. Tim w

    meta robots and robots txt

    Tim w, May 22, 2014, in forum: HTML
    Replies:
    1
    Views:
    129
Loading...

Share This Page