Stopping robots searching particular page

D

dorayme

A website is on a server. Just one or two of the pages are not
for public consumption. They are not top secret and no big harm
would be done if it was not 100% possible, but it would be best
if they did not come up in search engines. (A sort of provision
by a company for making some files available to those who have
the address. Company does not want password protection; but I am
considering persuading them).

What is the simplest and most effective way of stopping robots
searching a particular html pages on a server. Am looking for an
actual example and clear instructions. Getting confused by
looking at http://www.searchtools.com/index.html though doubtless
I will get less confused after much study.
 
J

Jukka K. Korpela

Scripsit dorayme:
What is the simplest and most effective way of stopping robots
searching a particular html pages on a server.

Put the following into the head part of each of those pages:

<meta name="robots" content="noindex">

Replace "noindex" by "noindex, nofollow" if you also want to stop robots
from following any links on the page (i.e. from finding new indexable pages
through it).

This follows the de-facto standard (Robots Exclusion Standard) that has long
been obeyed by any well-behaving indexing robots. And there's not much you
can do to the ill-behaving robots.
 
T

Tina Peters

dorayme said:
A website is on a server. Just one or two of the pages are not
for public consumption. They are not top secret and no big harm
would be done if it was not 100% possible, but it would be best
if they did not come up in search engines.

If its not linked to any other webpage, in any way, it shouldn't be
spidered.

--Tina
 
J

Jukka K. Korpela

Scripsit Tina Peters:
If its not linked to any other webpage, in any way, it shouldn't be
spidered.

Yet it may be spidered. Actually, it would be an interesting exercise in a
course on web issues to ask the students list down 10 possible situations
where the page might be spidered.

And to make the task a little more difficult, let's exclude the perhaps most
obvious scenario: someone who knows the page address submits it to a search
engine via its "Add URL" form.
 
D

dorayme

"Jukka K. Korpela said:
Scripsit dorayme:


Put the following into the head part of each of those pages:

<meta name="robots" content="noindex">

Replace "noindex" by "noindex, nofollow" if you also want to stop robots
from following any links on the page (i.e. from finding new indexable pages
through it).

This follows the de-facto standard (Robots Exclusion Standard) that has long
been obeyed by any well-behaving indexing robots. And there's not much you
can do to the ill-behaving robots.

Thank you. This is the level of exclusion that I want. Job done.
 
S

Sherm Pendley

dorayme said:
What is the simplest and most effective way of stopping robots
searching a particular html pages on a server.

There are two popular "standards" (neither of which is a standard in
the formal sense). One uses <meta ...> elements in your HTML, and the
other uses separate robots.txt files. Both are described here:

<http://www.robotstxt.org/>

Both approaches depend on cooperative robots. For uncooperative robots,
all you can do is shout "klaatu barada nikto" and hope for the best.

sherm--
 
D

dorayme

Sherm Pendley said:
There are two popular "standards" (neither of which is a standard in
the formal sense). One uses <meta ...> elements in your HTML, and the
other uses separate robots.txt files. Both are described here:

<http://www.robotstxt.org/>

Both approaches depend on cooperative robots. For uncooperative robots,
all you can do is shout "klaatu barada nikto" and hope for the best.

Thanks. If I get any reports of the pages concerned being found
now that I have gone the meta route, I will look further into the
robots.txt approach.

(Actually, sherm, I started reading about this last before
posting my question, got restless and slightly confused and
thought, I know what to do, I will pop my head above the trench
line a mo and see if something comes back from alt.htm to make
this thing stop buzzing around my brain. I know, it was a bit
reckless. But it who dares... you know... <g>

I also have a search engine on the particular site concerned and
they have various masking procedures I have since looked into.)
 
T

Travis Newbury

Yet it may be spidered. Actually, it would be an interesting exercise in a
course on web issues to ask the students list down 10 possible situations
where the page might be spidered.

Well that's just a stupid asignment. The students might actually be
forced to learn something from it. What the heck is your problem
suggesting something were a student could learn...
 
B

Ben C

Scripsit Tina Peters:


Yet it may be spidered. Actually, it would be an interesting exercise in a
course on web issues to ask the students list down 10 possible situations
where the page might be spidered.

And to make the task a little more difficult, let's exclude the perhaps most
obvious scenario: someone who knows the page address submits it to a search
engine via its "Add URL" form.

1. Someone posts the URL to a newsgroup.
2. You forget to turn off the webserver's AutoIndex or similar, so the
spider can just navigate its way to the url going through auto
generated directory indexes.

What are the other 8?
 
J

Jukka K. Korpela

Scripsit Ben C:
1. Someone posts the URL to a newsgroup.
2. You forget to turn off the webserver's AutoIndex or similar, so the
spider can just navigate its way to the url going through auto
generated directory indexes.

What are the other 8?

To mention some other scenarios of having a page indexed without having been
linked to from any other web page*), here's one relatively obvious one and
one imaginary though realistic (we know such things are being done with
email addresses for spamming purposes):

3. The page _was_ linked to from another page.

4. An indexing robot generates URLs automatically, more or less at random,
and tries them. It might for example try servers known to exist and append
to the server name some strings that are known to be common for web pages,
like /help.htm, /news.html....

*) Of course an author cannot prevent linking by others. You tell the URL to
your friend, who tells it to his pal, who sets up a link. But this common
way of getting indexed against your will falls outside the current exercise.
 
D

Dylan Parry

Jukka said:
3. The page _was_ linked to from another page.

4. An indexing robot generates URLs automatically, more or less at random,
and tries them. It might for example try servers known to exist and append
to the server name some strings that are known to be common for web pages,
like /help.htm, /news.html....

5. Someone visits your page[1] and has the Google Toolbar (or others
similar things) installed and reporting back to Google about the sites
they are visiting, thus allowing Google to add the site to their index.

____
[1] How they got the URL in the first place might be an issue here, but
it could be that you personally gave it to them or that it was written
down somewhere that wasn't necessarily an online resource (business card
etc).

--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk

The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
 
N

Nick Theodorakis

2. You forget to turn off the webserver's AutoIndex or similar, so the
spider can just navigate its way to the url going through auto
generated directory indexes.


At least one robot does this. I have a template page (definitely not
mentioned anywhere else) in a subdirectory that seems to get spidered
by the yahoo slurp robot.

Nick
 
A

Adrienne Boswell

Gazing into my crystal ball I observed dorayme
A website is on a server. Just one or two of the pages are not
for public consumption. They are not top secret and no big harm
would be done if it was not 100% possible, but it would be best
if they did not come up in search engines. (A sort of provision
by a company for making some files available to those who have
the address. Company does not want password protection; but I am
considering persuading them).

What is the simplest and most effective way of stopping robots
searching a particular html pages on a server. Am looking for an
actual example and clear instructions. Getting confused by
looking at http://www.searchtools.com/index.html though doubtless
I will get less confused after much study.

1. Robots exclusion, you can name a particular file, eg. backoffice.asp
2. Meta route (in my experience, not quite as reliable as the first)
 
E

Ed Mullen

dorayme said:
A website is on a server. Just one or two of the pages are not
for public consumption. They are not top secret and no big harm
would be done if it was not 100% possible, but it would be best
if they did not come up in search engines. (A sort of provision
by a company for making some files available to those who have
the address. Company does not want password protection; but I am
considering persuading them).

What is the simplest and most effective way of stopping robots
searching a particular html pages on a server. Am looking for an
actual example and clear instructions. Getting confused by
looking at http://www.searchtools.com/index.html though doubtless
I will get less confused after much study.

Why not just put it in a password-protected directory?

--
Ed Mullen
http://edmullen.net
http://mozilla.edmullen.net
http://abington.edmullen.net
I used to be schizophrenic, but we're all right now.
 
D

dorayme

Ed Mullen said:
it would be best

Why not just put it in a password-protected directory?

I guess because it puts up a hurdle for the company and the
particular companies to which they need to communicate this
address. People forget passwords and it is extra work to be
transmitting password information. I understand the reluctance on
this occasion. But see above.

[I am working on a psychologically based scheme at the moment,
Ed, in consultation with my psychologist, to make pages that have
a level of natural repugnance. The level must be such that people
with no real need or interest in the purpose of the page will
flee from it quickly whereas those with a task that requires the
resources to be found on that page will persist till they get
them. At the crudest level, perhaps a picture of a dead
decomposing rat at the top? Animated gif of fumes emanating from
it? Embedded horrible dead rat sounds? If you care to invest in
the further development of this promising new scheme, please send
$10.]
 
J

John Clayton

Would this also help answer the recent, earlier question "how to prevent
spiders from indexing 'mailto' addresses"?
Just asking.

John
 
E

Ed Mullen

dorayme said:
I guess because it puts up a hurdle for the company and the
particular companies to which they need to communicate this
address. People forget passwords and it is extra work to be
transmitting password information. I understand the reluctance on
this occasion. But see above.

But, most browsers have the ability to "remember" logon info so it's a a
case of "do it once". Geez, how hard is that? Set up an example and
show them. I have two different sites with protected pages/files. My
Mozilla-based browsers remember the logon info just fine. I click on a
link/favorite/bookmark, the logon pop-up comes up, I click OK.
[I am working on a psychologically based scheme at the moment,
Ed, in consultation with my psychologist, to make pages that have
a level of natural repugnance. The level must be such that people
with no real need or interest in the purpose of the page will
flee from it quickly whereas those with a task that requires the
resources to be found on that page will persist till they get
them. At the crudest level, perhaps a picture of a dead
decomposing rat at the top? Animated gif of fumes emanating from
it? Embedded horrible dead rat sounds? If you care to invest in
the further development of this promising new scheme, please send
$10.]

I doubt that decomposing rats will be a sufficiently universal
deterrence. In fact, I'm not sure you can settle on any image that will,
say, tick off, what? 80% of viewers? 90%? Now, if you could be certain
that everyone was browsing with sound on and the volume set to max, well
.... ooooo, baby! Then we got something!

--
Ed Mullen
http://edmullen.net
http://mozilla.edmullen.net
http://abington.edmullen.net
Give me ambiguity or give me something else.
 
D

dorayme

Ed Mullen said:
But, most browsers have the ability to "remember" logon info so it's a a
case of "do it once". Geez, how hard is that? Set up an example and
show them. I have two different sites with protected pages/files. My
Mozilla-based browsers remember the logon info just fine. I click on a
link/favorite/bookmark, the logon pop-up comes up, I click OK.

I knew someone would take this line <g> I remind you that I said
that I am considering so persuading in my original post. That is
point one. And yes, I am aware of some browsers having such
facilities, I would be personally lost without them or the Mac
keychain. But step back, Ed, and see why I am only considering
persuading and not headlong rushing into it. You are a young man,
full of natural enthusiasms, I am a 570 year old martian,
reserved, restrained, conservative, not the least pushy.

You are basically asking me to persuade not only the company to
change browsers but also to persuade them to persuade their
clients/suppliers (all over the world, rich and poor countries)
who need the resources on the page concerned to make sure they
have the appropriate browsers. How hard is that? It is much
harder than me not doing anything but sticking in the meta thing
that JK said on the nice web page I made for them and now sitting
back with pleasant thoughts of sorting out pictures of the dog I
walk, of all the gorgeous pics from babyhood to married of some
family members, of a new screen (cheap from Dell) for my desk and
getting ready to go and have a swim on a Sydney beach this avo
(have you any idea how lovely Sydney smells and feels today,
jasmine and clear blue sky... Almost a caricature of spring,
except it is real).

[not a snowflake in sight - Whack!]
 
E

Ed Mullen

dorayme said:
Ed Mullen said:
But, most browsers have the ability to "remember" logon info so it's a a
case of "do it once". Geez, how hard is that? Set up an example and
show them. I have two different sites with protected pages/files. My
Mozilla-based browsers remember the logon info just fine. I click on a
link/favorite/bookmark, the logon pop-up comes up, I click OK.

I knew someone would take this line <g> I remind you that I said
that I am considering so persuading in my original post. That is
point one. And yes, I am aware of some browsers having such
facilities, I would be personally lost without them or the Mac
keychain. But step back, Ed, and see why I am only considering
persuading and not headlong rushing into it. You are a young man,
full of natural enthusiasms, I am a 570 year old martian,
reserved, restrained, conservative, not the least pushy.

You are basically asking me to persuade not only the company to
change browsers but also to persuade them to persuade their
clients/suppliers (all over the world, rich and poor countries)
who need the resources on the page concerned to make sure they
have the appropriate browsers. How hard is that? It is much
harder than me not doing anything but sticking in the meta thing
that JK said on the nice web page I made for them and now sitting
back with pleasant thoughts of sorting out pictures of the dog I
walk, of all the gorgeous pics from babyhood to married of some
family members, of a new screen (cheap from Dell) for my desk and
getting ready to go and have a swim on a Sydney beach this avo
(have you any idea how lovely Sydney smells and feels today,
jasmine and clear blue sky... Almost a caricature of spring,
except it is real).

[not a snowflake in sight - Whack!]

I gotta go get a drink. I read it, I (sorta) got it, and now my head
hurts so much ...

Do it or don't do it. It is a solution. If you or your client don't
like it, fine. Your choice. But, it's simple, it exists, and, let's
face it, if it's a commercial app? "Use of this site/facility requires
...." And it is NOT onerous.

Ok, I'm wandering downstairs now ...

--
Ed Mullen
http://edmullen.net
http://mozilla.edmullen.net
http://abington.edmullen.net
An ounce of practice is worth more than tons of preaching. - Mohandas Gandhi
 
J

Jukka K. Korpela

Scripsit John Clayton:
- -
Would this also help answer the recent, earlier question "how to
prevent spiders from indexing 'mailto' addresses"?

It would prevent well-behaving robots from indexing the page at all, but
robots that collect addresses for spamming can hardly be expected to be
well-behaving.

(If you want to prevent spammers to get your email address at any cost, get
rid of all email addresses you have and don't ever get one. That's the only
method that actually works for the purpose. If you just want to use email
for something useful, find out an optimal way of doing spam filtering. Do
_not_ make this your visitors' problem.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top