how google spider access my web site?

baroque Chou · Jan 26, 2006

anyone know how google spiders access web site, how dose they manage to
get the href information? do they have special access right or
something? any help is appreciated

bb · Jan 26, 2006

look up robots

http://www.robotstxt.org/wc/faq.html

Brian Cryer · Jan 26, 2006

baroque Chou said:
anyone know how google spiders access web site, how dose they manage to
get the href information? do they have special access right or
something? any help is appreciated

No, google doesn't have any special access rights, they access your website
the same way as anyone else. This means that if you have a login screen
which you need to get past to view your site then the google spider won't
get past it. Some sites explicitly grant the google bot (or other bots)
access, but that's an exception not the rule.

In summary, what you can see in your browser (or better still, what I could
see in my browser if you gave me the url) is what the google spider can see.
The only exception to this is that the google spider is a little more fussy
about correct html than most browsers, so its worth checking that your code
validates and links are correct.

baroque Chou · Jan 26, 2006

thanks, seems google spider has some attributes that browser has. but
if I am using dynamic page, say, apsx, which dosen't produce an output
page before the web server execute it. how google know the href in that
page, and most time,even the executed page, the href is more like has a
form of
<a href='Middlelayer_Top10.aspx?id=105>
how will the spider make a deeper crawl? if both not access my source
code nor dose it make any request

KMA · Jan 26, 2006

Generally it goes like this:

You send google a reference to your homepage. Obviously this page shouldn't
requre logging in or a password.
The google bot downloads this page and strips out all the links. It makes a
"score" of the page for the Google index then downloads every page from the
link list and repeats the same procedure until all links are processed.

The exact details of the scoring mechanism are not published to prevent
people artificially pushing their page up the page rankings.

Some say that parameterised links (with a gfdg.aspx?productID=1234) are not
followed.

To get more of an idea, create an aspx page with links, run your prog, then
in the browser, right click and choose View Source. This is exactly what the
googlebot gets.

baroque Chou · Jan 26, 2006

thank you very much, some one suggest that you should use some rewrite
rule to make the url more search engine friendly,
e.g. gfdg.aspx?productID=1234 rewirte to gfdg.aspx/productID/1234
, but this page actully dosen't exist in my web server,
what exist is just the source page, the "instance" of that page is
created everytime by individual request. so do I need to archive the
instance of that page to some location(the hierarchy of the directory
may be well packaged following the url patten so that the spider can
have a better crawl)?

KMA · Jan 27, 2006

OK, basically it goes like this.

On your web pages you write bot-friendly urls, like
gfds/product/toasters/toastomatic5000.aspx.

But like you say, this page doesn't really exist. When the bot requests the
page, IIS will not be able to find the page, but if you implement your own
404 handler then IIS will call this. A normal 404 handler just gives back a
page saying "Sorry, page not found" but your special 404 handler will be
passed the url of the requested page. You then can strip off the productID
from the url and build your page for the product. This page is then sent
back to the bot.

In a way you are fooling the bot that you have lots of web pages, but really
you just have one page handler plus a databse of product data. Bot writers
expect this, because they know that it's very difficult to maintain a large
site in any other way.

Alan Silver · Feb 2, 2006

Why not just URL rewriting? Much cleaner.

OK, basically it goes like this.

On your web pages you write bot-friendly urls, like
gfds/product/toasters/toastomatic5000.aspx.

But like you say, this page doesn't really exist. When the bot requests the
page, IIS will not be able to find the page, but if you implement your own
404 handler then IIS will call this. A normal 404 handler just gives back a
page saying "Sorry, page not found" but your special 404 handler will be
passed the url of the requested page. You then can strip off the productID
from the url and build your page for the product. This page is then sent
back to the bot.

In a way you are fooling the bot that you have lots of web pages, but really
you just have one page handler plus a databse of product data. Bot writers
expect this, because they know that it's very difficult to maintain a large
site in any other way.

How to import Excel spreadsheet contacts into Google Contacts or iCloud?	0	Jul 28, 2025
Web Site	1	Feb 19, 2023
Google sheets song request	3	Apr 19, 2022
Building my own friend	4	Nov 25, 2024
Google analytics doesn't work with google forms	0	Nov 3, 2016
Seeking co-founders for my company.	3	Sep 8, 2024
I need help fixing my website	2	Oct 15, 2023
Help with my responsive home page	2	Dec 14, 2022

how google spider access my web site?

baroque Chou

bb

Brian Cryer

baroque Chou

KMA

baroque Chou

KMA

Alan Silver

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads