Karl Groves said:
Seach engine programming is way too complicated for even the most
experienced programmers.
Who writes the software that drives the web's search engines?
Dealing with things like misspellings, homophones,
synonyms and all that stuff are even just the tip of the iceberg.
But are not all that hard to do, in my experience. The hardest part of
the submerged portion of the iceberg is thinking your way through the
task before writing any code.
Then, when
you get into things like ranking the results based on relevance and you have
yourself a major nightmare.
Oh, I dunno. It's not as hard as it might seem. The hardest part is
coming up with a decent index from which to work. You have to spend a
lot of time thinking about your indexing algorithms, but I wouldn't go
so far as to call it a nightmare.
I like to do a hybrid sort of a thing, first indexing all of the words
as they appear, then applying the the Porter Stemming Algorithm
(
http://tartarus.org/~martin/PorterStemmer/ ) to derive their stems,
which receive a lower basic score so that whole words are viewed as
"more relevant". I then factor both based upon their position in the
"stream", their containing elements (h1...h5, bold, italic, etc.), and
their occurrence in any of the more interesting places (path/filename,
document title, META descriptions/keywords, etc.) Then off into the
monstrous database they go. In a single-site search, you don't have to
do any complex heuristics to detect spamdexing or doorway pages, punish
zero-timed redirects, etc. so that bit of nightmare doesn't count.
Still, those things are easily enough detected if you have need of
protective measures.
The second thing you have to spend a lot of time thinking about is the
database. No matter how you optimize it, it's going to suck. Resources,
that is. Lots and lots of resources.
My favorite hand-rolled algorithm says that the document at the
(Porter Stemming Algorithm) URL above is most relevant to the
following "natural" keyword groups:
porter, stemming (most relevant)
common, ansi, encodings, published, errors (relevant but too common)
with the top five scored terms being version, algorithm, porter,
stemming (and) common. Using the first three, four, or all five of the
top-scored terms at Lycos lands the URL at number one. Using the first
two lands it at number two. (I use Lycos in this example because it
doesn't have a "PageRank" algorithm that would require web spidering
for validation.) It'd be kinda silly to even think of looking for it
in the results for the single term "version".

Mixing and matching
any two from the list of "natural" search terms (at Lycos) brings the
site in at number one most of the time.
The least relevant natural group brings up that URL, at Lycos, in the
number one spot. Popping "errors" off the end moves it to number five.
Those terms are just way too common, even if they're what the document
appears to be "about." (It'd be really easy to knock that site out of
the number one spot, even for "porter stemming", as it obviously has
not been optimized for search engine ranking.)
In my retrieval algorithm, I first spellcheck, then generate a list of
synonyms, homonyms, and common abbreviations of the user-provided
search terms, giving the highest preference to the terms in the order
provided by the user, then the various permutations thereof working
down from best-fit to least-fit. A bit of heuristic manipulation (AKA
"magic") happens when I look at the results, to eliminate some that
might otherwise appear attractive, but for a single small or moderately
sized site these heuristics may be unimportant. If I get too few
"hits", I pop the last term off of the list, and reiterate to add more
hits after the first group, terminating either when I get a reasonable
set, or the relevance factor falls below some threshold.
If the site is all static content and has META descriptions/keywords,
it might be best to conserve resources by indexing only the path and
file name, title, and META description/keywords. The task gets far
simpler and the resource consumption falls off dramatically.
Having done it, albeit on a small scale (single sites of just tens of
thousands of documents, not hundreds of thousands or millions) I don't
consider search engine development to be "way too complicated for even
the most experienced programmers." Indexing the entire web would
require an experienced programmer (a la
http://www.gigablast.com/ which
is one guy with just eight servers), but indexing a single site isn't.
It's a good stretching exercise even for moderately skilled programmers
who aren't betting their careers on the product, and lots of fun, too.