OT: Full text search

Jeff Thies · Aug 23, 2004

I think many of us use mySQL...

I notice that MySQL has a full text search. This matches a phrase
like: "full text search of website", and returns a list of results
ordered by the highest degree of matches. Minor words and very frequent
words are excluded.

Sounds very powerful and as it's nearly trivial to spider a site and
stuff this it would be easy to implement.

So, anyone used this? Or something like it?

Jeff

Karl Groves · Aug 23, 2004

Jeff Thies said:
So, anyone used this? Or something like it?

Seach engine programming is way too complicated for even the most
experienced programmers. Dealing with things like misspellings, homophones,
synonyms and all that stuff are even just the tip of the iceberg. Then, when
you get into things like ranking the results based on relevance and you have
yourself a major nightmare.

-Karl

Jeff Thies · Aug 23, 2004

Karl said:
Seach engine programming is way too complicated for even the most
experienced programmers.

Well keyword/multi searches on shopping sites are very common and not
hard to implement, you wouldn't want a full fledged search engine there
and for most other site apps you wouldn't either.

Dealing with things like misspellings, homophones,
synonyms and all that stuff are even just the tip of the iceberg.

Why bother? Spellchecking isn't that hard though, I posted a client side
version a couple years ago, doing it server side would be even easier.
But, there's a lot of ways to workaround no match scenarios.

Then, when
you get into things like ranking the results based on relevance and you have
yourself a major nightmare.

But, that's where MySQL does it for you. Besides Google ranks on a lot
of crieria that would be less than helpfull on a mid size site.

So, your answer is no?

Jeff

Art Sackett · Aug 24, 2004

Jeff Thies said:
So, anyone used this? Or something like it?

I toyed with it a while back, and it's slower than the rectification of
sin. I wouldn't use it at all on a large dataset or a busy server.

Toby Inkster · Aug 24, 2004

Karl said:
Seach engine programming is way too complicated for even the most
experienced programmers. Dealing with things like misspellings, homophones,
synonyms and all that stuff are even just the tip of the iceberg. Then, when
you get into things like ranking the results based on relevance and you have
yourself a major nightmare.

Rankings aren't hard. I'm pretty happy with the rankings on my search
engine. I only have a handful of pages, but the algorithm should work fine
even with many thousands.

I don't deal with misspellings, etc on my site -- if you misspell your
search term you don't *deserve* a result ;-) I can't imaging misspellings
would be that hard though, if you used some third-party software to
suggest corrections (e.g. ispell).

I've also not implemented boolean searches (just exact phrase) yet.

Jeff Thies · Aug 24, 2004

Art said:
I toyed with it a while back, and it's slower than the rectification of
sin. I wouldn't use it at all on a large dataset or a busy server.

I was afraid of that! Must have one hell of an index, or maybe not!

I figured it would be slow slurping data in. It's slow querying also?

Cheers,
Jeff

Jeff Thies · Aug 24, 2004

Rankings aren't hard. I'm pretty happy with the rankings on my search

engine. I only have a handful of pages, but the algorithm should work fine
even with many thousands.

How are you going about that?

I don't deal with misspellings, etc on my site -- if you misspell your
search term you don't *deserve* a result ;-) I can't imaging misspellings
would be that hard though, if you used some third-party software to
suggest corrections (e.g. ispell).

I've also not implemented boolean searches (just exact phrase) yet.

I've been doing something like this:

AND search:

foreach my $keyword(@keywords){
$sort .= ' AND search_field like ' . '\'%' . $keyword . '%\' ';
}

seems too easy... you gotta clean out leading/trailing spaces in the
keywords is all.

Jeff

Art Sackett · Aug 24, 2004

Karl Groves said:
Seach engine programming is way too complicated for even the most
experienced programmers.

Who writes the software that drives the web's search engines?

Dealing with things like misspellings, homophones,
synonyms and all that stuff are even just the tip of the iceberg.

But are not all that hard to do, in my experience. The hardest part of
the submerged portion of the iceberg is thinking your way through the
task before writing any code.

Then, when
you get into things like ranking the results based on relevance and you have
yourself a major nightmare.

Oh, I dunno. It's not as hard as it might seem. The hardest part is
coming up with a decent index from which to work. You have to spend a
lot of time thinking about your indexing algorithms, but I wouldn't go
so far as to call it a nightmare.

I like to do a hybrid sort of a thing, first indexing all of the words
as they appear, then applying the the Porter Stemming Algorithm
(http://tartarus.org/~martin/PorterStemmer/ ) to derive their stems,
which receive a lower basic score so that whole words are viewed as
"more relevant". I then factor both based upon their position in the
"stream", their containing elements (h1...h5, bold, italic, etc.), and
their occurrence in any of the more interesting places (path/filename,
document title, META descriptions/keywords, etc.) Then off into the
monstrous database they go. In a single-site search, you don't have to
do any complex heuristics to detect spamdexing or doorway pages, punish
zero-timed redirects, etc. so that bit of nightmare doesn't count.
Still, those things are easily enough detected if you have need of
protective measures.

The second thing you have to spend a lot of time thinking about is the
database. No matter how you optimize it, it's going to suck. Resources,
that is. Lots and lots of resources.

My favorite hand-rolled algorithm says that the document at the
(Porter Stemming Algorithm) URL above is most relevant to the
following "natural" keyword groups:

porter, stemming (most relevant)
common, ansi, encodings, published, errors (relevant but too common)

with the top five scored terms being version, algorithm, porter,
stemming (and) common. Using the first three, four, or all five of the
top-scored terms at Lycos lands the URL at number one. Using the first
two lands it at number two. (I use Lycos in this example because it
doesn't have a "PageRank" algorithm that would require web spidering
for validation.) It'd be kinda silly to even think of looking for it
in the results for the single term "version".

Mixing and matching
any two from the list of "natural" search terms (at Lycos) brings the
site in at number one most of the time.

The least relevant natural group brings up that URL, at Lycos, in the
number one spot. Popping "errors" off the end moves it to number five.
Those terms are just way too common, even if they're what the document
appears to be "about." (It'd be really easy to knock that site out of
the number one spot, even for "porter stemming", as it obviously has
not been optimized for search engine ranking.)

In my retrieval algorithm, I first spellcheck, then generate a list of
synonyms, homonyms, and common abbreviations of the user-provided
search terms, giving the highest preference to the terms in the order
provided by the user, then the various permutations thereof working
down from best-fit to least-fit. A bit of heuristic manipulation (AKA
"magic") happens when I look at the results, to eliminate some that
might otherwise appear attractive, but for a single small or moderately
sized site these heuristics may be unimportant. If I get too few
"hits", I pop the last term off of the list, and reiterate to add more
hits after the first group, terminating either when I get a reasonable
set, or the relevance factor falls below some threshold.

If the site is all static content and has META descriptions/keywords,
it might be best to conserve resources by indexing only the path and
file name, title, and META description/keywords. The task gets far
simpler and the resource consumption falls off dramatically.

Having done it, albeit on a small scale (single sites of just tens of
thousands of documents, not hundreds of thousands or millions) I don't
consider search engine development to be "way too complicated for even
the most experienced programmers." Indexing the entire web would
require an experienced programmer (a la http://www.gigablast.com/ which
is one guy with just eight servers), but indexing a single site isn't.
It's a good stretching exercise even for moderately skilled programmers
who aren't betting their careers on the product, and lots of fun, too.

Art Sackett · Aug 24, 2004

Jeff Thies said:
foreach my $keyword(@keywords){
$sort .= ' AND search_field like ' . '\'%' . $keyword . '%\' ';
}

You might consider using:

([[:<:]]|[[

unct:]])$keyword([[

unct:]]|[[:>:]])

to ensure you get 'em all. Just a thought...

Art Sackett · Aug 24, 2004

Jeff Thies said:
I figured it would be slow slurping data in. It's slow querying also?

Painfully so.

Karl Groves · Aug 24, 2004

Jeff Thies said:
Well keyword/multi searches on shopping sites are very common and not
hard to implement, you wouldn't want a full fledged search engine there
and for most other site apps you wouldn't either.

So, how do you know what "keywords" have relevance to your user?
How do you know that your user is familiar with your vocabulary.

Here's an example: at a credit union they refer to a checking account as a
"share draft". So, if a user comes to the search engine and types in
"checking", the product results that get returned should be the "share
draft" products.

Why bother?

Because it isn't the user's job to learn your system. It is your system's
job to learn your user.

Spellchecking isn't that hard though, I posted a client side
version a couple years ago, doing it server side would be even easier.
But, there's a lot of ways to workaround no match scenarios.

But you've still not addressed homophones, synonyms, etc.

But, that's where MySQL does it for you. Besides Google ranks on a lot
of crieria that would be less than helpfull on a mid size site.

So, your answer is no?

Correct.

-Karl

Karl Groves · Aug 24, 2004

Toby Inkster said:
Rankings aren't hard. I'm pretty happy with the rankings on my search
engine. I only have a handful of pages, but the algorithm should work fine
even with many thousands.

Ranking should involve relevance not just on the search string but also
reflect the context of what they're searching for.
Its one thing to count the instances of a search string and sort them. It is
an entirely different matter to return a result based on relevance to the
user.

I don't deal with misspellings, etc on my site -- if you misspell your
search term you don't *deserve* a result ;-) I can't imaging misspellings
would be that hard though, if you used some third-party software to
suggest corrections (e.g. ispell).

*most* misspellings shouldn't be hard, except when someone misspells the
word they intend and end up with a completely different word.

I've also not implemented boolean searches (just exact phrase) yet.

Most users don't know a thing about boolean searching anyway, so implement
it understanding that only "power users" will be taking advantage of the
feature.

More info on searching

Half Web Searchers enter One Query, look at One Page of Results
http://usabilitynews.com/news/article1213.asp

Linking And Searching
http://www.humanfactors.com/downloads/jan032.htm

Brands Suffer from Search Dysfunctions
http://www.clickz.com/experts/brand/brand/article.php/1477641

Why On-Site Searching Stinks
http://www.uie.com/articles/search_stinks/

Why Searches Fail
http://www.searchtools.com/info/whysearchesfail.html

-Karl

Karl Groves · Aug 24, 2004

Art Sackett said:
Who writes the software that drives the web's search engines?

*Teams* of people and usability experts.
The largest internet search engines and companies like MondoSearch have
hired teams of usability experts to ensure that the search engine is
easy-to-use.

-Karl

Chris Morris · Aug 24, 2004

Jeff Thies said:
I think many of us use mySQL...

I notice that MySQL has a full text search. This matches a phrase
like: "full text search of website", and returns a list of results
ordered by the highest degree of matches. Minor words and very
frequent words are excluded.

Sounds very powerful and as it's nearly trivial to spider a site
and stuff this it would be easy to implement.

So, anyone used this? Or something like it?

We use it in places on our new internal search engine (you can try it
with the search box at http://www.dur.ac.uk/its/) - unlike some other
posters I've found it to be very fast (for a dataset of ~1500 pages
and about 2500 pieces of searchable content).

Chris Morris · Aug 24, 2004

Karl Groves said:
Seach engine programming is way too complicated for even the most
experienced programmers.

Now, I wouldn't call myself one of those, but I've implemented a
search engine that gives us better results (mostly) than the htdig
search engine which we used. It also copes a lot better with other
databases by searching them directly, which obviously isn't an option
for most search engines, but in the context of a CMS (which we have)
is useful.

Dealing with things like misspellings, homophones, synonyms and all
that stuff are even just the tip of the iceberg.

I have a search log set up and they aren't really that big a problem
(and I don't know of any search engines that really deal properly with
those, famous ones like Google included). e-mail vs email is a tricky
one (especially when you have a lot of content editors) - so that one
had to be dealt with, and I had a much more minor one today, but other
than that it's been fairly straightforward.

Then, when you get into things like ranking the results based on
relevance and you have yourself a major nightmare.

Ranking isn't that tricky - I didn't fiddle around too much with the
ranking algorithm and it gives good results for most search terms.

It's been the right solution for us, I think, anyway.

Chris Morris · Aug 24, 2004

Jeff Thies said:
I notice that MySQL has a full text search.

So, anyone used this? Or something like it?

Forgot to mention in my last post - it has a default lower limit on
word length to index of 4. You will almost certainly want to turn this
down to 3, or possibly lower if you have a lot of short acronyms that
people might want to search on.

Toby Inkster · Aug 24, 2004

Jeff said:
How are you going about that?

Each page on my (database-backed) site has a lot of meta data stored on
it. In particular, each page has a title, keywords and description.

The search algorithm goes much like this:

============================================================

$q = The word they're looking for;

foreach ( page $p on my site ) {

$p.score = 0;

if ($p.body contains $q) {
$p.score += 1;
}

if ($p.description contains $q) {
$p.score += 2;
}

if ($p.keywords contains $q) {
$p.score += 3;
}

if ($p.title contains $q) {
$p.score += 4;
}

}

sort pages by score;

foreach ( page $p on my site ) {
if ($p.score > 0) {
print link to $p;
}
}

============================================================

It's pretty simplistic, but if you take care over your description and
keywords then it seems to work well.

Art Sackett · Aug 24, 2004

Karl Groves said:
*Teams* of people and usability experts.

First: "*Teams* of people...":

Uh, Karl, didja somehow miss my reference to Gigablast? One guy, eight
servers: http://www.gigablast.com/

Second: "...and usability experts."

I admit that usability experts can be very valuable folks to have
around, but they don't write software. They design interfaces.

The most-used search engines on the web today are all pretty much the
same on the surface. Some have more space devoted to advertising than
others, but in the main, there's a single text input element sitting on
top of an engine that defaults to a boolean AND query, perhaps with an
option to perform an "advanced search" somewhere. You don't need a
usability expert to tell you how to do that.

Karl Groves · Aug 24, 2004

Art Sackett said:
First: "*Teams* of people...":

Uh, Karl, didja somehow miss my reference to Gigablast? One guy, eight
servers: http://www.gigablast.com/

Looks like it does a great job of searching and a piss-poor job of
displaying results in a manner that is helpful and relevant to users.
I maintain that it usually takes teams to develop them.
Then again, maybe it should be done by one guy, because I've seen teams do a
fine job of buggering the whole thing.

Second: "...and usability experts."

I admit that usability experts can be very valuable folks to have
around, but they don't write software. They design interfaces.

The most-used search engines on the web today are all pretty much the
same on the surface. Some have more space devoted to advertising than
others, but in the main, there's a single text input element sitting on
top of an engine that defaults to a boolean AND query, perhaps with an
option to perform an "advanced search" somewhere. You don't need a
usability expert to tell you how to do that.

Usability experts are used in search engine companies to help development
teams understand how to help users get valuable information from searching.
This includes helping in the process of developing algorithms that reflect
what users *really* want, rather than what they type into a search widget.
One person in this thread said (jokingly) that users who misspell words "get
what they deserve", so to speak. Truth is, that's the attitude we (at my
day job) actually face from development teams who have no clue whatsoever
about how users actually interact with computers. I've seen proprietary
search engines that are so bad that we tell the client to remove the stupid
thing because they're doing more to frustrate users than help them.

My opinion is that if it can't be done right, it needs to be thrown in the
trash. More often than not, the search tools on most sites are pure shit. If
someone thinks a plain old 'SELECT from table WHERE stuff LIKE %string%'
into a database is gonna do the trick, they're kidding themselves and
disappointing their visitors

-Karl

Art Sackett · Aug 24, 2004

Karl Groves said:
I maintain that it usually takes teams to develop them.

Usually, though, assumes the least common denominator, doesn't it?

Then again, maybe it should be done by one guy, because I've seen teams do a
fine job of buggering the whole thing.

The more developers, the bigger the mess, in my experience. There have
been great teams, certainly, but in my 24 years of professional
involvement in technology (both hardware and software), most "teams"
are comprised of "team players" -- a group who, in the main, I have
absolutely no use for. One great hacker can provide in a few months a
product that outshines by far the collective yearly output of a whole
corporation full of groupthinking team players. I digress...

Usability experts are used in search engine companies to help development
teams understand how to help users get valuable information from searching.

I understand and appreciate that.

This includes helping in the process of developing algorithms that reflect
what users *really* want, rather than what they type into a search widget.

We could go on and on about "what users *really* want", I think. It's a
big subject. Myself, I'm happiest with a search interface that lets me
plug in my booleans directly, a la the old Magellan interface. Google
themselves finally recognized this and began supporting it. Others
can't even define boolean and don't care to know what it means. Users
are a very diverse group. <rant>Dammit, if I plug "flambibulated
gonkulator" into a search engine, I want the results for flambibulated
gonkulator, not rabbit flambe. Don't tell me what I really want!</rant>

I prefer to focus on what users need.

One person in this thread said (jokingly) that users who misspell words "get
what they deserve", so to speak.

The lower verterbra in my back locked up when I read that...

What the user deserves is entirely dependent upon what you can afford
to give him. Not many of us can afford to operate in the red until an
Amazon miracle happens. Spellchecking should obviously be in every
search engine, and synonym provisions in most. Homonyms, well... if you
can afford to add 'em in. Complex heuristics: How much money do you
have?

Truth is, that's the attitude we (at my
day job) actually face from development teams who have no clue whatsoever
about how users actually interact with computers.

You have my sympathy. There's no excuse for that. Do it right, or don't
bother.

I've seen proprietary
search engines that are so bad that we tell the client to remove the stupid
thing because they're doing more to frustrate users than help them.

I've encountered many of those on the web. There's no excuse for them.

My opinion is that if it can't be done right, it needs to be thrown in the
trash.

Yep. If you can't do it right, don't bother.

If
someone thinks a plain old 'SELECT from table WHERE stuff LIKE %string%'
into a database is gonna do the trick, they're kidding themselves and
disappointing their visitors

I agree. There's a lot more to it than that. I wouldn't expect a guy
who learned a little perl a few months ago to create a solid search
engine, but I also know that teams and committees are responsible for
more project failures than all other causes combined. Google (like HP)
started as two guys in a garage who really knew what they were doing.
(Admittedly, a Wein bridge oscillator is far simpler than a nearly
complete index of the web, and never mind that I have nothing but
disrepect for HP today.)

Remember the infamous passing of the torch from AltaVista to Google?
AltaVista had a bigger team and a lot more money to spend, but they
allowed a stoopud technical mistake to cost them the farm.

Ann: Nucular full text search 0.5 +boolean queries +unicode fixes	0	May 20, 2009
Full-Text Searching	7	Feb 26, 2008
"Standard" Full Text Search Engine	6	Oct 26, 2007
[ANN] IHelp 0.4.0 - full text search	6	Nov 16, 2006
Do The Search Engines Know Your Website?	0	Jan 8, 2008
best search method	5	Mar 14, 2011
ANN: NUCULAR B3 Full text indexing (now on Win32 too)	9	Feb 13, 2008
ANNOUNCE: NUCULAR 0.1 Fielded Full Text Indexing [BETA]	0	Oct 24, 2007

OT: Full text search

Jeff Thies

Karl Groves

Jeff Thies

Art Sackett

Toby Inkster

Jeff Thies

Jeff Thies

Art Sackett

Art Sackett

Art Sackett

Karl Groves

Karl Groves

Karl Groves

Chris Morris

Chris Morris

Chris Morris

Toby Inkster

Art Sackett

Karl Groves

Art Sackett

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads