Looking for fast string hash searching

Thomas Christmann · May 13, 2004

Hi!

First let me apologize for asking this question when there are so many answers
to it on Google, but most of them are really contradicting, and making what I
want to do very performant is crucial to my project. So, here's what I have:

My C programm connects to a database and gets ca. 50-100K domain name/file path
pairs. Those pairs have to be cached by my application. Building the cache may
take a second or two, but retrieving from it must be very fast. Since I get
the data from a database, I'd be able to order by domain name (which will be
my key, and is guaranteed to be unique), so I thought something like a btree
search for strings might be a good idea. I only have to look up by domain name
from the hash, searching by path is not permitted.
Since I'm far from being an expert on the subject of hashing and search
algorithms, your opinion on how to make this fast is humbly requested

TIA,

Thomas

Stephen L. · May 13, 2004

Thomas said:
Hi!

First let me apologize for asking this question when there are so many answers
to it on Google, but most of them are really contradicting, and making what I
want to do very performant is crucial to my project. So, here's what I have:

My C programm connects to a database and gets ca. 50-100K domain name/file path
pairs. Those pairs have to be cached by my application. Building the cache may
take a second or two, but retrieving from it must be very fast. Since I get
the data from a database, I'd be able to order by domain name (which will be
my key, and is guaranteed to be unique), so I thought something like a btree
search for strings might be a good idea. I only have to look up by domain name
from the hash, searching by path is not permitted.
Since I'm far from being an expert on the subject of hashing and search
algorithms, your opinion on how to make this fast is humbly requested

TIA,

Thomas

This isn't _really_ a `C' question...

If the distribution of "domain" names
is pretty even across the alphabet,
then you could use the 1st letter of
the name as an index to an array of
"pointers" to name/path pairs that
you can `bsearch()'. 100,000 entries
isn't that much now-a-days, and
dividing by 26 (for about 4,000 entries)
should provide a very fast lookup.

Stephen

Thomas Christmann · May 13, 2004

This isn't _really_ a `C' question...

I know, I know, and I'm sorry to post here, but you guys usually
help me very much (not knowingly, I suppose) with your posts. Also,
there isn't really an alt.hash.maps

If the distribution of "domain" names
is pretty even across the alphabet,
then you could use the 1st letter of
the name as an index to an array of
"pointers" to name/path pairs that
you can `bsearch()'. 100,000 entries
isn't that much now-a-days, and
dividing by 26 (for about 4,000 entries)
should provide a very fast lookup.

Sounds good, I'll give that a try.

Thanks,

Thomas

August Derleth · May 14, 2004

I know, I know, and I'm sorry to post here, but you guys usually
help me very much (not knowingly, I suppose) with your posts. Also,
there isn't really an alt.hash.maps

comp.programming or something like that (I've forgotten the exact name)
handles language-agnostic algorithm questions. You should get the
algorithm ironed out first before trying a specific implementation anyway.

James Kanze · May 16, 2004

|> Thomas Christmann wrote:

|> > First let me apologize for asking this question when there are so
|> > many answers to it on Google, but most of them are really
|> > contradicting, and making what I want to do very performant is
|> > crucial to my project. So, here's what I have:

|> > My C programm connects to a database and gets ca. 50-100K domain
|> > name/file path pairs. Those pairs have to be cached by my
|> > application. Building the cache may take a second or two, but
|> > retrieving from it must be very fast. Since I get the data from a
|> > database, I'd be able to order by domain name (which will be my
|> > key, and is guaranteed to be unique), so I thought something like
|> > a btree search for strings might be a good idea. I only have to
|> > look up by domain name from the hash, searching by path is not
|> > permitted. Since I'm far from being an expert on the subject of
|> > hashing and search algorithms, your opinion on how to make this
|> > fast is humbly requested

|> If the distribution of "domain" names
|> is pretty even across the alphabet,
|> then you could use the 1st letter of
|> the name as an index to an array of
|> "pointers" to name/path pairs that
|> you can `bsearch()'.

They aren't. I'll bet that well over half of all domains start with
"www.". Also, the alphabet for domain names isn't limited to letters.

I think that for this application, nothing will beat a good hash code.
The trick is, of course, to avoid a bad one

; for some reason, URL's
seem to be very sensitive to bad hash codes. A Google search for FNV
hashing should turn up what you need -- if performance of the hash
itself turns out to be an issue, and your hardware doesn't handle
arbitrary multiplies very rapidly, I've also used Mersenne prime based
hash codes in the past with good results. (The basic algorithm is the
same as for FNV hashing, but the multiplier is a Mersenne prime, which
can easily be calculated with a shift and a subtraction.)

a fast malloc/free implementation & benchmarks	0	Mar 20, 2011
Searching for Link List Implementation (with requirements)	12	Nov 27, 2008
get hexadecimal hash string for a number	23	Sep 18, 2012
Efficiently searching multiple files	10	May 20, 2010
I need help with my python assignment and I'm stuck can't find any solution for it. Convert CSV string format to JSON format	0	Oct 12, 2021
Searching a small file database	7	Jun 24, 2008
Looking for a Fast Persistent Store	43	Aug 9, 2006
Searching in Logs	1	Oct 13, 2005

Looking for fast string hash searching

Thomas Christmann

Stephen L.

Thomas Christmann

August Derleth

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads