Looking for fast string hash searching

Discussion in 'C Programming' started by Thomas Christmann, May 13, 2004.

  1. Hi!

    First let me apologize for asking this question when there are so many answers
    to it on Google, but most of them are really contradicting, and making what I
    want to do very performant is crucial to my project. So, here's what I have:

    My C programm connects to a database and gets ca. 50-100K domain name/file path
    pairs. Those pairs have to be cached by my application. Building the cache may
    take a second or two, but retrieving from it must be very fast. Since I get
    the data from a database, I'd be able to order by domain name (which will be
    my key, and is guaranteed to be unique), so I thought something like a btree
    search for strings might be a good idea. I only have to look up by domain name
    from the hash, searching by path is not permitted.
    Since I'm far from being an expert on the subject of hashing and search
    algorithms, your opinion on how to make this fast is humbly requested :)

    TIA,

    Thomas
     
    Thomas Christmann, May 13, 2004
    #1
    1. Advertising

  2. Thomas Christmann

    Stephen L. Guest

    Thomas Christmann wrote:
    >
    > Hi!
    >
    > First let me apologize for asking this question when there are so many answers
    > to it on Google, but most of them are really contradicting, and making what I
    > want to do very performant is crucial to my project. So, here's what I have:
    >
    > My C programm connects to a database and gets ca. 50-100K domain name/file path
    > pairs. Those pairs have to be cached by my application. Building the cache may
    > take a second or two, but retrieving from it must be very fast. Since I get
    > the data from a database, I'd be able to order by domain name (which will be
    > my key, and is guaranteed to be unique), so I thought something like a btree
    > search for strings might be a good idea. I only have to look up by domain name
    > from the hash, searching by path is not permitted.
    > Since I'm far from being an expert on the subject of hashing and search
    > algorithms, your opinion on how to make this fast is humbly requested :)
    >
    > TIA,
    >
    > Thomas


    This isn't _really_ a `C' question...

    If the distribution of "domain" names
    is pretty even across the alphabet,
    then you could use the 1st letter of
    the name as an index to an array of
    "pointers" to name/path pairs that
    you can `bsearch()'. 100,000 entries
    isn't that much now-a-days, and
    dividing by 26 (for about 4,000 entries)
    should provide a very fast lookup.


    Stephen
     
    Stephen L., May 13, 2004
    #2
    1. Advertising

  3. > This isn't _really_ a `C' question...

    I know, I know, and I'm sorry to post here, but you guys usually
    help me very much (not knowingly, I suppose) with your posts. Also,
    there isn't really an alt.hash.maps :)

    > If the distribution of "domain" names
    > is pretty even across the alphabet,
    > then you could use the 1st letter of
    > the name as an index to an array of
    > "pointers" to name/path pairs that
    > you can `bsearch()'. 100,000 entries
    > isn't that much now-a-days, and
    > dividing by 26 (for about 4,000 entries)
    > should provide a very fast lookup.


    Sounds good, I'll give that a try.

    Thanks,

    Thomas
     
    Thomas Christmann, May 13, 2004
    #3
  4. On Thu, 13 May 2004 07:47:36 -0700, Thomas Christmann wrote:

    >> This isn't _really_ a `C' question...

    >
    > I know, I know, and I'm sorry to post here, but you guys usually
    > help me very much (not knowingly, I suppose) with your posts. Also,
    > there isn't really an alt.hash.maps :)


    comp.programming or something like that (I've forgotten the exact name)
    handles language-agnostic algorithm questions. You should get the
    algorithm ironed out first before trying a specific implementation anyway.

    --
    yvoregnevna gjragl-guerr gjb-gubhfnaq guerr ng lnubb qbg pbz
    To email me, rot13 and convert spelled-out numbers to numeric form.
    "Makes hackers smile" makes hackers smile.
     
    August Derleth, May 14, 2004
    #4
  5. Thomas Christmann

    James Kanze Guest

    "Stephen L." <> writes:

    |> Thomas Christmann wrote:

    |> > First let me apologize for asking this question when there are so
    |> > many answers to it on Google, but most of them are really
    |> > contradicting, and making what I want to do very performant is
    |> > crucial to my project. So, here's what I have:

    |> > My C programm connects to a database and gets ca. 50-100K domain
    |> > name/file path pairs. Those pairs have to be cached by my
    |> > application. Building the cache may take a second or two, but
    |> > retrieving from it must be very fast. Since I get the data from a
    |> > database, I'd be able to order by domain name (which will be my
    |> > key, and is guaranteed to be unique), so I thought something like
    |> > a btree search for strings might be a good idea. I only have to
    |> > look up by domain name from the hash, searching by path is not
    |> > permitted. Since I'm far from being an expert on the subject of
    |> > hashing and search algorithms, your opinion on how to make this
    |> > fast is humbly requested :)

    |> If the distribution of "domain" names
    |> is pretty even across the alphabet,
    |> then you could use the 1st letter of
    |> the name as an index to an array of
    |> "pointers" to name/path pairs that
    |> you can `bsearch()'.

    They aren't. I'll bet that well over half of all domains start with
    "www.". Also, the alphabet for domain names isn't limited to letters.

    I think that for this application, nothing will beat a good hash code.
    The trick is, of course, to avoid a bad one:); for some reason, URL's
    seem to be very sensitive to bad hash codes. A Google search for FNV
    hashing should turn up what you need -- if performance of the hash
    itself turns out to be an issue, and your hardware doesn't handle
    arbitrary multiplies very rapidly, I've also used Mersenne prime based
    hash codes in the past with good results. (The basic algorithm is the
    same as for FNV hashing, but the multiplier is a Mersenne prime, which
    can easily be calculated with a shift and a subtraction.)

    --
    James Kanze
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34
     
    James Kanze, May 16, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    671
  2. rp
    Replies:
    1
    Views:
    537
    red floyd
    Nov 10, 2011
  3. |MKSM|
    Replies:
    5
    Views:
    149
    Robert Klemme
    Mar 9, 2006
  4. Srijayanth Sridhar
    Replies:
    19
    Views:
    625
    David A. Black
    Jul 2, 2008
  5. Ralf Baerwaldt
    Replies:
    1
    Views:
    133
    Paul Lalli
    Jul 20, 2004
Loading...

Share This Page