efficient way of looking up huge hashes

Discussion in 'C Programming' started by Ram Prasad, May 25, 2007.

  1. Ram  Prasad

    Ram Prasad Guest

    Hi,
    I am a quiet a newbie to C programming. ( not new to programming
    itself though )
    I am writing an email blacklist application that will lookup huge DB
    files like

    ----------
    somedomain.com=emailid1
    someotherdomain.com=emailid2
    somedomain.com=senderdomain1
    ....
    ....

    ---------------------

    there would be anywhere between 10k-200k such records
    What is the best way of looking up such a list. I want to have the
    queries return very fast , the memory footprint would ideally be low
    but that is not a limiting factor

    Should I use something like a Berkeley DB and read from a DB file or
    should I read the entire stuff into memory.


    Thanks
    Ram

    PS:
    Note to Spammers: Go ahead , send me spam

    http://ecm.netcore.co.in/spamtrap.html
    Ram Prasad, May 25, 2007
    #1
    1. Advertising

  2. In article <>,
    Ram Prasad <> wrote:

    >I am writing an email blacklist application that will lookup huge DB
    >files like
    >
    >----------
    >somedomain.com=emailid1
    >someotherdomain.com=emailid2
    >somedomain.com=senderdomain1
    >...
    >...
    >
    >---------------------
    >
    >there would be anywhere between 10k-200k such records


    Even with 200k records that's still only about 5MB, which is hardly
    huge. Does a single run of the program just do one lookup? If so,
    and it runs infrequently, you could just read the file in and compare
    the strings as you go. On the other hand, if it sits there repeatedly
    processing addresses, it would be reasonable to use an in-memory hash
    table.

    If you're doing single lookups but running the program very often (say
    once a second), or if you think the list could get much bigger, then
    an on-disk hash table such as Berkeley DB would be the way to go. It
    has the advantage that it won't read the whole file.

    -- Richarad
    --
    "Consideration shall be given to the need for as many as 32 characters
    in some alphabets" - X3.4, 1963.
    Richard Tobin, May 25, 2007
    #2
    1. Advertising

  3. Ram  Prasad

    Ram Prasad Guest

    On May 25, 5:46 pm, (Richard Tobin) wrote:
    > In article <>,
    > Ram Prasad <> wrote:
    >
    > >I am writing an email blacklist application that will lookup huge DB
    > >files like

    >
    > >----------
    > >somedomain.com=emailid1
    > >someotherdomain.com=emailid2
    > >somedomain.com=senderdomain1
    > >...
    > >...

    >
    > >---------------------

    >
    > >there would be anywhere between 10k-200k such records

    >
    > Even with 200k records that's still only about 5MB, which is hardly
    > huge. Does a single run of the program just do one lookup? If so,
    > and it runs infrequently, you could just read the file in and compare
    > the strings as you go. On the other hand, if it sits there repeatedly
    > processing addresses, it would be reasonable to use an in-memory hash
    > table.
    >
    > If you're doing single lookups but running the program very often (say
    > once a second), or if you think the list could get much bigger, then
    > an on-disk hash table such as Berkeley DB would be the way to go. It
    > has the advantage that it won't read the whole file.
    >
    > -- Richarad
    > --
    > "Consideration shall be given to the need for as many as 32 characters
    > in some alphabets" - X3.4, 1963.


    This will be running in a mail-filter daemon. A single instance
    would potentially do thousands of lookups. Since the proram would be a
    daemon there I could read the entire DB into memory during startup and
    use it from the memory
    Which libraries should I use for such a such lookups. I dont
    need a hash lookup , just an if_key_exists() lookup
    Ram Prasad, May 25, 2007
    #3
  4. On Fri, 25 May 2007 06:17:50 -0700, Ram Prasad wrote:

    > On May 25, 5:46 pm, (Richard Tobin) wrote:
    >> In article <>,
    >> Ram Prasad <> wrote:
    >>
    >> >I am writing an email blacklist application that will lookup huge DB
    >> >files like

    >>
    >> >----------

    ><snip>
    > This will be running in a mail-filter daemon. A single instance
    > would potentially do thousands of lookups. Since the proram would be a
    > daemon there I could read the entire DB into memory during startup and
    > use it from the memory
    > Which libraries should I use for such a such lookups. I dont
    > need a hash lookup , just an if_key_exists() lookup


    http://judy.sourceforge.net/
    Duncan Muirhead, May 25, 2007
    #4
  5. Ram  Prasad

    Tor Rustad Guest

    Ram Prasad wrote:

    >
    > Should I use something like a Berkeley DB and read from a DB file or
    > should I read the entire stuff into memory.
    >


    I'm having a similar problem, and was thinking about using Berkeley DB too.

    I wouldn't worry too much about the performance issue, if you have the table
    in memory, the OS might swap your unused memory pages to disk anyway.
    Likewise, if you access some disk location a lot, it will typically be in
    cache.

    So, using Berkeley DB is a good idea. Knuth once said:

    "premature optimization is the root of all evil"

    --
    Tor <torust [at] online [dot] no>
    Tor Rustad, May 26, 2007
    #5
  6. Ram  Prasad

    Ram Prasad Guest

    On May 26, 5:58 am, Tor Rustad <> wrote:
    > Ram Prasad wrote:
    >
    > > Should I use something like a Berkeley DB and read from a DB file or
    > > should I read the entire stuff into memory.

    >
    > I'm having a similar problem, and was thinking about using Berkeley DB too.
    >
    > I wouldn't worry too much about the performance issue, if you have the table
    > in memory, the OS might swap your unused memory pages to disk anyway.
    > Likewise, if you access some disk location a lot, it will typically be in
    > cache.
    >
    > So, using Berkeley DB is a good idea.


    I am evaluating multiple options. I think tinycdb looks very
    promising
    http://www.corpit.ru/mjt/tinycdb.html

    It compiles easily on my machine and the example scripts work without
    much fuss.
    Unfortunately with so many options available there is no single
    standard method. How does one make his choice
    Ram Prasad, May 28, 2007
    #6
  7. Ram  Prasad

    Tor Rustad Guest

    Re: efficient way of looking up huge hashes [OT]

    Ram Prasad wrote:
    > On May 26, 5:58 am, Tor Rustad <> wrote:
    >> Ram Prasad wrote:
    >>
    >>> Should I use something like a Berkeley DB and read from a DB file or
    >>> should I read the entire stuff into memory.

    >>
    >> I'm having a similar problem, and was thinking about using Berkeley DB too.
    >>
    >> I wouldn't worry too much about the performance issue, if you have the table
    >> in memory, the OS might swap your unused memory pages to disk anyway.
    >> Likewise, if you access some disk location a lot, it will typically be in
    >> cache.
    >>
    >> So, using Berkeley DB is a good idea.

    >
    > I am evaluating multiple options. I think tinycdb looks very
    > promising
    > http://www.corpit.ru/mjt/tinycdb.html


    Nice and simple C API, but rather limited platform support.

    > Unfortunately with so many options available there is no single
    > standard method. How does one make his choice


    Yes, there are pro and cons, if I have a tuning problem, I would
    prototype multiple solutions, and select the one which get the job done,
    with minimal drawbacks. In practice,

    * robustness / maturity
    * maintainability
    * support / documentation
    * portability
    * fingerprint
    * security
    * error handling
    * simplicity

    etc.

    might be important design parameters too. I rarely select the fastest
    solution.

    The simplest solution to compare with, would perhaps be holding the
    blacklist in memory, and to use qsort()/bsearch(). That prototype can be
    made in no time.

    However, for expert advice, you should instead consult one of the many
    database news groups.

    --
    Tor <torust [at] online [dot] no>
    Tor Rustad, May 28, 2007
    #7
  8. Ram  Prasad

    Ram Prasad Guest

    Re: efficient way of looking up huge hashes [OT]

    On May 28, 7:46 pm, Tor Rustad <torust_at_online.no> wrote:
    > Ram Prasad wrote:
    > > On May 26, 5:58 am, Tor Rustad <> wrote:
    > >> Ram Prasad wrote:

    >
    > >>> Should I use something like a Berkeley DB and read from a DB file or
    > >>> should I read the entire stuff into memory.

    >
    > >> I'm having a similar problem, and was thinking about using Berkeley DB too.

    >
    > >> I wouldn't worry too much about the performance issue, if you have the table
    > >> in memory, the OS might swap your unused memory pages to disk anyway.
    > >> Likewise, if you access some disk location a lot, it will typically be in
    > >> cache.

    >
    > >> So, using Berkeley DB is a good idea.

    >
    > > I am evaluating multiple options. I think tinycdb looks very
    > > promising
    > > http://www.corpit.ru/mjt/tinycdb.html

    >
    > Nice and simple C API, but rather limited platform support.
    >
    > > Unfortunately with so many options available there is no single
    > > standard method. How does one make his choice

    >
    > Yes, there are pro and cons, if I have a tuning problem, I would
    > prototype multiple solutions, and select the one which get the job done,
    > with minimal drawbacks. In practice,
    >
    > * robustness / maturity
    > * maintainability
    > * support / documentation
    > * portability
    > * fingerprint
    > * security
    > * error handling
    > * simplicity
    >
    > etc.
    >
    > might be important design parameters too. I rarely select the fastest
    > solution.
    >
    > The simplest solution to compare with, would perhaps be holding the
    > blacklist in memory, and to use qsort()/bsearch(). That prototype can be
    > made in no time.


    Can you pls explain figerprint ?


    Would you suggest bsearch() on a 5MB data would be better than a hash-
    db lookup.
    What are the implications of multiple instances running can they share
    the same data area


    Thanks
    Ram

    PS:
    Note to Spammers: Go ahead , send me spam

    http://ecm.netcore.co.in/spamtrap.html
    Ram Prasad, May 30, 2007
    #8
  9. Ram  Prasad

    Tor Rustad Guest

    Re: efficient way of looking up huge hashes [OT]

    Ram Prasad wrote:

    > Can you pls explain figerprint ?


    That was a typo, for "footprint", e.g. the memory usage.

    > Would you suggest bsearch() on a 5MB data would be better than a hash-
    > db lookup.


    The relevant question for you: is a simple qsort/bsearch sufficient?


    > What are the implications of multiple instances running can they share
    > the same data area


    Those implications are system specific, we don't discuss file locks,
    mutex, thread programming etc. here.

    --
    Tor <torust [at] online [dot] no>
    Tor Rustad, May 30, 2007
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben Holness

    Hashes of Hashes via subs

    Ben Holness, Oct 5, 2003, in forum: Perl
    Replies:
    8
    Views:
    559
    Ben Holness
    Oct 8, 2003
  2. Replies:
    3
    Views:
    490
  3. Steven Arnold

    using hashes as keys in hashes

    Steven Arnold, Nov 23, 2005, in forum: Ruby
    Replies:
    3
    Views:
    159
    Mauricio Fernández
    Nov 23, 2005
  4. kazaam
    Replies:
    12
    Views:
    267
    Matthias Wächter
    Sep 13, 2007
  5. Tim O'Donovan

    Hash of hashes, of hashes, of arrays of hashes

    Tim O'Donovan, Oct 27, 2005, in forum: Perl Misc
    Replies:
    5
    Views:
    208
Loading...

Share This Page