How to quickly search through arrays?

Discussion in 'C++' started by Steve555, Feb 22, 2010.

  1. Steve555

    Steve555 Guest

    Hi

    I have a set of 171 entities (in this case they're integers
    representing musical phrases)
    Theses exist in sequences of length 4, and I have a library (lookup-
    table) of such sequences.
    e.g.
    2 58 93 170
    38 39 100 150
    38 40 41 87
    etc.

    The table is sorted and is quite sparse; there are 333,932 out of a
    possible 855,036,081 (171^4) .

    I'm looking for fast ways to find matches for a query containing 1 or
    2 wildcards.
    e.g. find matches for:
    2 58 ? 170
    ? 39 ? 150

    At the moment I'm using brute force to just iterated through the
    lookup table, but this is proving very slow as the algorithm needs to
    do hundreds of searches in close to real time.

    I know very little about searching, so any ideas for me to study would
    be welcomed.
    I do know how to do a binary search, but only from the first entry, so
    if that is a wildcard I'm stumped.

    I happy to re-format/re-package the data if it could improve the
    search speed. I'm not concerned about size or memory usage.

    Thanks for any suggestions

    Steve
     
    Steve555, Feb 22, 2010
    #1
    1. Advertising

  2. On 22/02/10 08:02, Steve555 wrote:
    > Hi
    >
    > I have a set of 171 entities (in this case they're integers
    > representing musical phrases)
    > Theses exist in sequences of length 4, and I have a library (lookup-
    > table) of such sequences.
    > e.g.
    > 2 58 93 170
    > 38 39 100 150
    > 38 40 41 87
    > etc.
    >
    > The table is sorted and is quite sparse; there are 333,932 out of a
    > possible 855,036,081 (171^4) .
    >
    > I'm looking for fast ways to find matches for a query containing 1 or
    > 2 wildcards.
    > e.g. find matches for:
    > 2 58 ? 170
    > ? 39 ? 150
    >
    > At the moment I'm using brute force to just iterated through the
    > lookup table, but this is proving very slow as the algorithm needs to
    > do hundreds of searches in close to real time.
    >
    > I know very little about searching, so any ideas for me to study would
    > be welcomed.
    > I do know how to do a binary search, but only from the first entry, so
    > if that is a wildcard I'm stumped.


    Sounds that you might benefit from using a ternary search tree. They
    have comparable to hash lookup performance and they do support partial
    (wildcards) searches.

    http://www.cs.princeton.edu/~rs/strings/

    --
    Max
     
    Maxim Yegorushkin, Feb 22, 2010
    #2
    1. Advertising

  3. Steve555

    Öö Tiib Guest

    On Feb 22, 10:02 am, Steve555 <> wrote:
    > Hi
    >
    > I have a set of 171 entities (in this case they're integers
    > representing musical phrases)
    > Theses exist in sequences of length 4, and I have a library (lookup-
    > table) of such sequences.
    > e.g.
    > 2  58  93  170
    > 38  39  100  150
    > 38  40  41  87
    > etc.
    >
    > The table is sorted and is quite sparse; there are 333,932 out of a
    > possible 855,036,081 (171^4) .
    >
    > I'm looking for fast ways to find matches for a query containing 1 or
    > 2 wildcards.
    > e.g. find matches for:
    > 2  58  ?  170
    > ?  39  ?  150
    >
    > At the moment I'm using brute force to just iterated through the
    > lookup table, but this is proving very slow as the algorithm needs to
    > do hundreds of searches in close to real time.
    >
    > I know very little about searching, so any ideas for me to study would
    > be welcomed.


    > I do know how to do a binary search, but only from the first entry, so
    > if that is a wildcard I'm stumped.


    You may create index vectors to your table (sorted by other phrases).
    For search by first phrase use std::lower_bound() on your table. For
    search by other phrase use std::lower_bound() on such sorted index.
    If you have static array (no new phrases are added into your tables)
    you may want to make pre-searched indexes to quicken it up even more.

    > I happy to re-format/re-package the data if it could improve the
    > search speed. I'm not concerned about size or memory usage.


    Possibly you still want to make some alternative implementations and
    profile them integrated into your application. Size of memory usage is
    a factor that often drags performance down. It is common to
    underestimate how devastating are frequent cache misses to
    performance.

    >
    > Thanks for any suggestions
    >
    > Steve
     
    Öö Tiib, Feb 22, 2010
    #3
  4. Hi,

    Steve555 wrote:
    > I have a set of 171 entities (in this case they're integers
    > representing musical phrases)
    > Theses exist in sequences of length 4, and I have a library (lookup-
    > table) of such sequences.
    > e.g.
    > 2 58 93 170
    > 38 39 100 150
    > 38 40 41 87
    > etc.
    >
    > The table is sorted and is quite sparse; there are 333,932 out of a
    > possible 855,036,081 (171^4) .
    >
    > I'm looking for fast ways to find matches for a query containing 1 or
    > 2 wildcards.
    > e.g. find matches for:
    > 2 58 ? 170
    > ? 39 ? 150
    >
    > At the moment I'm using brute force to just iterated through the
    > lookup table, but this is proving very slow as the algorithm needs to
    > do hundreds of searches in close to real time.


    what about the change rate of your data set?

    If the data changes rarely create six hashtables for each pair of
    columns. If you want to speed up requests with only one wildcard even
    more, place a nested hashtable for a third column in the value type of
    the above hash tables. This will make all lookups with one and two
    wildcards O(1). Note that you do not need to support all combinations as
    you can choose the first hashtable appropriately in case of three given
    Columns.
    If memory consumption is more important than the scalability you could
    replace the many inner hashtables by sorted lists.

    Btw., how large can your numbers grow? If an unsigned char is
    sufficient, you can pack an entire line into a single 32 bit structure.
    Even with unsigned short data keeping redundant copies of your data in
    the data structures is better than references.


    > I do know how to do a binary search, but only from the first entry, so
    > if that is a wildcard I'm stumped.


    You need different entry points for different types of queries.


    Marcel
     
    Marcel Müller, Feb 22, 2010
    #4
  5. Steve555

    Steve555 Guest

    On 22 Feb, 09:34, Maxim Yegorushkin <>
    wrote:
    > On 22/02/10 08:02, Steve555 wrote:
    >
    >
    >
    >
    >
    > > Hi

    >
    > > I have a set of 171 entities (in this case they're integers
    > > representing musical phrases)
    > > Theses exist in sequences of length 4, and I have a library (lookup-
    > > table) of such sequences.
    > > e.g.
    > > 2  58  93  170
    > > 38  39  100  150
    > > 38  40  41  87
    > > etc.

    >
    > > The table is sorted and is quite sparse; there are 333,932 out of a
    > > possible 855,036,081 (171^4) .

    >
    > > I'm looking for fast ways to find matches for a query containing 1 or
    > > 2 wildcards.
    > > e.g. find matches for:
    > > 2  58  ?  170
    > > ?  39  ?  150

    >
    > > At the moment I'm using brute force to just iterated through the
    > > lookup table, but this is proving very slow as the algorithm needs to
    > > do hundreds of searches in close to real time.

    >
    > > I know very little about searching, so any ideas for me to study would
    > > be welcomed.
    > > I do know how to do a binary search, but only from the first entry, so
    > > if that is a wildcard I'm stumped.

    >
    > Sounds that you might benefit from using a ternary search tree. They
    > have comparable to hash lookup performance and they do support partial
    > (wildcards) searches.
    >
    > http://www.cs.princeton.edu/~rs/strings/
    >
    > --
    > Max


    Thanks Max, the Dr Dobbs link there has a lot of useful stuff.
     
    Steve555, Feb 22, 2010
    #5
  6. Steve555

    Steve555 Guest

    On 22 Feb, 09:53, Öö Tiib <> wrote:
    > On Feb 22, 10:02 am, Steve555 <> wrote:
    >
    >
    >
    >
    >
    > > Hi

    >
    > > I have a set of 171 entities (in this case they're integers
    > > representing musical phrases)
    > > Theses exist in sequences of length 4, and I have a library (lookup-
    > > table) of such sequences.
    > > e.g.
    > > 2  58  93  170
    > > 38  39  100  150
    > > 38  40  41  87
    > > etc.

    >
    > > The table is sorted and is quite sparse; there are 333,932 out of a
    > > possible 855,036,081 (171^4) .

    >
    > > I'm looking for fast ways to find matches for a query containing 1 or
    > > 2 wildcards.
    > > e.g. find matches for:
    > > 2  58  ?  170
    > > ?  39  ?  150

    >
    > > At the moment I'm using brute force to just iterated through the
    > > lookup table, but this is proving very slow as the algorithm needs to
    > > do hundreds of searches in close to real time.

    >
    > > I know very little about searching, so any ideas for me to study would
    > > be welcomed.
    > > I do know how to do a binary search, but only from the first entry, so
    > > if that is a wildcard I'm stumped.

    >
    > You may create index vectors to your table (sorted by other phrases).
    > For search by first phrase use std::lower_bound() on your table. For
    > search by other phrase use std::lower_bound() on such sorted index.
    > If you have static array (no new phrases are added into your tables)
    > you may want to make pre-searched indexes to quicken it up even more.
    >
    > > I happy to re-format/re-package the data if it could improve the
    > > search speed. I'm not concerned about size or memory usage.

    >
    > Possibly you still want to make some alternative implementations and
    > profile them integrated into your application. Size of memory usage is
    > a factor that often drags performance down. It is common to
    > underestimate how devastating are frequent cache misses to
    > performance.
    >
    >
    >
    >
    >
    > > Thanks for any suggestions

    >
    > > Steve


    Did you mean make 3 other index vectors for the elements in the 2nd,
    3rd, 4th positions?
    I can see how that would allow me to use binary search on any
    'column'.

    I'm not sure what 'pre-searched indexes' means.

    Thanks

    Steve
     
    Steve555, Feb 22, 2010
    #6
  7. Steve555

    Steve555 Guest

    On 22 Feb, 10:06, Marcel Müller <> wrote:
    > Hi,
    >
    >
    >
    >
    >
    > Steve555 wrote:
    > > I have a set of 171 entities (in this case they're integers
    > > representing musical phrases)
    > > Theses exist in sequences of length 4, and I have a library (lookup-
    > > table) of such sequences.
    > > e.g.
    > > 2  58  93  170
    > > 38  39  100  150
    > > 38  40  41  87
    > > etc.

    >
    > > The table is sorted and is quite sparse; there are 333,932 out of a
    > > possible 855,036,081 (171^4) .

    >
    > > I'm looking for fast ways to find matches for a query containing 1 or
    > > 2 wildcards.
    > > e.g. find matches for:
    > > 2  58  ?  170
    > > ?  39  ?  150

    >
    > > At the moment I'm using brute force to just iterated through the
    > > lookup table, but this is proving very slow as the algorithm needs to
    > > do hundreds of searches in close to real time.

    >
    > what about the change rate of your data set?
    >
    > If the data changes rarely create six hashtables for each pair of
    > columns. If you want to speed up requests with only one wildcard even
    > more, place a nested hashtable for a third column in the value type of
    > the above hash tables. This will make all lookups with one and two
    > wildcards O(1). Note that you do not need to support all combinations as
    > you can choose the first hashtable appropriately in case of three given
    > Columns.
    > If memory consumption is more important than the scalability you could
    > replace the many inner hashtables by sorted lists.
    >
    > Btw., how large can your numbers grow? If an unsigned char is
    > sufficient, you can pack an entire line into a single 32 bit structure.
    > Even with unsigned short data keeping redundant copies of your data in
    > the data structures is better than references.
    >
    > > I do know how to do a binary search, but only from the first entry, so
    > > if that is a wildcard I'm stumped.

    >
    > You need different entry points for different types of queries.
    >
    > Marcel


    Thanks for the comprehensive reply. I've just googled as much as I
    could to try and understand before replying.
    I really need to learn about hash tables, before I get why I need six
    of them!
    (I did say I was new to search techniques ;) )

    The lookup table (library of sequences) never changes. The elements
    will never be bigger than 171, so yes I could pack 4 of them into 32
    bits.
    Are you, then, saying to store each sequence-of-4 in a single
    unsigned long, and do some kind of bit search?
    Sorry, I don't know how that would work... how would I differentiate
    between the ones the exist in the library, and those that don't?

    Thanks

    Steve
     
    Steve555, Feb 22, 2010
    #7
  8. On Feb 22, 2:32 pm, Steve555 <> wrote:
    > On 22 Feb, 10:06, Marcel Müller <> wrote:
    >
    >
    >
    > > Hi,

    >
    > > Steve555 wrote:
    > > > I have a set of 171 entities (in this case they're integers
    > > > representing musical phrases)
    > > > Theses exist in sequences of length 4, and I have a library (lookup-
    > > > table) of such sequences.
    > > > e.g.
    > > > 2  58  93  170
    > > > 38  39  100  150
    > > > 38  40  41  87
    > > > etc.

    >
    > > > The table is sorted and is quite sparse; there are 333,932 out of a
    > > > possible 855,036,081 (171^4) .

    >
    > > > I'm looking for fast ways to find matches for a query containing 1 or
    > > > 2 wildcards.
    > > > e.g. find matches for:
    > > > 2  58  ?  170
    > > > ?  39  ?  150

    >
    > > > At the moment I'm using brute force to just iterated through the
    > > > lookup table, but this is proving very slow as the algorithm needs to
    > > > do hundreds of searches in close to real time.

    >
    > > what about the change rate of your data set?

    >
    > > If the data changes rarely create six hashtables for each pair of
    > > columns. If you want to speed up requests with only one wildcard even
    > > more, place a nested hashtable for a third column in the value type of
    > > the above hash tables. This will make all lookups with one and two
    > > wildcards O(1). Note that you do not need to support all combinations as
    > > you can choose the first hashtable appropriately in case of three given
    > > Columns.
    > > If memory consumption is more important than the scalability you could
    > > replace the many inner hashtables by sorted lists.

    >
    > > Btw., how large can your numbers grow? If an unsigned char is
    > > sufficient, you can pack an entire line into a single 32 bit structure.
    > > Even with unsigned short data keeping redundant copies of your data in
    > > the data structures is better than references.

    >
    > > > I do know how to do a binary search, but only from the first entry, so
    > > > if that is a wildcard I'm stumped.

    >
    > > You need different entry points for different types of queries.

    >
    > > Marcel

    >
    > Thanks for the comprehensive reply. I've just googled as much as I
    > could to try and understand before replying.
    > I really need to learn about hash tables, before I get why I need six
    > of them!
    > (I did say I was new to search techniques ;)  )
    >
    > The lookup table (library of sequences) never changes. The elements
    > will never be bigger than 171, so yes I could pack 4 of them into 32
    > bits.


    In that case you can generate a perfect hash function for the full
    table, four for the single wild card and six for the double wild
    card cases. This should give you the fastest possible searches.
     
    Gert-Jan de Vos, Feb 22, 2010
    #8
  9. Steve555

    Öö Tiib Guest

    On Feb 22, 3:24 pm, Steve555 <> wrote:
    > On 22 Feb, 09:53, Öö Tiib <> wrote:
    >
    >
    >
    >
    >
    > > On Feb 22, 10:02 am, Steve555 <> wrote:

    >
    > > > Hi

    >
    > > > I have a set of 171 entities (in this case they're integers
    > > > representing musical phrases)
    > > > Theses exist in sequences of length 4, and I have a library (lookup-
    > > > table) of such sequences.
    > > > e.g.
    > > > 2  58  93  170
    > > > 38  39  100  150
    > > > 38  40  41  87
    > > > etc.

    >
    > > > The table is sorted and is quite sparse; there are 333,932 out of a
    > > > possible 855,036,081 (171^4) .

    >
    > > > I'm looking for fast ways to find matches for a query containing 1 or
    > > > 2 wildcards.
    > > > e.g. find matches for:
    > > > 2  58  ?  170
    > > > ?  39  ?  150

    >
    > > > At the moment I'm using brute force to just iterated through the
    > > > lookup table, but this is proving very slow as the algorithm needs to
    > > > do hundreds of searches in close to real time.

    >
    > > > I know very little about searching, so any ideas for me to study would
    > > > be welcomed.
    > > > I do know how to do a binary search, but only from the first entry, so
    > > > if that is a wildcard I'm stumped.

    >
    > > You may create index vectors to your table (sorted by other phrases).
    > > For search by first phrase use std::lower_bound() on your table. For
    > > search by other phrase use std::lower_bound() on such sorted index.
    > > If you have static array (no new phrases are added into your tables)
    > > you may want to make pre-searched indexes to quicken it up even more.

    >
    > > > I happy to re-format/re-package the data if it could improve the
    > > > search speed. I'm not concerned about size or memory usage.

    >
    > > Possibly you still want to make some alternative implementations and
    > > profile them integrated into your application. Size of memory usage is
    > > a factor that often drags performance down. It is common to
    > > underestimate how devastating are frequent cache misses to
    > > performance.

    >
    > > > Thanks for any suggestions

    >
    > > > Steve

    >
    > Did you mean make 3 other index vectors for the elements in the 2nd,
    > 3rd, 4th positions?
    > I can see how that would allow me to use binary search on any
    > 'column'.


    Yes.

    > I'm not sure what 'pre-searched indexes' means.


    I meant do not stop at making indexes.

    Your array will have 334K records (Lets call it 'array'.) It is sorted
    by 0-th phrase.
    There are also 3 334K vectors containing array indexes sorted by 1-st,
    2-nd, 3-rd phrases ( lets call them 'index1' etc.)

    Now ... occurs that you usually first search is like where sequence of
    phrases 38 start in 'index2'. You can search that before (since your
    array is static) and put results into vector of 171 indexes of indexes
    (lets call it 'firstIndex2'). As result first array element with
    phrase 38 by index2 is:

    array[ index2[ firstIndex2[38] ] ];

    Similarily you make lastIndex2. It results with no searching at all
    from 334K of space. You search only between these indexes, knowing
    that all the records are between them.
     
    Öö Tiib, Feb 22, 2010
    #9
  10. Steve555 wrote:
    > Thanks for the comprehensive reply. I've just googled as much as I
    > could to try and understand before replying.
    > I really need to learn about hash tables, before I get why I need six
    > of them!
    > (I did say I was new to search techniques ;) )


    Feel free to use std::hash_map and forget about the internal structure
    of hash tables.


    > The lookup table (library of sequences) never changes. The elements
    > will never be bigger than 171, so yes I could pack 4 of them into 32
    > bits.
    > Are you, then, saying to store each sequence-of-4 in a single
    > unsigned long, and do some kind of bit search?


    typedef unsigned char MyEntry[4];

    will most likely fit your needs.

    If you encounter problems because array types are a bit special in C/C++
    wrap it in a struct:
    struct MyEntry
    { unsigned char Column[4];
    };

    Objects of type MyEntry are now 32 bits.


    struct MyHashKey
    { unsigned char Key[2];
    };

    MyHashKey is intended to be used as hash key for two arbitrary columns.

    Because MyHashKey is no integral type you must define a hash function to
    use it in a hashtable.

    int MyHashFunc(MyHashKey key)
    { // portable version:
    return key[0] | (key[1] << 8);
    // fast non-portable version:
    // return *(short*)key;
    }

    As content I would recommend to use something like:

    template <typename Compare>
    typedef MyHashEntry std::set<MyEntry, Compare>;

    It will contain the matching MyEntry lines of your data ordered by the
    remaining columns, to speed up lookups with one or no wildcards. If you
    have two wildcards enumerate over the entire sublist.
    The Compare template parameter is needed because Content is sorted by
    different columns depending on your first key columns. It is a functor
    that provides the comperator for two MyEntries.

    Now you need the comparer mentioned above.

    template <int COL1, int COL2>
    bool MyColumnComparer(MyEntry l, MyEntry r)
    { unsigned lkey = (l.Column[COL1] << 8) | l.Column[COL2];
    unsigned rkey = (r.Column[COL1] << 8) | r.Column[COL2];
    return lkey < rkey;
    }

    And now you can define your root data structures:

    std::hash_set<MyHashKey,
    MyHashEntry<MyColumnComparer<3,4> >,
    MyHashFunc> EntriesBy1234;

    The above hashtable instance is intended to store the Entries hashed by
    the first two columns and for each distinct values of the first two
    colums ordered by the third column. In the same way you could define
    further entry points for your searches.

    std::hash_set<MyHashKey,
    MyHashEntry<MyColumnComparer<2,4> >,
    MyHashFunc> EntriesBy1324;
    std::hash_set<MyHashKey,
    MyHashEntry<MyColumnComparer<2,3> >,
    MyHashFunc> EntriesBy1423;
    std::hash_set<MyHashKey,
    MyHashEntry<MyColumnComparer<1,4> >,
    MyHashFunc> EntriesBy2314;
    std::hash_set<MyHashKey,
    MyHashEntry<MyColumnComparer<1,3> >,
    MyHashFunc> EntriesBy2413;
    std::hash_set<MyHashKey,
    MyHashEntry<MyColumnComparer<1,2> >,
    MyHashFunc> EntriesBy3412;

    For lookups with only one wildcard you can choose an arbitrary list with
    the required key columns. There is some redundancy in this case.


    Note, that all of that is only a rough idea. It may even contain
    syntactical errors because I wrote down it quite quickly.
    Also note that the data structure is only optimized for lookups not for
    the population process. Changes are quite slow because they have to be
    applied at different places simultaneously.


    > Sorry, I don't know how that would work... how would I differentiate
    > between the ones the exist in the library, and those that don't?


    Sorry, I didn't understand the your last question.


    Marcel
     
    Marcel Müller, Feb 22, 2010
    #10
  11. Steve555

    red floyd Guest

    On Feb 22, 11:20 am, Marcel Müller <>
    wrote:

    > Feel free to use std::hash_map and forget about the internal structure
    > of hash tables.
    >
    >


    Except for the small detail that there is no such thing as
    std::hash_map.

    TR1 introduced std::unordered_map, std::unordered_multimap,
    std::unordered_set, and std::unordered_multiset.
     
    red floyd, Feb 23, 2010
    #11
  12. Steve555

    Steve555 Guest

    On 22 Feb, 15:17, Öö Tiib <> wrote:
    > On Feb 22, 3:24 pm, Steve555 <> wrote:
    >
    >
    >
    >
    >
    > > On 22 Feb, 09:53, Öö Tiib <> wrote:

    >
    > > > On Feb 22, 10:02 am, Steve555 <> wrote:

    >
    > > > > Hi

    >
    > > > > I have a set of 171 entities (in this case they're integers
    > > > > representing musical phrases)
    > > > > Theses exist in sequences of length 4, and I have a library (lookup-
    > > > > table) of such sequences.
    > > > > e.g.
    > > > > 2  58  93  170
    > > > > 38  39  100  150
    > > > > 38  40  41  87
    > > > > etc.

    >
    > > > > The table is sorted and is quite sparse; there are 333,932 out of a
    > > > > possible 855,036,081 (171^4) .

    >
    > > > > I'm looking for fast ways to find matches for a query containing 1 or
    > > > > 2 wildcards.
    > > > > e.g. find matches for:
    > > > > 2  58  ?  170
    > > > > ?  39  ?  150

    >
    > > > > At the moment I'm using brute force to just iterated through the
    > > > > lookup table, but this is proving very slow as the algorithm needs to
    > > > > do hundreds of searches in close to real time.

    >
    > > > > I know very little about searching, so any ideas for me to study would
    > > > > be welcomed.
    > > > > I do know how to do a binary search, but only from the first entry, so
    > > > > if that is a wildcard I'm stumped.

    >
    > > > You may create index vectors to your table (sorted by other phrases).
    > > > For search by first phrase use std::lower_bound() on your table. For
    > > > search by other phrase use std::lower_bound() on such sorted index.
    > > > If you have static array (no new phrases are added into your tables)
    > > > you may want to make pre-searched indexes to quicken it up even more.

    >
    > > > > I happy to re-format/re-package the data if it could improve the
    > > > > search speed. I'm not concerned about size or memory usage.

    >
    > > > Possibly you still want to make some alternative implementations and
    > > > profile them integrated into your application. Size of memory usage is
    > > > a factor that often drags performance down. It is common to
    > > > underestimate how devastating are frequent cache misses to
    > > > performance.

    >
    > > > > Thanks for any suggestions

    >
    > > > > Steve

    >
    > > Did you mean make 3 other index vectors for the elements in the 2nd,
    > > 3rd, 4th positions?
    > > I can see how that would allow me to use binary search on any
    > > 'column'.

    >
    > Yes.
    >
    > > I'm not sure what 'pre-searched indexes' means.

    >
    > I meant do not stop at making indexes.
    >
    > Your array will have 334K records (Lets call it 'array'.) It is sorted
    > by 0-th phrase.
    > There are also 3 334K vectors containing array indexes sorted by 1-st,
    > 2-nd, 3-rd phrases ( lets call them 'index1' etc.)
    >
    > Now ... occurs that you usually first search is like where sequence of
    > phrases 38 start in 'index2'. You can search that before (since your
    > array is static) and put results into vector of 171 indexes of indexes
    > (lets call it 'firstIndex2'). As result first array element with
    > phrase 38 by index2 is:
    >
    > array[ index2[ firstIndex2[38] ] ];
    >
    > Similarily you make lastIndex2. It results with no searching at all
    > from 334K of space. You search only between these indexes, knowing
    > that all the records are between them.


    Thanks for the clarification. I'm going to try and implement this
    method and the other hash method, see which works best in context.
     
    Steve555, Feb 23, 2010
    #12
  13. Steve555

    Steve555 Guest

    On 22 Feb, 19:20, Marcel Müller <> wrote:
    > Steve555 wrote:
    > > Thanks for the comprehensive reply. I've just googled as much as I
    > > could to try and understand before replying.
    > > I really need to learn about hash tables, before I get why I need six
    > > of them!
    > > (I did say I was new to search techniques ;)  )

    >
    > Feel free to use std::hash_map and forget about the internal structure
    > of hash tables.
    >
    > > The lookup table (library of sequences) never changes. The elements
    > > will never be bigger than 171, so yes I could pack 4 of them into 32
    > > bits.
    > > Are you, then,  saying to store each sequence-of-4 in a single
    > > unsigned long, and do some kind of bit search?

    >
    >    typedef unsigned char MyEntry[4];
    >
    > will most likely fit your needs.
    >
    > If you encounter problems because array types are a bit special in C/C++
    > wrap it in a struct:
    >    struct MyEntry
    >    { unsigned char Column[4];
    >    };
    >
    > Objects of type MyEntry are now 32 bits.
    >
    >    struct MyHashKey
    >    { unsigned char Key[2];
    >    };
    >
    > MyHashKey is intended to be used as hash key for two arbitrary columns.
    >
    > Because MyHashKey is no integral type you must define a hash function to
    > use it in a hashtable.
    >
    >    int MyHashFunc(MyHashKey key)
    >    { // portable version:
    >      return key[0] | (key[1] << 8);
    >      // fast non-portable version:
    >      // return *(short*)key;
    >    }
    >
    > As content I would recommend to use something like:
    >
    >    template <typename Compare>
    >    typedef MyHashEntry std::set<MyEntry, Compare>;
    >
    > It will contain the matching MyEntry lines of your data ordered by the
    > remaining columns, to speed up lookups with one or no wildcards. If you
    > have two wildcards enumerate over the entire sublist.
    > The Compare template parameter is needed because Content is sorted by
    > different columns depending on your first key columns. It is a functor
    > that provides the comperator for two MyEntries.
    >
    > Now you need the comparer mentioned above.
    >
    >    template <int COL1, int COL2>
    >    bool MyColumnComparer(MyEntry l, MyEntry r)
    >    { unsigned lkey = (l.Column[COL1] << 8) | l.Column[COL2];
    >      unsigned rkey = (r.Column[COL1] << 8) | r.Column[COL2];
    >      return lkey < rkey;
    >    }
    >
    > And now you can define your root data structures:
    >
    >    std::hash_set<MyHashKey,
    >                  MyHashEntry<MyColumnComparer<3,4> >,
    >                  MyHashFunc> EntriesBy1234;
    >
    > The above hashtable instance is intended to store the Entries hashed by
    > the first two columns and for each distinct values of the first two
    > colums ordered by the third column. In the same way you could define
    > further entry points for your searches.
    >
    >    std::hash_set<MyHashKey,
    >                  MyHashEntry<MyColumnComparer<2,4> >,
    >                  MyHashFunc> EntriesBy1324;
    >    std::hash_set<MyHashKey,
    >                  MyHashEntry<MyColumnComparer<2,3> >,
    >                  MyHashFunc> EntriesBy1423;
    >    std::hash_set<MyHashKey,
    >                  MyHashEntry<MyColumnComparer<1,4> >,
    >                  MyHashFunc> EntriesBy2314;
    >    std::hash_set<MyHashKey,
    >                  MyHashEntry<MyColumnComparer<1,3> >,
    >                  MyHashFunc> EntriesBy2413;
    >    std::hash_set<MyHashKey,
    >                  MyHashEntry<MyColumnComparer<1,2> >,
    >                  MyHashFunc> EntriesBy3412;
    >
    > For lookups with only one wildcard you can choose an arbitrary list with
    > the required key columns. There is some redundancy in this case.
    >
    > Note, that all of that is only a rough idea. It may even contain
    > syntactical errors because I wrote down it quite quickly.
    > Also note that the data structure is only optimized for lookups not for
    > the population process. Changes are quite slow because they have to be
    > applied at different places simultaneously.
    >
    > > Sorry, I don't know how that would work... how would I differentiate
    > > between the ones the exist in the library, and those that don't?

    >
    > Sorry, I didn't understand the your last question.
    >
    > Marcel


    Thanks Marcel, this looks perfect, and is something I really want to
    get my teeth in to now, but - frustration - I don't have std::hash-map
    on my system (Mac OSX 10.6, XCode compiler, GCC 4.2). Googling seems
    to suggest it's a custom SGI thing, possibly I can find and install
    it? I'll post when I've got a working version running. :(
     
    Steve555, Feb 23, 2010
    #13
  14. Steve555

    tonydee Guest

    On Feb 22, 10:46 pm, Gert-Jan de Vos <gert-
    > wrote:
    > [ given 333,932 keys out of a possible 855,036,081 (171^4) values ]
    >
    > In that case you can generate a perfect hash function for the full
    > table, four for the single wild card and six for the double wild
    > card cases. This should give you the fastest possible searches.


    If there's some special pattern to the 333,932 keys, it might be
    possible to generate a perfect hash in reasonable time, but it seems
    pretty unlikely, no...? Perfect hashes tend to be practical for
    smallish numbers of keys or increasingly sparse tables, the former
    doesn't apply and the latter doesn't scale, and may be more trouble
    than it's worth (given cache misses). If I'm missing something, be
    very keen to understand....

    Cheers,
    Tony
     
    tonydee, Feb 24, 2010
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Charles Fox

    How to quickly print arrays?

    Charles Fox, Jul 13, 2004, in forum: Java
    Replies:
    8
    Views:
    397
    Neal Gafter
    Jul 14, 2004
  2. Philipp
    Replies:
    21
    Views:
    1,134
    Philipp
    Jan 20, 2009
  3. Steve555
    Replies:
    2
    Views:
    416
    red floyd
    Feb 23, 2010
  4. Brian Green
    Replies:
    2
    Views:
    113
    Brian Green
    Sep 5, 2008
  5. Replies:
    3
    Views:
    126
    Roy Smith
    Sep 25, 2013
Loading...

Share This Page