How to quickly search through arrays?

Steve555 · Feb 22, 2010

Hi

I have a set of 171 entities (in this case they're integers
representing musical phrases)
Theses exist in sequences of length 4, and I have a library (lookup-
table) of such sequences.
e.g.
2 58 93 170
38 39 100 150
38 40 41 87
etc.

The table is sorted and is quite sparse; there are 333,932 out of a
possible 855,036,081 (171^4) .

I'm looking for fast ways to find matches for a query containing 1 or
2 wildcards.
e.g. find matches for:
2 58 ? 170
? 39 ? 150

At the moment I'm using brute force to just iterated through the
lookup table, but this is proving very slow as the algorithm needs to
do hundreds of searches in close to real time.

I know very little about searching, so any ideas for me to study would
be welcomed.
I do know how to do a binary search, but only from the first entry, so
if that is a wildcard I'm stumped.

I happy to re-format/re-package the data if it could improve the
search speed. I'm not concerned about size or memory usage.

Thanks for any suggestions

Steve

Maxim Yegorushkin · Feb 22, 2010

Hi

I have a set of 171 entities (in this case they're integers
representing musical phrases)
Theses exist in sequences of length 4, and I have a library (lookup-
table) of such sequences.
e.g.
2 58 93 170
38 39 100 150
38 40 41 87
etc.

The table is sorted and is quite sparse; there are 333,932 out of a
possible 855,036,081 (171^4) .

I'm looking for fast ways to find matches for a query containing 1 or
2 wildcards.
e.g. find matches for:
2 58 ? 170
? 39 ? 150

At the moment I'm using brute force to just iterated through the
lookup table, but this is proving very slow as the algorithm needs to
do hundreds of searches in close to real time.

I know very little about searching, so any ideas for me to study would
be welcomed.
I do know how to do a binary search, but only from the first entry, so
if that is a wildcard I'm stumped.

Sounds that you might benefit from using a ternary search tree. They
have comparable to hash lookup performance and they do support partial
(wildcards) searches.

http://www.cs.princeton.edu/~rs/strings/

Öö Tiib · Feb 22, 2010

Hi

I have a set of 171 entities (in this case they're integers
representing musical phrases)
Theses exist in sequences of length 4, and I have a library (lookup-
table) of such sequences.
e.g.
2 58 93 170
38 39 100 150
38 40 41 87
etc.

The table is sorted and is quite sparse; there are 333,932 out of a
possible 855,036,081 (171^4) .

I'm looking for fast ways to find matches for a query containing 1 or
2 wildcards.
e.g. find matches for:
2 58 ? 170
? 39 ? 150

At the moment I'm using brute force to just iterated through the
lookup table, but this is proving very slow as the algorithm needs to
do hundreds of searches in close to real time.

I know very little about searching, so any ideas for me to study would
be welcomed.

I do know how to do a binary search, but only from the first entry, so
if that is a wildcard I'm stumped.

You may create index vectors to your table (sorted by other phrases).
For search by first phrase use std::lower_bound() on your table. For
search by other phrase use std::lower_bound() on such sorted index.
If you have static array (no new phrases are added into your tables)
you may want to make pre-searched indexes to quicken it up even more.

I happy to re-format/re-package the data if it could improve the
search speed. I'm not concerned about size or memory usage.

Possibly you still want to make some alternative implementations and
profile them integrated into your application. Size of memory usage is
a factor that often drags performance down. It is common to
underestimate how devastating are frequent cache misses to
performance.

Marcel Müller · Feb 22, 2010

Hi,

I have a set of 171 entities (in this case they're integers
representing musical phrases)
Theses exist in sequences of length 4, and I have a library (lookup-
table) of such sequences.
e.g.
2 58 93 170
38 39 100 150
38 40 41 87
etc.

The table is sorted and is quite sparse; there are 333,932 out of a
possible 855,036,081 (171^4) .

I'm looking for fast ways to find matches for a query containing 1 or
2 wildcards.
e.g. find matches for:
2 58 ? 170
? 39 ? 150

At the moment I'm using brute force to just iterated through the
lookup table, but this is proving very slow as the algorithm needs to
do hundreds of searches in close to real time.

what about the change rate of your data set?

If the data changes rarely create six hashtables for each pair of
columns. If you want to speed up requests with only one wildcard even
more, place a nested hashtable for a third column in the value type of
the above hash tables. This will make all lookups with one and two
wildcards O(1). Note that you do not need to support all combinations as
you can choose the first hashtable appropriately in case of three given
Columns.
If memory consumption is more important than the scalability you could
replace the many inner hashtables by sorted lists.

Btw., how large can your numbers grow? If an unsigned char is
sufficient, you can pack an entire line into a single 32 bit structure.
Even with unsigned short data keeping redundant copies of your data in
the data structures is better than references.

I do know how to do a binary search, but only from the first entry, so
if that is a wildcard I'm stumped.

You need different entry points for different types of queries.

Marcel

Steve555 · Feb 22, 2010

Sounds that you might benefit from using a ternary search tree. They
have comparable to hash lookup performance and they do support partial
(wildcards) searches.

http://www.cs.princeton.edu/~rs/strings/

Thanks Max, the Dr Dobbs link there has a lot of useful stuff.

Steve555 · Feb 22, 2010

You may create index vectors to your table (sorted by other phrases).
For search by first phrase use std::lower_bound() on your table. For
search by other phrase use std::lower_bound() on such sorted index.
If you have static array (no new phrases are added into your tables)
you may want to make pre-searched indexes to quicken it up even more.

Possibly you still want to make some alternative implementations and
profile them integrated into your application. Size of memory usage is
a factor that often drags performance down. It is common to
underestimate how devastating are frequent cache misses to
performance.

Did you mean make 3 other index vectors for the elements in the 2nd,
3rd, 4th positions?
I can see how that would allow me to use binary search on any
'column'.

I'm not sure what 'pre-searched indexes' means.

Thanks

Steve

Steve555 · Feb 22, 2010

Hi,

what about the change rate of your data set?

If the data changes rarely create six hashtables for each pair of
columns. If you want to speed up requests with only one wildcard even
more, place a nested hashtable for a third column in the value type of
the above hash tables. This will make all lookups with one and two
wildcards O(1). Note that you do not need to support all combinations as
you can choose the first hashtable appropriately in case of three given
Columns.
If memory consumption is more important than the scalability you could
replace the many inner hashtables by sorted lists.

Btw., how large can your numbers grow? If an unsigned char is
sufficient, you can pack an entire line into a single 32 bit structure.
Even with unsigned short data keeping redundant copies of your data in
the data structures is better than references.

You need different entry points for different types of queries.

Marcel

Thanks for the comprehensive reply. I've just googled as much as I
could to try and understand before replying.
I really need to learn about hash tables, before I get why I need six
of them!
(I did say I was new to search techniques

)

The lookup table (library of sequences) never changes. The elements
will never be bigger than 171, so yes I could pack 4 of them into 32
bits.
Are you, then, saying to store each sequence-of-4 in a single
unsigned long, and do some kind of bit search?
Sorry, I don't know how that would work... how would I differentiate
between the ones the exist in the library, and those that don't?

Thanks

Steve

Gert-Jan de Vos · Feb 22, 2010

Thanks for the comprehensive reply. I've just googled as much as I
could to try and understand before replying.
I really need to learn about hash tables, before I get why I need six
of them!
(I did say I was new to search techniques )

The lookup table (library of sequences) never changes. The elements
will never be bigger than 171, so yes I could pack 4 of them into 32
bits.

In that case you can generate a perfect hash function for the full
table, four for the single wild card and six for the double wild
card cases. This should give you the fastest possible searches.

Öö Tiib · Feb 22, 2010

Did you mean make 3 other index vectors for the elements in the 2nd,
3rd, 4th positions?
I can see how that would allow me to use binary search on any
'column'.
Yes.

I'm not sure what 'pre-searched indexes' means.

I meant do not stop at making indexes.

Your array will have 334K records (Lets call it 'array'.) It is sorted
by 0-th phrase.
There are also 3 334K vectors containing array indexes sorted by 1-st,
2-nd, 3-rd phrases ( lets call them 'index1' etc.)

Now ... occurs that you usually first search is like where sequence of
phrases 38 start in 'index2'. You can search that before (since your
array is static) and put results into vector of 171 indexes of indexes
(lets call it 'firstIndex2'). As result first array element with
phrase 38 by index2 is:

array[ index2[ firstIndex2[38] ] ];

Similarily you make lastIndex2. It results with no searching at all
from 334K of space. You search only between these indexes, knowing
that all the records are between them.

Marcel Müller · Feb 22, 2010

Steve555 said:
Thanks for the comprehensive reply. I've just googled as much as I
could to try and understand before replying.
I really need to learn about hash tables, before I get why I need six
of them!
(I did say I was new to search techniques )

Feel free to use std::hash_map and forget about the internal structure
of hash tables.

The lookup table (library of sequences) never changes. The elements
will never be bigger than 171, so yes I could pack 4 of them into 32
bits.
Are you, then, saying to store each sequence-of-4 in a single
unsigned long, and do some kind of bit search?

typedef unsigned char MyEntry[4];

will most likely fit your needs.

If you encounter problems because array types are a bit special in C/C++
wrap it in a struct:
struct MyEntry
{ unsigned char Column[4];
};

Objects of type MyEntry are now 32 bits.

struct MyHashKey
{ unsigned char Key[2];
};

MyHashKey is intended to be used as hash key for two arbitrary columns.

Because MyHashKey is no integral type you must define a hash function to
use it in a hashtable.

int MyHashFunc(MyHashKey key)
{ // portable version:
return key[0] | (key[1] << 8);
// fast non-portable version:
// return *(short*)key;
}

As content I would recommend to use something like:

template <typename Compare>
typedef MyHashEntry std::set<MyEntry, Compare>;

It will contain the matching MyEntry lines of your data ordered by the
remaining columns, to speed up lookups with one or no wildcards. If you
have two wildcards enumerate over the entire sublist.
The Compare template parameter is needed because Content is sorted by
different columns depending on your first key columns. It is a functor
that provides the comperator for two MyEntries.

Now you need the comparer mentioned above.

template <int COL1, int COL2>
bool MyColumnComparer(MyEntry l, MyEntry r)
{ unsigned lkey = (l.Column[COL1] << 8) | l.Column[COL2];
unsigned rkey = (r.Column[COL1] << 8) | r.Column[COL2];
return lkey < rkey;
}

And now you can define your root data structures:

std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<3,4> >,
MyHashFunc> EntriesBy1234;

The above hashtable instance is intended to store the Entries hashed by
the first two columns and for each distinct values of the first two
colums ordered by the third column. In the same way you could define
further entry points for your searches.

std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<2,4> >,
MyHashFunc> EntriesBy1324;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<2,3> >,
MyHashFunc> EntriesBy1423;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<1,4> >,
MyHashFunc> EntriesBy2314;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<1,3> >,
MyHashFunc> EntriesBy2413;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<1,2> >,
MyHashFunc> EntriesBy3412;

For lookups with only one wildcard you can choose an arbitrary list with
the required key columns. There is some redundancy in this case.

Note, that all of that is only a rough idea. It may even contain
syntactical errors because I wrote down it quite quickly.
Also note that the data structure is only optimized for lookups not for
the population process. Changes are quite slow because they have to be
applied at different places simultaneously.

Sorry, I don't know how that would work... how would I differentiate
between the ones the exist in the library, and those that don't?

Sorry, I didn't understand the your last question.

Marcel

red floyd · Feb 22, 2010

Feel free to use std::hash_map and forget about the internal structure
of hash tables.

Except for the small detail that there is no such thing as
std::hash_map.

TR1 introduced std::unordered_map, std::unordered_multimap,
std::unordered_set, and std::unordered_multiset.

Steve555 · Feb 23, 2010

Did you mean make 3 other index vectors for the elements in the 2nd,
3rd, 4th positions?
I can see how that would allow me to use binary search on any
'column'.
Yes.

I'm not sure what 'pre-searched indexes' means.

Click to expand...

I meant do not stop at making indexes.

Your array will have 334K records (Lets call it 'array'.) It is sorted
by 0-th phrase.
There are also 3 334K vectors containing array indexes sorted by 1-st,
2-nd, 3-rd phrases ( lets call them 'index1' etc.)

Now ... occurs that you usually first search is like where sequence of
phrases 38 start in 'index2'. You can search that before (since your
array is static) and put results into vector of 171 indexes of indexes
(lets call it 'firstIndex2'). As result first array element with
phrase 38 by index2 is:

array[ index2[ firstIndex2[38] ] ];

Similarily you make lastIndex2. It results with no searching at all
from 334K of space. You search only between these indexes, knowing
that all the records are between them.

Thanks for the clarification. I'm going to try and implement this
method and the other hash method, see which works best in context.

Steve555 · Feb 23, 2010

Steve555 said:
Steve555 said:

Thanks for the comprehensive reply. I've just googled as much as I
could to try and understand before replying.
I really need to learn about hash tables, before I get why I need six
of them!
(I did say I was new to search techniques )

Click to expand...

Feel free to use std::hash_map and forget about the internal structure
of hash tables.

The lookup table (library of sequences) never changes. The elements
will never be bigger than 171, so yes I could pack 4 of them into 32
bits.
Are you, then, saying to store each sequence-of-4 in a single
unsigned long, and do some kind of bit search?

Click to expand...

typedef unsigned char MyEntry[4];

will most likely fit your needs.

If you encounter problems because array types are a bit special in C/C++
wrap it in a struct:
struct MyEntry
{ unsigned char Column[4];
};

Objects of type MyEntry are now 32 bits.

struct MyHashKey
{ unsigned char Key[2];
};

MyHashKey is intended to be used as hash key for two arbitrary columns.

Because MyHashKey is no integral type you must define a hash function to
use it in a hashtable.

int MyHashFunc(MyHashKey key)
{ // portable version:
return key[0] | (key[1] << 8);
// fast non-portable version:
// return *(short*)key;
}

As content I would recommend to use something like:

template <typename Compare>
typedef MyHashEntry std::set<MyEntry, Compare>;

It will contain the matching MyEntry lines of your data ordered by the
remaining columns, to speed up lookups with one or no wildcards. If you
have two wildcards enumerate over the entire sublist.
The Compare template parameter is needed because Content is sorted by
different columns depending on your first key columns. It is a functor
that provides the comperator for two MyEntries.

Now you need the comparer mentioned above.

template <int COL1, int COL2>
bool MyColumnComparer(MyEntry l, MyEntry r)
{ unsigned lkey = (l.Column[COL1] << 8) | l.Column[COL2];
unsigned rkey = (r.Column[COL1] << 8) | r.Column[COL2];
return lkey < rkey;
}

And now you can define your root data structures:

std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<3,4> >,
MyHashFunc> EntriesBy1234;

The above hashtable instance is intended to store the Entries hashed by
the first two columns and for each distinct values of the first two
colums ordered by the third column. In the same way you could define
further entry points for your searches.

std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<2,4> >,
MyHashFunc> EntriesBy1324;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<2,3> >,
MyHashFunc> EntriesBy1423;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<1,4> >,
MyHashFunc> EntriesBy2314;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<1,3> >,
MyHashFunc> EntriesBy2413;
std::hash_set<MyHashKey,
MyHashEntry<MyColumnComparer<1,2> >,
MyHashFunc> EntriesBy3412;

For lookups with only one wildcard you can choose an arbitrary list with
the required key columns. There is some redundancy in this case.

Note, that all of that is only a rough idea. It may even contain
syntactical errors because I wrote down it quite quickly.
Also note that the data structure is only optimized for lookups not for
the population process. Changes are quite slow because they have to be
applied at different places simultaneously.

Sorry, I don't know how that would work... how would I differentiate
between the ones the exist in the library, and those that don't?

Click to expand...

Sorry, I didn't understand the your last question.

Marcel

Thanks Marcel, this looks perfect, and is something I really want to
get my teeth in to now, but - frustration - I don't have std::hash-map
on my system (Mac OSX 10.6, XCode compiler, GCC 4.2). Googling seems
to suggest it's a custom SGI thing, possibly I can find and install
it? I'll post when I've got a working version running.

tonydee · Feb 23, 2010

[ given 333,932 keys out of a possible 855,036,081 (171^4) values ]

In that case you can generate a perfect hash function for the full
table, four for the single wild card and six for the double wild
card cases. This should give you the fastest possible searches.

If there's some special pattern to the 333,932 keys, it might be
possible to generate a perfect hash in reasonable time, but it seems
pretty unlikely, no...? Perfect hashes tend to be practical for
smallish numbers of keys or increasingly sparse tables, the former
doesn't apply and the latter doesn't scale, and may be more trouble
than it's worth (given cache misses). If I'm missing something, be
very keen to understand....

Cheers,
Tony

How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 18, 2024
Best way to search through STL maps?	10	Nov 29, 2009
Q: Hi-HO! How to implement this search engine... ?	1	Sep 20, 2010
how to search through file for matching criteria	1	Jun 7, 2006
how to search through file for certain criteria	3	Jun 7, 2006
[OT, multi-posted to comp.programming] Mystery program...	3	May 2, 2011
seeking to improve Python skills	0	Jan 23, 2009
BlueJ don't know what i did wrong	29	Feb 22, 2013

How to quickly search through arrays?

Steve555

Maxim Yegorushkin

Öö Tiib

Marcel Müller

Steve555

Steve555

Steve555

Gert-Jan de Vos

Öö Tiib

Marcel Müller

red floyd

Steve555

Steve555

tonydee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads