Best algorithm/data structure combination for word lookup?

P

pj

Hi,

I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:

1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)

2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.

3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.

Things that I don't mind trading off:

1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.

My current thoughts are:

* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).
* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).

So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std::pair<std::string, int> > for lookups with
binary search and to associate each word with an index into another
container
std::vector<std::string> to store the output words
Impl. B:
std::map<std::string, int> for lookups
std::vector<std::string> for the output words

A sorted std::vector would be faster for lookups than std::map, right?
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?

What do you suggest? Is there any other good option that I 'm not
considering?

Thanks in advance!
Jon
 
V

Victor Bazarov

[..]
A sorted std::vector would be faster for lookups than std::map, right?

If so, then only marginally.
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?

Could be. Have you tried looking at different approaches in the
TACP by Knuth? Sorting and Searching takes a whole volume.
What do you suggest? Is there any other good option that I 'm not
considering?

Pick one, get it working. Think of an interface where you can
easily plug in different implementations. Then measure each of
the approaches.

V
 
T

Thomas Tutone

I 'm currently writing a program that performs transliteration
(i.e.,
converts greek text written using the english alphabet to "pure"
greek
text using the greek alphabet) as part of my thesis. I have
decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and
here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:

1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the
string
"leet" twice)

2. Has very fast negative answers to lookups, as I don't expect
lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore
most
of the time lookups will turn out to be negative.

3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is
already
fast. This one will have to be even faster to be useful.

Things that I don't mind trading off:

1. Mutability of the associated data structures. They are going
to be
built once at program startup and not modified for the duration
of the
run.

My current thoughts are:

* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another
structure
that provides the output words (std::vector most likely).
* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or
binary
search (that probably means a sorted std::vector).

So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std::pair<std::string, int> > for lookups with
binary search and to associate each word with an index into
another
container
std::vector<std::string> to store the output words
Impl. B:
std::map<std::string, int> for lookups
std::vector<std::string> for the output words

A sorted std::vector would be faster for lookups than std::map,
right?

Generally, yes, but you would have to measure to be sure. Scott
Meyers has a chapter of his Effective STL book devoted to this topic,
and he concludes that where (as is the case for you), the data
structure will not change, a sorted vector is typically faster. In a
project I did involving look-up in a data structure consisting of just
several hundred words, the version of the program that used a sorted
vector was indeed more than 20% faster than the prior version using a
map. YMMV.
What about using a non-standard std::hash_map?

Or the soon-to-be-standard std::tr1::unordered_map. Dinkumware has
already released a version of their standard library supporting it,
gcc 4.3 supports it as well, and other vendors are following suit.
A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g.
the
1000 words I mentioned above)?

Depends on the quality of the hashing function. If it is both very
fast and does not result in many collisions, it could be materially
faster than the binary search. With only 4 letters, I bet you could
accomplish that, but you'd have to try and see.

Good luck - sounds like a fun project.

Best regards,

Tom
 
?

=?ISO-8859-1?Q?Erik_Wikstr=F6m?=

Hi,

I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:

1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)

2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.

When using binary trees/searches the time it takes to find out that the
word is not in the dictionary must be the same as it takes to find that
it isn't, if you think about it you'll see why.
3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.

A question, is there much work involved translating letter for letter,
ie. isn't there more or less a 1:1 mapping of the Latin letters to the
Greek? Just curious since you probably wont see any gains if it's that
simple.
Things that I don't mind trading off:

1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.

My current thoughts are:

* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).

I'd probably use a pointer instead of an int, doubt that there'll be
much difference in speed.
* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).

So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std::pair<std::string, int> > for lookups with
binary search and to associate each word with an index into another
container
std::vector<std::string> to store the output words
Impl. B:
std::map<std::string, int> for lookups
std::vector<std::string> for the output words

A sorted std::vector would be faster for lookups than std::map, right?

For small sets the vector could probably be faster but I think the map
will be better as the size increase (but I don't have anything to back
that up with, so measure yourself).
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?

Unless you plan to use the application in an embedded system or handheld
then std::map/std::vector will probably be fast enough for a dictionary
of that size.
What do you suggest? Is there any other good option that I 'm not
considering?

Don't think it will give you much with such small sets of words but a
radix-tree might be a good choice. It's a tree-structure and the depth
of each node would be no more than the number of letters in the word
(might be less). Another advantage is that it can give you early
negatives, you'd only have to walk as deep as the word has letters in
common with a word that's in the dictionary. The Wikipedia article on
the subject will give you the idea: en.wikipedia.org/wiki/Radix_tree
 
T

Thomas Tutone

On 2007-07-27 17:59, (e-mail address removed) wrote:
[snip]
A sorted std::vector would be faster for lookups than std::map,
right?

For small sets the vector could probably be faster but I think
the map
will be better as the size increase (but I don't have anything to
back
that up with, so measure yourself).

I think you misunderstand. The OP was referring to a sorted vector
that one searches using std::lower_bound or a similar binary search
algorithm. Doing so is typically more efficient than using the find
member function of a std::map, which is a more complex data structure
than a vector. The drawback of using a sorted vector is that
inserting or deleting elements is typically far more time consuming
than doing so in a map. The advantage is that searching to see if an
element exists is typically faster in a sorted vector than in a map.
(Both searches should occur in logarithmic time, but the binary search
of a sorted vector may be done more efficiently.) This should hold
true whether the size of the container is 100 or 10,000, although the
gains may be marginal depending upon the implementation. Scott Meyers
devotes a chapter of Effective STL to this topic.

Best regards,

Tom
 
P

pj

When using binary trees/searches the time it takes to find out
that the word is not in the dictionary must be the same as it
takes to find that it isn't, if you think about it you'll see why.

Sure, but in other containers (e.g. radix trees) it could be
different, which is why I 'm asking. :)
A question, is there much work involved translating letter for
letter, ie. isn't there more or less a 1:1 mapping of the Latin
letters to the Greek? Just curious since you probably wont see any
gains if it's that simple.

Quite a bit of work. For example, Greek has no less than 5 different
ways to write down the phoneme "EE" (as in "EAsy"), and usually it's
just written as an "i" (without regard to proper Greek spelling).
There are also many cases where letters have to be translated with
forward or backward seeking, e.g. "TH" may represent the two letters
"tau eta" or the single letter "theta" (think about the TH sound in
"think"). This can be decided for sure UNLESS the next letter (after
the "h") is one of "l", "m", "n", "r", in which case both
possibilities have to be tried. You get the point. :)

I 'm using a dictionary lookup through aspell as the final stage of
the process; the catch is that asking aspell "is this word known?" is
more than two orders of magnitude faster than asking "give me
suggestions for this unknown word". Therefore when I see a word with
ambiguous phonemes, I 'm asking aspell about ALL the plausible
spellings in hopes of "guessing" the correct one. Even with some words
having up to 512 spelling combinations, this is about 4-5 times faster
than making an arbitrary translation and relying on aspell to correct
misspellings (and gives better results too).

What I 'm trying to do with this "home made lookup" scheme is
obviously bypass (and outperform!) aspell queries for small common
words, since I can finetune it for the specific type of data it will
be operating on.

---

Based on the replies so far, I think I 'll go for the sorted vector
approach first using a suitably generic interface. This extra
processing can only improve the results, so if it also improves speed
(by not having to go through aspell at all) I 'll just stick with it.

Thanks to everyone for replying, although of course I 'm still open to
good advice!

Jon
 
T

Tobias

For faster look-up you might create a pointer-table at the beginning
of the run time or even at compile time if the word list is fixed.

For the english alphabet this would be an array of 26 indices. First
entry is 0, that's where words with initial letter 'a' start in your
word list. Second entry is the index of the first word with initial
letter b in your word list and so on.

If you have to look up a word with initial letter b then you can start
at the pointer-table entry 1 and stop at pointer entry 2 (excluding).
If you don't find it in this shorter range then it does not exist in
the full list.

With a bit more memory you can also go for the first two letters. That
would make an index list with 26**2 = 676 entries. This should shorten
the look-up time significantly.

Best regards,
Tobias
 
J

James Kanze

I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:
1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)

The alternative to storing the string "leet" twice is to use
pointers. And it's far from obvious that a pointer requires
less memory than the string---on my machine, a pointer is 8
bytes, where as I can store "leet" in five bytes.
2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.

For 1000 or so words, std::set is more than adequate. For
100000, of course, you'd have to consider some sort of hashing.

You might also consider some sort of tri.
3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.
Things that I don't mind trading off:
1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.
My current thoughts are:
* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).

Which would make it significantly slower, without necessarily
reducting memory very much.
* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).

If the data is unmutable, and can be known at compile time, the
simplest solution is probably std::lower_bound over a C style
array, statically initialized; if you can set a maximum length
of, say 7, for the strings, this can avoid all pointers
altogether, and be very, very fast. (It also have very good
locality, which can make a significant difference in real life,
even if it doesn't affect the big-O factor.)

Otherwise, std::map (normally a balanced tree, and never a hash
table) can be used.
So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std::pair<std::string, int> > for lookups with
binary search and to associate each word with an index into another
container
std::vector<std::string> to store the output words
Impl. B:
std::map<std::string, int> for lookups
std::vector<std::string> for the output words
A sorted std::vector would be faster for lookups than std::map, right?

What do the measurements say on your machine? As I said, if the
data were known at compile time, I'd use a:

struct Entry
{
char from[ 8 ] ;
char to [ 8 ] ;

bool operator<( Entry const& other ) const
{
return memcmp( from, other.from, 8 ) < 0 ;
}
} ;

Entry const table[] =
{
// Entries pre-sorted.
} ;

and std::lower_bound. A bit brutal, perhaps, but remarkably
efficient, because locality is as good as possible.
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?

It depends on the efficiency of the hash function. There can be
some gain for as few as a couple of hundred elements, but in
practice, the difference is very, very small, and a hash table
will never has as good locality as something like the above.
 
J

James Kanze

On 2007-07-27 17:59, (e-mail address removed) wrote: [snip]
A sorted std::vector would be faster for lookups than std::map,
right?
For small sets the vector could probably be faster but I
think the map will be better as the size increase (but I
don't have anything to back that up with, so measure
yourself).
I think you misunderstand. The OP was referring to a sorted vector
that one searches using std::lower_bound or a similar binary search
algorithm. Doing so is typically more efficient than using the find
member function of a std::map, which is a more complex data structure
than a vector.

I suspect that the real difference comes from the fact that
vector has better locality; std::map is (always?) a node based
structure, which means each entry is allocated separately, and
could be anywhere with regards to the others. And even if it
happens to be adjacent, you still have the extra pointers, etc.,
which means that you get fewer entries in a cache line or a
page.

Note that similar considerations may concern the use of
std::string, unless the implementation uses the small string
optimization.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top