P
pj
Hi,
I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:
1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)
2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.
3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.
Things that I don't mind trading off:
1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.
My current thoughts are:
* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).
* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).
So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std:
air<std::string, int> > for lookups with
binary search and to associate each word with an index into another
container
std::vector<std::string> to store the output words
Impl. B:
std::map<std::string, int> for lookups
std::vector<std::string> for the output words
A sorted std::vector would be faster for lookups than std::map, right?
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?
What do you suggest? Is there any other good option that I 'm not
considering?
Thanks in advance!
Jon
I 'm currently writing a program that performs transliteration (i.e.,
converts greek text written using the english alphabet to "pure" greek
text using the greek alphabet) as part of my thesis. I have decided to
add the capability to convert words using some sort of lookup
algorithm as a sidekick to the "normal" conversion algorithm, and here
it starts getting interesting. I want to find an algorithm that
satisfies these requirements:
1. Can support a N-to-1 input words to output words relationship
without duplicating storage (contrived example: "133t" and "1337"
should both somehow point to "leet" without having to store the string
"leet" twice)
2. Has very fast negative answers to lookups, as I don't expect lookup
dictionaries to get big (e.g. 1000 words maximum) and therefore most
of the time lookups will turn out to be negative.
3. Did I mention speed? This algorithm will be used only for small
(e.g. 2-4 letters) words, with which the normal conversion is already
fast. This one will have to be even faster to be useful.
Things that I don't mind trading off:
1. Mutability of the associated data structures. They are going to be
built once at program startup and not modified for the duration of the
run.
My current thoughts are:
* As per requirement #1, we should probably associate strings with
integers. These integers will be used as indexes to another structure
that provides the output words (std::vector most likely).
* As per requirement #2, the fastest way to do the lookup would be
either a hash function (that probably means using std::map) or binary
search (that probably means a sorted std::vector).
So we have a couple of different possible implementations:
Impl. A:
Sorted std::vector<std:
binary search and to associate each word with an index into another
container
std::vector<std::string> to store the output words
Impl. B:
std::map<std::string, int> for lookups
std::vector<std::string> for the output words
A sorted std::vector would be faster for lookups than std::map, right?
What about using a non-standard std::hash_map? A hash-based
implementation might be faster than binary search for large
dictionaries, but is it worth the trouble for smallish ones (e.g. the
1000 words I mentioned above)?
What do you suggest? Is there any other good option that I 'm not
considering?
Thanks in advance!
Jon