find words that contains some specific letters

L

Lew

Could you explain what you mean by "the interception of these four
SortedSet"? I don't understand that part.
 
T

Tom Anderson

Could you explain what you mean by "the interception of these four
SortedSet"? I don't understand that part.


I assume he means 'intersection'. I mean, intersection, interception,
O(1), O(c), what's the difference, right?

tom
 
L

Lew

Could you explain what you mean by "the interception of these four
SortedSet"?  I don't understand that part.


I assume he means 'intersection'. I mean, intersection, interception,
O(1), O(c), what's the difference, right?


I figured it meant "intersection", too, but that's the part I don't
understand. What does it mean to take an intersection of the result
sets?

With a HashSet approach, you take each of the n! permutations of the
search string and determine 'dictionary.contains( permutation )'. The
result you want is the union of the permutations that return 'true'.
I do not understand why there would be a test for portions of the
search string, much less taking an "intersection" of those results.

With a HashMap<String, Set<String>> approach, the entire Set of
resultant dictionary words is indexed by the search string, so one
simple 'dictionary.get( searchTerm )' yields an entire result set
directly.

So I remain puzzled, and still need the answer.
 
A

Arved Sandstrom

Arved said:
Assuming that we are talking about a fresh search every time, your
proposed algorithm (later post) is probably not all that bad, although
I'd be inclined to take the letters of interest, put them in a HashSet,
and then process each letter of a dictionary word and see if the set
contains it.

Any near-optimal method would also take into account the frequency of
letters in a language. For example, if you saw that your set of letters
contained one rare letter, for example, you'd save time overall by first
doing a search for this letter specifically in each dictionary word, and
discarding all of them (the large majority) that do not contain that
rare letter (*). Then you can do the comprehensive search.

AHS

* If your entire dictionary was composed into a single string, and you
maintained an offsets table, you could just burn through that large
string and look for the rare letter locations, and from the table figure
out what words contain it.

Not correct as far as the HashSet part goes...now that I carefully read
the original post.

However, I'd still consider something to do with letter frequencies as a
possible speedup in some cases.

AHS
 
T

Tom McGlynn

I figured it meant "intersection", too, but that's the part I don't
understand. What does it mean to take an intersection of the result
sets?

With a HashSet approach, you take each of the n! permutations of the
search string and determine 'dictionary.contains( permutation )'. The
result you want is the union of the permutations that return 'true'.
I do not understand why there would be a test for portions of the
search string, much less taking an "intersection" of those results.

With a HashMap<String, Set<String>> approach, the entire Set of
resultant dictionary words is indexed by the search string, so one
simple 'dictionary.get( searchTerm )' yields an entire result set
directly.

So I remain puzzled, and still need the answer.

I can take a crack at it... I understood Giovanni as saying that we
pre-compute something like
Map<Character, Map<Integer, Set<String>>> countMap
which given a letter and a number of occurrences of the letter returns
the set of all words that have exactly that number of occurrences
(using some predefined dictionary).

E.g., countMap.get('d').get(1) will contain "ad" but not "add".


Next one analyzes the base string into its letter occurrences, e.g.,
if we want to find the permutations of aurora we have:
a 2
u 1
o 1
r 2
and find the intersection of
countMap.get('a').get(2)
countMap.get('u').get(1)
countMap.get('o').get(1)
and
countMap.get('r').get(2)

Set doesn't have a named intersection operator, but it does
have retainAll which basically does intersection. [Though
one needs to be check that it's implemented.]

E.g., something like:
Set wordList = countMap.get('a').get(2);
wordList.retainAll(countMap.get('u').get(1));
wordList.retainAll(countMap.get('o').get(1));
wordList.retainAll(countMap.get('r').get(2));
// Modulo the following discussion wordList now contains
// the list of words that have 2 a's, 1 o, 1 u and 2 r's

This is I think what Giovanni described, but I don't think it's quite
enough. I think we also have to constrain the length of the words to
be the same as the length of the original key. (E.g., "auroras" would
match
above). So perhaps we could make countMap a List where the index
gives the length of the words indexed...

List<Map<Char,Map<Integer,Set<String>>>> countMap so
countMap.get(6).get('a').get(2)
would get the set of 6 character words that have exactly two a's in
them.

I think this modified procedure would work though I don't know if it's
particular efficient.

Regards,
Tom McGlynn
 
J

John B. Matthews

[...]
With a HashMap<String, Set<String>> approach, the entire Set of
resultant dictionary words is indexed by the search string, so one
simple 'dictionary.get( searchTerm )' yields an entire result set
directly.

Implementing <http://en.wikipedia.org/wiki/Jumble>, your HashMap<String,
Set<String>> makes an excellent dictionary. The Map takes a little extra
time to construct, but the result is static. Once the characters of an
input word are sorted, O(n log n), the lookup is indeed O(1).
So I remain puzzled, and still need the answer.

<code>
package org.gcs.jumble;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

/**
* Jumble.
* @author John B. Matthews
*/
public class Jumble {

private static final String NAME = "/usr/share/dict/words";
private static final Map<String, Set<String>> map =
new HashMap<String, Set<String>>();
static {
try {
File file = new File(NAME);
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(file)));
String s;
while ((s = in.readLine()) != null) {
byte[] ba = s.getBytes();
Arrays.sort(ba);
String sorted = new String(ba);
Set words = map.get(sorted);
if (words == null) {
words = new TreeSet<String>();
words.add(s);
map.put(sorted, words);
} else {
words.add(s);
}
}
} catch (IOException ex) {
System.err.println(ex.getMessage());
}
}

public static void main(String... args) {
if (args.length < 1) {
showHelp();
} else {
for (String word : args) {
System.out.println(word + ":");
byte[] ba = word.getBytes();
Arrays.sort(ba);
Set<String> words = map.get(new String(ba));
if (words != null) {
for (String s : words) {
System.out.println(s);
}
}
}
}
}

private static void showHelp() {
System.out.println(
"Usage: java -jar Jumble.jar <word> [<word>]");
}
}
</code>
 
G

Giovanni Azua

Giovanni Azua said:
SortedSet1 - "aa" - this means 'a' occurs 2x in the word
SortedSet2 - "rr" - this means 'r' occurs 2x in the word
SortedSet3 - "o"
SortedSet4 - "u"

Doing the map lookup you get these four SortedSet back. Finally
calculating the interception of these four SortedSet you get the desired
result e.g. [snip]
I meant intersection.

best regards,
Giovanni
 
A

Andy Dingley

I want to make an algorithm that searches for all words in a
dictionary that contains some specific set of letters.

Broadly three approaches:

One, search a stream of characters for a sequence of characters,
AFAIK, the 30+ year old Boyer–Moore algorithm is still the benchmark
here.

Two, search a stream of characters for a pattern. Here you look at
regex (the code is out there, don't reinvent it) because you _might_
find something better for some narrow set of circumstances, but it's a
cold day in hell before the average joe coder will beat the coding of
regex, especially if you only have "project time" to do it in.

Thirdly, searching a random loose pile of stuff for other stuff. You
know it's a "pattern" in a "pattern", but the real world data just
isn't neat and tidy. Then you have to talk about it in a Java
newsgroup. So here you go and use Lucene, which isn't a search
algorithm as such, but it's a great framework for applying algorithms
to real world mounds of data.
 
L

Lew

Implementing <http://en.wikipedia.org/wiki/Jumble>, your HashMap<String,
Set<String>> makes an excellent dictionary. The Map takes a little extra
time to construct, but the result is static. Once the characters of an
input word are sorted, O(n log n), the lookup is indeed O(1).

Since n, the search term length, is quite small relative to m, the dictionary
size, this is not bad. Indeed, since nearly all the time one can figure the
search term to be less than, say, 32 characters, one can place a probabilistic
upper bound on search time at around 160 complexity units.
<code>... [elided - good code]
</code>

Nice. The Jumble link to Wikipedia is most informative, too. Thank you.
 
G

Giovanni Azua

Lew said:
You mean O(m), right?
yes

There will be a small List or similar structure at each bucket of the
Set, but generally speaking those lists will be very small relative to
m.
big O is about worst-case complexity not about "relatively this or that",
hoping small rather than big data size is wishful thinking.

m being the size of the dictionary:
if you mean small being m/1000 this is still O(m)
if you mean small being m/1000000 this is still O(m)

in general m/c is still O(m) when c is constant with respect to m.
That is why the Javadocs for HashSet claim constant-time
HashSet does not claim constant time, your interpretation does.

"This class offers constant time performance for the basic operations (add,
remove, contains and size), *assuming* the hash function disperses the
elements properly among the buckets"
http://java.sun.com/j2se/1.5.0/docs/api/java/util/HashSet.html

e.g. you can place a hashCode implementation that always returns 1 and that
will proof the constant time offering wrong or simply the data being
modelled does not have a perfect hashCode e.g. String.
performance for HashSet#contains(). Are you saying the Javadocs are
wrong?
No, the javadocs are fine. I think your interpretation is wrong.
It is not common to do binary searches on HashSets.
Not only uncommon but plain wrong, there is no order in HashSets. I don't
know where you got this idea from.

FYI there are more implementations to Set than HashSet e.g. TreeSet that
offers *guaranteed* O(log n) search time unlike the first.

"This implementation provides guaranteed log(n) time cost for the basic
operations (add, remove and contains)."
http://java.sun.com/j2se/1.5.0/docs/api/java/util/TreeSet.html
HashSet lookups tend to be much faster than binary searches because
the hash lookup takes one to the correct bucket in O(1) time, then
there is an O(x) search through the list at that bucket, where x is
some very small number. The nature of the hashing alogrithm should
keep x more-or-less constant with respect to m, thus the claim of
constant-time complexity for 'contains()' is not invalidated.
Rubbish. The complexity of Hash tables is worst-case O(n). TreeSet or
Collections.binarySearch (for RandomAccess implementations) is guaranteed
O(log n)
Again, this is the claim that the Javadocs make. I feel very
comfortable agreeing with the Javadocs on this matter.
Javadoc can't make it for a book in algorithms and complexity, it just
provides a general guideline which is very clear too "it does not guarantee
that the order will remain constant over time".
Order of the HashMap is not relevant; one finds the correct entry
directly via the search-term hash and a short linear search through
the matching bucket. The size of each bucket does not depend on m for
typical dictionaries.
exactly. This is O(m) i.e. all elements (or many) fall under a few buckets
and the factors influencing this are hashCode and good table size (prime
number etc).
The term "constant time" means O(1). [snip]
Yes I made a mistake here. I thought the notation was O(c) but it is O(1)
 
G

Giovanni Azua

Hi John,

John B. Matthews said:
Implementing <http://en.wikipedia.org/wiki/Jumble>, your HashMap<String,
Set<String>> makes an excellent dictionary. The Map takes a little extra
time to construct, but the result is static. Once the characters of an
input word are sorted, O(n log n), the lookup is indeed O(1).

<code>
[snip]
Set<String> words = map.get(new String(ba));
This line will return wrong matches that do not satisfy the requirement of
the OP. You wrongly assume that for every different sorted input "ba" String
there will be a different hashCode or that they will always fall under
different buckets and that is not true.

Best regards,
Giovanni
 
G

Giovanni Azua

Hi Tom,

Tom McGlynn said:
I can take a crack at it... I understood Giovanni as saying that we
pre-compute something like
I bet you did :)
Set doesn't have a named intersection operator, but it does
have retainAll which basically does intersection. [Though
one needs to be check that it's implemented.]
The retainAll is indeed an intersection implementation but surprisingly it
is not overriden and optimized in the TreeSet implementation to be O(n).
Instead it reuses the one extended from AbstractCollection, I had a look at
this implementation and it is O(n log n) so unfortunately we need to build
our own here ...
This is I think what Giovanni described,
Exactly.

but I don't think it's quite enough. I think we also have to constrain
the length of the words to be the same as the length of the original
key. (E.g., "auroras" would match
Good catch with the length I did not see that before :) good job! I am
finishing a full implementation that I will post later and it includes your
corrections. I am testing it using an online dictionary.

The complexity is what I posted before O(n * m) where "n" is number of chars
in the input and "m" the size of the dictionary e.g. my newest big Oxford
dictionary contains 350k words and phrases.

By parallelizing the calculation of intersects, the complexity can be
reduced to O(log n * m) i.e. calculating intersection for every two Sets in
parallel and recursively ala mergesort.

Best regards,
Giovanni
 
J

John B. Matthews

"Giovanni Azua said:
John B. Matthews said:
Implementing <http://en.wikipedia.org/wiki/Jumble>, your
HashMap<String, Set<String>> makes an excellent dictionary. The Map
takes a little extra time to construct, but the result is static.
Once the characters of an input word are sorted, O(n log n), the
lookup is indeed O(1).

<code>
[snip]
Set<String> words = map.get(new String(ba));

This line will return wrong matches that do not satisfy the
requirement of the OP.

This line does not return anything. The result is used to determine if
the current word is a permutation of an already encountered word or a
novel key for the Map.
You wrongly assume that for every different sorted input "ba" String
there will be a different hashCode or that they will always fall
under different buckets and that is not true.

The code [1] correctly implements the first algorithm described in the
article cited [2]. In particular, it uses the sorted letters of each
word as the key in a HashMap. The OP's requirements are not so clear to
me. Perhaps the OP can elucidate whether the algorithm addresses the
problem.

[1]<http://sites.google.com/site/drjohnbmatthews/jumble>
[2]<http://en.wikipedia.org/wiki/Jumble>
 
L

Lew

Giovanni said:
This line will return wrong matches that do not satisfy the requirement of
the OP. You wrongly assume that for every different sorted input "ba" String
there will be a different hashCode or that they will always fall under
different buckets and that is not true.

It assumes no such thing. The String hash code does a very decent job of
distributing keys. No one claims that it makes keys always fall under
different buckets. It does come quite close for random input, though. One
can expect constant-time performance for a get() from a HashSet or HashMap
with String keys.
 
L

Lew

Giovanni said:
big O is about worst-case complexity not about "relatively this or that",
hoping small rather than big data size is wishful thinking.

m being the size of the dictionary:
if you mean small being m/1000 this is still O(m)
if you mean small being m/1000000 this is still O(m)

Nope. We're talking about String keys. Strings have a hash code that

What you can expect is order k, where k is the number of keys that hash to the
same bucket. For typical uses of String keys, that'll be very, very small, on
the order of ten or less, and will not vary with m.
e.g. you can place a hashCode implementation that always returns 1 and that
will proof the constant time offering wrong or simply the data being
modelled does not have a perfect hashCode e.g. String.

You could, but Strings don't.

What do you mean by a "perfect" hash code? No hash code guarantees completely
distinct values for distinct keys. What String's hash achieves is
probabilistically extremely good distribution, thus fulfilling the requirement
of the Javadocs.
 
G

Giovanni Azua

Patricia Shanahan said:
A perfect hash function is one that guarantees distinct hash codes for a
specific set of keys. There are algorithms for constructing one, given a
set of keys.
For example the Character class.

Best regards,
Giovanni
 
J

John B. Matthews

"Giovanni Azua said:
For example the Character class.

I think all of the primitive wrappers (except Double) have trivially
perfect integer hashes, as they wrap distinct values. Patricia was
perhaps referring to algorithms of the kind discussed here:

<http://en.wikipedia.org/wiki/Perfect_hash_function>

The jumble algorithm relies on a hashed map for efficiency, but a
perfect hash is not essential. I recently implemented the algorithm
in both Java and Ada using library routines:

<http://en.wikipedia.org/wiki/Jumble>
<http://sites.google.com/site/drjohnbmatthews/jumble>
<http://home.roadrunner.com/~jbmatthews/jumble.html>

I was intrigued to compare the two languages' library implementations
of a general purpose string hash. Using a multiplicative function with
an initial value of zero, both offer excellent performance. Java uses a
multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of
eight on an unsigned (modular) value whose size is implementation
defined (e.g. mod 2**32):

<http://www.docjar.org/html/api/java/lang/String.java.html>
<http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>

[Cross-posted to comp.lang.java.programmer and comp.lang.ada;
Followup-To header not set.]
 
M

Mark Space

John said:
I think all of the primitive wrappers (except Double) have trivially
perfect integer hashes, as they wrap distinct values. Patricia was
perhaps referring to algorithms of the kind discussed here:

<http://en.wikipedia.org/wiki/Perfect_hash_function>


I remember in school we implemented a minimal perfect hash function on a
given set of keys: the keywords in the Pascal language. We were given
the parameters of the algorithm: use the length of the keyword, the
first character and the last character only.

The concept was obviously to show how a fast look-up could be done when
implementing a compiler, but the hash function itself was interesting at
the time.
 
L

Lew

John said:
The jumble algorithm relies on a hashed map for efficiency, but a

Because it has constant-time performance for get() with the String hash code.
perfect hash is not essential. I recently implemented the algorithm
in both Java and Ada using library routines:

<http://en.wikipedia.org/wiki/Jumble>
<http://sites.google.com/site/drjohnbmatthews/jumble>
<http://home.roadrunner.com/~jbmatthews/jumble.html>

I was intrigued to compare the two languages' library implementations
of a general purpose string hash. Using a multiplicative function with
an initial value of zero, both offer excellent performance. Java uses a
multiplier of 31 on a signed 32-bit integer; Ada uses a multiplier of
eight on an unsigned (modular) value whose size is implementation
defined (e.g. mod 2**32):

<http://www.docjar.org/html/api/java/lang/String.java.html>
<http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-strhas.adb>

The Java String hash is just about verbatim from Knuth. It has the virtue of
being a well-studied algorithm that has been shown to have good distribution
characteristics.

I'm not so sure about the Ada one. I haven't studied this in a while, but I
seem to recall that power-of-two multipliers mod 2**32 don't have as good a
distribution. Perhaps someone could correct me on that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top