Problem with searching in TreeSet expressions starting with a given text

S

setar

I store in TreeSet expressions in some natural language. This can be
any natural language. The content of a set is sorted in this language
by language collator:
collator = Collator.getInstance(locale);
collator.compare(string1, string2)

Sorting is working correctly. For example in English the letters have
following order: ..., c, d, e, ... so words: cat, dog, ear will be
sorted in the given order.

Sorting is also working correctly in other languages. In non-English
languages there can be additional letters. For example in Polish there
is a 'æ' letter (c with an accent). This letters change the order of
other letter. In Polish the order is: ..., c, æ, d, e, .... and for
example following words have given order: cena (en. price), æma (en.
moth), duma (en. pride), echo (en. echo).

So index is build correctly for any language and it is sorted
correctly.

Now I want to receive a range of its sorted values which are strings
beginning with a given string.
For English it is easy. If I want to get all expressions beginning with
a string "cat" I can invoke subSet method with the first argument
equal to this string and the second argument the same as the first but
with last letter replaced with its SUCCESOR. Successors in English are
easy to determine:
TreeSet index;
index.subSet("cat", "cau");

But for other languages determining successors of letters isn't easy.
For example in Polish the successor of letter 'c' is 'æ' not 'd'. I
could create tables with orders of letters for all languages, but there
is too many languages... I was looking in Java API for method
retrieving successor of given character but I found nothing. Does
anybody know if there is such a method?

Second way of retrieving expressions beginning with a given string is
invoking subSet with following arguments:
TreeSet index;
index.subSet("cat", "cat" + Character.MAX_VALUE);

I tested it with few string and everything seemed to work ok. But
recently I tried to return all expressions starting with a word
"cat", not only with a string. So I want to receive string "cat
is running", but I don't want to receive expression "catch". So I
invoked subSet with space after string "cat":
TreeSet index;
index.subSet("cat ", "cat " + Character.MAX_VALUE);

Unfortunately method returned also the "catch" word. I don't know
why subSet is working in this way.

Does anybody know how to correct described ideas or maybe problem can
be solved in other way?
 
R

Robert Klemme

I store in TreeSet expressions in some natural language. This can be
any natural language. The content of a set is sorted in this language
by language collator:
collator = Collator.getInstance(locale);
collator.compare(string1, string2)

Sorting is working correctly. For example in English the letters have
following order: ..., c, d, e, ... so words: cat, dog, ear will be
sorted in the given order.

Sorting is also working correctly in other languages. In non-English
languages there can be additional letters. For example in Polish there
is a 'ć' letter (c with an accent). This letters change the order of
other letter. In Polish the order is: ..., c, ć, d, e, .... and for
example following words have given order: cena (en. price), ćma (en.
moth), duma (en. pride), echo (en. echo).

So index is build correctly for any language and it is sorted
correctly.

Now I want to receive a range of its sorted values which are strings
beginning with a given string.
For English it is easy. If I want to get all expressions beginning with
a string "cat" I can invoke subSet method with the first argument
equal to this string and the second argument the same as the first but
with last letter replaced with its SUCCESOR. Successors in English are
easy to determine:
TreeSet index;
index.subSet("cat", "cau");

But for other languages determining successors of letters isn't easy.
For example in Polish the successor of letter 'c' is 'ć' not 'd'. I
could create tables with orders of letters for all languages, but there
is too many languages... I was looking in Java API for method
retrieving successor of given character but I found nothing. Does
anybody know if there is such a method?

Second way of retrieving expressions beginning with a given string is
invoking subSet with following arguments:
TreeSet index;
index.subSet("cat", "cat" + Character.MAX_VALUE);

I tested it with few string and everything seemed to work ok. But
recently I tried to return all expressions starting with a word
"cat", not only with a string. So I want to receive string "cat
is running", but I don't want to receive expression "catch". So I
invoked subSet with space after string "cat":
TreeSet index;
index.subSet("cat ", "cat " + Character.MAX_VALUE);

Unfortunately method returned also the "catch" word. I don't know
why subSet is working in this way.

It may be due to the workings of the Comparator you are using. Try
comparing "cat ", "catch" and "cat " + Character.MAX_VALUE with your
Comparator and see the outcome. After all Character.MAX_VALUE might not
be a valid Unicode character.

IIRC there is a method in Character that checks whether a char is a
valid character. You could use that to find the max legal character and
try again with that.
Does anybody know how to correct described ideas or maybe problem can
be solved in other way?

Some thoughts: it may be that you stretch the TreeSet too far. Could be
that with better access to internals you are better off, so you might
have to craft your own data structure (Tries come to mind).

A two phase approach might work as well: use subSet() as the first line
to reduce the set's size and then iterate through it and use some other
mechanism like a regexp to further narrow down your result.

Kind regards

robert
 
S

setar

User "Robert Klemme said:
It may be due to the workings of the Comparator you are using. Try
comparing "cat ", "catch" and "cat " + Character.MAX_VALUE with your
Comparator and see the outcome. After all Character.MAX_VALUE might not be
a valid Unicode character.
IIRC there is a method in Character that checks whether a char is a valid
character. You could use that to find the max legal character and try
again with that.


Some thoughts: it may be that you stretch the TreeSet too far. Could be
that with better access to internals you are better off, so you might have
to craft your own data structure (Tries come to mind).

A two phase approach might work as well: use subSet() as the first line to
reduce the set's size and then iterate through it and use some other
mechanism like a regexp to further narrow down your result.

Yes, you are right. These strings are sorted in this order: "cat ", "catch",
"cat " + Character.MAX_VALUE, so this is the reason why subSet returns also
"catch ". My collator ignores white spaces, maybe it is true for all
collators. In my project it was useful feature to ignore white spaces while
sorting set so I forgot about this. I will try to apply yours ideas.

Thanks
 
S

setar

Yes, you are right. These strings are sorted in this order: "cat ",
"catch", "cat " + Character.MAX_VALUE, so this is the reason why subSet
returns also "catch ". My collator ignores white spaces, maybe it is true
for all collators. In my project it was useful feature to ignore white
spaces while sorting set so I forgot about this. I will try to apply yours
ideas.

Strings returned by subSet should be additionally filtered. If we get only
these strings which start with searching string (String.startsWith)
everything will work. We must filter because RuleBasedCollaror "ignores"
some characters while sorting. Among these characters is space.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top