Storing and parsing variable length strings

D

dterzis

Hi folks...

I have an application which needs to store variable-length strings (up
to 24 characters long) that must then be matched according to prefix.
That is, whenever a prefix is given, the application has to be able to
retrieve all stored strings that start with this prefix. Conversely,
the program has to be able to store strings in such a way, so that
prefix-based retrieving is fast and accurate (i.e., all relative
strings are located and retrieved). Doing this linearly is obviously
slow and inefficient.

So, what is the best way of implementing this functionality in J2SE 1.4
or later? Or, if there is no ready (preprogrammed) way of doing it
properly in the standard distribution, are you aware of any open source
implementation (e.g., tree or something similar)?

Thanks,

Dimitris
 
I

Ian Pilcher


The OP needs to find all of the strings that start with a "prefix", so
so he effectively needs to find objects within a range of values.
Depending on where the data comes from, a TreeSet or a sorted array
would be better.
 
R

Roedy Green

So, what is the best way of implementing this functionality in J2SE 1.4
or later? Or, if there is no ready (preprogrammed) way of doing it
properly in the standard distribution, are you aware of any open source
implementation (e.g., tree or something similar)?

Since you do direct access with the prefix, and have perfect spelling
of the prefixes, you would start with a HashMap of with prefixes as
keys. They map to a collection of Strings, (possibly with the prefixes
chopped off to conserve space.)

What sort of Collection? You might use ArrayList or HashSet or even
simple arrays of exactly the correct size (small fast but not fast for
update since you must create a new array to add an elt).

if your strings use some reduced character set, you might encode them
with byte[] to make them more compact.

For fast access, you could convert the 3-char prefix to a numerical
index 0..17,575 a=0, b=1 ... and do a direct index lookup to get your
collection.

There may be a fair bit of overhead with all those objects. If you
need to reduce the ram footprint, you can consider combining the
24-char strings into a single byte array in toto or per prefix using a
marker separator characters. Don't worry about this until you prove a
need for optimisation.
 
T

Thomas Hawtin

I have an application which needs to store variable-length strings (up
to 24 characters long) that must then be matched according to prefix.
That is, whenever a prefix is given, the application has to be able to
retrieve all stored strings that start with this prefix. Conversely,
the program has to be able to store strings in such a way, so that
prefix-based retrieving is fast and accurate (i.e., all relative
strings are located and retrieved). Doing this linearly is obviously
slow and inefficient.

I've not done anything like this, but that wont stop me commenting.

A SortedSet with natural ordering looks like what you need. Find the
tailSet after (inclusive) the prefix. Iterate until you find a string
that doesn't have the prefix.

class PrefixMatcher {
private static final String[] EMPTY_STRING = new String[0];
private final SortedSet<String> set;
public PrefixMatcher(Collection<String> strings) {
this.set = new java.util.TreeSet(strings);
}
public String[] startsWith(String prefix) {
Iterator<String> tailIter = set.tailSet(prefix).iterator();

// A bit of optimisation for a potentially common case
// of no matches.
// Start by testing the first, if exists.
if (!tailIter.hasNext()) {
return EMPTY_STRING;
}
String first = tailIter.next();
if (!first.startsWith(prefix)) {
return EMPTY_STRING;
}

// At least one found, loop for the rest.
List<String> found = new java.util.ArrayList<String>();
found.add(first);
while (tailIter.hasNext()) {
String string = tailIter.next();
if (!string.startsWith(prefix)) {
break;
}
found.add(string);
}
return found.toArray(EMPTY_STRING);
}

Alternatively, you could 'increment' the prefix and find the bounded subSet.

public String[] startsWith(String prefix) {
return set.subSet(
prefix,
increment(prefix)
).toArray(EMPTY_STRING);
}
private static String increment(String string) {
final int len = string.length();
final char last = len==0 ? '\uffff' : string.charAt(len-1);
return
last=='\uffff' ? (
string+'\u0000'
) : (
string.substring(0, len-1) + (char)(last+1)
);
}
}
}

(Usual disclaimer.)

If you want non-exact matching rules (say case insensitive), then you
can supply a comparator.

Tom Hawtin
 
T

Thomas Hawtin

Thomas said:
public String[] startsWith(String prefix) {
return set.subSet(
prefix,
increment(prefix)
).toArray(EMPTY_STRING);
}
private static String increment(String string) {
final int len = string.length();
final char last = len==0 ? '\uffff' : string.charAt(len-1);
return
last=='\uffff' ? (
string+'\u0000'
) : (
string.substring(0, len-1) + (char)(last+1)
);
}

Argh. That's wrong. What you need is what I originally decided to write
but changed my mind (it's late). When the last character is \uffff
(which probably isn't a valid character anyway), you need to increment
the character preceeding. If the string is all -1, then you need to
either use tailSet or a nulls last sorting comparator.
}
}

(Usual disclaimer.)

Using arrays and java.util.Arrays would probably be faster if set didn't
change. Or use an array backed SortedSet implementation.

Tom Hawtin
 
W

Wibble

Hi folks...

I have an application which needs to store variable-length strings (up
to 24 characters long) that must then be matched according to prefix.
That is, whenever a prefix is given, the application has to be able to
retrieve all stored strings that start with this prefix. Conversely,
the program has to be able to store strings in such a way, so that
prefix-based retrieving is fast and accurate (i.e., all relative
strings are located and retrieved). Doing this linearly is obviously
slow and inefficient.

So, what is the best way of implementing this functionality in J2SE 1.4
or later? Or, if there is no ready (preprogrammed) way of doing it
properly in the standard distribution, are you aware of any open source
implementation (e.g., tree or something similar)?

Thanks,

Dimitris
Do you know all the strings a priori? Is it a large number of strings?
If you know all the strings up front, you can just keep them in a sorted
array and use binary search.

If there arent that many strings but you dont know them up front, you
can store them in 24 hash maps, each with the first n of 24 characters
as the hash key and a list of strings as the value.

A string trie would work too, but you'd have to find or build it.

If you have alot of strings, and don't know the complete set up front,
consider a database.
 
D

Dilton McGowan II

Hi folks...

I have an application which needs to store variable-length strings (up
to 24 characters long) that must then be matched according to prefix.
That is, whenever a prefix is given, the application has to be able to
retrieve all stored strings that start with this prefix. Conversely,
the program has to be able to store strings in such a way, so that
prefix-based retrieving is fast and accurate (i.e., all relative
strings are located and retrieved). Doing this linearly is obviously
slow and inefficient.

So, what is the best way of implementing this functionality in J2SE 1.4
or later? Or, if there is no ready (preprogrammed) way of doing it
properly in the standard distribution, are you aware of any open source
implementation (e.g., tree or something similar)?

Thanks,

Dimitris

Use a database with an index on the field of interest; either a local
database like Derby or server based one like DB2, this is what they're
designed for and excel at.
 
R

Roedy Green

Since you do direct access with the prefix, and have perfect spelling
of the prefixes, you would start with a HashMap of with prefixes as
keys. They map to a collection of Strings, (possibly with the prefixes
chopped off to conserve space.)

another approach would be to build the structure using ArrayLists,
then when it is complete, convert it to use plain arrays, then
serialise the thing so you can bring it back easily.
 
R

Roedy Green

The OP needs to find all of the strings that start with a "prefix", so
so he effectively needs to find objects within a range of values.
Depending on where the data comes from, a TreeSet or a sorted array
would be better.

He does not need a sort to do that. All he need do is chop off the
first three chars and use that as a HashMap style key.

He might consider using a sort first to help speed the building of the
list of strings attached to each prefix.
 
R

Roedy Green

Use a database with an index on the field of interest; either a local
database like Derby or server based one like DB2, this is what they're
designed for and excel at.

It is fascinating seeing the variety of responses. Each person brings
a different set of assumptions.

It is quite possible all you need is an array of strings and a loop if
the problem set is small enough.
 
T

Thomas Hawtin

Roedy said:
On Fri, 02 Sep 2005 20:31:20 -0500, Ian

He does not need a sort to do that. All he need do is chop off the
first three chars and use that as a HashMap style key.

He might consider using a sort first to help speed the building of the
list of strings attached to each prefix.

Depends on the data.

If he was sorting through tens of thousands of Java method names, he
might find performance for get and set methods lacking. Alternatively
looking for "it" would take 65536 look up. "@" would take, er, lots.

It's like Mozilla trying to lookup up a URL. You type 'h' for http or
'w' for www or anything else, and it takes a few minutes to respond.

Tom Hawtin
 
J

Jim Janney

Roedy Green said:
He does not need a sort to do that. All he need do is chop off the
first three chars and use that as a HashMap style key.

Except that he doesn't say that the prefixes are all three characters
long. Sorting into order gives you O(log(n)) searches for prefixes of
any length.
 
R

Roedy Green

Except that he doesn't say that the prefixes are all three characters
long. Sorting into order gives you O(log(n)) searches for prefixes of
any length.

On re-reading his request in the light of your observation I think he
means by prefix that he wants to do lookup by any of the first N chars
of the string, not that there is some single definable prefix, like a
telephone exchange prefix, for each string.

In that case, he wants a TreeMap which lets you look up by approximate
key and scan forward till you find one that does not begin with the
prefix string you want.

Another implementation is to sort the strings alphabetically and
search for what you want with binary search. see
http://mindprod.com/jgloss/binarysearch.html

IRRC Tremaps don't allow duplicate keys. If you have duplicate keys,
that is another problem to solve, which I won't concern myself with
yet.

The other question of interest is once you look up this string what do
you do? What sort of object do you need to process? JUST an array of
strings with the same lead chars? In that case you would need only a
TreeSet not a TreeMap.
 
L

Lasse Reichstein Nielsen

I have an application which needs to store variable-length strings (up
to 24 characters long) that must then be matched according to prefix.
That is, whenever a prefix is given, the application has to be able to
retrieve all stored strings that start with this prefix.

This sounds like a job for a Trie data strcuture.
<URL:http://en.wikipedia.org/wiki/Trie?

A Google search found this implementation:
<URL:http://software.topcoder.com/catalog/c_component.jsp?comp=6601809&ver=1>
I don't know anything about it except what is written on that page :)

/L
 
G

Googmeister

Hi folks...

I have an application which needs to store variable-length strings (up
to 24 characters long) that must then be matched according to prefix.
That is, whenever a prefix is given, the application has to be able to
retrieve all stored strings that start with this prefix. Conversely,
the program has to be able to store strings in such a way, so that
prefix-based retrieving is fast and accurate (i.e., all relative
strings are located and retrieved). Doing this linearly is obviously
slow and inefficient.

So, what is the best way of implementing this functionality in J2SE 1.4
or later? Or, if there is no ready (preprogrammed) way of doing it
properly in the standard distribution, are you aware of any open source
implementation (e.g., tree or something similar)?

If efficiency is crucial (time and space), use a trie, e.g., a ternary
search trie (TST). Its more flexible than a binary search tree
(e.g., supports prefix matches) and can be faster than hashing.
It can also be more space efficient that storing the strings in a
sorted array. Whether you are in need of such an optimized
solution depends on your application.
 
D

dterzis

Hi folks...

First of all, thanks very much for all the ideas and suggestions. Much
appreciated!

Allow me to provide a bit more information, based on your comments thus
far.

- The strings used act as variable-length prefixes to stored objects.

- The set of strings that act as key to the stored objects is not known
beforehand. It is supposed to grow and be dynamically updated (i.e.,
sort of a dictionary or a routing table).

- The data structure must be able to retrieve all objects with a common
string prefix (e.g., all objects indexed by "Hi", "High", "Highland",
if the prefix were "Hi").

- Once a more generic prefix is received, the application needs to
substitute all the existing ones with the more generic (this being some
sort of aggregation).

- The size of storage required is rather medium scale, i.e. up to tens
of thousands of objects.

- Performance has to be high - i.e., no more than N*log(N) for any
operation.

By the looks of it, a Trie seems the best solution, but I will need to
develop one. Based on the above, do you believe I could get away with a
Java Collection, e.g. a TreeMap?

Thanks,

Dimitris
 
T

Thomas Hawtin

- The size of storage required is rather medium scale, i.e. up to tens
of thousands of objects.

- Performance has to be high - i.e., no more than N*log(N) for any
operation.

By the looks of it, a Trie seems the best solution, but I will need to
develop one. Based on the above, do you believe I could get away with a
Java Collection, e.g. a TreeMap?

TreeMap will give you O(log n) performance - n (tens of thousands) times
better than apparently required.

O(n log n) is actually quite high. 30,000 * log2 30,000 is around
450,000. Multiply that by a nice constant and you get a lot of cycles.
But even a linear search is only O(n). n operations over a set of n
taking O(n log n) is more reasonable.

Binary search on arrays could be faster and have a lower memory
overhead. It would take some work to make arrays work efficiently with
insertion, but perhaps not as much as a Trie implementation.

Tries are going to be much more work, and quite possibly end up slower.

Tom Hawtin
 
W

Wibble

Hi folks...

First of all, thanks very much for all the ideas and suggestions. Much
appreciated!

Allow me to provide a bit more information, based on your comments thus
far.

- The strings used act as variable-length prefixes to stored objects.

- The set of strings that act as key to the stored objects is not known
beforehand. It is supposed to grow and be dynamically updated (i.e.,
sort of a dictionary or a routing table).

- The data structure must be able to retrieve all objects with a common
string prefix (e.g., all objects indexed by "Hi", "High", "Highland",
if the prefix were "Hi").

- Once a more generic prefix is received, the application needs to
substitute all the existing ones with the more generic (this being some
sort of aggregation).

- The size of storage required is rather medium scale, i.e. up to tens
of thousands of objects.

- Performance has to be high - i.e., no more than N*log(N) for any
operation.

By the looks of it, a Trie seems the best solution, but I will need to
develop one. Based on the above, do you believe I could get away with a
Java Collection, e.g. a TreeMap?

Thanks,

Dimitris
A TreeMap and a trie are pretty different. A TreeMap is a sorted
collection and and the tailMap() method could be used to find the
first the first string with a prefix. I think for you're app,
TreeMap is simpler and probably just as good as a trie.
 
G

Googmeister

By the looks of it, a Trie seems the best solution, but I will need to
develop one. Based on the above, do you believe I could get away with a
Java Collection, e.g. a TreeMap?

I was among those who proposed a trie for large problems. But given
the type of problem you describe, a balanced binary search tree
(TreeMap) will be plenty fast enough here.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,830
Latest member
ZADIva7383

Latest Threads

Top