How do I rewrite this in a cleaner way?

F

Fencer

Hello, I have a large set of strings and I need to find all combinations
of two of those string and perform an algorithm on them (the algorithm
is actually an XQuery that runs over a large dataset so I won't post
that here). The combinations foo-bar and bar-foo are considered to be
equal so only one should be generated.

I wrote a static class method that takes an array of Strings and returns
all combinations in a Vector. The Vector stores a class Pair which I've
written myself too. The complete code is posted below with output from a
sample run. It seems to work, but how can it be improved and get rid of
the Pair class? Thanks!

package utility;

import java.util.HashSet;
import java.util.Set;
import java.util.Vector;

public class MiscUtils {

public static void main(String[] args) {
Vector<Pair> combos = findCombinations(new String[]{"chebi",
"kegg.compound"});
System.out.println(combos.size());
System.out.println(combos);
}

public static Vector<Pair> findCombinations(String[] strings) {
Vector<Pair> combinations = new Vector<Pair>();
Set<String> alreadyPaired = new HashSet<String>();

for (int i = 0; i < strings.length; ++i) {
for (int j = 0; j < strings.length; ++j) {
if (i != j && !alreadyPaired.contains(strings[j])) {
combinations.add(new Pair(strings, strings[j]));
}
}

alreadyPaired.add(strings);
}

return combinations;
}
}

class Pair {
public Pair(String s1, String s2) {
this.s1 = s1;
this.s2 = s2;
}

public String getFirst() {
return s1;
}

public String getSecond() {
return s2;
}

public String toString() {
return s1 + ":" + s2;
}

private String s1 = null;
private String s2 = null;
}

Sample run:
1
[chebi:kegg.compound]

- Fencer
 
F

Fencer

Do you mean you perform a single operation on the entire set of string
pairs? Or that you iterate through the set of string pairs, performing
some operation on each pair?

Thanks for your reply my Duniho. I should have been clearer: I iterate
through all my combinations and runs the all algorithm on each
combination. The algorithm is a complicated XQuery that runs over a
large data set. The number of combinations I have is not that large
actually, I just didn't want to do the pairing by hand. It's true that I
don't really need to keep the combinations around but I would actually
like to do that.
The input data could actually contain duplicates but that would be an
error on my part. I mostly want to get rid of the Pair class. :)
If the latter, then one of the biggest improvements you might make is to
not keep the pairs in a collection at all. Just feed them one pair at a
time to the algorithm as you generate them, discarding the pair once
it's been processed.
I wrote a static class method that takes an array of Strings and
returns all combinations in a Vector. The Vector stores a class Pair
which I've written myself too. The complete code is posted below with
output from a sample run. It seems to work, but how can it be improved
and get rid of the Pair class? Thanks!

It seems to me that another significant improvement you could easily
make is to get rid of the Set<String> for excluding pairs you already
have. Instead, just don't generate the reversed pairs in the first place:

for (int i = 0; i < strings.length; i++)
{
for (int j = i + 1; j < strings.length; j++)
{
combinations.add(new Pair(strings, strings[j]));
}
}

The above does not account for the possibility of duplicate strings in
the input. In that case, you would still get a duplicated pair. That can
be resolved by removing the duplicates in the input before generating
the pairs.

Of course, if you have huge amounts of RAM, then you may find that
reducing the memory overhead is less important than reducing processing
overhead. In that case, you could deal with duplicates using a Set<Pair>
just as you had been before, but still at least avoiding generating
reversed pairs as above.

If you're guaranteed the input won't have duplicate strings, then of
course dealing with duplicates is not necessary.

There may be more "clever" improvements you could make. But I think it's
a good idea to deal with the more basic inefficiencies in the algorithm
first.

Pete
 
F

Fencer

I rewrote it like this:
package utility;

import java.util.Arrays;
import java.util.HashSet;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Set;

public class MiscUtils {

public static void main(String[] args) {
Map<String, String> combos =
findCombinations(new String[]{"chebi", "kegg.compound", "chebi"});
System.out.println(combos.size());
System.out.println(combos);
}

public static Map<String, String> findCombinations(String[] strings) {
Set<String> set = new HashSet<String>(Arrays.asList(strings));

if (set.size() < strings.length) {
System.err.println("Input array contained duplicates.");
strings = (String[])set.toArray();
}

Map<String, String> combos = new LinkedHashMap<String, String>();

for (int i = 0; i < strings.length; i++)
{
for (int j = i + 1; j < strings.length; j++)
{
combos.put(strings, strings[j]);
}
}

return combos;
}
}

but apparently I can't handle duplicates the way I try handle them,
because I get following runtime error:
Input array contained duplicates.
Exception in thread "main" java.lang.ClassCastException:
[Ljava.lang.Object; cannot be cast to [Ljava.lang.String;
at utility.MiscUtils.findCombinations(MiscUtils.java:23)
at utility.MiscUtils.main(MiscUtils.java:13)

This line strings = (String[])set.toArray(); is what the runtime system
is unhappy with.


- Fencer
 
S

Screamin Lord Byron

Hello, I have a large set of strings and I need to find all combinations
of two of those string and perform an algorithm on them (the algorithm
is actually an XQuery that runs over a large dataset so I won't post
that here). The combinations foo-bar and bar-foo are considered to be
equal so only one should be generated.

I wrote a static class method that takes an array of Strings and returns
all combinations in a Vector. The Vector stores a class Pair which I've
written myself too. The complete code is posted below with output from a
sample run. It seems to work, but how can it be improved and get rid of
the Pair class? Thanks!

If you need a concept of pair, then you need some class that represents
it. Why would you want to get rid of it?

What is the reason you are using a Vector (synchronization)? Why not
implement your equals() and hashcode() methods of the Pair class, and
simply use the HashSet for storing pairs? Then you wouldn't have to
worry about duplicates at all.

for (int i = 0; i < strings.length; ++i) {
for (int j = 0; j < strings.length; ++j) {

for (int j = i+1;.....
 
F

Fencer

Fencer said:
[...]
but apparently I can't handle duplicates the way I try handle them,
because I get following runtime error:
Input array contained duplicates.
Exception in thread "main" java.lang.ClassCastException:
[Ljava.lang.Object; cannot be cast to [Ljava.lang.String;
at utility.MiscUtils.findCombinations(MiscUtils.java:23)
at utility.MiscUtils.main(MiscUtils.java:13)

This line strings = (String[])set.toArray(); is what the runtime
system is unhappy with.

Of course it is unhappy with it. You can't do that. You need to copy
each string individually to a new array of the desired type (String[]),
or use the array returned by toArray() directly by always casting the
objects in it to String when you want to do something with them.

As far as your other changes, I think you are really barking up the
wrong tree. I don't see any compelling reason to try to avoid the Pair
class you wrote before, but certainly storing the pairs as a
LinkedHashMap doesn't seem like that would address whatever concern you
have. That's a less efficient way to store pairs of strings than just
having your own Pair class, assuming you don't need rapid retrieval of
the second string in a pair given _only_ the first string (which was not
part of your original problem description).

If you're looking to minimize the storage requirements for the pairs,
you might just keep two parallel arrays. The length is known in advance
(sum of the series), so it's just a matter of filling the array as the
loops proceed. One array stores the first string in the pair, the other
array stores the second in the pair.

Otherwise, just use an array of Pair objects.

Pete

Ok, thanks Pete!

- fencer
 
F

Fencer

for (int i = 0; i < strings.length; i++)
{
for (int j = i + 1; j < strings.length; j++)
{
combinations.add(new Pair(strings, strings[j]));
}
}

The above does not account for the possibility of duplicate strings in
the input. In that case, you would still get a duplicated pair. That can
be resolved by removing the duplicates in the input before generating
the pairs.


Hmm, that loop correct doesn't seem correct (or I did as mistake
explaining the desired behavior). For the input a, b, c, and d it
generates the three combinations a:d, b:d, and c:d, but it should be six
combinations:
a b c d =>
a b
a c
a d
b c
b d
c d

- Fencer
 
F

Fencer

for (int i = 0; i < strings.length; i++)
{
for (int j = i + 1; j < strings.length; j++)
{
combinations.add(new Pair(strings, strings[j]));
}
}

The above does not account for the possibility of duplicate strings in
the input. In that case, you would still get a duplicated pair. That can
be resolved by removing the duplicates in the input before generating
the pairs.


Hmm, that loop correct doesn't seem correct (or I did as mistake
explaining the desired behavior). For the input a, b, c, and d it
generates the three combinations a:d, b:d, and c:d, but it should be six
combinations:
a b c d =>
a b
a c
a d
b c
b d
c d

- Fencer


Lol nevermind that, I had introduced another error. Sorry for the noise!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,871
Messages
2,569,919
Members
46,172
Latest member
JamisonPat

Latest Threads

Top