# Unofficial quiz: Here's an algorithmic question for you

Discussion in 'Ruby' started by Hal Fulton, Oct 9, 2004.

1. ### Hal FultonGuest

This is actually related to an interesting problem given to me in college
by one of my professors.

I'm thinking of digging out that disk and resurrecting the problem as a
quiz. So I won't go into detail on it here.

The original problem:
- 100 text fragments come from four different sources.
- There are 25 from each source.
- They're now scrambled.
- The sources are not labeled or named as such (their order
does not matter).
- Do whatever textual or statistical analysis you deem appropriate,
and separate them into four buckets, approximating their original
buckets as well as you can.

The question I'm asking here is related only to the evalution of the
quality of a given result:

Given four bins e,f,g,h (the results of our guesswork) and given
the four original bins a,b,c,d -- evaluate how well we did.

For simplicity, assume these bins are just arrays of integers.

Note that:
- each bin has exactly 25 items
- the union of a,b,c,d is 0..99
- the union of e,f,g,h is also 0..99
- there is no special correspondence between a and e,
b and f, etc.

So my real questions are:

1. How would you solve the problem of evaluating the correctness
of the partitioning? This would have to be "fuzzy" in some
sense, of course -- maybe a Float answer between 0 and 1.
2. And the meta-question: Is the "correctness" ambiguous? I think
not, but I haven't examined it thoroughly.

Cheers,
Hal

Hal Fulton, Oct 9, 2004

2. ### Mikael BrockmanGuest

You may find http://arxiv.org/abs/cs.SD/0303025 interesting:

Mikael Brockman, Oct 9, 2004

3. ### Mauricio FernándezGuest

This implies that the sources have different characteristic properties ---
meaning that all the fragments from the same source are in some (non
obvious) way similar, right?

If so, your problem can be seen as a clusterization problem, which you
can solve using any of the numerous methods in the literature (say for
instance K-means).

Define a mapping from elements in your working set to a feature vector.
Compute the average feature vector for each of A={a,...,d} and B={e,...,h},
note them V_i and W_j (for i,j=1..4)
then you can define a distance as the minimum of
sum(|V_i-W_a(i)|^2, i=1..4) with a(i) representing the association
between averages in A and B
(there are 24 such mappings)

you might want to add a term like
K*sum(sum(|W_j-elem_i|^2,i=1..25), j=1..4)
to represent the distance from the elements in each cluster to the
average.

If you use the method suggested in the other msg in the thread (the one
based on the Kolmogorov complexity), you could take
length(gzip(concat(ORIG BUCKET, sort(CANDIDATE BUCKET))))
as a distance estimation in the 1st equation, but this can fail
depending on the nature of your sources.
In which sense?

Mauricio Fernández, Oct 10, 2004
4. ### Robert KlemmeGuest

That's interesting. I took a bit less formal mathematical approach of
thinking, but I think the solution is quite similar. Here's what I'd have
done: for all permutations of e,f,g,h divide the size of the set
intersection of each pair by 25 (the set size) and then compute the average
of these values for all four pairs. Now choose the permutation of e,f,g,h
with the highest value.

As the faculty function is involved for a general solution there could be
numerous optimizaions. For example, it is reasonable to first calculate the
match indicators (intersection size / set size) for all pairs with O(n^2),
order the result buckets per original bucket with decreasing match indicator
and use that to guide the permutation. If you're lucky (i.e. if the bucket
finding algorithm worked well) every original bucket has a different
calculated bucket as first element in the ordered list and you're done
immediately.

If not, it's more difficult if you want to find the optimal permutation. If
you just want to detect failure of the bucket finding algorithm it might as
well suffice to recognize that two of the original buckets have the same
best match calculated bucket (which then both must have a match indicator <=
0.5 which seems pretty bad IMHO).

Another line of action could be to rule out pairs below a certain match
indicator (say 0.8) to reduce the number of permutations to consider.

Kind regards

robert

Robert Klemme, Oct 10, 2004