Unofficial quiz: Here's an algorithmic question for you

H

Hal Fulton

This is actually related to an interesting problem given to me in college
by one of my professors.

I'm thinking of digging out that disk and resurrecting the problem as a
quiz. So I won't go into detail on it here.

The original problem:
- 100 text fragments come from four different sources.
- There are 25 from each source.
- They're now scrambled.
- The sources are not labeled or named as such (their order
does not matter).
- Do whatever textual or statistical analysis you deem appropriate,
and separate them into four buckets, approximating their original
buckets as well as you can.

The question I'm asking here is related only to the evalution of the
quality of a given result:

Given four bins e,f,g,h (the results of our guesswork) and given
the four original bins a,b,c,d -- evaluate how well we did.

For simplicity, assume these bins are just arrays of integers.

Note that:
- each bin has exactly 25 items
- the union of a,b,c,d is 0..99
- the union of e,f,g,h is also 0..99
- there is no special correspondence between a and e,
b and f, etc.

So my real questions are:

1. How would you solve the problem of evaluating the correctness
of the partitioning? This would have to be "fuzzy" in some
sense, of course -- maybe a Float answer between 0 and 1.
2. And the meta-question: Is the "correctness" ambiguous? I think
not, but I haven't examined it thoroughly.


Cheers,
Hal
 
M

Mikael Brockman

Hal Fulton said:
This is actually related to an interesting problem given to me in college
by one of my professors.

I'm thinking of digging out that disk and resurrecting the problem as a
quiz. So I won't go into detail on it here.

The original problem:
- 100 text fragments come from four different sources.
- There are 25 from each source.
- They're now scrambled.
- The sources are not labeled or named as such (their order
does not matter).
- Do whatever textual or statistical analysis you deem appropriate,
and separate them into four buckets, approximating their original
buckets as well as you can.

The question I'm asking here is related only to the evalution of the
quality of a given result:

Given four bins e,f,g,h (the results of our guesswork) and given
the four original bins a,b,c,d -- evaluate how well we did.

For simplicity, assume these bins are just arrays of integers.

Note that:
- each bin has exactly 25 items
- the union of a,b,c,d is 0..99
- the union of e,f,g,h is also 0..99
- there is no special correspondence between a and e,
b and f, etc.

So my real questions are:

1. How would you solve the problem of evaluating the correctness
of the partitioning? This would have to be "fuzzy" in some
sense, of course -- maybe a Float answer between 0 and 1.
2. And the meta-question: Is the "correctness" ambiguous? I think
not, but I haven't examined it thoroughly.


Cheers,
Hal

You may find http://arxiv.org/abs/cs.SD/0303025 interesting:
 
M

Mauricio Fernández

This is actually related to an interesting problem given to me in college
by one of my professors.

I'm thinking of digging out that disk and resurrecting the problem as a
quiz. So I won't go into detail on it here.

The original problem:
- 100 text fragments come from four different sources.
- There are 25 from each source.
- They're now scrambled.
- The sources are not labeled or named as such (their order
does not matter).
- Do whatever textual or statistical analysis you deem appropriate,
and separate them into four buckets, approximating their original
buckets as well as you can.

This implies that the sources have different characteristic properties ---
meaning that all the fragments from the same source are in some (non
obvious) way similar, right?

If so, your problem can be seen as a clusterization problem, which you
can solve using any of the numerous methods in the literature (say for
instance K-means).
The question I'm asking here is related only to the evalution of the
quality of a given result:

Given four bins e,f,g,h (the results of our guesswork) and given
the four original bins a,b,c,d -- evaluate how well we did.

For simplicity, assume these bins are just arrays of integers.

Note that:
- each bin has exactly 25 items
- the union of a,b,c,d is 0..99
- the union of e,f,g,h is also 0..99
- there is no special correspondence between a and e,
b and f, etc.

So my real questions are:

1. How would you solve the problem of evaluating the correctness
of the partitioning? This would have to be "fuzzy" in some
sense, of course -- maybe a Float answer between 0 and 1.

mmm what about the following:

Define a mapping from elements in your working set to a feature vector.
Compute the average feature vector for each of A={a,...,d} and B={e,...,h},
note them V_i and W_j (for i,j=1..4)
then you can define a distance as the minimum of
sum(|V_i-W_a(i)|^2, i=1..4) with a(i) representing the association
between averages in A and B
(there are 24 such mappings)

you might want to add a term like
K*sum(sum(|W_j-elem_i|^2,i=1..25), j=1..4)
to represent the distance from the elements in each cluster to the
average.

If you use the method suggested in the other msg in the thread (the one
based on the Kolmogorov complexity), you could take
length(gzip(concat(ORIG BUCKET, sort(CANDIDATE BUCKET))))
as a distance estimation in the 1st equation, but this can fail
depending on the nature of your sources.
2. And the meta-question: Is the "correctness" ambiguous? I think
not, but I haven't examined it thoroughly.

In which sense?
 
R

Robert Klemme

Mauricio Fernández said:
mmm what about the following:

Define a mapping from elements in your working set to a feature vector.
Compute the average feature vector for each of A={a,...,d} and
B={e,...,h},
note them V_i and W_j (for i,j=1..4)
then you can define a distance as the minimum of
sum(|V_i-W_a(i)|^2, i=1..4) with a(i) representing the association
between averages in A and B
(there are 24 such mappings)

That's interesting. I took a bit less formal mathematical approach of
thinking, but I think the solution is quite similar. Here's what I'd have
done: for all permutations of e,f,g,h divide the size of the set
intersection of each pair by 25 (the set size) and then compute the average
of these values for all four pairs. Now choose the permutation of e,f,g,h
with the highest value.

As the faculty function is involved for a general solution there could be
numerous optimizaions. For example, it is reasonable to first calculate the
match indicators (intersection size / set size) for all pairs with O(n^2),
order the result buckets per original bucket with decreasing match indicator
and use that to guide the permutation. If you're lucky (i.e. if the bucket
finding algorithm worked well) every original bucket has a different
calculated bucket as first element in the ordered list and you're done
immediately.

If not, it's more difficult if you want to find the optimal permutation. If
you just want to detect failure of the bucket finding algorithm it might as
well suffice to recognize that two of the original buckets have the same
best match calculated bucket (which then both must have a match indicator <=
0.5 which seems pretty bad IMHO).

Another line of action could be to rule out pairs below a certain match
indicator (say 0.8) to reduce the number of permutations to consider.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top