Unofficial quiz: Here's an algorithmic question for you

Discussion in 'Ruby' started by Hal Fulton, Oct 9, 2004.

  1. Hal Fulton

    Hal Fulton Guest

    This is actually related to an interesting problem given to me in college
    by one of my professors.

    I'm thinking of digging out that disk and resurrecting the problem as a
    quiz. So I won't go into detail on it here.

    The original problem:
    - 100 text fragments come from four different sources.
    - There are 25 from each source.
    - They're now scrambled.
    - The sources are not labeled or named as such (their order
    does not matter).
    - Do whatever textual or statistical analysis you deem appropriate,
    and separate them into four buckets, approximating their original
    buckets as well as you can.

    The question I'm asking here is related only to the evalution of the
    quality of a given result:

    Given four bins e,f,g,h (the results of our guesswork) and given
    the four original bins a,b,c,d -- evaluate how well we did.

    For simplicity, assume these bins are just arrays of integers.

    Note that:
    - each bin has exactly 25 items
    - the union of a,b,c,d is 0..99
    - the union of e,f,g,h is also 0..99
    - there is no special correspondence between a and e,
    b and f, etc.

    So my real questions are:

    1. How would you solve the problem of evaluating the correctness
    of the partitioning? This would have to be "fuzzy" in some
    sense, of course -- maybe a Float answer between 0 and 1.
    2. And the meta-question: Is the "correctness" ambiguous? I think
    not, but I haven't examined it thoroughly.


    Cheers,
    Hal
     
    Hal Fulton, Oct 9, 2004
    #1
    1. Advertisements

  2. You may find http://arxiv.org/abs/cs.SD/0303025 interesting:
     
    Mikael Brockman, Oct 9, 2004
    #2
    1. Advertisements

  3. This implies that the sources have different characteristic properties ---
    meaning that all the fragments from the same source are in some (non
    obvious) way similar, right?

    If so, your problem can be seen as a clusterization problem, which you
    can solve using any of the numerous methods in the literature (say for
    instance K-means).
    mmm what about the following:

    Define a mapping from elements in your working set to a feature vector.
    Compute the average feature vector for each of A={a,...,d} and B={e,...,h},
    note them V_i and W_j (for i,j=1..4)
    then you can define a distance as the minimum of
    sum(|V_i-W_a(i)|^2, i=1..4) with a(i) representing the association
    between averages in A and B
    (there are 24 such mappings)

    you might want to add a term like
    K*sum(sum(|W_j-elem_i|^2,i=1..25), j=1..4)
    to represent the distance from the elements in each cluster to the
    average.

    If you use the method suggested in the other msg in the thread (the one
    based on the Kolmogorov complexity), you could take
    length(gzip(concat(ORIG BUCKET, sort(CANDIDATE BUCKET))))
    as a distance estimation in the 1st equation, but this can fail
    depending on the nature of your sources.
    In which sense?
     
    Mauricio Fernández, Oct 10, 2004
    #3
  4. That's interesting. I took a bit less formal mathematical approach of
    thinking, but I think the solution is quite similar. Here's what I'd have
    done: for all permutations of e,f,g,h divide the size of the set
    intersection of each pair by 25 (the set size) and then compute the average
    of these values for all four pairs. Now choose the permutation of e,f,g,h
    with the highest value.

    As the faculty function is involved for a general solution there could be
    numerous optimizaions. For example, it is reasonable to first calculate the
    match indicators (intersection size / set size) for all pairs with O(n^2),
    order the result buckets per original bucket with decreasing match indicator
    and use that to guide the permutation. If you're lucky (i.e. if the bucket
    finding algorithm worked well) every original bucket has a different
    calculated bucket as first element in the ordered list and you're done
    immediately.

    If not, it's more difficult if you want to find the optimal permutation. If
    you just want to detect failure of the bucket finding algorithm it might as
    well suffice to recognize that two of the original buckets have the same
    best match calculated bucket (which then both must have a match indicator <=
    0.5 which seems pretty bad IMHO).

    Another line of action could be to rule out pairs below a certain match
    indicator (say 0.8) to reduce the number of permutations to consider.

    Kind regards

    robert
     
    Robert Klemme, Oct 10, 2004
    #4
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.