Discussion in 'Perl Misc' started by julie_smith, Jan 18, 2005.

  1. julie_smith

    julie_smith Guest

    I have an articles table containing columns like
    id,name,author,section,creationdate,description,longmatter, etc.
    I am using mysql.

    some of them are fixed value fields(enumerations)

    like->section will have news,sports,politics etc...

    while description will be a text field with any amount of arbitrary

    now I have 50000 articles under different sections.

    I want to implement a "similar articles" feature.
    By this I mean when an article is shown,
    I want to display all the similar articles based on that article.(10
    per page).

    Now how do I calculate the similarity of 1 article with all the 50000
    articles ?

    I dont want articles from the same section only.
    Since the search result has to be very fast,
    Can I create some algorithm that will look through all the fields in
    each row of the
    articles table and assign a weight/checksum to it.

    And then in the similar articles part I display all the articles wth a
    +-5 difference in checksum with the
    current displayed articles checksum ?

    Thanks in advance,

    julie_smith, Jan 18, 2005
  2. Check out the CPAN module Algorithm::Diff.
    Gunnar Hjalmarsson, Jan 18, 2005
  3. julie_smith

    Anno Siegel Guest

    Okay. Given two articles, how do you decide if they are similar?
    What you are going to do with the list of similar articles is of
    no consequence on how you select them.
    First you have to tell us how to compare two individual articles, *then*
    we can talk about ways to apply this to many pairs efficiently.
    Since you mention all the different fields, I suppose they all play
    a part in deciding whether two articles are similar or not. You can't
    map that many dimensions onto a single number and have it work like
    you want to. The best you can hope for is a numeric representation
    of *each field*, which can be compared to decide if articles are similar
    with respect to one particular field. With some of the fields being
    text strings, that won't be possible for all fields either.

    Anno Siegel, Jan 18, 2005
