similar articles algorithm based on numeric indexing of all rows via columns in a table

Discussion in 'Perl Misc' started by julie_smith@operamail.com, Jan 18, 2005.

  1. Guest

    Hi,
    I have an articles table containing columns like
    id,name,author,section,creationdate,description,longmatter, etc.
    I am using mysql.

    some of them are fixed value fields(enumerations)

    like->section will have news,sports,politics etc...

    while description will be a text field with any amount of arbitrary
    text.

    now I have 50000 articles under different sections.

    I want to implement a "similar articles" feature.
    By this I mean when an article is shown,
    I want to display all the similar articles based on that article.(10
    per page).

    Now how do I calculate the similarity of 1 article with all the 50000
    articles ?

    I dont want articles from the same section only.
    Since the search result has to be very fast,
    Can I create some algorithm that will look through all the fields in
    each row of the
    articles table and assign a weight/checksum to it.

    And then in the similar articles part I display all the articles wth a
    +-5 difference in checksum with the
    current displayed articles checksum ?

    Thanks in advance,

    Julie
    , Jan 18, 2005
    #1
    1. Advertising

  2. Re: similar articles algorithm based on numeric indexing of all rowsvia columns in a table

    wrote:
    > I want to implement a "similar articles" feature.
    > By this I mean when an article is shown,
    > I want to display all the similar articles based on that article.(10
    > per page).
    >
    > Now how do I calculate the similarity of 1 article with all the 50000
    > articles ?
    >
    > I dont want articles from the same section only.
    > Since the search result has to be very fast,
    > Can I create some algorithm that will look through all the fields in
    > each row of the
    > articles table and assign a weight/checksum to it.


    Check out the CPAN module Algorithm::Diff.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Jan 18, 2005
    #2
    1. Advertising

  3. Anno Siegel Guest

    <> wrote in comp.lang.perl.misc:
    > Hi,
    > I have an articles table containing columns like
    > id,name,author,section,creationdate,description,longmatter, etc.
    > I am using mysql.
    >
    > some of them are fixed value fields(enumerations)
    >
    > like->section will have news,sports,politics etc...
    >
    > while description will be a text field with any amount of arbitrary
    > text.
    >
    > now I have 50000 articles under different sections.
    >
    > I want to implement a "similar articles" feature.


    Okay. Given two articles, how do you decide if they are similar?

    > By this I mean when an article is shown,
    > I want to display all the similar articles based on that article.(10
    > per page).


    What you are going to do with the list of similar articles is of
    no consequence on how you select them.

    > Now how do I calculate the similarity of 1 article with all the 50000
    > articles ?


    First you have to tell us how to compare two individual articles, *then*
    we can talk about ways to apply this to many pairs efficiently.

    > I dont want articles from the same section only.
    > Since the search result has to be very fast,
    > Can I create some algorithm that will look through all the fields in
    > each row of the
    > articles table and assign a weight/checksum to it.
    >
    > And then in the similar articles part I display all the articles wth a
    > +-5 difference in checksum with the
    > current displayed articles checksum ?


    Since you mention all the different fields, I suppose they all play
    a part in deciding whether two articles are similar or not. You can't
    map that many dimensions onto a single number and have it work like
    you want to. The best you can hope for is a numeric representation
    of *each field*, which can be compared to decide if articles are similar
    with respect to one particular field. With some of the fields being
    text strings, that won't be possible for all fields either.

    Anno
    Anno Siegel, Jan 18, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. C
    Replies:
    0
    Views:
    484
  2. helpful sql
    Replies:
    0
    Views:
    797
    helpful sql
    May 19, 2005
  3. Emin
    Replies:
    4
    Views:
    399
    Paul McGuire
    Jan 12, 2007
  4. Skybuck Flying
    Replies:
    30
    Views:
    1,080
    Bill Reid
    Sep 19, 2011
  5. C
    Replies:
    3
    Views:
    213
    Manohar Kamath [MVP]
    Oct 17, 2003
Loading...

Share This Page