similar articles algorithm based on numeric indexing of all rows via columns in a table

J

julie_smith

Hi,
I have an articles table containing columns like
id,name,author,section,creationdate,description,longmatter, etc.
I am using mysql.

some of them are fixed value fields(enumerations)

like->section will have news,sports,politics etc...

while description will be a text field with any amount of arbitrary
text.

now I have 50000 articles under different sections.

I want to implement a "similar articles" feature.
By this I mean when an article is shown,
I want to display all the similar articles based on that article.(10
per page).

Now how do I calculate the similarity of 1 article with all the 50000
articles ?

I dont want articles from the same section only.
Since the search result has to be very fast,
Can I create some algorithm that will look through all the fields in
each row of the
articles table and assign a weight/checksum to it.

And then in the similar articles part I display all the articles wth a
+-5 difference in checksum with the
current displayed articles checksum ?

Thanks in advance,

Julie
 
G

Gunnar Hjalmarsson

I want to implement a "similar articles" feature.
By this I mean when an article is shown,
I want to display all the similar articles based on that article.(10
per page).

Now how do I calculate the similarity of 1 article with all the 50000
articles ?

I dont want articles from the same section only.
Since the search result has to be very fast,
Can I create some algorithm that will look through all the fields in
each row of the
articles table and assign a weight/checksum to it.

Check out the CPAN module Algorithm::Diff.
 
A

Anno Siegel

Hi,
I have an articles table containing columns like
id,name,author,section,creationdate,description,longmatter, etc.
I am using mysql.

some of them are fixed value fields(enumerations)

like->section will have news,sports,politics etc...

while description will be a text field with any amount of arbitrary
text.

now I have 50000 articles under different sections.

I want to implement a "similar articles" feature.

Okay. Given two articles, how do you decide if they are similar?
By this I mean when an article is shown,
I want to display all the similar articles based on that article.(10
per page).

What you are going to do with the list of similar articles is of
no consequence on how you select them.
Now how do I calculate the similarity of 1 article with all the 50000
articles ?

First you have to tell us how to compare two individual articles, *then*
we can talk about ways to apply this to many pairs efficiently.
I dont want articles from the same section only.
Since the search result has to be very fast,
Can I create some algorithm that will look through all the fields in
each row of the
articles table and assign a weight/checksum to it.

And then in the similar articles part I display all the articles wth a
+-5 difference in checksum with the
current displayed articles checksum ?

Since you mention all the different fields, I suppose they all play
a part in deciding whether two articles are similar or not. You can't
map that many dimensions onto a single number and have it work like
you want to. The best you can hope for is a numeric representation
of *each field*, which can be compared to decide if articles are similar
with respect to one particular field. With some of the fields being
text strings, that won't be possible for all fields either.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top