T
Tore Aursand
Hi!
I have a large (more than 3,000 at the moment) set of documents in various
formats (mostly PDF and Word). I need to create a sort of (...) index of
these documents based on their similarity. I thought it would be nice to
gather some suggestions from the people in this group before I proceeded.
First of all: Converting the documents to a more sensible format (text in
my case) is not the problem. The problem is the indexing and how to store
the data which represents the similarity between the documents.
I've done a search on CPAN and found a few modules which is of interest,
primarily AI::Categorize and WordNet. I haven't used any of these before,
but it seems like WordNet is the most appropriate one; AI::Categorize
seems to require you to categorize some of the documents first (which I
don't have the opportunity to do).
Are there any other modules I should take a look at? Any suggestions on
how I should deal with this task? Something you think I might forget?
Some traps I should look out for?
Any comments are appreciated! Thanks.
I have a large (more than 3,000 at the moment) set of documents in various
formats (mostly PDF and Word). I need to create a sort of (...) index of
these documents based on their similarity. I thought it would be nice to
gather some suggestions from the people in this group before I proceeded.
First of all: Converting the documents to a more sensible format (text in
my case) is not the problem. The problem is the indexing and how to store
the data which represents the similarity between the documents.
I've done a search on CPAN and found a few modules which is of interest,
primarily AI::Categorize and WordNet. I haven't used any of these before,
but it seems like WordNet is the most appropriate one; AI::Categorize
seems to require you to categorize some of the documents first (which I
don't have the opportunity to do).
Are there any other modules I should take a look at? Any suggestions on
how I should deal with this task? Something you think I might forget?
Some traps I should look out for?
Any comments are appreciated! Thanks.