RFC: Text similarity

T

Tore Aursand

Hi!

I have a large (more than 3,000 at the moment) set of documents in various
formats (mostly PDF and Word). I need to create a sort of (...) index of
these documents based on their similarity. I thought it would be nice to
gather some suggestions from the people in this group before I proceeded.

First of all: Converting the documents to a more sensible format (text in
my case) is not the problem. The problem is the indexing and how to store
the data which represents the similarity between the documents.

I've done a search on CPAN and found a few modules which is of interest,
primarily AI::Categorize and WordNet. I haven't used any of these before,
but it seems like WordNet is the most appropriate one; AI::Categorize
seems to require you to categorize some of the documents first (which I
don't have the opportunity to do).

Are there any other modules I should take a look at? Any suggestions on
how I should deal with this task? Something you think I might forget?
Some traps I should look out for?

Any comments are appreciated! Thanks.
 
J

James Willmore

First of all: Converting the documents to a more sensible format (text in
my case) is not the problem. The problem is the indexing and how to store
the data which represents the similarity between the documents.

Just an insight or two ...

I'd use a database to store information about each document in. This way,
you can use SQL to do things like count the word occurances and create
stats on each document. Plus, your mixing apples with apples - raw word
count with raw word count. It doesn't have to be a "real" database (like
mySQL or PostgreSQL) - it could be a Sprite or SQLite database. The
advantages to this approach are 1) you can try different options out
without having to re-parse 3000 documents; 2)if you have more documents to
add or some to remove, a simple SQL statement or two is easier to perform
than a whole lot of re-coding or re-thinking the parsing part of your
code. In fact, you can split up the various parts of your logic into
different scripts that act as filters - one to parse the documents, one to
populate the database, and maybe a few to determine similarities. All too
often we think in terms of "once and done" when a few scripts might me a
better solution.

I'd also look over one (or more) of the Lingua modules to establish a
criteria of what to put into the database. I doubt if you want to put a
whole lot of "the" and "a" entries into the database. This would inflate
the data source to about 5 times what it needs to be. So, using something
like Lingua::StopWord(?) might help.

There are Statistics modules as well. You could perform tests againist
two documents and get a statistically correlation between the documents to
see *how* similar they are. I'm rusty on Statistics 101, but my thinking
is maybe using a t-test between the two documents might be the way to go.
This may be overkill for what you want, but worth thinking about (for
maybe a minute or two :) ). There may even be something easier to do.

[ ... ]

Just my $0.02 :)
HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
The rhino is a homely beast, For human eyes he's not a feast.
Farewell, farewell, you old rhinoceros, I'll stare at something
less prepoceros. -- Ogden Nash
 
M

Michele Dondi

I have a large (more than 3,000 at the moment) set of documents in various
formats (mostly PDF and Word). I need to create a sort of (...) index of
these documents based on their similarity. I thought it would be nice to
gather some suggestions from the people in this group before I proceeded.

I know that this may seem naive, but in a popular science magazine I
read that a paper has been published about a technique that indeed
identifies the (natural) language some documents are written in by
compressing (e.g. LZW) them along with some more text from samples
taken from a bunch of different languages and comparing the different
compressed sizes. You may try some variation on this scheme...

I for one would be interested in the results, BTW!


Michele
 
M

Malcolm Dew-Jones

Tore Aursand ([email protected]) wrote:
: Hi!

: I have a large (more than 3,000 at the moment) set of documents in various
: formats (mostly PDF and Word). I need to create a sort of (...) index of
: these documents based on their similarity. I thought it would be nice to
: gather some suggestions from the people in this group before I proceeded.

: First of all: Converting the documents to a more sensible format (text in
: my case) is not the problem. The problem is the indexing and how to store
: the data which represents the similarity between the documents.

: I've done a search on CPAN and found a few modules which is of interest,
: primarily AI::Categorize and WordNet. I haven't used any of these before,
: but it seems like WordNet is the most appropriate one; AI::Categorize
: seems to require you to categorize some of the documents first (which I
: don't have the opportunity to do).

: Are there any other modules I should take a look at? Any suggestions on
: how I should deal with this task? Something you think I might forget?
: Some traps I should look out for?

: Any comments are appreciated! Thanks.

There is a bayesian filter, not for spam, I think it's called ifile.

It helps file email into folders based on categories.

I could imagine starting the process by creating a few categories by hand,
each with one document or two or three similar documents, and then adding
documents using ifile. Each time a document doesn't have a good match in
the existing categories then create a new category.

Or do it in reverse, start with 2,999 categories (one for each document,
except the last), and take the last document (number 3,000) and try to
file it into one of the 2,999 categories. Do that for each document to
get a feel for the process, and then start merging the categories.

$0.02
 
A

Ala Qumsieh

Tore said:
Any comments are appreciated! Thanks.

I would suggest taking your question to the perlai mailing list. I
recall a discussion about a similar problem a while ago.

--Ala
 
C

ctcgag

Michele Dondi said:
I know that this may seem naive, but in a popular science magazine I
read that a paper has been published about a technique that indeed
identifies the (natural) language some documents are written in by
compressing (e.g. LZW) them along with some more text from samples
taken from a bunch of different languages and comparing the different
compressed sizes. You may try some variation on this scheme...

I've tried this in various incarnations. It works well for very short
files, but for longer files it takes some sort of preprocessing. Most
compressors either operate chunk-wise, starting over again once the
code-book is full, or have some other mechanism that compresses only
locally. So you if just append documents, and they are long, then the
compressor will have forgotten about the section of one docuemnt by the
time it gets to the corresponding part of another document.

Xho
 
T

Tore Aursand

I know that this may seem naive, but in a popular science magazine I
read that a paper has been published about a technique that indeed
identifies the (natural) language some documents are written in by
compressing (e.g. LZW) them along with some more text from samples taken
from a bunch of different languages and comparing the different
compressed sizes. You may try some variation on this scheme...

I really don't have the opportunity to categorize any of the documents;
Everything must be 100% automatic without human interference.

I should also point out that the text is mainly in Norwegian, but there
might be occurances of English text (as we're talking about technical
manuals).
I for one would be interested in the results, BTW!

I will keep you updated! :)
 
T

Tore Aursand

I'd use a database to store information about each document in.

That has already been taken care of; I will use MySQL for this, and have
already a database up and running which consists of meta information about
each document (title, description and where it is stored).

The next step will be to retrieve all the words from each document, remove
obvious stopwords, and then associate each document with its words (and
how many times it appears in each document).

Based on this information I will create a script which tries to find
similar documents based on the associated words; If two documents holds a
majority of the same words, they are doomed to be similar. :)

The documents are in Norwegian, though, so I'm not able to rely on some of
the excellent Lingua- and Stem-modules out there. I'm aware that there
are a few modules for the Norwegian language, too, but I'm not quite sure
about the quality of them (and if they rely too much on the Danish
language, which at least some of the modules do).

The whole application is - of course - split into more than one script;

* Processing: Converting the documents to text, and converting the
text into words (and how many times each word appears).
* Inserting into the database.
* Similiarity checking; A script which checks every document in the
database against all the other documents. Quite expensive, this one,
but easily run around 5 in the morning when everyone is asleep. :)
* Web frontend for querying the database (ie. selecting/reading the
documents and letting the user choose to see related documents).
There are Statistics modules as well. You could perform tests againist
two documents and get a statistically correlation between the documents
to see *how* similar they are.

Hmm. Do you have any module names? A "brief search" didn't yield any
useful hits.
I'm rusty on Statistics 101, but my thinking is maybe using a t-test
between the two documents might be the way to go.

I don't even know what a "t-test" is, but googling for "t-test" may give
me the answer...? Or should I search for something else (specific)?
Just my $0.02 :)

Great! Thanks alot!
 
M

Michele Dondi

I really don't have the opportunity to categorize any of the documents;
Everything must be 100% automatic without human interference.

Well, you may try matching limited-sized portions of the documents
(after having converted them to pure text) against each other (I mean
across documents, not within the *same* document) and average the
result over a document.


Just my 2x10^-12 Eur,
Michele
 
T

Tore Aursand

Well, you may try matching limited-sized portions of the documents
(after having converted them to pure text) against each other (I mean
across documents, not within the *same* document) and average the result
over a document.

Because there will be _a lot_ of documents - where each document can be
quite big - I have to have in mind:

* Processing power is limited, so the matching must be as light-
weight as possible, but at the same time as good as possible. Yeah, I
know how that sentence sounds. :)

* Data storage is also limited; I can't store each document (and
all its contents) in the database. I can only store meta data and
data related to the task of finding related documents.

The latter brings me to the point of extracting all the words from each
document, removing single characters, stopwords and numbers, and then
store these words (and their frequency) in a document/word-mapped data
table. Quite simple, really.
 
J

James Willmore

[ ... ]
Hmm. Do you have any module names? A "brief search" didn't yield any
useful hits.


I don't even know what a "t-test" is, but googling for "t-test" may give
me the answer...? Or should I search for something else (specific)?

A t-test is a statistical test. It measures variance between two groups.
In other words, is it possible that the results seen are by chance or they
are actually similar. If you use Google, try searching for "t-test" (use
the quotes to insure the search will give the proper results). The first
URL (http://trochim.human.cornell.edu/kb/stat_t.htm) gives a pretty good
explaination.

Sad thing is, I had a thought about where to go with this and lost the
thought :-( I worked some crazy hours the last few days, as well as
fighting off a cold - that may have something to do with it :-(

There are Statistics modules available. Statistics::DependantTTest is one
that will do t-tests for you. I'm not sure now if this is a good way to
go.

Which every way you go, I'd me interested in knowing. I had tried doing
something similar some time back, but stoped the effort in favor of a
pre-made solution.

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Why did the Roman Empire collapse? What is the Latin for office
automation?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top