RFC: Text similarity

Discussion in 'Perl Misc' started by Tore Aursand, Apr 23, 2004.

  1. Tore Aursand

    Tore Aursand Guest

    Hi!

    I have a large (more than 3,000 at the moment) set of documents in various
    formats (mostly PDF and Word). I need to create a sort of (...) index of
    these documents based on their similarity. I thought it would be nice to
    gather some suggestions from the people in this group before I proceeded.

    First of all: Converting the documents to a more sensible format (text in
    my case) is not the problem. The problem is the indexing and how to store
    the data which represents the similarity between the documents.

    I've done a search on CPAN and found a few modules which is of interest,
    primarily AI::Categorize and WordNet. I haven't used any of these before,
    but it seems like WordNet is the most appropriate one; AI::Categorize
    seems to require you to categorize some of the documents first (which I
    don't have the opportunity to do).

    Are there any other modules I should take a look at? Any suggestions on
    how I should deal with this task? Something you think I might forget?
    Some traps I should look out for?

    Any comments are appreciated! Thanks.


    --
    Tore Aursand <>
    "First get your facts; then you can distort them at your leisure."
    (Mark Twain)
     
    Tore Aursand, Apr 23, 2004
    #1
    1. Advertising

  2. On Fri, 23 Apr 2004 14:16:53 +0200, Tore Aursand wrote:

    > First of all: Converting the documents to a more sensible format (text in
    > my case) is not the problem. The problem is the indexing and how to store
    > the data which represents the similarity between the documents.


    Just an insight or two ...

    I'd use a database to store information about each document in. This way,
    you can use SQL to do things like count the word occurances and create
    stats on each document. Plus, your mixing apples with apples - raw word
    count with raw word count. It doesn't have to be a "real" database (like
    mySQL or PostgreSQL) - it could be a Sprite or SQLite database. The
    advantages to this approach are 1) you can try different options out
    without having to re-parse 3000 documents; 2)if you have more documents to
    add or some to remove, a simple SQL statement or two is easier to perform
    than a whole lot of re-coding or re-thinking the parsing part of your
    code. In fact, you can split up the various parts of your logic into
    different scripts that act as filters - one to parse the documents, one to
    populate the database, and maybe a few to determine similarities. All too
    often we think in terms of "once and done" when a few scripts might me a
    better solution.

    I'd also look over one (or more) of the Lingua modules to establish a
    criteria of what to put into the database. I doubt if you want to put a
    whole lot of "the" and "a" entries into the database. This would inflate
    the data source to about 5 times what it needs to be. So, using something
    like Lingua::StopWord(?) might help.

    There are Statistics modules as well. You could perform tests againist
    two documents and get a statistically correlation between the documents to
    see *how* similar they are. I'm rusty on Statistics 101, but my thinking
    is maybe using a t-test between the two documents might be the way to go.
    This may be overkill for what you want, but worth thinking about (for
    maybe a minute or two :) ). There may even be something easier to do.

    [ ... ]

    Just my $0.02 :)
    HTH

    --
    Jim

    Copyright notice: all code written by the author in this post is
    released under the GPL. http://www.gnu.org/licenses/gpl.txt
    for more information.

    a fortune quote ...
    The rhino is a homely beast, For human eyes he's not a feast.
    Farewell, farewell, you old rhinoceros, I'll stare at something
    less prepoceros. -- Ogden Nash
     
    James Willmore, Apr 23, 2004
    #2
    1. Advertising

  3. On Fri, 23 Apr 2004 14:16:53 +0200, Tore Aursand <>
    wrote:

    >I have a large (more than 3,000 at the moment) set of documents in various
    >formats (mostly PDF and Word). I need to create a sort of (...) index of
    >these documents based on their similarity. I thought it would be nice to
    >gather some suggestions from the people in this group before I proceeded.


    I know that this may seem naive, but in a popular science magazine I
    read that a paper has been published about a technique that indeed
    identifies the (natural) language some documents are written in by
    compressing (e.g. LZW) them along with some more text from samples
    taken from a bunch of different languages and comparing the different
    compressed sizes. You may try some variation on this scheme...

    I for one would be interested in the results, BTW!


    Michele
    --
    you'll see that it shouldn't be so. AND, the writting as usuall is
    fantastic incompetent. To illustrate, i quote:
    - Xah Lee trolling on clpmisc,
    "perl bug File::Basename and Perl's nature"
     
    Michele Dondi, Apr 23, 2004
    #3
  4. Tore Aursand () wrote:
    : Hi!

    : I have a large (more than 3,000 at the moment) set of documents in various
    : formats (mostly PDF and Word). I need to create a sort of (...) index of
    : these documents based on their similarity. I thought it would be nice to
    : gather some suggestions from the people in this group before I proceeded.

    : First of all: Converting the documents to a more sensible format (text in
    : my case) is not the problem. The problem is the indexing and how to store
    : the data which represents the similarity between the documents.

    : I've done a search on CPAN and found a few modules which is of interest,
    : primarily AI::Categorize and WordNet. I haven't used any of these before,
    : but it seems like WordNet is the most appropriate one; AI::Categorize
    : seems to require you to categorize some of the documents first (which I
    : don't have the opportunity to do).

    : Are there any other modules I should take a look at? Any suggestions on
    : how I should deal with this task? Something you think I might forget?
    : Some traps I should look out for?

    : Any comments are appreciated! Thanks.

    There is a bayesian filter, not for spam, I think it's called ifile.

    It helps file email into folders based on categories.

    I could imagine starting the process by creating a few categories by hand,
    each with one document or two or three similar documents, and then adding
    documents using ifile. Each time a document doesn't have a good match in
    the existing categories then create a new category.

    Or do it in reverse, start with 2,999 categories (one for each document,
    except the last), and take the last document (number 3,000) and try to
    file it into one of the 2,999 categories. Do that for each document to
    get a feel for the process, and then start merging the categories.

    $0.02
     
    Malcolm Dew-Jones, Apr 23, 2004
    #4
  5. Tore Aursand

    Ala Qumsieh Guest

    Tore Aursand wrote:

    > Any comments are appreciated! Thanks.


    I would suggest taking your question to the perlai mailing list. I
    recall a discussion about a similar problem a while ago.

    --Ala
     
    Ala Qumsieh, Apr 24, 2004
    #5
  6. Tore Aursand

    Guest

    Michele Dondi <> wrote:
    > On Fri, 23 Apr 2004 14:16:53 +0200, Tore Aursand <>
    > wrote:
    >
    > >I have a large (more than 3,000 at the moment) set of documents in
    > >various formats (mostly PDF and Word). I need to create a sort of (...)
    > >index of these documents based on their similarity. I thought it would
    > >be nice to gather some suggestions from the people in this group before
    > >I proceeded.

    >
    > I know that this may seem naive, but in a popular science magazine I
    > read that a paper has been published about a technique that indeed
    > identifies the (natural) language some documents are written in by
    > compressing (e.g. LZW) them along with some more text from samples
    > taken from a bunch of different languages and comparing the different
    > compressed sizes. You may try some variation on this scheme...


    I've tried this in various incarnations. It works well for very short
    files, but for longer files it takes some sort of preprocessing. Most
    compressors either operate chunk-wise, starting over again once the
    code-book is full, or have some other mechanism that compresses only
    locally. So you if just append documents, and they are long, then the
    compressor will have forgotten about the section of one docuemnt by the
    time it gets to the corresponding part of another document.

    Xho


    >
    > I for one would be interested in the results, BTW!
    >
    > Michele


    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 24, 2004
    #6
  7. Tore Aursand

    Tore Aursand Guest

    On Fri, 23 Apr 2004 21:50:52 +0200, Michele Dondi wrote:
    >> I have a large (more than 3,000 at the moment) set of documents in
    >> various formats (mostly PDF and Word). I need to create a sort of
    >> (...) index of these documents based on their similarity. I thought it
    >> would be nice to gather some suggestions from the people in this group
    >> before I proceeded.


    > I know that this may seem naive, but in a popular science magazine I
    > read that a paper has been published about a technique that indeed
    > identifies the (natural) language some documents are written in by
    > compressing (e.g. LZW) them along with some more text from samples taken
    > from a bunch of different languages and comparing the different
    > compressed sizes. You may try some variation on this scheme...


    I really don't have the opportunity to categorize any of the documents;
    Everything must be 100% automatic without human interference.

    I should also point out that the text is mainly in Norwegian, but there
    might be occurances of English text (as we're talking about technical
    manuals).

    > I for one would be interested in the results, BTW!


    I will keep you updated! :)


    --
    Tore Aursand <>
    "First, God created idiots. That was just for practice. Then He created
    school boards." (Mark Twain)
     
    Tore Aursand, Apr 26, 2004
    #7
  8. Tore Aursand

    Tore Aursand Guest

    On Fri, 23 Apr 2004 11:46:44 -0400, James Willmore wrote:
    >> First of all: Converting the documents to a more sensible format (text
    >> in my case) is not the problem. The problem is the indexing and how to
    >> store the data which represents the similarity between the documents.


    > I'd use a database to store information about each document in.


    That has already been taken care of; I will use MySQL for this, and have
    already a database up and running which consists of meta information about
    each document (title, description and where it is stored).

    The next step will be to retrieve all the words from each document, remove
    obvious stopwords, and then associate each document with its words (and
    how many times it appears in each document).

    Based on this information I will create a script which tries to find
    similar documents based on the associated words; If two documents holds a
    majority of the same words, they are doomed to be similar. :)

    The documents are in Norwegian, though, so I'm not able to rely on some of
    the excellent Lingua- and Stem-modules out there. I'm aware that there
    are a few modules for the Norwegian language, too, but I'm not quite sure
    about the quality of them (and if they rely too much on the Danish
    language, which at least some of the modules do).

    The whole application is - of course - split into more than one script;

    * Processing: Converting the documents to text, and converting the
    text into words (and how many times each word appears).
    * Inserting into the database.
    * Similiarity checking; A script which checks every document in the
    database against all the other documents. Quite expensive, this one,
    but easily run around 5 in the morning when everyone is asleep. :)
    * Web frontend for querying the database (ie. selecting/reading the
    documents and letting the user choose to see related documents).

    > There are Statistics modules as well. You could perform tests againist
    > two documents and get a statistically correlation between the documents
    > to see *how* similar they are.


    Hmm. Do you have any module names? A "brief search" didn't yield any
    useful hits.

    > I'm rusty on Statistics 101, but my thinking is maybe using a t-test
    > between the two documents might be the way to go.


    I don't even know what a "t-test" is, but googling for "t-test" may give
    me the answer...? Or should I search for something else (specific)?

    > Just my $0.02 :)


    Great! Thanks alot!


    --
    Tore Aursand <>
    "Then there was the man who drowned crossing a stream with an average
    depth of six inches." (W.I.E. Gates)
     
    Tore Aursand, Apr 26, 2004
    #8
  9. On Tue, 27 Apr 2004 00:37:29 +0200, Tore Aursand <>
    wrote:

    >> I know that this may seem naive, but in a popular science magazine I
    >> read that a paper has been published about a technique that indeed
    >> identifies the (natural) language some documents are written in by
    >> compressing (e.g. LZW) them along with some more text from samples taken
    >> from a bunch of different languages and comparing the different
    >> compressed sizes. You may try some variation on this scheme...

    >
    >I really don't have the opportunity to categorize any of the documents;
    >Everything must be 100% automatic without human interference.


    Well, you may try matching limited-sized portions of the documents
    (after having converted them to pure text) against each other (I mean
    across documents, not within the *same* document) and average the
    result over a document.


    Just my 2x10^-12 Eur,
    Michele
    --
    $\=q.,.,$_=q.print' ,\g,,( w,a'c'e'h,,map{$_-=qif/g/;chr
    }107..q[..117,q)[map+hex,split//,join' ,2B,, w$ECDF078D3'
    F9'5F3014$,$,];];$\.=$/,s,q,32,g,s,g,112,g,y,' , q,,eval;
     
    Michele Dondi, Apr 27, 2004
    #9
  10. Tore Aursand

    Tore Aursand Guest

    On Tue, 27 Apr 2004 23:12:40 +0200, Michele Dondi wrote:
    >>> I know that this may seem naive, but in a popular science magazine I
    >>> read that a paper has been published about a technique that indeed
    >>> identifies the (natural) language some documents are written in by
    >>> compressing (e.g. LZW) them along with some more text from samples
    >>> taken from a bunch of different languages and comparing the different
    >>> compressed sizes. You may try some variation on this scheme...


    >> I really don't have the opportunity to categorize any of the documents;
    >> Everything must be 100% automatic without human interference.


    > Well, you may try matching limited-sized portions of the documents
    > (after having converted them to pure text) against each other (I mean
    > across documents, not within the *same* document) and average the result
    > over a document.


    Because there will be _a lot_ of documents - where each document can be
    quite big - I have to have in mind:

    * Processing power is limited, so the matching must be as light-
    weight as possible, but at the same time as good as possible. Yeah, I
    know how that sentence sounds. :)

    * Data storage is also limited; I can't store each document (and
    all its contents) in the database. I can only store meta data and
    data related to the task of finding related documents.

    The latter brings me to the point of extracting all the words from each
    document, removing single characters, stopwords and numbers, and then
    store these words (and their frequency) in a document/word-mapped data
    table. Quite simple, really.


    --
    Tore Aursand <>
    "To cease smoking is the easiset thing I ever did. I ought to know,
    I've done it a thousand times." (Mark Twain)
     
    Tore Aursand, Apr 28, 2004
    #10
  11. On Tue, 27 Apr 2004 00:37:29 +0200, Tore Aursand wrote:

    > On Fri, 23 Apr 2004 11:46:44 -0400, James Willmore wrote:

    [ ... ]

    >> There are Statistics modules as well. You could perform tests againist
    >> two documents and get a statistically correlation between the documents
    >> to see *how* similar they are.

    >
    > Hmm. Do you have any module names? A "brief search" didn't yield any
    > useful hits.
    >
    >> I'm rusty on Statistics 101, but my thinking is maybe using a t-test
    >> between the two documents might be the way to go.

    >
    > I don't even know what a "t-test" is, but googling for "t-test" may give
    > me the answer...? Or should I search for something else (specific)?


    A t-test is a statistical test. It measures variance between two groups.
    In other words, is it possible that the results seen are by chance or they
    are actually similar. If you use Google, try searching for "t-test" (use
    the quotes to insure the search will give the proper results). The first
    URL (http://trochim.human.cornell.edu/kb/stat_t.htm) gives a pretty good
    explaination.

    Sad thing is, I had a thought about where to go with this and lost the
    thought :-( I worked some crazy hours the last few days, as well as
    fighting off a cold - that may have something to do with it :-(

    There are Statistics modules available. Statistics::DependantTTest is one
    that will do t-tests for you. I'm not sure now if this is a good way to
    go.

    Which every way you go, I'd me interested in knowing. I had tried doing
    something similar some time back, but stoped the effort in favor of a
    pre-made solution.

    --
    Jim

    Copyright notice: all code written by the author in this post is
    released under the GPL. http://www.gnu.org/licenses/gpl.txt
    for more information.

    a fortune quote ...
    Why did the Roman Empire collapse? What is the Latin for office
    automation?
     
    James Willmore, Apr 29, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fabian Leitritz

    Document-Document similarity

    Fabian Leitritz, Jan 14, 2005, in forum: Java
    Replies:
    0
    Views:
    439
    Fabian Leitritz
    Jan 14, 2005
  2. =?iso-8859-1?B?bW9vcJk=?=

    What are the similarity and difference b/w EBJ and COM+?

    =?iso-8859-1?B?bW9vcJk=?=, May 30, 2006, in forum: Java
    Replies:
    1
    Views:
    434
    dimitar
    May 30, 2006
  3. Luca Montecchiani

    String similarity

    Luca Montecchiani, Oct 10, 2003, in forum: Python
    Replies:
    0
    Views:
    570
    Luca Montecchiani
    Oct 10, 2003
  4. Replies:
    1
    Views:
    369
    Roedy Green
    Jan 7, 2008
  5. Ivan Shmakov
    Replies:
    3
    Views:
    1,228
    Kari Hurtta
    Feb 13, 2012
Loading...

Share This Page