Deduping quotations

Discussion in 'Java' started by Roedy Green, Nov 30, 2009.

  1. Roedy Green

    Roedy Green Guest

    Have you ever noticed how the quotation websites have the same
    quotations with tiny variations? or the same quote attributed to
    several different authors. Sometimes there is a short and long version
    of the same quotation.

    I was wondering how you might detect these.


    I thought you might do it by converting all to lower case, stripping
    punctuation and normalising white space to a single space.

    Then you would remove common words.

    Then you need to match, where order matters, put precise matching does
    not. Just how would that work?


    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    I mean the word proof not in the sense of the lawyers, who set two half proofs equal to a whole one, but in the sense of a mathematician, where half proof = 0, and it is demanded for proof that every doubt becomes impossible.
    ~ Carl Friedrich Gauss
    Roedy Green, Nov 30, 2009
    #1
    1. Advertising

  2. Roedy Green

    Arne Vajhøj Guest

    Roedy Green wrote:
    > Have you ever noticed how the quotation websites have the same
    > quotations with tiny variations? or the same quote attributed to
    > several different authors. Sometimes there is a short and long version
    > of the same quotation.


    It is common.

    Poor quoting can easily spread such variations.

    > I was wondering how you might detect these.
    >
    > I thought you might do it by converting all to lower case, stripping
    > punctuation and normalising white space to a single space.
    >
    > Then you would remove common words.
    >
    > Then you need to match, where order matters, put precise matching does
    > not. Just how would that work?


    Maybe:
    - only look at the very specific words
    - convert those to a standard form
    - test if all of those are present

    Arne
    Arne Vajhøj, Nov 30, 2009
    #2
    1. Advertising

  3. Roedy Green

    Tom Anderson Guest

    On Sun, 29 Nov 2009, Roedy Green wrote:

    > Have you ever noticed how the quotation websites have the same
    > quotations with tiny variations? or the same quote attributed to several
    > different authors. Sometimes there is a short and long version of the
    > same quotation.
    >
    > I was wondering how you might detect these.


    > I thought you might do it by converting all to lower case, stripping
    > punctuation and normalising white space to a single space.


    Then computing edit distances between all pairs of quotations:

    http://en.wikipedia.org/wiki/Levenshtein_distance

    And reporting those with distances below a certain threshold. I would
    guess that for a database of a few hundred quotations, the analysis would
    take under five minutes - probably under one minute, and probably a matter
    of seconds.

    Lucene has an implementation of this algorithm, and i imagine it's a fast
    one. If you weren't satisfied with a speed of a your own implementation
    (and it's really not difficult), you could try finding and using that.

    tom

    --
    No, Charlie, Tottenham Court Road is the Midlands. -- Lola, 'Kinky Boots'
    Tom Anderson, Nov 30, 2009
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jesse
    Replies:
    2
    Views:
    317
  2. Roedy Green

    deduping algorithm

    Roedy Green, Jul 22, 2004, in forum: Java
    Replies:
    14
    Views:
    5,073
    Roedy Green
    Jul 23, 2004
  3. Vadim
    Replies:
    1
    Views:
    1,193
    Pascal J. Bourguignon
    Dec 24, 2008
  4. Roedy Green

    Pithy programming Quotations

    Roedy Green, Aug 8, 2009, in forum: Java
    Replies:
    23
    Views:
    877
    Karl Uppiano
    Aug 23, 2009
  5. dirknbr

    deduping

    dirknbr, Jun 21, 2010, in forum: Python
    Replies:
    5
    Views:
    260
    Paul Rubin
    Jun 21, 2010
Loading...

Share This Page