Deduping quotations

Roedy Green · Nov 29, 2009

Have you ever noticed how the quotation websites have the same
quotations with tiny variations? or the same quote attributed to
several different authors. Sometimes there is a short and long version
of the same quotation.

I was wondering how you might detect these.

I thought you might do it by converting all to lower case, stripping
punctuation and normalising white space to a single space.

Then you would remove common words.

Then you need to match, where order matters, put precise matching does
not. Just how would that work?

Arne Vajhøj · Nov 29, 2009

Roedy said:
Have you ever noticed how the quotation websites have the same
quotations with tiny variations? or the same quote attributed to
several different authors. Sometimes there is a short and long version
of the same quotation.

It is common.

Poor quoting can easily spread such variations.

I was wondering how you might detect these.

I thought you might do it by converting all to lower case, stripping
punctuation and normalising white space to a single space.
>
Then you would remove common words.

Then you need to match, where order matters, put precise matching does
not. Just how would that work?

Maybe:
- only look at the very specific words
- convert those to a standard form
- test if all of those are present

Arne

Tom Anderson · Nov 30, 2009

Have you ever noticed how the quotation websites have the same
quotations with tiny variations? or the same quote attributed to several
different authors. Sometimes there is a short and long version of the
same quotation.

I was wondering how you might detect these.

I thought you might do it by converting all to lower case, stripping
punctuation and normalising white space to a single space.

Then computing edit distances between all pairs of quotations:

http://en.wikipedia.org/wiki/Levenshtein_distance

And reporting those with distances below a certain threshold. I would
guess that for a database of a few hundred quotations, the analysis would
take under five minutes - probably under one minute, and probably a matter
of seconds.

Lucene has an implementation of this algorithm, and i imagine it's a fast
one. If you weren't satisfied with a speed of a your own implementation
(and it's really not difficult), you could try finding and using that.

tom

FAQ 4.32 How do I strip blank space from the beginning/end of a string?	0	Feb 25, 2011
[SUMMARY] Obfuscated Email (#163)	6	May 22, 2008
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
[newbie] Recursive algorithm - review	5	Jan 3, 2014
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
FAQ 4.42 How can I tell whether a certain element is contained in a list or array?	0	Feb 8, 2011
opinion: comp lang docs style	10	Jan 4, 2011
[ANN] ActiveRecord .from_xml upgrade	0	Mar 7, 2009

Deduping quotations

Roedy Green

Arne Vajhøj

Tom Anderson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads