General matimatical approach to..

I

IchBin

Info setup: I have a private project for myself because I like to
collect quotes. That is, people quotes, sayings or proverbs. I have
amassed currently 21,631 quotes over 1,255 authors of them. I am
selective in what I want. Each author has a quick biography and link to
Internet for a more complete biography at Wikipedia for further research
if I have the urge. Each quote has a possible reference and or comment.
I have a GUI with Trees with N-Tabs.. and so on.

On one tab I have a program that can be used to check for duplicates
(all or an author) that have gotten thru its initial insert duplication
check. I basically have a spinner to indicate the number of chars from 0
in the quotes that I check for exact duplication and then display them.
Then I can determine if they are true duplicates in the rest of the
quote or if the phraseology just a little different. Then do what may.

Current it works well but I feel that there must be a better
mathematical model to indicate the provability of duplication. So I can
look at only at some percent of that provability. Naturally, a quote can
be a duplicate and not have the same sequence of tokens on the quote or
even more abstract. I don't think you can ever be 100% correct in a
logic comparison because of the Language mechanics. Based on the
similarity of all tokens in the sting I think I can at least predict
what quote should be manually checked. This way there is no need to
enter any parameters to check for duplication.

I do not have a math degree but did have to take a lot a statistics
because of my degree. I have been in the computer field since collage
so my statistics is rusty. I think it can be done better this way but
can not put my finger on it. I guess I could pull out my old stat books.

I am just asking the math people if they could give me some insights or
better way ala mathematical algorithm... For the computer science side
of the house I guess it would definitional involve a tree parse at a
minimum.

Hope my question is understandable. There is no rush . I plan to at
least put the database on my site and then the all of the supporting
programs and docs. You never know their maybe someone out there with the
same interests to have a GUI to do all of that kinda stuff. Else I did
write it for myself anyway.

Thanks in Advance...
IchBin, Pocono Lake, Pa, USA
http://weconsultants.servebeer.com/JHackerAppManager
__________________________________________________________________________

'If there is one, Knowledge is the "Fountain of Youth"'
-William E. Taylor, Regular Guy (1952-)
 
R

Roedy Green

Current it works well but I feel that there must be a better
mathematical model to indicate the provability of duplication.

The biggest problem you face is the length of the quote. Sometimes
people just quote a small section of a larger quote.

So I would do is break each quote into sentences or phrases and check
each phrase for dups separately.

Next you want uniform spelling. So you want to run these through a
spellchecker before you start.

For a quick check, just do a HashMap lookup of each phrase and dump
out ones that have dups. Convert all to lower case first.

For a more elaborate check prepare a dictionary of all words used in
your quote collection. Sort alphabetically, converting everything to
lower case first. Calculate word frequency as you discard dups. Toss
the common words. Now from each quote, create a key of the
"interesting" words in alphabetical order. Check for dups with a
HashMap as before.

To get authors in standard form, print out a list of all authors
discarding dups sorting in alpha order putting the frequency on each
name. . Eyeball the list and create a replace script to convert odd
balls into the standard form. Focus your attention on the lines with
only 1 entry. Use the most common spelling as your standard.
In the old days, you did this with punch cards, putting the odd balls
behind the standard to create the replace script. I wrote a mouse
version of this for DOS many years ago.
 
A

Alex Hunsley

IchBin said:
Info setup: I have a private project for myself because I like to
collect quotes. That is, people quotes, sayings or proverbs. I have
amassed currently 21,631 quotes over 1,255 authors of them. I am
selective in what I want. Each author has a quick biography and link to
Internet for a more complete biography at Wikipedia for further research
if I have the urge. Each quote has a possible reference and or comment.
I have a GUI with Trees with N-Tabs.. and so on.

On one tab I have a program that can be used to check for duplicates
(all or an author) that have gotten thru its initial insert duplication
check. I basically have a spinner to indicate the number of chars from 0
in the quotes that I check for exact duplication and then display them.
Then I can determine if they are true duplicates in the rest of the
quote or if the phraseology just a little different. Then do what may.

Current it works well but I feel that there must be a better
mathematical model to indicate the provability of duplication.

I'm assuming you mean 'probability' here....
Yup, there are better ways. You can certainly cook up an algorithm that,
given two quotes, gives some sort of measure of similarity.

So I can
look at only at some percent of that provability. Naturally, a quote can
be a duplicate and not have the same sequence of tokens on the quote or
even more abstract. I don't think you can ever be 100% correct in a
logic comparison because of the Language mechanics.
Yup.

Based on the
similarity of all tokens in the sting I think I can at least predict
what quote should be manually checked. This way there is no need to
enter any parameters to check for duplication.

I can think of ways to do this, but first of all, can you be a bit more
specific about the sorts of ways two quotes, essentially the same, may
differ?
The sorts of things I'm thinking of are: a word might be mis-spelt.
Words may come in a different order. Sentences themselves may come in a
different order? Presumably punctuation is to be ignored.
Can you post a few examples of how a single quote might differ?

As well as having a system that gives a similarity score for two given
quotes, you may want to consider having some sort of system that reduces
the amount of comparisons you have to do.
Otherwise, given N quotes, the naive straightfoward way of comparing
every quote to every other would require N!/(N-2)!2 = N(N-1)/2
comparisons. e.g. for N=1000, you'd have 1000*999/2 = 499,500
comparisons. You can avoid doing such huge amounts of comparisons by
grouping your quotes by some sort of hashing function (that relates to
some measure of a quotes contents).
I wouldn't worry about this yet though (if at all).
I do not have a math degree but did have to take a lot a statistics
because of my degree. I have been in the computer field since collage so
my statistics is rusty. I think it can be done better this way but can
not put my finger on it. I guess I could pull out my old stat books.

Hold back on that stats idea for the moment... I think you can do this
without having to resort to any heavy stats.



[Random end note for the mathematically minded:
compare the expression for number of comparisons for N items - N(N-1)/2
- with Guass' expression for the sum of the first N integers - (N+1)N/2.]
 
T

Tom Leylan

You're basically looking for near duplicates. Could you do this with SQL
select statements? You enter the quote, significant words are isolated and
a list of the quotes with matches are generated. You could determine which
words (and how many) are significant by simply marking them as you enter the
quotation.

For example you could enter the following (note I prefaced the keywords with
%). "A %thing is not %necessarily true because %badly %uttered, nor false
because spoken %magnificently.", "Saint Augustine"

The entry routine would pick out the words I marked (and remove the markings
before saving) and search the database for quotes that contained these 5
words. You'd get a list of all quotes you had with these 5 words and could
decide to save it or discard it.

Tom
 
S

Simon

If you can afford time complexity quote1.length()*quote2.length() for pairwise
comparisons, you can try to Google for "edit distance" or "Levenshtein
Distance". That way you would also find duplicates due to spelling errors, which
are probably most common.

I suspect you would miss typos if you use hash-based methods, dictionaries or
similar approaches. Be very careful if you list terms in alphabetical order to
compare them since this is very unstable if words are missing or misspelt.

Also, in order to use the dictionary approach, and "toss the common words" as
Roedy said, you need to know what "common words" are. Maybe, have a look at
"tf-idf" (term frequency–inverse document frequency) for that. Then you can also
use data mining techniques to find clusters in your text corpus. Mayb e do a
search for "text mining".
 
G

Gordon Beaton

you need to know what "common words" are.

It might help to know that the accepted term for these is "stop
words", and there are lists available online. Use Google.

/gordon
 
M

Michele Ouellet

On one tab I have a program that can be used to check for duplicates
(all or an author) that have gotten thru its initial insert duplication
check. I basically have a spinner to indicate the number of chars from 0
in the quotes that I check for exact duplication and then display them.
Then I can determine if they are true duplicates in the rest of the
quote or if the phraseology just a little different. Then do what may.

The mathematics for decent string comparison were worked out in Information
Retrieval ages ago... Why not just use a search engine such as Lucene to:
1) Index your quotes
2) Issue the new quote as a query against this index.

This will return the nearest matches.

Lucene is excellent: fast and open-source.
http://lucene.apache.org/java/docs/index.html

Enjoy!

Michèle Ouellet
Stelvio Inc.
 
O

Oliver Wong

IchBin said:
Info setup: I have a private project for myself because I like to collect
quotes. That is, people quotes, sayings or proverbs. I have amassed
currently 21,631 quotes over 1,255 authors of them. [...]

On one tab I have a program that can be used to check for duplicates (all
or an author) that have gotten thru its initial insert duplication check.
I basically have a spinner to indicate the number of chars from 0 in the
quotes that I check for exact duplication and then display them. Then I
can determine if they are true duplicates in the rest of the quote or if
the phraseology just a little different. Then do what may.

Current it works well but I feel that there must be a better mathematical
model to indicate the provability of duplication. So I can look at only at
some percent of that provability. Naturally, a quote can be a duplicate
and not have the same sequence of tokens on the quote or even more
abstract. I don't think you can ever be 100% correct in a logic comparison
because of the Language mechanics. Based on the similarity of all tokens
in the sting I think I can at least predict what quote should be manually
checked. This way there is no need to enter any parameters to check for
duplication.

Slightly OT, as I don't have any answers beyond what others have
provided for the OP, but...

has anyone tried to actually try to come up with some sort of
intepretation of the quotes to see if they are "basically saying the same
thing"? This is mostly an issue when the author of the quote originally said
something in one language, and what they said was translated by different
people, and these translations are quoted.

For example, the translations "I disapprove of what you say, but I will
defend to the death your right to say it." and "Think for yourselves and let
others enjoy the privilege to do so too.", are believed to be referring to
the same original statement that Voltair had made in French; or similarly
"The ends justify the means." and "One must consider the final result." by
Nicolo Machiavelli.

The two "pairs of quotes" I mentioned above would probably be very
difficult to detect as being "duplicates" using merely character-based or
word-based analysis since they share almost zero words in common.

I'm thinking perhaps a parse tree could be built up of the English
sentences (assuming we have the technology to classify words as acting as
nouns, adjectives, verbs, etc.) and then perhaps using some sort of semantic
webs to group words with similar meanings together. Then take two parse
trees, and measure their "distance in meaning". Something like that. I
haven't done much research in this area, so I don't know how feasible this
is, or what's been done yet.

- Oliver
 
R

Roedy Green

The mathematics for decent string comparison were worked out in Information
Retrieval ages ago... Why not just use a search engine such as Lucene to:
1) Index your quotes
2) Issue the new quote as a query against this index.

This will return the nearest matches.

Lucene is excellent: fast and open-source.
http://lucene.apache.org/java/docs/index.html

Set say he did that, he still has to manually sit there and decide if
there are dups for every quote. He need something to find the most
likely dups for him, perhaps ordering them by likelihood, not the dups
for any particular quote, but the duplicates in his entire collection.
 
R

Roedy Green

I'm thinking perhaps a parse tree could be built up of the English
sentences (assuming we have the technology to classify words as acting as
nouns, adjectives, verbs, etc.) and then perhaps using some sort of semantic
webs to group words with similar meanings together. Then take two parse
trees, and measure their "distance in meaning". Something like that. I
haven't done much research in this area, so I don't know how feasible this
is, or what's been done yet.

you might be able to do it with a thesaurus to classify words into
meaning clusters.
 
I

IchBin

Roedy said:
you might be able to do it with a thesaurus to classify words into
meaning clusters.

Thanks so far for all your input. I am just reading them. I have to sit
down and get all of the visuals mentioned so far together. You can
appreciate the complexity based on to the degree one would want
perfection or degree of masochism based of the require medication...LOL

I have thought about all of the mentioned problems associated with the
comparison of a sentence and it abstraction that it may or may not
portray. I just need to think about it for a tad longer with all of your
suggestions. I am not tying to write some
grammatical\language\abstraction AI program. Just a little self defined
project I have taken on because of my interest. But then I guess you
have to lean that way.

Thanks in Advance...
IchBin, Pocono Lake, Pa, USA
http://weconsultants.servebeer.com/JHackerAppManager
__________________________________________________________________________

'If there is one, Knowledge is the "Fountain of Youth"'
-William E. Taylor, Regular Guy (1952-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top