regex-strategy for finding similar words?

Christoph Pingel · Nov 18, 2004

Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.

Now I'm looking for the best strategy to match the appearence of my
thesaurus items in the texts. Do I have to build patterns from all my
thesaurus items for the expected misspellings, or is there a more
general approach possible? I tried to add '?.?' to each letter in a
thesaurus item, but this is much too weak (I get a lot of false
positives because this expression for example matches any possible
substring). Adding word boundries helps a little, but sometimes a
concept spans two words, and this could be spelled with a space char
*or* a dash. In this case, I'm back to matching almose everything.
Any ideas?
BTW, efficiency is not absolutely required, it's not meant to work
for realtime requests.

TIA,
best regards,
Christoph

Peter Maas · Nov 18, 2004

Christoph said:
Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.

You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.

Search for '"Levenshtein distance" python' and find e.g.

http://trific.ath.cx/resources/python/levenshtein/

Daniel Dittmar · Nov 18, 2004

Christoph said:
an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.

There exists the agrep project (http://www.tgries.de/agrep/), for which
Python bindings exist. agrep (=approximate grep) allows you to specify
the number of allowed errors.

Daniel

Thomas Guettler · Nov 18, 2004

Am Thu, 18 Nov 2004 13:20:08 +0100 schrieb Christoph Pingel:

Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.

Hi,

You can write a method which takes a single word,
and returns a normalized version.

normalize("...ies") --> "...y"

normalize("running") --> "run"

Build a big dictionary which maps each word
to a list of files where they occur. Only
add normalized words to the dictionary (or database).

bigdict={"foo": ["file1.txt", "file2.txt", ...]}

HTH,
Thomas

Christoph Pingel · Nov 18, 2004

Thank you, Levensthein distance and agrep sound both interesting,
I'll try both I think.

best,
Christoph

John Machin · Nov 19, 2004

Peter Maas said:
You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.

Search for '"Levenshtein distance" python' and find e.g.

http://trific.ath.cx/resources/python/levenshtein/

Forget regexes. A string distance function is a helpful start. Firstly
you need a method of parsing your input text into words or phrases.
Secondly you need a fuzzy dictionary into which you load your
thesaurus. This fuzzy dictionary will have a method like
fuzzy_dict.approx_lookup('mispeled', 2) which would return you a list
of words in the dictionary that had a (say) Levenshtein distance of 2
or less from your input word. Algorithms for fuzzy dictionaries:
google for "Burkhard Keller tree" and "permuted lexicon".
Alternatively you could look for a spelling checker with publicly
available source and grab some ideas or even code. You might be
interested in other string distance measures than Levenshtein distance
e.g. the so-called Damerau distance which counts transpositions. There
are faster algorithms available than the dynamic-programming one --
google "Heikki Hyyro". These allow you to preprocess your input word
and then compare with multiple dictionary candidates in O(kn) instead
of O(kmn) time where the input word has length m, and you compare with
k dictionary words each of average length n.

HTH,
John

Christoph Pingel · Nov 19, 2004

There

are faster algorithms available than the dynamic-programming one --
google "Heikki Hyyro". These allow you to preprocess your input word
and then compare with multiple dictionary candidates in O(kn) instead
of O(kmn) time where the input word has length m, and you compare with
k dictionary words each of average length n.

John,

thanks for the rich input!
I found the Hyyro/Navarro algorithm, and this looks interesting.
First I will give python's own difflib.SequenceMatcher class a try -
the project here doesn't necessarily need the fastest algorithm,
coding time is an issue however. I found that a match ratio slightly
above 0.85 captures most of my misspelling cases without producing
errors.

best,
Christoph

Announce SiSU - publishing for e-documents, books, libraries, relational databases	1	Jan 4, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

regex-strategy for finding similar words?

Christoph Pingel

Peter Maas

Daniel Dittmar

Thomas Guettler

Christoph Pingel

John Machin

Christoph Pingel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

regex-strategy for finding *similar* words?

Christoph Pingel

Peter Maas

Daniel Dittmar

Thomas Guettler

Christoph Pingel

John Machin

Christoph Pingel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

regex-strategy for finding similar words?