regex-strategy for finding *similar* words?

C

Christoph Pingel

Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.

Now I'm looking for the best strategy to match the appearence of my
thesaurus items in the texts. Do I have to build patterns from all my
thesaurus items for the expected misspellings, or is there a more
general approach possible? I tried to add '?.?' to each letter in a
thesaurus item, but this is much too weak (I get a lot of false
positives because this expression for example matches any possible
substring). Adding word boundries helps a little, but sometimes a
concept spans two words, and this could be spelled with a space char
*or* a dash. In this case, I'm back to matching almose everything.
Any ideas?
BTW, efficiency is not absolutely required, it's not meant to work
for realtime requests.

TIA,
best regards,
Christoph
 
P

Peter Maas

Christoph said:
Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.

You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.

Search for '"Levenshtein distance" python' and find e.g.

http://trific.ath.cx/resources/python/levenshtein/
 
D

Daniel Dittmar

Christoph said:
an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.

There exists the agrep project (http://www.tgries.de/agrep/), for which
Python bindings exist. agrep (=approximate grep) allows you to specify
the number of allowed errors.

Daniel
 
T

Thomas Guettler

Am Thu, 18 Nov 2004 13:20:08 +0100 schrieb Christoph Pingel:
Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.

Hi,

You can write a method which takes a single word,
and returns a normalized version.

normalize("...ies") --> "...y"

normalize("running") --> "run"

Build a big dictionary which maps each word
to a list of files where they occur. Only
add normalized words to the dictionary (or database).

bigdict={"foo": ["file1.txt", "file2.txt", ...]}

HTH,
Thomas
 
C

Christoph Pingel

Thank you, Levensthein distance and agrep sound both interesting,
I'll try both I think.

best,
Christoph
 
J

John Machin

Peter Maas said:
You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.

Search for '"Levenshtein distance" python' and find e.g.

http://trific.ath.cx/resources/python/levenshtein/

Forget regexes. A string distance function is a helpful start. Firstly
you need a method of parsing your input text into words or phrases.
Secondly you need a fuzzy dictionary into which you load your
thesaurus. This fuzzy dictionary will have a method like
fuzzy_dict.approx_lookup('mispeled', 2) which would return you a list
of words in the dictionary that had a (say) Levenshtein distance of 2
or less from your input word. Algorithms for fuzzy dictionaries:
google for "Burkhard Keller tree" and "permuted lexicon".
Alternatively you could look for a spelling checker with publicly
available source and grab some ideas or even code. You might be
interested in other string distance measures than Levenshtein distance
e.g. the so-called Damerau distance which counts transpositions. There
are faster algorithms available than the dynamic-programming one --
google "Heikki Hyyro". These allow you to preprocess your input word
and then compare with multiple dictionary candidates in O(kn) instead
of O(kmn) time where the input word has length m, and you compare with
k dictionary words each of average length n.

HTH,
John
 
C

Christoph Pingel

There
are faster algorithms available than the dynamic-programming one --
google "Heikki Hyyro". These allow you to preprocess your input word
and then compare with multiple dictionary candidates in O(kn) instead
of O(kmn) time where the input word has length m, and you compare with
k dictionary words each of average length n.

John,

thanks for the rich input!
I found the Hyyro/Navarro algorithm, and this looks interesting.
First I will give python's own difflib.SequenceMatcher class a try -
the project here doesn't necessarily need the fastest algorithm,
coding time is an issue however. I found that a match ratio slightly
above 0.85 captures most of my misspelling cases without producing
errors.

best,
Christoph
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top