C
Christoph Pingel
Hi all,
an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.
Now I'm looking for the best strategy to match the appearence of my
thesaurus items in the texts. Do I have to build patterns from all my
thesaurus items for the expected misspellings, or is there a more
general approach possible? I tried to add '?.?' to each letter in a
thesaurus item, but this is much too weak (I get a lot of false
positives because this expression for example matches any possible
substring). Adding word boundries helps a little, but sometimes a
concept spans two words, and this could be spelled with a space char
*or* a dash. In this case, I'm back to matching almose everything.
Any ideas?
BTW, efficiency is not absolutely required, it's not meant to work
for realtime requests.
TIA,
best regards,
Christoph
an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.
Now I'm looking for the best strategy to match the appearence of my
thesaurus items in the texts. Do I have to build patterns from all my
thesaurus items for the expected misspellings, or is there a more
general approach possible? I tried to add '?.?' to each letter in a
thesaurus item, but this is much too weak (I get a lot of false
positives because this expression for example matches any possible
substring). Adding word boundries helps a little, but sometimes a
concept spans two words, and this could be spelled with a space char
*or* a dash. In this case, I'm back to matching almose everything.
Any ideas?
BTW, efficiency is not absolutely required, it's not meant to work
for realtime requests.
TIA,
best regards,
Christoph