fastest way for humongous regexp search?

Tim Arnold · Nov 1, 2004

Hi,
I've got a list of 1000 common misspellings, and I'd like to check a set of
text files for those misspellings.
I'm trying to figure out the fastest way to do it; here's what I'm doing now
(below).

I'm still learning Python, love it, and I'm pretty sure that what I'm doing
is naive.

Thanks for taking the time to look at this,
--Tim
----------------------------------------------------------------------------
----------
(1) Create one humongous regexp, compile it and cPickle it. The regexp is
like this:

misspelled = (
'\\bjudgement\\b|' +
'\\bjudgemental\\b|' +

<snip><snip><snip>

'\\bYorksire\\b|' +
'\\bYoyages\\b')

p = re.compile(misspelled, re.I)
f = open('misspell.pat', 'w')
cPickle.dump(p,f)
f.close()
----------------------------------------------------------------------------
----------
(2) Check the file(s), report the misspelling, the line number and the
actual line of text.
- only warns on multiple identical misspellings
- using 'EtaOinShrdlu' as a nonsense line-marker; tried \n but that
didn't give correct results.
- running on HP Unix, Python 2.2

f = open('misspell.pat', 'r')
p = cPickle.load(f)

a = open('myfile.txt').readlines()
s = 'EtaOinShrdlu'.join(a)

mistake = {}
for mMatch in p.findall(s):
if mistake.get(mMatch,0):
print 'Warning: multiple occurrences of mistake "%s" ' % mMatch
else:
mistake[mMatch] = s.count('EtaOinShrdlu', 0, s.index(mMatch))

for k, v in mistake.items():
print 'Misspelling: "%s" on line number %d' % (k, mistake[k]+1)
print '%s \n' % a[mistake[k]]

Istvan Albert · Nov 1, 2004

Tim said:
I've got a list of 1000 common misspellings, and I'd like to check a set of
text files for those misspellings.

A much simpler way would be to just store these misspellings as a dictionary
(or set), read and split each line into words, then check whether each
of words is in the set.

Istvan

Tim Arnold · Nov 2, 2004

Istvan Albert said:
A much simpler way would be to just store these misspellings as a dictionary
(or set), read and split each line into words, then check whether each
of words is in the set.

Istvan

Thanks, I didn't know that would be faster.
But I need to match against the misspellings in a case-insensitive
way--that's the reason I'm using the regular expressions.

--Tim

Richie Hindle · Nov 2, 2004

[Tim]

I've got a list of 1000 common misspellings, and I'd like to check a set
of text files for those misspellings.
[Istvan]
A much simpler way would be to just store these misspellings as a
dictionary (or set), read and split each line into words, then check
whether each of words is in the set.
[Tim]
Thanks, I didn't know that would be faster.
But I need to match against the misspellings in a case-insensitive
way--that's the reason I'm using the regular expressions.

Make the misspelling set lower case, and convert the list of words from
the text file into lower case before comparing them:

from sets import Set
misspellings = Set(['speling', 'misteak'])
text = "Does this text contain any common speling mistakes?"
print [word for word in text.split() if word in misspellings]

Click to expand...

Click to expand...

['speling']

Richie Hindle · Nov 2, 2004

[me, with brain switched off]

Make the misspelling set lower case, and convert the list of words from
the text file into lower case before comparing them:

Gah! That code should read:

from sets import Set
misspellings = Set(['speling', 'misteak'])
text = "Does this text contain any common Speling Mistakes?"
print [word for word in text.lower().split() if word in misspellings]

Click to expand...

Click to expand...

['speling']

Diez B. Roggisch · Nov 2, 2004

Thanks, I didn't know that would be faster.

But I need to match against the misspellings in a case-insensitive
way--that's the reason I'm using the regular expressions.

normalize them all to lowercase. Still way faster.

Using cPickle	6	Feb 6, 2009
unexpected behaviour for python regexp: caret symbol almost useless?	4	May 28, 2006
print header for output	0	Jun 19, 2011
Request for comments - concurrent ssh client	0	Nov 4, 2009
For Peer Review	1	Apr 2, 2010
OOo and regexp	0	Dec 3, 2006
Display context snippet for search phrase match optimisation request	0	Oct 13, 2004
[SUMMARY] Word Search Generator (#159)	3	Apr 18, 2008

fastest way for humongous regexp search?

Tim Arnold

Istvan Albert

Tim Arnold

Richie Hindle

Richie Hindle

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads