fastest way for humongous regexp search?

Discussion in 'Python' started by Tim Arnold, Nov 1, 2004.

  1. Tim Arnold

    Tim Arnold Guest

    Hi,
    I've got a list of 1000 common misspellings, and I'd like to check a set of
    text files for those misspellings.
    I'm trying to figure out the fastest way to do it; here's what I'm doing now
    (below).

    I'm still learning Python, love it, and I'm pretty sure that what I'm doing
    is naive.

    Thanks for taking the time to look at this,
    --Tim
    ----------------------------------------------------------------------------
    ----------
    (1) Create one humongous regexp, compile it and cPickle it. The regexp is
    like this:

    misspelled = (
    '\\bjudgement\\b|' +
    '\\bjudgemental\\b|' +

    <snip><snip><snip>

    '\\bYorksire\\b|' +
    '\\bYoyages\\b')

    p = re.compile(misspelled, re.I)
    f = open('misspell.pat', 'w')
    cPickle.dump(p,f)
    f.close()
    ----------------------------------------------------------------------------
    ----------
    (2) Check the file(s), report the misspelling, the line number and the
    actual line of text.
    - only warns on multiple identical misspellings
    - using 'EtaOinShrdlu' as a nonsense line-marker; tried \n but that
    didn't give correct results.
    - running on HP Unix, Python 2.2

    f = open('misspell.pat', 'r')
    p = cPickle.load(f)

    a = open('myfile.txt').readlines()
    s = 'EtaOinShrdlu'.join(a)

    mistake = {}
    for mMatch in p.findall(s):
    if mistake.get(mMatch,0):
    print 'Warning: multiple occurrences of mistake "%s" ' % mMatch
    else:
    mistake[mMatch] = s.count('EtaOinShrdlu', 0, s.index(mMatch))

    for k, v in mistake.items():
    print 'Misspelling: "%s" on line number %d' % (k, mistake[k]+1)
    print '%s \n' % a[mistake[k]]
     
    Tim Arnold, Nov 1, 2004
    #1
    1. Advertising

  2. Tim Arnold wrote:

    > I've got a list of 1000 common misspellings, and I'd like to check a set of
    > text files for those misspellings.


    A much simpler way would be to just store these misspellings as a dictionary
    (or set), read and split each line into words, then check whether each
    of words is in the set.

    Istvan
     
    Istvan Albert, Nov 1, 2004
    #2
    1. Advertising

  3. Tim Arnold

    Tim Arnold Guest

    "Istvan Albert" <> wrote in message
    news:...
    > Tim Arnold wrote:
    >
    > > I've got a list of 1000 common misspellings, and I'd like to check a set

    of
    > > text files for those misspellings.

    >
    > A much simpler way would be to just store these misspellings as a

    dictionary
    > (or set), read and split each line into words, then check whether each
    > of words is in the set.
    >
    > Istvan


    Thanks, I didn't know that would be faster.
    But I need to match against the misspellings in a case-insensitive
    way--that's the reason I'm using the regular expressions.

    --Tim
     
    Tim Arnold, Nov 2, 2004
    #3
  4. [Tim]
    > I've got a list of 1000 common misspellings, and I'd like to check a set
    > of text files for those misspellings.


    [Istvan]
    > A much simpler way would be to just store these misspellings as a
    > dictionary (or set), read and split each line into words, then check
    > whether each of words is in the set.


    [Tim]
    > Thanks, I didn't know that would be faster.
    > But I need to match against the misspellings in a case-insensitive
    > way--that's the reason I'm using the regular expressions.


    Make the misspelling set lower case, and convert the list of words from
    the text file into lower case before comparing them:

    >>> from sets import Set
    >>> misspellings = Set(['speling', 'misteak'])
    >>> text = "Does this text contain any common speling mistakes?"
    >>> print [word for word in text.split() if word in misspellings]

    ['speling']

    --
    Richie Hindle
     
    Richie Hindle, Nov 2, 2004
    #4
  5. [me, with brain switched off]
    > Make the misspelling set lower case, and convert the list of words from
    > the text file into lower case before comparing them:


    Gah! That code should read:

    >>> from sets import Set
    >>> misspellings = Set(['speling', 'misteak'])
    >>> text = "Does this text contain any common Speling Mistakes?"
    >>> print [word for word in text.lower().split() if word in misspellings]

    ['speling']

    --
    Richie Hindle
     
    Richie Hindle, Nov 2, 2004
    #5
  6. > Thanks, I didn't know that would be faster.
    > But I need to match against the misspellings in a case-insensitive
    > way--that's the reason I'm using the regular expressions.


    normalize them all to lowercase. Still way faster.
    --
    Regards,

    Diez B. Roggisch
     
    Diez B. Roggisch, Nov 2, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. clintonG

    2.0 Themes = Big Fat Humongous Pages

    clintonG, Mar 28, 2006, in forum: ASP .Net
    Replies:
    3
    Views:
    577
    Jeff Lynch
    Mar 28, 2006
  2. Nicolai P. Zwar

    How do I create humongous Pop-UP Windows?

    Nicolai P. Zwar, Oct 20, 2003, in forum: HTML
    Replies:
    2
    Views:
    387
    Nicolai P. Zwar
    Oct 20, 2003
  3. Nicolai P. Zwar

    How do I create humongous pop-up windows?

    Nicolai P. Zwar, Oct 20, 2003, in forum: HTML
    Replies:
    15
    Views:
    656
    Bagman
    Oct 22, 2003
  4. Dennis Farr

    humongous flat file

    Dennis Farr, Aug 7, 2003, in forum: XML
    Replies:
    10
    Views:
    849
    Steven Dilley
    Aug 11, 2003
  5. Joao Silva
    Replies:
    16
    Views:
    404
    7stud --
    Aug 21, 2009
Loading...

Share This Page