Display context snippet for search phrase match optimisation request

Discussion in 'Python' started by Follower, Oct 13, 2004.

  1. Follower

    Follower Guest

    Hi,

    I am working on a function to return extracts from a text document
    with a specific phrase highlighted (i.e. display the context of the
    matched phrase).

    The requirements are:

    * Match should be case-insensitive, but extract should have case
    preserved.

    * The extract should show N words or characters of context on both
    sides of the match.

    * The phrase should be highlighted. i.e. Bracketed by arbitrary text
    e.g. "<b>" & "</b>".

    * The phrase is simple. e.g. "double" or "Another option"

    * Only the first N matches should be returned.

    * There will always be at least one match. (The extracts are only
    requested if another process has determined a match exists in the
    document.)

    * The size of the text document varies from a few hundred
    kilobytes to a little under 20 megabytes.

    I've found two alternative methods (included below) and was wondering
    if anyone had suggestions for improvements. One method uses the string
    "index()" method and the other uses a regular expression.

    As tested, there seems to be a lack of consistency in results, in
    "real world" testing the regular expression seems to be faster most of
    the time, but in "timeit" tests it almost always seems to considerably
    slower which surprised me. I'm beginning to think the only time when
    the regular expression method is faster is when the matches are near
    the beginning of the document.

    I'm using the Python Swish-E binding
    <http://jibe.freeshell.org/bits/SwishE/> on Windows with Python 2.3.
    The purpose of the code is to display context for each found document
    (which is actually a PDF file with the content converted to text).

    In "real world" practice for a set of fifteen results there's only
    around two to five seconds difference between the two methods, so I
    should probably stop worrying about it. I really just wanted to know
    if any other approach is likely to be significantly better. And if
    not, then anyone else can feel free to use this code. :)

    --Phil.

    # ========== Test Output ============

    For the following test output:

    "getNext_A" uses the "index()" string method.
    "getNext_B" uses regular expressions.


    # 17MB file with phrase "simply"
    ------ getNext_B ------
    1.2671 sec/pass
    126.72300005

    ------ getNext_A ------
    0.5441 sec/pass
    54.4189999104

    # 17MB file with phrase "auckland"
    ------ getNext_B ------
    0.0054 sec/pass
    0.530999898911

    ------ getNext_A ------
    0.4429 sec/pass
    44.2940001488

    # 132KB file with phrase "simply"
    ------ getNext_B ------
    0.0111 sec/pass
    1.12199997902

    ------ getNext_A ------
    0.0041 sec/pass
    0.411000013351

    # 132KB file with phrase "auckland"
    ------ getNext_B ------
    0.0109 sec/pass
    1.10099983215

    ------ getNext_A ------
    0.0041 sec/pass
    0.411000013351


    # ========== Script file "test_context.py" ============
    # Remove first two comment characters in each line used to preserve
    white space
    # for Usenet post.

    ###!/usr/bin/python
    ##
    ##FILENAME = r"17MB_document.txt"
    ##
    ###PHRASE = "auckland"
    ###PHRASE = "simply"
    ##PHRASE = "proceedings"
    ##
    ##MAX_MATCHES = 7
    ##
    ##RANGE = 40
    ##
    ##
    ##def getNext_A(content, phrase, limit):
    ## """
    ## """
    ## lowContent = content.lower()
    ## lowPhrase = phrase.lower()
    ## phraseLen = len(phrase)
    ##
    ## idx = -1
    ## for matchCount in range(limit):
    ## try:
    ## idx = lowContent.index(lowPhrase, idx + 1)
    ## except ValueError:
    ## break
    ##
    ## yield (content[max([idx - RANGE, 0]): idx].lstrip(),
    ## content[idx: idx + phraseLen],
    ## content[idx + phraseLen : idx + phraseLen +
    RANGE].rstrip())
    ##
    ##
    ##def getNext_B(content, phrase, limit):
    ## """
    ## """
    ## matcher = re.compile(phrase, re.IGNORECASE) # TODO: Escape
    "phrase"?
    ##
    ## for match in itertools.islice(matcher.finditer(content), limit):
    ## start, end = match.span()
    ## yield (content[max([start - RANGE, 0]): start].lstrip(),
    ## content[start: end],
    ## content[end: end + RANGE].rstrip())
    ##
    ##
    ##def getContext(content, phrase, func):
    ## """
    ## """
    ## results = []
    ## for match in func(content, phrase, MAX_MATCHES):
    ## results.append("...%s<b>%s</b>%s..." % match)
    ## return "".join(results)
    ##
    ##
    ##import timeit
    ##import time
    ##if __name__ == "__main__":
    ## print
    ## content = open(FILENAME).read()
    ##
    ## for (f, n) in [(getNext_B, "getNext_B"), (getNext_A,
    "getNext_A") ]:
    ## print "------ %s ------" % n
    ## ta = time.time()
    ##
    ## t = timeit.Timer(stmt="getContext(content, PHRASE, %s)" % n,
    ## setup="from __main__ import getContext,
    content,
    ## PHRASE, getNext_A, getNext_B")
    ## print "%.4f sec/pass" % (t.timeit(number=100)/100)
    ##
    ## print time.time() - ta
    ##
    ## print
    ## #print getContext(content, PHRASE, f)
     
    Follower, Oct 13, 2004
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sorry.no.email@post_NG.com

    Search Engine Optimisation

    sorry.no.email@post_NG.com, May 8, 2006, in forum: HTML
    Replies:
    0
    Views:
    360
    sorry.no.email@post_NG.com
    May 8, 2006
  2. Jens Jensen
    Replies:
    5
    Views:
    485
    Jens Jensen
    Aug 15, 2006
  3. Replies:
    4
    Views:
    291
    Diez B. Roggisch
    Nov 5, 2007
  4. Eric
    Replies:
    1
    Views:
    2,106
    Mark Fitzpatrick
    Dec 28, 2007
  5. Norah Jones
    Replies:
    0
    Views:
    138
    Norah Jones
    Apr 7, 2014
Loading...

Share This Page