Needs help with Matching Logic

Discussion in 'Perl Misc' started by Kishore, Jul 20, 2004.

  1. Kishore

    Kishore Guest

    I am comparitively a newbie in Perl.
    I am working a logic to display the snippets matched results of a
    'keyword' from a text file just like google would do in the search
    results.

    I have the content of the text file in the variable $file_content.
    And I have the 'keyword' in $keyword.

    I need to get the string like google does when displaying the search
    results..
    When I match the $keyword in the $file_content, I want to also pull 5
    words before and 5 words after so I can show that snippet of the file
    where the matching of the keyword occurs.

    I searched in the google groups for a few days, but couldn't find
    anything to help me.

    I really appreciate any help I can get.

    Thanks!
    Kishore
     
    Kishore, Jul 20, 2004
    #1
    1. Advertising

  2. Kishore

    Paul Lalli Guest

    On Tue, 20 Jul 2004, Kishore wrote:

    > I am comparitively a newbie in Perl.
    > I am working a logic to display the snippets matched results of a
    > 'keyword' from a text file just like google would do in the search
    > results.
    >
    > I have the content of the text file in the variable $file_content.
    > And I have the 'keyword' in $keyword.
    >
    > I need to get the string like google does when displaying the search
    > results..
    > When I match the $keyword in the $file_content, I want to also pull 5
    > words before and 5 words after so I can show that snippet of the file
    > where the matching of the keyword occurs.
    >
    > I searched in the google groups for a few days, but couldn't find
    > anything to help me.
    >
    > I really appreciate any help I can get.


    how about something like:

    m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/

    Using that, $1 is the series of up to five words before the match, $2 is
    the match, and $3 is the series of up to five words after the match.

    It'd probably have to be tweaked a bit to get exactly what you want, but
    it should at least give you a starting point.

    Paul Lalli
     
    Paul Lalli, Jul 20, 2004
    #2
    1. Advertising

  3. Kishore

    Kishore Guest

    Paul Lalli <> wrote in message
    > how about something like:
    >
    > m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/
    >
    > Using that, $1 is the series of up to five words before the match, $2 is
    > the match, and $3 is the series of up to five words after the match.
    >


    It works really great.

    Thank you very much.

    What is colon:)) for? I don't believe I saw this in the books I have
    been refering to so far.

    Thanks!
    - Kishore.
     
    Kishore, Jul 21, 2004
    #3
  4. Kishore

    gnari Guest

    "Kishore" <> wrote in message
    news:...
    > Paul Lalli <> wrote in message
    > > how about something like:
    > >
    > > m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/
    > >

    >
    > It works really great.
    >
    > What is colon:)) for? I don't believe I saw this in the books I have
    > been refering to so far.


    (?:...)

    look up 'Extended Patterns' in
    perldoc perlre

    gnari
     
    gnari, Jul 21, 2004
    #4
  5. On 2004-07-20, Paul Lalli <> wrote:
    >
    > m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/
    >
    > Using that, $1 is the series of up to five words before the match, $2 is
    > the match, and $3 is the series of up to five words after the match.


    Note that if $keyword is supposed to be a plain string rather than a
    regex, you'll neet to escape metacharacters in it. An easy way to do
    this is:

    m/((?:\S+\s+){0,5})(\Q$keyword\E)((?:\s+\S+){0,5})/

    Also, this regex can be optimized a bit by noting that the only way $1
    can contain less than 5 words is if the match occurs at the very
    beginning of the string. Separating that special case, we get:

    m/((?:\S+\s+){5}|^\s*(?:\S+\s+){0,4})(\Q$keyword\E)((?:\s+\S+){0,5})/

    This is noticeably faster if the first occurrence of $keyword isn't
    near the beginning, since it saves the regex engine some needless
    backtracking.

    Also note that, if you use global matching to extract multiple
    snippets from the text, the results can be unexpected if there are
    multiple occurrences of $keyword near each other. In particular, if
    there are less than 5 words between two occurrences, the second one
    will be swallowed in the 5 words matched after the first one.

    The easiest way to fix that is to use negative look-ahead:

    m/((?:\S+\s+){0,5}?)(\Q$keyword\E)((?:\s+(?!\Q$keyword\E)\S+){0,5})/g

    Oddly enough, optimizing this regex the same way as before doesn't
    seem to help, and seems to tickle a perl bug (probably related to \G
    handling?) when used in scalar context.


    Oh, and you probably want case-insensitive matching, and should
    probably allow punctuation around $keyword, something like:

    m/((?:\w+\W+){0,5})(\Q$keyword\E)((?:\W+\w+){0,5})/i

    or (optimized):

    m/((?:\w+\W+){5}|^\W*(?:\w+\W+){0,4})(\Q$keyword\E)((?:\W+\w+){0,5})/i

    or for global matching:

    m/((?:\w+\W+){0,5}?)(\Q$keyword\E)((?:\W+(?!\Q$keyword\E)\w+){0,5})/ig

    --
    Ilmari Karonen
    If replying by e-mail, please replace ".invalid" with ".net" in address.
     
    Ilmari Karonen, Jul 21, 2004
    #5
  6. Ilmari Karonen <> writes:

    > On 2004-07-20, Paul Lalli <> wrote:
    > >
    > > m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/
    > >
    > > Using that, $1 is the series of up to five words before the match, $2 is
    > > the match, and $3 is the series of up to five words after the match.

    >
    > Note that if $keyword is supposed to be a plain string rather than a
    > regex, you'll neet to escape metacharacters in it. An easy way to do
    > this is:
    >
    > m/((?:\S+\s+){0,5})(\Q$keyword\E)((?:\s+\S+){0,5})/


    > Also note that, if you use global matching to extract multiple
    > snippets from the text, the results can be unexpected if there are
    > multiple occurrences of $keyword near each other. In particular, if
    > there are less than 5 words between two occurrences, the second one
    > will be swallowed in the 5 words matched after the first one.
    >
    > The easiest way to fix that is to use negative look-ahead:
    >
    > m/((?:\S+\s+){0,5}?)(\Q$keyword\E)((?:\s+(?!\Q$keyword\E)\S+){0,5})/g


    Er, no it would be easier and more ideomatic to put the third capture
    inside a lookahead.

    m/((?:\S+\s+){0,5}?)(\Q$keyword\E)(?=((?:\s+\S+){0,5}))/g


    --
    \\ ( )
    . _\\__[oo
    .__/ \\ /\@
    . l___\\
    # ll l\\
    ###LL LL\\
     
    Brian McCauley, Jul 21, 2004
    #6
  7. Kishore

    Ben Morrow Guest

    Quoth (Kishore):
    > Paul Lalli <> wrote in message
    > > how about something like:
    > >
    > > m/((?:\S+\s+){0,5})($keyword)((?:\s+\S+){0,5})/

    >
    > What is colon:)) for? I don't believe I saw this in the books I have
    > been refering to so far.


    The construction is (?: ... ), to be contrasted with ( ... ); it modifes
    the parens so that they just group without capturing. See perldoc
    perlre or perldoc perlretut.

    [as a side note, I would *always* use /x on a regex with (?:) in, just
    because things get lost:

    /( (?: \S+\s+ ){0,5} ) ($keyword) ( (?: \s+\S+ ){0,5} )/x

    ]

    Ben

    --
    "If a book is worth reading when you are six, *
    it is worth reading when you are sixty." - C.S.Lewis
     
    Ben Morrow, Jul 21, 2004
    #7
  8. On 2004-07-21, Brian McCauley <> wrote:
    > Ilmari Karonen <> writes:
    >>
    >> Also note that, if you use global matching to extract multiple
    >> snippets from the text, the results can be unexpected if there are
    >> multiple occurrences of $keyword near each other. In particular, if
    >> there are less than 5 words between two occurrences, the second one
    >> will be swallowed in the 5 words matched after the first one.
    >>
    >> The easiest way to fix that is to use negative look-ahead:
    >>
    >> m/((?:\S+\s+){0,5}?)(\Q$keyword\E)((?:\s+(?!\Q$keyword\E)\S+){0,5})/g

    >
    > Er, no it would be easier and more ideomatic to put the third capture
    > inside a lookahead.
    >
    > m/((?:\S+\s+){0,5}?)(\Q$keyword\E)(?=((?:\s+\S+){0,5}))/g


    Those two don't do the same thing. With your version the snippets may
    overlap, with mine they can't. Deciding which solution is better is
    really up to the OP.

    --
    Ilmari Karonen
    If replying by e-mail, please replace ".invalid" with ".net" in address.
     
    Ilmari Karonen, Jul 22, 2004
    #8
  9. Kishore

    Kishore Guest

    Ilmari Karonen <> wrote in message news:<>...
    > On 2004-07-20, Paul Lalli <> wrote:
    >
    > Oh, and you probably want case-insensitive matching, and should
    > probably allow punctuation around $keyword, something like:
    >
    > m/((?:\w+\W+){0,5})(\Q$keyword\E)((?:\W+\w+){0,5})/i
    >


    I was having problems with punctuation.
    This code solved the problem.
    Thanks very much.
     
    Kishore, Jul 22, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    1
    Views:
    613
    Craig Deelsnyder
    Oct 25, 2003
  2. spike
    Replies:
    8
    Views:
    1,543
    Steve Holden
    Feb 9, 2010
  3. Eric

    Needs help in logic

    Eric, Jul 10, 2011, in forum: Java
    Replies:
    3
    Views:
    305
    lewbloch
    Jul 11, 2011
  4. Madhusudan Singh

    Newbie needs help on pattern matching

    Madhusudan Singh, Sep 2, 2004, in forum: Perl Misc
    Replies:
    7
    Views:
    144
    William Park
    Sep 3, 2004
  5. Bobby Chamness
    Replies:
    2
    Views:
    261
    Xicheng Jia
    May 3, 2007
Loading...

Share This Page