regexpr question is w2 taken

Discussion in 'Perl Misc' started by Richard Bell, Apr 15, 2004.

  1. Richard Bell

    Richard Bell Guest

    I'm a bit new to perl and am trying to emulate the behavior of a free
    text search engine that has a feature

    is w2 taken

    taken to mean the word 'is' within 2 words of the word 'taken' where
    the distance (2) and the words ('is', 'taken') are arbitrary.

    I've a variable that looks like this

    'one two three four and so on words seperated by spaces that goes on
    and on and on and on for a very long way'

    that I'm tring to process.

    I'm having a problem finding a regular expression that handles this
    case. Something like

    "\bis\b(what goes here){0,2}\btaken\b"

    Can someone point me in the right direction?

    I assume that $pos will point to the last character matched. Is this
    correct? How can I know the index of the first character matched? Can
    I know what '(what goes here)' matched? How? As part of this
    process, I'm trying to track what characters in the string were
    matched by a number of regular expressions by getting $pos and keeping
    a bit map of the characters matched.

    Thanks.

    Richard
     
    Richard Bell, Apr 15, 2004
    #1
    1. Advertising

  2. Richard Bell

    Paul Lalli Guest

    On Thu, 15 Apr 2004, Richard Bell wrote:

    > I'm a bit new to perl and am trying to emulate the behavior of a free
    > text search engine that has a feature
    >
    > is w2 taken
    >
    > taken to mean the word 'is' within 2 words of the word 'taken' where
    > the distance (2) and the words ('is', 'taken') are arbitrary.
    >
    > I've a variable that looks like this
    >
    > 'one two three four and so on words seperated by spaces that goes on
    > and on and on and on for a very long way'
    >
    > that I'm tring to process.
    >
    > I'm having a problem finding a regular expression that handles this
    > case. Something like
    >
    > "\bis\b(what goes here){0,2}\btaken\b"
    >
    > Can someone point me in the right direction?
    >
    > I assume that $pos will point to the last character matched. Is this
    > correct? How can I know the index of the first character matched? Can
    > I know what '(what goes here)' matched? How? As part of this
    > process, I'm trying to track what characters in the string were
    > matched by a number of regular expressions by getting $pos and keeping
    > a bit map of the characters matched.
    >



    [untested]

    if ($string =~ /\b$first\s+(\w+\s+){0,2}$second\b/){
    print "Found $first within two words of $second\n";
    print "Separated by '$1'\n";
    }

    This is assuming, of course, that Perl's definition of 'word' is
    acceptable to you. If not, you might want to replace the \w+\s+ above
    with something like

    [a-zA-Z[:punct:]]+\s+

    or, to just say "0, 1, or 2 of any sequences of non-whitespace followed by
    whitespace:

    \S+\s+


    Hope this helps
    Paul Lalli
     
    Paul Lalli, Apr 15, 2004
    #2
    1. Advertising

  3. Richard Bell

    Anno Siegel Guest

    Richard Bell <> wrote in comp.lang.perl.misc:
    >
    > I'm a bit new to perl and am trying to emulate the behavior of a free
    > text search engine that has a feature
    >
    > is w2 taken
    >
    > taken to mean the word 'is' within 2 words of the word 'taken' where
    > the distance (2) and the words ('is', 'taken') are arbitrary.
    >
    > I've a variable that looks like this
    >
    > 'one two three four and so on words seperated by spaces that goes on

    ^^^^^^^^^
    "separated"

    > and on and on and on for a very long way'
    >
    > that I'm tring to process.
    >
    > I'm having a problem finding a regular expression that handles this
    > case. Something like
    >
    > "\bis\b(what goes here){0,2}\btaken\b"
    >
    > Can someone point me in the right direction?


    An approach:

    my ( $first, $last, $n) = ( 'words', 'spaces', 2);

    my $any_word = qr/\s*\b\S+/;
    print "$1\n" if /($first${any_word}{0,$n}\s*\b$last)/;

    There are at least two non-trivial problems left. One is the simplistic
    definition of "word" as a maximal sequence of non-spaces. A better
    definition of $any_word would be needed. Another is that texts come
    in lines, but you will want to match across line boundaries. Slurping
    the whole text ?????????????????????????????????


    > I assume that $pos will point to the last character matched. Is this
    > correct?


    $pos? If you mean the pos() function, it is not correct. perldoc -f pos.

    > How can I know the index of the first character matched? Can
    > I know what '(what goes here)' matched? How? As part of this
    > process,


    You ned to read up on regular expressions. These are very elementary
    questions. Look for capturing parentheses in perlre and for the
    arrays @+ and @- in perlvar.

    > I'm trying to track what characters in the string were
    > matched by a number of regular expressions by getting $pos and keeping
    > a bit map of the characters matched.


    A bit map of the characters matched? I'm not sure what you mean, but
    you may want vec() and ord(). Watch out for unicode.

    Anno
     
    Anno Siegel, Apr 16, 2004
    #3
  4. Richard Bell

    Anno Siegel Guest

    Richard Bell <> wrote in comp.lang.perl.misc:
    >
    > I'm a bit new to perl and am trying to emulate the behavior of a free
    > text search engine that has a feature
    >
    > is w2 taken
    >
    > taken to mean the word 'is' within 2 words of the word 'taken' where
    > the distance (2) and the words ('is', 'taken') are arbitrary.
    >
    > I've a variable that looks like this
    >
    > 'one two three four and so on words seperated by spaces that goes on

    ^^^^^^^^^
    "separated"

    > and on and on and on for a very long way'
    >
    > that I'm tring to process.
    >
    > I'm having a problem finding a regular expression that handles this
    > case. Something like
    >
    > "\bis\b(what goes here){0,2}\btaken\b"
    >
    > Can someone point me in the right direction?


    An approach:

    my ( $first, $last, $n) = ( 'words', 'spaces', 2);

    my $any_word = qr/\s*\b\S+/;
    print "$1\n" if /($first${any_word}{0,$n}\s*\b$last)/;

    > I assume that $pos will point to the last character matched. Is this
    > correct?


    $pos? If you mean the pos() function, it is not correct. perldoc -f pos.

    > How can I know the index of the first character matched? Can
    > I know what '(what goes here)' matched? How? As part of this
    > process,


    You ned to read up on regular expressions. These are very elementary
    questions. Look for capturing parentheses in perlre and for the
    arrays @+ and @- in perlvar.

    > I'm trying to track what characters in the string were
    > matched by a number of regular expressions by getting $pos and keeping
    > a bit map of the characters matched.


    A bit map of the characters matched? I'm not sure what you mean, but
    you may want vec() and ord(). Watch out for unicode.

    Anno
     
    Anno Siegel, Apr 16, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jonas
    Replies:
    3
    Views:
    376
    Gunnar Hjalmarsson
    Aug 11, 2004
  2. MC

    Regexpr FTP

    MC, Feb 4, 2005, in forum: Perl
    Replies:
    4
    Views:
    548
    Martin Gregory
    Feb 9, 2005
  3. Stan
    Replies:
    5
    Views:
    681
    Steven Cheng[MSFT]
    Jan 16, 2004
  4. Ben Cameron

    Time taken algorithm

    Ben Cameron, Aug 4, 2004, in forum: Java
    Replies:
    4
    Views:
    593
    P.Hill
    Aug 5, 2004
  5. Replies:
    1
    Views:
    122
    Austin Ziegler
    Jul 28, 2005
Loading...

Share This Page