Extracting a range of words!

Discussion in 'Perl Misc' started by vivek_12315, Dec 16, 2010.

  1. vivek_12315

    vivek_12315 Guest

    I need perl help...

    say

    $text = qq (I confirm that sufficient information and detail have been
    reported in this technical report, that it is scientifically sound,
    and that appropriate conclusions have been included)

    i find the index for "sound".

    After that I just need the substring from {-5, +5} WORDS around that
    indexof(sound)

    i.e.

    my final answer shud be = report, that it is scientifically sound, and
    that appropriate conclusions have

    Is There a strategy or I have to do it in basic steps ?
    vivek_12315, Dec 16, 2010
    #1
    1. Advertising

  2. vivek_12315

    Uri Guttman Guest

    >>>>> "TM" == Tad McClellan <> writes:

    TM> my $word = 'sound';
    TM> $text =~ s/.*? # leading stuff to strip
    TM> ( # $1 is stuff to keep
    TM> (\w+\W+){0,5} # 0-5 words
    TM> \b$word\b\W* # the word to search for
    TM> (\w+\W*){0,5} # 0-5 words
    TM> )
    TM> .* # trailing stuff to strip
    TM> /$1/sx;

    that was pretty much the regex i would write. but do you need the \b's
    in there? assuming $word is really \w chars, then the preceding \W will
    obviate the need for the \b. same for the trailing one.

    also i would use \s+\S+ since written words could contain apostrophes
    and some other punctuation.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
    Uri Guttman, Dec 16, 2010
    #2
    1. Advertising

  3. vivek_12315

    ccc31807 Guest

    On Dec 16, 10:52 am, vivek_12315 <> wrote:
    > $text = qq (I confirm that sufficient information and detail have been
    > reported in this technical report, that it is scientifically sound,
    > and that appropriate conclusions have been included)
    > After that I just need the substring from {-5, +5} WORDS around that
    > indexof(sound)


    Here's a pretty mindless way to do it. Split the string into an array,
    then iterate through the array looking for 'sound'. If necessary, you
    can use the word boundary markers. Then, starting at the index of the
    array element you found, print the ten elements starting at 'index -
    5'. Like this:

    my $text = qq (I confirm that sufficient information and detail have
    been reported in this technical report, that it is scientifically
    sound, and that appropriate conclusions have been included);
    my @text = split(/\s/, $text);
    my $index = 0;
    foreach my $element (@text)
    {
    last if ($element =~ /sound/);
    $index++;
    }
    my @report = splice(@text, $index - 5, 11);
    print "@report\n";

    CC.
    ccc31807, Dec 16, 2010
    #3
  4. vivek_12315

    Guest

    On Thu, 16 Dec 2010 07:52:35 -0800 (PST), vivek_12315 <> wrote:

    >I need perl help...
    >
    >say
    >
    >$text = qq (I confirm that sufficient information and detail have been
    >reported in this technical report, that it is scientifically sound,
    >and that appropriate conclusions have been included)
    >
    >i find the index for "sound".
    >
    >After that I just need the substring from {-5, +5} WORDS around that
    >indexof(sound)
    >
    >i.e.
    >
    >my final answer shud be = report, that it is scientifically sound, and
    >that appropriate conclusions have
    >
    >Is There a strategy or I have to do it in basic steps ?


    The only real strategy is that you have to know what WORDS are.
    Something like parsing a language. Its not enough trying to split
    on spaces. So you need to define the language first.

    That means there is a relationship between punctuation and whitespace,
    the usual separators of language WORDs.
    Its not easy. Free flowing wild englishy bad spelling, punctuation, etc,
    will not make this easy. Since you have no basis for a grammar, just an
    approximation is the best you could do.

    I like this one, uses punctuation and it enforces some rules.
    But it is impossible to get it always correct.

    -sln


    use strict;
    use warnings;

    my $text = qq (I confirm that sufficient information and detail have been
    reported in this technical report, that it' is "scientifically" sound,
    and that appropriate conclusion's have been included);

    if ( $text =~ /
    ( #1
    ( #2
    (?:
    (?:^|\s)
    [[:punct:]]*
    \w
    [\w[:punct:]]*
    [\s[:punct:]]*
    ){0,5}
    )
    sound
    ( #3
    (?:
    [\s[:punct:]]*
    \w
    [\w[:punct:]]*
    (?:$|\s)
    ){0,5}
    )
    )
    /x )
    {
    print <<RES;
    \r 1= '$1'\n\n
    \r 2= '$2'\n\n
    \r 3= '$3'\n
    RES
    }
    , Dec 16, 2010
    #4
  5. vivek_12315

    Guest

    On Thu, 16 Dec 2010 13:57:52 -0600, Tad McClellan <> wrote:

    >Uri Guttman <> wrote:
    >>>>>>> "TM" == Tad McClellan <> writes:

    >>
    >> TM> my $word = 'sound';
    >> TM> $text =~ s/.*? # leading stuff to strip
    >> TM> ( # $1 is stuff to keep
    >> TM> (\w+\W+){0,5} # 0-5 words
    >> TM> \b$word\b\W* # the word to search for
    >> TM> (\w+\W*){0,5} # 0-5 words
    >> TM> )
    >> TM> .* # trailing stuff to strip
    >> TM> /$1/sx;
    >>
    >> that was pretty much the regex i would write. but do you need the \b's
    >> in there? assuming $word is really \w chars, then the preceding \W will
    >> obviate the need for the \b. same for the trailing one.

    >
    >
    >It is needed for the trailing one, since the \W is zero or more,
    >and it is zero or more so that the last $word in $text can be
    >matched.
    >
    >
    >> also i would use \s+\S+ since written words could contain apostrophes
    >> and some other punctuation.

    >
    >
    >That would be an improvement, I'll use that in the future.


    Neither \s+\S+ or \w+\W+ will work seperately, they have to be used together.
    But, since they overlap, its impossible to use together. This leaves
    \w plus \s plus punctuation as the foundation.

    -sln
    , Dec 16, 2010
    #5
  6. vivek_12315

    Justin C Guest

    On 2010-12-16, vivek_12315 <> wrote:
    > I need perl help...
    >
    > say
    >
    > $text = qq (I confirm that sufficient information and detail have been
    > reported in this technical report, that it is scientifically sound,
    > and that appropriate conclusions have been included)
    >
    > i find the index for "sound".
    >
    > After that I just need the substring from {-5, +5} WORDS around that
    > indexof(sound)
    >
    > i.e.
    >
    > my final answer shud be = report, that it is scientifically sound, and
    > that appropriate conclusions have
    >
    > Is There a strategy or I have to do it in basic steps ?


    my @words = split / /, $text;
    my $i = indexof(sound); # which you said you had
    my @wanted;

    for my $word ( 0 .. $#words ) {
    push @wanted, $words[$word] if ($word >= -6) && ($word < 5) && ($word != $i);
    }


    Justin.

    --
    Justin C, by the sea.
    Justin C, Dec 17, 2010
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,066
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    349
    Barry Schwarz
    Oct 5, 2003
  3. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    413
    Daniel T.
    Feb 16, 2006
  4. BerlinBrown
    Replies:
    6
    Views:
    4,420
  5. Lasse Edsvik

    replace words with bold words

    Lasse Edsvik, Oct 5, 2003, in forum: ASP General
    Replies:
    9
    Views:
    227
Loading...

Share This Page