Extracting a range of words!

V

vivek_12315

I need perl help...

say

$text = qq (I confirm that sufficient information and detail have been
reported in this technical report, that it is scientifically sound,
and that appropriate conclusions have been included)

i find the index for "sound".

After that I just need the substring from {-5, +5} WORDS around that
indexof(sound)

i.e.

my final answer shud be = report, that it is scientifically sound, and
that appropriate conclusions have

Is There a strategy or I have to do it in basic steps ?
 
U

Uri Guttman

TM> my $word = 'sound';
TM> $text =~ s/.*? # leading stuff to strip
TM> ( # $1 is stuff to keep
TM> (\w+\W+){0,5} # 0-5 words
TM> \b$word\b\W* # the word to search for
TM> (\w+\W*){0,5} # 0-5 words
TM> )
TM> .* # trailing stuff to strip
TM> /$1/sx;

that was pretty much the regex i would write. but do you need the \b's
in there? assuming $word is really \w chars, then the preceding \W will
obviate the need for the \b. same for the trailing one.

also i would use \s+\S+ since written words could contain apostrophes
and some other punctuation.

uri
 
C

ccc31807

$text = qq (I confirm that sufficient information and detail have been
reported in this technical report, that it is scientifically sound,
and that appropriate conclusions have been included)
After that I just need the substring from {-5, +5} WORDS around that
indexof(sound)

Here's a pretty mindless way to do it. Split the string into an array,
then iterate through the array looking for 'sound'. If necessary, you
can use the word boundary markers. Then, starting at the index of the
array element you found, print the ten elements starting at 'index -
5'. Like this:

my $text = qq (I confirm that sufficient information and detail have
been reported in this technical report, that it is scientifically
sound, and that appropriate conclusions have been included);
my @text = split(/\s/, $text);
my $index = 0;
foreach my $element (@text)
{
last if ($element =~ /sound/);
$index++;
}
my @report = splice(@text, $index - 5, 11);
print "@report\n";

CC.
 
S

sln

I need perl help...

say

$text = qq (I confirm that sufficient information and detail have been
reported in this technical report, that it is scientifically sound,
and that appropriate conclusions have been included)

i find the index for "sound".

After that I just need the substring from {-5, +5} WORDS around that
indexof(sound)

i.e.

my final answer shud be = report, that it is scientifically sound, and
that appropriate conclusions have

Is There a strategy or I have to do it in basic steps ?

The only real strategy is that you have to know what WORDS are.
Something like parsing a language. Its not enough trying to split
on spaces. So you need to define the language first.

That means there is a relationship between punctuation and whitespace,
the usual separators of language WORDs.
Its not easy. Free flowing wild englishy bad spelling, punctuation, etc,
will not make this easy. Since you have no basis for a grammar, just an
approximation is the best you could do.

I like this one, uses punctuation and it enforces some rules.
But it is impossible to get it always correct.

-sln


use strict;
use warnings;

my $text = qq (I confirm that sufficient information and detail have been
reported in this technical report, that it' is "scientifically" sound,
and that appropriate conclusion's have been included);

if ( $text =~ /
( #1
( #2
(?:
(?:^|\s)
[[:punct:]]*
\w
[\w[:punct:]]*
[\s[:punct:]]*
){0,5}
)
sound
( #3
(?:
[\s[:punct:]]*
\w
[\w[:punct:]]*
(?:$|\s)
){0,5}
)
)
/x )
{
print <<RES;
\r 1= '$1'\n\n
\r 2= '$2'\n\n
\r 3= '$3'\n
RES
}
 
S

sln

It is needed for the trailing one, since the \W is zero or more,
and it is zero or more so that the last $word in $text can be
matched.




That would be an improvement, I'll use that in the future.

Neither \s+\S+ or \w+\W+ will work seperately, they have to be used together.
But, since they overlap, its impossible to use together. This leaves
\w plus \s plus punctuation as the foundation.

-sln
 
J

Justin C

I need perl help...

say

$text = qq (I confirm that sufficient information and detail have been
reported in this technical report, that it is scientifically sound,
and that appropriate conclusions have been included)

i find the index for "sound".

After that I just need the substring from {-5, +5} WORDS around that
indexof(sound)

i.e.

my final answer shud be = report, that it is scientifically sound, and
that appropriate conclusions have

Is There a strategy or I have to do it in basic steps ?

my @words = split / /, $text;
my $i = indexof(sound); # which you said you had
my @wanted;

for my $word ( 0 .. $#words ) {
push @wanted, $words[$word] if ($word >= -6) && ($word < 5) && ($word != $i);
}


Justin.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top