splitting paragraph into sentences

S

Sandman

I've searched the docs, but I can't seem to get it right... I want to split a
paragraph into sentences, but this doesn't work:

#!/usr/bin/perl
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. This as well, right?
Yes, it is!";

my @list = split / ?(?=[\.\?\!])/, $string;

foreach (@list){
print "$_\n";
}

__END__
This is a sentence
.. This is also a sentence
.. This as well, right
? Yes, it is
!


The delimiter is kept, but to the wrong item - how do I keep it attached to the
correct item?

Or is there a special var that keeps the matched delimiter in a split()
operation, so I could do something like this:


#!/usr/bin/perl
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. This as well, right?
Yes, it is!";

my @list = split /[\.\?\!] /, $string;

foreach (@list){
print "$_$SPECIALVARIABLE\n";
}


As you may have understood, the wanted output is:

This is a sentence.
This is also a sentence.
This as well, right?
Yes, it is!
 
G

Gunnar Hjalmarsson

Sandman said:
I've searched the docs, but I can't seem to get it right... I want
to split a paragraph into sentences, but this doesn't work:

#!/usr/bin/perl
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. This as
well, right? Yes, it is!";

my @list = split / ?(?=[\.\?\!])/, $string;

foreach (@list){
print "$_\n";
}

__END__
This is a sentence
. This is also a sentence
. This as well, right
? Yes, it is
!

The delimiter is kept, but to the wrong item - how do I keep it
attached to the correct item?

Try a look-behind instead. This may be what you want:

my @list = split /(?<=[.?!])\s*/, $string;

(Note that '.', '?' etc. are not special in a character class, and do
therefore not need to be escaped.)
 
S

Sandman

Gunnar Hjalmarsson said:
Sandman said:
I've searched the docs, but I can't seem to get it right... I want
to split a paragraph into sentences, but this doesn't work:

#!/usr/bin/perl
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. This as
well, right? Yes, it is!";

my @list = split / ?(?=[\.\?\!])/, $string;

foreach (@list){
print "$_\n";
}

__END__
This is a sentence
. This is also a sentence
. This as well, right
? Yes, it is
!

The delimiter is kept, but to the wrong item - how do I keep it
attached to the correct item?

Try a look-behind instead. This may be what you want:

my @list = split /(?<=[.?!])\s*/, $string;

(Note that '.', '?' etc. are not special in a character class, and do
therefore not need to be escaped.)

Thanks again Gunnar, I didn't even know you could do a look-behind. I ddidn't
find it documented anywhere I looked.
 
G

Gunnar Hjalmarsson

Sandman said:
Thanks again Gunnar, I didn't even know you could do a look-behind.
I ddidn't find it documented anywhere I looked.

See "Extended Patterns" in "perldoc perlre".
 
P

Paul Lalli

I've searched the docs, but I can't seem to get it right... I want to split a
paragraph into sentences, but this doesn't work:

#!/usr/bin/perl
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. This as well, right?
Yes, it is!";

my @list = split / ?(?=[\.\?\!])/, $string;

foreach (@list){
print "$_\n";
}

__END__
This is a sentence
. This is also a sentence
. This as well, right
? Yes, it is
!


The delimiter is kept, but to the wrong item - how do I keep it attached to the
correct item?

You haven't well defined the characters you actually want to split on.
The terms you want to capture are seperated by one or more whitespaces -
but only those whitespaces that follow a punctuation mark. This sounds
like a good job for look-behind assertions:

#!/usr/bin/perl
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence. This as well, right? Yes, it is!";

my @list = split /(?<=[.!?])\s+/, $string;
print "$_\n" for @list;
__END__
This is a sentence.
This is also a sentence.
This as well, right?
Yes, it is!


Alternatively, you could try to define what you want to capture, rather
than what you want to throw away...

my @list = $string =~ /((?:\w+\s*)+[.!?])+/g;

But that's a little messier.


Paul Lalli
 
A

Anno Siegel

bowsayge said:
Sandman said to us:


I hope this helps you, but, as you will see below, determining when a
sentence ends might require you to know all of the abbreviations:

That's only the beginning of it. If an abbreviation is the last word of
a sentence, lexical analysis won't do. One would have to parse enough
of the language to understand where the sentence ends.
use strict;
use warnings;

my $string = "This is a sentence. This is also a sentence.
This as well, right?
Yes, it is! Dr. Montgomery will see you now.";

my @list = split /([\.\?\!])[\s]*/, $string;

The escapes in the character class are not necessary, [.?!] is valid.
for (my $m = 0; $m < $#list; $m += 2) {
$list[$m] .= $list[$m+1];
undef($list[$m+1]);
}

It would be easier to declare another array and collect the sentences
there. That way, you don't have to weed out undef's later. You can
also avoid all index arithmetic. Like this (untested):

my @sentences;
while ( @list ) {
push @sentences, shift( @list) . shift( @list);
}

But see below for a solution that doesn't need another variable.
foreach (@list){
next if !defined($_);
print "$_\n";
}

If you have to weed out undefined elements from a list, there's an idiom:

foreach ( grep defined, @list ) { ...

The "grep" function has many useful applications. See "perldoc -f grep"
for how it works.

The basis of your method is sound. You use split with capturing to
get a list of alternating a sentence and the closing punctuation.
If I wanted to join each sentence with the punctuation in place, I'd use
splice():

my @list = split /([.?!])[\s]*/, $string;
$list[ $_] .= splice @list, $_ + 1, 1 for 0 .. @list/2 - 1;
print "$_\n" for @list;

In the unlikely case that the sequence of the sentences doesn't matter,
a hash can do the pairing:

my %h = split /([.?!])[\s]*/, $string;
print "$_$h{ $_}\n" for keys %h;

That is not a serious suggestion for the given situation, but it
is a technique worth considering if you have a list of pairs.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,117
Latest member
Matilda564
Top