splitting paragraph into sentences

Discussion in 'Perl Misc' started by Sandman, Aug 2, 2004.

  1. Sandman

    Sandman Guest

    I've searched the docs, but I can't seem to get it right... I want to split a
    paragraph into sentences, but this doesn't work:

    #!/usr/bin/perl
    use strict;
    use warnings;

    my $string = "This is a sentence. This is also a sentence. This as well, right?
    Yes, it is!";

    my @list = split / ?(?=[\.\?\!])/, $string;

    foreach (@list){
    print "$_\n";
    }

    __END__
    This is a sentence
    .. This is also a sentence
    .. This as well, right
    ? Yes, it is
    !


    The delimiter is kept, but to the wrong item - how do I keep it attached to the
    correct item?

    Or is there a special var that keeps the matched delimiter in a split()
    operation, so I could do something like this:


    #!/usr/bin/perl
    use strict;
    use warnings;

    my $string = "This is a sentence. This is also a sentence. This as well, right?
    Yes, it is!";

    my @list = split /[\.\?\!] /, $string;

    foreach (@list){
    print "$_$SPECIALVARIABLE\n";
    }


    As you may have understood, the wanted output is:

    This is a sentence.
    This is also a sentence.
    This as well, right?
    Yes, it is!

    --
    Sandman[.net]
    Sandman, Aug 2, 2004
    #1
    1. Advertising

  2. Sandman wrote:
    > I've searched the docs, but I can't seem to get it right... I want
    > to split a paragraph into sentences, but this doesn't work:
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > my $string = "This is a sentence. This is also a sentence. This as
    > well, right? Yes, it is!";
    >
    > my @list = split / ?(?=[\.\?\!])/, $string;
    >
    > foreach (@list){
    > print "$_\n";
    > }
    >
    > __END__
    > This is a sentence
    > . This is also a sentence
    > . This as well, right
    > ? Yes, it is
    > !
    >
    > The delimiter is kept, but to the wrong item - how do I keep it
    > attached to the correct item?


    Try a look-behind instead. This may be what you want:

    my @list = split /(?<=[.?!])\s*/, $string;

    (Note that '.', '?' etc. are not special in a character class, and do
    therefore not need to be escaped.)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Aug 2, 2004
    #2
    1. Advertising

  3. Sandman

    Sandman Guest

    In article <>,
    Gunnar Hjalmarsson <> wrote:

    > Sandman wrote:
    > > I've searched the docs, but I can't seem to get it right... I want
    > > to split a paragraph into sentences, but this doesn't work:
    > >
    > > #!/usr/bin/perl
    > > use strict;
    > > use warnings;
    > >
    > > my $string = "This is a sentence. This is also a sentence. This as
    > > well, right? Yes, it is!";
    > >
    > > my @list = split / ?(?=[\.\?\!])/, $string;
    > >
    > > foreach (@list){
    > > print "$_\n";
    > > }
    > >
    > > __END__
    > > This is a sentence
    > > . This is also a sentence
    > > . This as well, right
    > > ? Yes, it is
    > > !
    > >
    > > The delimiter is kept, but to the wrong item - how do I keep it
    > > attached to the correct item?

    >
    > Try a look-behind instead. This may be what you want:
    >
    > my @list = split /(?<=[.?!])\s*/, $string;
    >
    > (Note that '.', '?' etc. are not special in a character class, and do
    > therefore not need to be escaped.)


    Thanks again Gunnar, I didn't even know you could do a look-behind. I ddidn't
    find it documented anywhere I looked.

    --
    Sandman[.net]
    Sandman, Aug 2, 2004
    #3
  4. Sandman wrote:
    > Thanks again Gunnar, I didn't even know you could do a look-behind.
    > I ddidn't find it documented anywhere I looked.


    See "Extended Patterns" in "perldoc perlre".

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Aug 2, 2004
    #4
  5. Sandman

    Paul Lalli Guest

    On Mon, 2 Aug 2004, Sandman wrote:

    > I've searched the docs, but I can't seem to get it right... I want to split a
    > paragraph into sentences, but this doesn't work:
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > my $string = "This is a sentence. This is also a sentence. This as well, right?
    > Yes, it is!";
    >
    > my @list = split / ?(?=[\.\?\!])/, $string;
    >
    > foreach (@list){
    > print "$_\n";
    > }
    >
    > __END__
    > This is a sentence
    > . This is also a sentence
    > . This as well, right
    > ? Yes, it is
    > !
    >
    >
    > The delimiter is kept, but to the wrong item - how do I keep it attached to the
    > correct item?


    You haven't well defined the characters you actually want to split on.
    The terms you want to capture are seperated by one or more whitespaces -
    but only those whitespaces that follow a punctuation mark. This sounds
    like a good job for look-behind assertions:

    #!/usr/bin/perl
    use strict;
    use warnings;

    my $string = "This is a sentence. This is also a sentence. This as well, right? Yes, it is!";

    my @list = split /(?<=[.!?])\s+/, $string;
    print "$_\n" for @list;
    __END__
    This is a sentence.
    This is also a sentence.
    This as well, right?
    Yes, it is!


    Alternatively, you could try to define what you want to capture, rather
    than what you want to throw away...

    my @list = $string =~ /((?:\w+\s*)+[.!?])+/g;

    But that's a little messier.


    Paul Lalli
    Paul Lalli, Aug 2, 2004
    #5
  6. Sandman

    Anno Siegel Guest

    bowsayge <> wrote in comp.lang.perl.misc:
    > Sandman said to us:
    >
    > > I've searched the docs, but I can't seem to get it right... I want to
    > > split a paragraph into sentences

    >
    > I hope this helps you, but, as you will see below, determining when a
    > sentence ends might require you to know all of the abbreviations:


    That's only the beginning of it. If an abbreviation is the last word of
    a sentence, lexical analysis won't do. One would have to parse enough
    of the language to understand where the sentence ends.

    > use strict;
    > use warnings;
    >
    > my $string = "This is a sentence. This is also a sentence.
    > This as well, right?
    > Yes, it is! Dr. Montgomery will see you now.";
    >
    > my @list = split /([\.\?\!])[\s]*/, $string;


    The escapes in the character class are not necessary, [.?!] is valid.

    > for (my $m = 0; $m < $#list; $m += 2) {
    > $list[$m] .= $list[$m+1];
    > undef($list[$m+1]);
    > }


    It would be easier to declare another array and collect the sentences
    there. That way, you don't have to weed out undef's later. You can
    also avoid all index arithmetic. Like this (untested):

    my @sentences;
    while ( @list ) {
    push @sentences, shift( @list) . shift( @list);
    }

    But see below for a solution that doesn't need another variable.

    > foreach (@list){
    > next if !defined($_);
    > print "$_\n";
    > }


    If you have to weed out undefined elements from a list, there's an idiom:

    foreach ( grep defined, @list ) { ...

    The "grep" function has many useful applications. See "perldoc -f grep"
    for how it works.

    The basis of your method is sound. You use split with capturing to
    get a list of alternating a sentence and the closing punctuation.
    If I wanted to join each sentence with the punctuation in place, I'd use
    splice():

    my @list = split /([.?!])[\s]*/, $string;
    $list[ $_] .= splice @list, $_ + 1, 1 for 0 .. @list/2 - 1;
    print "$_\n" for @list;

    In the unlikely case that the sequence of the sentences doesn't matter,
    a hash can do the pairing:

    my %h = split /([.?!])[\s]*/, $string;
    print "$_$h{ $_}\n" for keys %h;

    That is not a serious suggestion for the given situation, but it
    is a technique worth considering if you have a list of pairs.

    Anno
    Anno Siegel, Aug 2, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tony
    Replies:
    4
    Views:
    2,124
    Andy De Petter
    Nov 27, 2003
  2. dorayme
    Replies:
    112
    Views:
    2,668
    dorayme
    Mar 30, 2009
  3. basi
    Replies:
    35
    Views:
    652
    Adam i Agnieszka Gasiorowski FNORD
    Dec 3, 2005
  4. Ana
    Replies:
    0
    Views:
    103
  5. Sandman

    Splitting paragraph into array.

    Sandman, Aug 13, 2004, in forum: Perl Misc
    Replies:
    3
    Views:
    139
    Tad McClellan
    Aug 13, 2004
Loading...

Share This Page