Reg Exp and sentences

Discussion in 'Perl Misc' started by kjhjhjhjadsasda@urbanhabit.com, Sep 30, 2005.

  1. Guest

    Hi

    Im trying to get a solid regular expression that identifies sentences
    from a text chunk and that throws away anything that isnt.

    Example:

    pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
    right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
    Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
    1223 sd dskj() sdkjas | asd| |sdasda sadkjasd

    Would result in:

    This is a proper sentence, right here. Hejrkjlekk werkwe wer werjlkj!
    Wedkljew erewrkjkj? Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf,
    sdfklj sdflkjsdf lksdfj.

    Eg something that looks for a length more than say 5 words, that starts
    with an upper case letter, can include ,()- and space and ends with an
    ..!?

    Thanks
    M
     
    , Sep 30, 2005
    #1
    1. Advertising

  2. writes:

    > Im trying to get a solid regular expression that identifies sentences
    > from a text chunk and that throws away anything that isnt.
    >
    > Example:
    >
    > pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
    > right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
    > Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
    > 1223 sd dskj() sdkjas | asd| |sdasda sadkjasd
    >
    > Would result in:
    >
    > This is a proper sentence, right here. Hejrkjlekk werkwe wer werjlkj!
    > Wedkljew erewrkjkj? Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf,
    > sdfklj sdflkjsdf lksdfj.
    >
    > Eg something that looks for a length more than say 5 words, that starts
    > with an upper case letter, can include ,()- and space and ends with an
    > .!?


    What have you tried so far?

    If you need help getting started, try <http://learn.perl.org> for lots of
    useful tutorials, book suggestions, and so forth.

    Oh, and don't forget to read this group's guidelines, if you haven't yet
    done so - lots of tips and useful links there too.

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
     
    Sherm Pendley, Sep 30, 2005
    #2
    1. Advertising

  3. Dr.Ruud Guest

    schreef:

    > Im trying to get a solid regular expression that identifies sentences
    > from a text chunk and that throws away anything that isnt.


    The sed mailing list on yahoogroups is a nice place to get free regexes.

    That list is available on gmane too:
    news://news.gmane.org/gmane.editors.sed.user

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 30, 2005
    #3
  4. Scott Bryce Guest

    wrote:

    > Eg something that looks for a length more than say 5 words, that starts
    > with an upper case letter, can include ,()- and space and ends with an
    > .!?


    Hey... Would this work? I don't know. Let me think. No. I guess not.

    You may wind up tossing out complete sentences that have fewer than 5 words.

    "Besides," he said, "Not all sentences end with a period." (At least I
    don't think so.)
     
    Scott Bryce, Sep 30, 2005
    #4
  5. Guest

    Its actually fine if it "by mistake" excludes some sentences, hard to
    make it bullet proof I guess.
     
    , Sep 30, 2005
    #5
  6. Matt Garrish Guest

    <> wrote in message
    news:...
    > Hi
    >
    > Im trying to get a solid regular expression that identifies sentences
    > from a text chunk and that throws away anything that isnt.
    >
    > Example:
    >
    > pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
    > right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
    > Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
    > 1223 sd dskj() sdkjas | asd| |sdasda sadkjasd
    >
    > Would result in:
    >
    > This is a proper sentence, right here. Hejrkjlekk werkwe wer werjlkj!
    > Wedkljew erewrkjkj? Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf,
    > sdfklj sdflkjsdf lksdfj.
    >


    Think of how you do that as a person. You cognitively determine whether each
    word is a word and whether those words when strung together form a sentence
    that makes sense to you as a speaker of that language. Regular expressions,
    as you're hopefully aware, are not cognitive.

    Regular expressions are for matching patterns, and you do no have a pattern
    to match. You might use a regular expression to break up the sentences on
    punctuation, but you're never going to write a regular expression to
    determine what is and what isn't a "proper" sentence.

    Matt
     
    Matt Garrish, Oct 1, 2005
    #6
  7. Guest

    > Regular expressions are for matching patterns, and you do no have a pattern
    > to match. You might use a regular expression to break up the sentences on
    > punctuation, but you're never going to write a regular expression to
    > determine what is and what isn't a "proper" sentence.
    >
    > Matt


    Thanks all for the inputs.

    Surely, though, there must be a regular expression saying $whatever
    starts with A-Z, has whatever in the middle and ends with .
    (punctuation) ?

    M
     
    , Oct 1, 2005
    #7
  8. Matt Garrish Guest

    <> wrote in message
    news:...
    >> Regular expressions are for matching patterns, and you do no have a
    >> pattern
    >> to match. You might use a regular expression to break up the sentences on
    >> punctuation, but you're never going to write a regular expression to
    >> determine what is and what isn't a "proper" sentence.
    >>

    >
    > Surely, though, there must be a regular expression saying $whatever
    > starts with A-Z, has whatever in the middle and ends with .
    > (punctuation) ?
    >


    I hesistate to even write this, but...

    my $text = <<TEXT;
    I suppose this is a sentence. THisdsa askhwerjjk.vfklanf.,,dsf,, .
    "I quote, this is going to fail you in ways you may not expect!?!<<<"
    But that's not dkalkg ghdsklgklg askl my problem. Dskjdskjfn!
    99 bottles of beer in my stomach... oops where'd my sentence go?
    TEXT

    foreach my $sentence ($text =~ /([A-Z0-9].*?[.!?])/gs) {
    print $sentence, "\n";
    }

    Hopefully the above will give you some ideas as to what you're up against,
    though.

    Matt
     
    Matt Garrish, Oct 2, 2005
    #8
  9. wrote:
    > > Regular expressions are for matching patterns, and you do no have a pattern
    > > to match. You might use a regular expression to break up the sentences on
    > > punctuation, but you're never going to write a regular expression to
    > > determine what is and what isn't a "proper" sentence.
    > >
    > > Matt

    >
    > Thanks all for the inputs.
    >
    > Surely, though, there must be a regular expression saying $whatever
    > starts with A-Z, has whatever in the middle and ends with .
    > (punctuation) ?
    >
    > M


    A starting point (in Ruby):

    # Will match multiple contiguous sentences.
    re = /(?: ^ | \s )
    (
    (?:
    ["('`] *
    [A-Z]
    [- a-z \s ,;: () '`"]+
    [.?!]
    [")'`] *
    (?: \s+ | $ )
    ) +
    )
    /xm
    s = DATA.read
    s.scan( re ){ |x| x = x.first.strip
    if x.split.size > 4
    puts x
    end
    }

    __END__
    pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
    right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
    Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
    1223 sd dskj() sdkjas | asd| |sdasda sadkjasd
    "I suppose this is a sentence," he said. THisdsa
    askhwerjjk.vfklanf.,,dsf,, .
    (A "sentence" at the very end.)


    Output:

    This is a proper sentence,
    right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
    Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
    "I suppose this is a sentence," he said.
    (A "sentence" at the very end.)
     
    William James, Oct 2, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andrew Rowland

    Reg exp: matching relative path only.

    Andrew Rowland, Aug 2, 2003, in forum: Perl
    Replies:
    0
    Views:
    1,160
    Andrew Rowland
    Aug 2, 2003
  2. Tony
    Replies:
    4
    Views:
    2,189
    Andy De Petter
    Nov 27, 2003
  3. psk

    Newbie-Reg Exp

    psk, Jan 16, 2004, in forum: Perl
    Replies:
    2
    Views:
    1,380
    Gunnar Hjalmarsson
    Jan 19, 2004
  4. Lucas Branca

    reg exp and octal notation

    Lucas Branca, Mar 5, 2004, in forum: Python
    Replies:
    5
    Views:
    518
    Lucas Branca
    Mar 5, 2004
  5. aekalman
    Replies:
    6
    Views:
    156
    Ben Morrow
    Nov 22, 2004
Loading...

Share This Page