Reg Exp and sentences

K

kjhjhjhjadsasda

Hi

Im trying to get a solid regular expression that identifies sentences
from a text chunk and that throws away anything that isnt.

Example:

pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
1223 sd dskj() sdkjas | asd| |sdasda sadkjasd

Would result in:

This is a proper sentence, right here. Hejrkjlekk werkwe wer werjlkj!
Wedkljew erewrkjkj? Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf,
sdfklj sdflkjsdf lksdfj.

Eg something that looks for a length more than say 5 words, that starts
with an upper case letter, can include ,()- and space and ends with an
..!?

Thanks
M
 
S

Sherm Pendley

Im trying to get a solid regular expression that identifies sentences
from a text chunk and that throws away anything that isnt.

Example:

pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
1223 sd dskj() sdkjas | asd| |sdasda sadkjasd

Would result in:

This is a proper sentence, right here. Hejrkjlekk werkwe wer werjlkj!
Wedkljew erewrkjkj? Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf,
sdfklj sdflkjsdf lksdfj.

Eg something that looks for a length more than say 5 words, that starts
with an upper case letter, can include ,()- and space and ends with an
.!?

What have you tried so far?

If you need help getting started, try <http://learn.perl.org> for lots of
useful tutorials, book suggestions, and so forth.

Oh, and don't forget to read this group's guidelines, if you haven't yet
done so - lots of tips and useful links there too.

sherm--
 
D

Dr.Ruud

(e-mail address removed) schreef:
Im trying to get a solid regular expression that identifies sentences
from a text chunk and that throws away anything that isnt.

The sed mailing list on yahoogroups is a nice place to get free regexes.

That list is available on gmane too:
news://news.gmane.org/gmane.editors.sed.user
 
S

Scott Bryce

Eg something that looks for a length more than say 5 words, that starts
with an upper case letter, can include ,()- and space and ends with an
.!?

Hey... Would this work? I don't know. Let me think. No. I guess not.

You may wind up tossing out complete sentences that have fewer than 5 words.

"Besides," he said, "Not all sentences end with a period." (At least I
don't think so.)
 
K

kjhjhjhjadsasda

Its actually fine if it "by mistake" excludes some sentences, hard to
make it bullet proof I guess.
 
M

Matt Garrish

Hi

Im trying to get a solid regular expression that identifies sentences
from a text chunk and that throws away anything that isnt.

Example:

pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
1223 sd dskj() sdkjas | asd| |sdasda sadkjasd

Would result in:

This is a proper sentence, right here. Hejrkjlekk werkwe wer werjlkj!
Wedkljew erewrkjkj? Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf,
sdfklj sdflkjsdf lksdfj.

Think of how you do that as a person. You cognitively determine whether each
word is a word and whether those words when strung together form a sentence
that makes sense to you as a speaker of that language. Regular expressions,
as you're hopefully aware, are not cognitive.

Regular expressions are for matching patterns, and you do no have a pattern
to match. You might use a regular expression to break up the sentences on
punctuation, but you're never going to write a regular expression to
determine what is and what isn't a "proper" sentence.

Matt
 
K

kjhjhjhjadsasda

Regular expressions are for matching patterns, and you do no have a pattern
to match. You might use a regular expression to break up the sentences on
punctuation, but you're never going to write a regular expression to
determine what is and what isn't a "proper" sentence.

Matt

Thanks all for the inputs.

Surely, though, there must be a regular expression saying $whatever
starts with A-Z, has whatever in the middle and ends with .
(punctuation) ?

M
 
M

Matt Garrish

Surely, though, there must be a regular expression saying $whatever
starts with A-Z, has whatever in the middle and ends with .
(punctuation) ?

I hesistate to even write this, but...

my $text = <<TEXT;
I suppose this is a sentence. THisdsa askhwerjjk.vfklanf.,,dsf,, .
"I quote, this is going to fail you in ways you may not expect!?!<<<"
But that's not dkalkg ghdsklgklg askl my problem. Dskjdskjfn!
99 bottles of beer in my stomach... oops where'd my sentence go?
TEXT

foreach my $sentence ($text =~ /([A-Z0-9].*?[.!?])/gs) {
print $sentence, "\n";
}

Hopefully the above will give you some ideas as to what you're up against,
though.

Matt
 
W

William James

Thanks all for the inputs.

Surely, though, there must be a regular expression saying $whatever
starts with A-Z, has whatever in the middle and ends with .
(punctuation) ?

M

A starting point (in Ruby):

# Will match multiple contiguous sentences.
re = /(?: ^ | \s )
(
(?:
["('`] *
[A-Z]
[- a-z \s ,;: () '`"]+
[.?!]
[")'`] *
(?: \s+ | $ )
) +
)
/xm
s = DATA.read
s.scan( re ){ |x| x = x.first.strip
if x.split.size > 4
puts x
end
}

__END__
pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
1223 sd dskj() sdkjas | asd| |sdasda sadkjasd
"I suppose this is a sentence," he said. THisdsa
askhwerjjk.vfklanf.,,dsf,, .
(A "sentence" at the very end.)


Output:

This is a proper sentence,
right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
"I suppose this is a sentence," he said.
(A "sentence" at the very end.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top