regular expression for english words

R

rahul

Greetings,

I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence. I've made a few attempts after reading perlre/perretut
but have not succeeded. Any help would be appreciated. Here's my script
-

#!C:\perl\bin\perl.exe

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
#if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
#if (/^\b([a-zA-Z]+)[\.]?\b$/){
#if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
if (/^\b[a-zA-Z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}
 
I

ioneabu

rahul said:
Greetings,

I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence. I've made a few attempts after reading perlre/perretut
but have not succeeded. Any help would be appreciated. Here's my script
-

#!C:\perl\bin\perl.exe

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
#if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
#if (/^\b([a-zA-Z]+)[\.]?\b$/){
#if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
if (/^\b[a-zA-Z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

I ran it and it did what I expected. Are you trying to catch the words
with capitals mixed in the middle? It looks like a spam blocker to me.
I hear that procmail has plenty written for it already.

wana
 
I

ioneabu

rahul said:
Greetings,

I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence. I've made a few attempts after reading perlre/perretut
but have not succeeded. Any help would be appreciated. Here's my script
-

#!C:\perl\bin\perl.exe

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
#if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
#if (/^\b([a-zA-Z]+)[\.]?\b$/){
#if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
if (/^\b[a-zA-Z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

I think this will do it:

#!/usr/bin/perl
use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}
 
A

A. Sinan Unur

I am trying to match english words in a string with white space(s) as
delimiter.

How can you decide which language a word is written in using a regular
expression? Maybe you mean something else?
#!C:\perl\bin\perl.exe

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This
iS a vAlid sTatement.';

foreach(split /\s+/,$ps){
#if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
#if (/^\b([a-zA-Z]+)[\.]?\b$/){
#if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
if (/^\b[a-zA-Z]+(\.?)\b$/){
print "$_: yes\n";

Please explain what you actually want to match against. Why is the
uncommented test above preferable to:

#! /usr/bin/perl

use strict;
use warnings;

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

for ( split /\s+/, $ps ) {
/^[a-zA-Z]+\.?$/ ? print "$_: yes\n" : print "$_: no\n";
}

Your code, and my code, will print yes for each of the following
'words': 'hbsjfsd skdfjh sdkfjhn'. Those 'words' are clearly not
English.

Sinan
 
G

Gunnar Hjalmarsson

rahul said:
I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence.

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

One idea, which attempts to grab one or more valid sentences and
disregard the rest:

print join "\n", $ps =~ /(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;
 
R

rahul

I think this will do it:

#!/usr/bin/perl
use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

Hi,
Thanks for the response. the script prints a "no" for 'statement.'
which is something im not able to figure out either. im trying to say
its ok for an english word to have a period at the end of a sentence.
I ran it and it did what I expected. Are you trying to catch the words
with capitals mixed in the middle? It looks like a spam blocker to me.
I hear that procmail has plenty written for it already.

I am just trying out a problem a friend asked me to solve for practice.
And the problem satement read capitals are ok in between words or
anywhere in the sentence.

-rahul
 
R

rahul

Gunnar said:
One idea, which attempts to grab one or more valid sentences and
disregard the rest:

print join "\n", $ps =~
/(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

Thanks! It works great except a little complicated for me to
understand. Will try and practice more. thanks again!

-rahul
 
R

rahul

A. Sinan Unur said:
How can you decide which language a word is written in using a regular
expression? Maybe you mean something else?

I did mean something else actually. Just did not know how to put it in
words.
Please explain what you actually want to match against. Why is the
uncommented test above preferable to:

#! /usr/bin/perl

use strict;
use warnings;

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

for ( split /\s+/, $ps ) {
/^[a-zA-Z]+\.?$/ ? print "$_: yes\n" : print "$_: no\n";
}

Your code, and my code, will print yes for each of the following
'words': 'hbsjfsd skdfjh sdkfjhn'. Those 'words' are clearly not
English.

Thanks. It works and I actually understand your code too! The statment
should ve read 'match any letter in the english alphabet which
optionally ends with a period'.

-rahul
 
G

Glenn Jackman

At 2005-05-12 02:31PM said:
I think this will do it:

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS a vAlid sTatement.';

foreach(split /\s+/,$ps){
if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

Hi,
Thanks for the response. the script prints a "no" for 'statement.'

It would be preferable to use [[:alpha:]] in place of [a-z] or [a-zA-Z]

my @result = qw(no yes);
my $re = qr/^[[:alpha:]]+\.?$/;
foreach (split ' ', $ps) {
print $_, ': ', $result[ /$re/ ], "\n";
}


or

my @words = $ps =~ /(?:^|(?<=\s))([[:alpha:]]+)(?=[.\s]|$)/g;
 
I

ioneabu

Gunnar said:
rahul said:
I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence.

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

One idea, which attempts to grab one or more valid sentences and
disregard the rest:

print join "\n", $ps =~ /(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

Wow, you know how to do the hard stuff in the second half of the regex
chapters that I keep putting off. Like they used to say about short
but difficult proofs in the Math dept., take it home for the weekend,
find a nice grassy area on a hillside in the sun, relax and contemplate
it until it makes sense. That and a few pages from 'Programming Perl'
should do it.

wana
 
T

Tad McClellan

rahul said:
'match any letter in the english alphabet which
optionally ends with a period'.


/[a-z]+\.?/gi;

or maybe:

/\b[a-z]+\b\.?/gi;


(periods can appear in the _middle_ of a sentence too though...)
 
T

Tad McClellan

rahul said:
(e-mail address removed) wrote:
if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
^^
^^
Thanks for the response. the script prints a "no" for 'statement.'


Remove the 2nd word boundary, I'm not sure why it is in there anyway.

If the string ends with \w, it is a no-op.

(the 1st word boundary is _always_ a no-op, so it shouldn't be
there either. The 1st a-z is superfluous too.)

If the string ends with period, it causes the match to fail.


if (/[A-Z]?[a-z]+(\.?)$/){

which is something im not able to figure out either.


When $_ ends with a period, the \b falls between a not-word (\W) and
a not-word character (end of string counts as \W), but \b requires
either word/not-word (\w\W) or not-word/word (\W\w) in order to match.

im trying to say
its ok for an english word to have a period at the end of a sentence.


What about question marks?

What about exclamation marks!

Mr. Rahul has this sentence where a period is not at the end!
 
G

Gunnar Hjalmarsson

rahul said:
Gunnar said:
One idea, which attempts to grab one or more valid sentences and
disregard the rest:

print join "\n", $ps =~

/(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

Thanks! It works great except a little complicated for me to
understand. Will try and practice more.

Not sure what you mean by practice. When you see a regexp that you don't
fully understand, you can break it down in pieces and look up in
"perldoc perlre" those pieces you want to have explained. In this case,
you may also want to read about the m// operator in "perldoc perlop".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top