regular expression for english words

rahul · May 12, 2005

Greetings,

I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence. I've made a few attempts after reading perlre/perretut
but have not succeeded. Any help would be appreciated. Here's my script
-

#!C:\perl\bin\perl.exe

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
#if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
#if (/^\b([a-zA-Z]+)[\.]?\b$/){
#if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
if (/^\b[a-zA-Z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

ioneabu · May 12, 2005

rahul said:
Greetings,

I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence. I've made a few attempts after reading perlre/perretut
but have not succeeded. Any help would be appreciated. Here's my script
-

#!C:\perl\bin\perl.exe

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
#if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
#if (/^\b([a-zA-Z]+)[\.]?\b$/){
#if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
if (/^\b[a-zA-Z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

I ran it and it did what I expected. Are you trying to catch the words
with capitals mixed in the middle? It looks like a spam blocker to me.
I hear that procmail has plenty written for it already.

wana

ioneabu · May 12, 2005

rahul said:
Greetings,

I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence. I've made a few attempts after reading perlre/perretut
but have not succeeded. Any help would be appreciated. Here's my script
-

#!C:\perl\bin\perl.exe

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
#if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
#if (/^\b([a-zA-Z]+)[\.]?\b$/){
#if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
if (/^\b[a-zA-Z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

I think this will do it:

#!/usr/bin/perl
use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

A. Sinan Unur · May 12, 2005

I am trying to match english words in a string with white space(s) as
delimiter.

How can you decide which language a word is written in using a regular
expression? Maybe you mean something else?

#!C:\perl\bin\perl.exe

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This
iS a vAlid sTatement.';

foreach(split /\s+/,$ps){
#if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
#if (/^\b([a-zA-Z]+)[\.]?\b$/){
#if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
if (/^\b[a-zA-Z]+(\.?)\b$/){
print "$_: yes\n";

Please explain what you actually want to match against. Why is the
uncommented test above preferable to:

#! /usr/bin/perl

use strict;
use warnings;

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

for ( split /\s+/, $ps ) {
/^[a-zA-Z]+\.?$/ ? print "$_: yes\n" : print "$_: no\n";
}

Your code, and my code, will print yes for each of the following
'words': 'hbsjfsd skdfjh sdkfjhn'. Those 'words' are clearly not
English.

Sinan

Gunnar Hjalmarsson · May 12, 2005

rahul said:
I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence.

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

One idea, which attempts to grab one or more valid sentences and
disregard the rest:

print join "\n", $ps =~ /(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

rahul · May 12, 2005

[email protected] said:
I think this will do it:

#!/usr/bin/perl
use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

foreach(split /\s+/,$ps){
if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

Hi,
Thanks for the response. the script prints a "no" for 'statement.'
which is something im not able to figure out either. im trying to say
its ok for an english word to have a period at the end of a sentence.

I ran it and it did what I expected. Are you trying to catch the words
with capitals mixed in the middle? It looks like a spam blocker to me.
I hear that procmail has plenty written for it already.

I am just trying out a problem a friend asked me to solve for practice.
And the problem satement read capitals are ok in between words or
anywhere in the sentence.

-rahul

rahul · May 12, 2005

Gunnar said:
One idea, which attempts to grab one or more valid sentences and
disregard the rest:

print join "\n", $ps =~

/(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

Thanks! It works great except a little complicated for me to
understand. Will try and practice more. thanks again!

-rahul

rahul · May 12, 2005

A. Sinan Unur said:
How can you decide which language a word is written in using a regular
expression? Maybe you mean something else?

I did mean something else actually. Just did not know how to put it in
words.

Please explain what you actually want to match against. Why is the
uncommented test above preferable to:

#! /usr/bin/perl

use strict;
use warnings;

use strict;
use warnings;

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

for ( split /\s+/, $ps ) {
/^[a-zA-Z]+\.?$/ ? print "$_: yes\n" : print "$_: no\n";
}

Your code, and my code, will print yes for each of the following
'words': 'hbsjfsd skdfjh sdkfjhn'. Those 'words' are clearly not
English.

Thanks. It works and I actually understand your code too! The statment
should ve read 'match any letter in the english alphabet which
optionally ends with a period'.

-rahul

Glenn Jackman · May 12, 2005

At 2005-05-12 02:31PM said:
[email protected] said:

I think this will do it:

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS a vAlid sTatement.';

foreach(split /\s+/,$ps){
if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
print "$_: yes\n";
}
else {
print "$_: no\n";
}
}

Click to expand...

Hi,
Thanks for the response. the script prints a "no" for 'statement.'

It would be preferable to use [[:alpha:]] in place of [a-z] or [a-zA-Z]

my @result = qw(no yes);
my $re = qr/^[[:alpha:]]+\.?$/;
foreach (split ' ', $ps) {
print $_, ': ', $result[ /$re/ ], "\n";
}

or

my @words = $ps =~ /(?:^|(?<=\s))([[:alpha:]]+)(?=[.\s]|$)/g;

ioneabu · May 12, 2005

Gunnar said:
rahul said:

I am trying to match english words in a string with white space(s) as
delimiter. Additionally, I am trying to match a period at the end of a
word/sentence.

Click to expand...

my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
a vAlid sTatement.';

Click to expand...

One idea, which attempts to grab one or more valid sentences and
disregard the rest:

print join "\n", $ps =~ /(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

Wow, you know how to do the hard stuff in the second half of the regex
chapters that I keep putting off. Like they used to say about short
but difficult proofs in the Math dept., take it home for the weekend,
find a nice grassy area on a hillside in the sun, relax and contemplate
it until it makes sense. That and a few pages from 'Programming Perl'
should do it.

wana

Tad McClellan · May 12, 2005

rahul said:
'match any letter in the english alphabet which
optionally ends with a period'.

/[a-z]+\.?/gi;

or maybe:

/\b[a-z]+\b\.?/gi;

(periods can appear in the _middle_ of a sentence too though...)

Tad McClellan · May 12, 2005

rahul said:
(e-mail address removed) wrote:

if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){

Click to expand...

^^
^^
Thanks for the response. the script prints a "no" for 'statement.'

Remove the 2nd word boundary, I'm not sure why it is in there anyway.

If the string ends with \w, it is a no-op.

(the 1st word boundary is _always_ a no-op, so it shouldn't be
there either. The 1st a-z is superfluous too.)

If the string ends with period, it causes the match to fail.

if (/[A-Z]?[a-z]+(\.?)$/){

which is something im not able to figure out either.

When $_ ends with a period, the \b falls between a not-word (\W) and
a not-word character (end of string counts as \W), but \b requires
either word/not-word (\w\W) or not-word/word (\W\w) in order to match.

im trying to say
its ok for an english word to have a period at the end of a sentence.

What about question marks?

What about exclamation marks!

Mr. Rahul has this sentence where a period is not at the end!

Gunnar Hjalmarsson · May 12, 2005

rahul said:
Gunnar said:

One idea, which attempts to grab one or more valid sentences and
disregard the rest:

print join "\n", $ps =~

/(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

Click to expand...

Thanks! It works great except a little complicated for me to
understand. Will try and practice more.

Not sure what you mean by practice. When you see a regexp that you don't
fully understand, you can break it down in pieces and look up in
"perldoc perlre" those pieces you want to have explained. In this case,
you may also want to read about the m// operator in "perldoc perlop".

Regular expression for BOM required	6	Jan 12, 2013
Regular expression for matching words containing underscore _character	5	Dec 12, 2007
Regex: deleting non-matching words	3	Aug 22, 2010
Collect Excel Data from Website	5	Apr 30, 2022
Recursion regular expression (xtended)	1	Aug 16, 2010
about condensed regular expression syntax	7	Jun 27, 2007
Difference of * and + in regular expression	10	Jun 22, 2008
How do I get the text that is found by a regular expression?	10	Apr 30, 2014

regular expression for english words

rahul

ioneabu

ioneabu

A. Sinan Unur

Gunnar Hjalmarsson

rahul

rahul

rahul

Glenn Jackman

ioneabu

Tad McClellan

Tad McClellan

Gunnar Hjalmarsson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads