regular expression for english words

Discussion in 'Perl Misc' started by rahul, May 12, 2005.

  1. rahul

    rahul Guest

    Greetings,

    I am trying to match english words in a string with white space(s) as
    delimiter. Additionally, I am trying to match a period at the end of a
    word/sentence. I've made a few attempts after reading perlre/perretut
    but have not succeeded. Any help would be appreciated. Here's my script
    -

    #!C:\perl\bin\perl.exe

    use strict;
    use warnings;

    my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
    a vAlid sTatement.';

    foreach(split /\s+/,$ps){
    #if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
    #if (/^\b([a-zA-Z]+)[\.]?\b$/){
    #if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
    if (/^\b[a-zA-Z]+(\.?)\b$/){
    print "$_: yes\n";
    }
    else {
    print "$_: no\n";
    }
    }
     
    rahul, May 12, 2005
    #1
    1. Advertising

  2. rahul

    Guest

    rahul wrote:
    > Greetings,
    >
    > I am trying to match english words in a string with white space(s) as
    > delimiter. Additionally, I am trying to match a period at the end of

    a
    > word/sentence. I've made a few attempts after reading perlre/perretut
    > but have not succeeded. Any help would be appreciated. Here's my

    script
    > -
    >
    > #!C:\perl\bin\perl.exe
    >
    > use strict;
    > use warnings;
    >
    > my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This

    iS
    > a vAlid sTatement.';
    >
    > foreach(split /\s+/,$ps){
    > #if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
    > #if (/^\b([a-zA-Z]+)[\.]?\b$/){
    > #if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
    > if (/^\b[a-zA-Z]+(\.?)\b$/){
    > print "$_: yes\n";
    > }
    > else {
    > print "$_: no\n";
    > }
    > }


    I ran it and it did what I expected. Are you trying to catch the words
    with capitals mixed in the middle? It looks like a spam blocker to me.
    I hear that procmail has plenty written for it already.

    wana
     
    , May 12, 2005
    #2
    1. Advertising

  3. rahul

    Guest

    rahul wrote:
    > Greetings,
    >
    > I am trying to match english words in a string with white space(s) as
    > delimiter. Additionally, I am trying to match a period at the end of

    a
    > word/sentence. I've made a few attempts after reading perlre/perretut
    > but have not succeeded. Any help would be appreciated. Here's my

    script
    > -
    >
    > #!C:\perl\bin\perl.exe
    >
    > use strict;
    > use warnings;
    >
    > my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This

    iS
    > a vAlid sTatement.';
    >
    > foreach(split /\s+/,$ps){
    > #if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
    > #if (/^\b([a-zA-Z]+)[\.]?\b$/){
    > #if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
    > if (/^\b[a-zA-Z]+(\.?)\b$/){
    > print "$_: yes\n";
    > }
    > else {
    > print "$_: no\n";
    > }
    > }


    I think this will do it:

    #!/usr/bin/perl
    use strict;
    use warnings;

    my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
    a vAlid sTatement.';

    foreach(split /\s+/,$ps){
    if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
    print "$_: yes\n";
    }
    else {
    print "$_: no\n";
    }
    }
     
    , May 12, 2005
    #3
  4. "rahul" <> wrote in news:1115920231.373399.290290
    @o13g2000cwo.googlegroups.com:

    > I am trying to match english words in a string with white space(s) as
    > delimiter.


    How can you decide which language a word is written in using a regular
    expression? Maybe you mean something else?

    > #!C:\perl\bin\perl.exe
    >
    > use strict;
    > use warnings;
    >
    > my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This
    > iS a vAlid sTatement.';
    >
    > foreach(split /\s+/,$ps){
    > #if (/^\b[^\d]*([a-zA-Z])[^\d]*(\.?)\b$/){
    > #if (/^\b([a-zA-Z]+)[\.]?\b$/){
    > #if (/^\b[a-zA-Z]+[\.]{0,1}\b$/){
    > if (/^\b[a-zA-Z]+(\.?)\b$/){
    > print "$_: yes\n";


    Please explain what you actually want to match against. Why is the
    uncommented test above preferable to:

    #! /usr/bin/perl

    use strict;
    use warnings;

    use strict;
    use warnings;

    my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
    a vAlid sTatement.';

    for ( split /\s+/, $ps ) {
    /^[a-zA-Z]+\.?$/ ? print "$_: yes\n" : print "$_: no\n";
    }

    Your code, and my code, will print yes for each of the following
    'words': 'hbsjfsd skdfjh sdkfjhn'. Those 'words' are clearly not
    English.

    Sinan




    > }
    > else {
    > print "$_: no\n";
    > }
    > }
    >




    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, May 12, 2005
    #4
  5. rahul wrote:
    > I am trying to match english words in a string with white space(s) as
    > delimiter. Additionally, I am trying to match a period at the end of a
    > word/sentence.


    <snip>

    > my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS
    > a vAlid sTatement.';


    One idea, which attempts to grab one or more valid sentences and
    disregard the rest:

    print join "\n", $ps =~ /(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, May 12, 2005
    #5
  6. rahul

    rahul Guest

    wrote:
    > I think this will do it:
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This

    iS
    > a vAlid sTatement.';
    >
    > foreach(split /\s+/,$ps){
    > if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
    > print "$_: yes\n";
    > }
    > else {
    > print "$_: no\n";
    > }
    > }


    Hi,
    Thanks for the response. the script prints a "no" for 'statement.'
    which is something im not able to figure out either. im trying to say
    its ok for an english word to have a period at the end of a sentence.

    >I ran it and it did what I expected. Are you trying to catch the

    words
    >with capitals mixed in the middle? It looks like a spam blocker to

    me.
    >I hear that procmail has plenty written for it already.


    I am just trying out a problem a friend asked me to solve for practice.
    And the problem satement read capitals are ok in between words or
    anywhere in the sentence.

    -rahul
     
    rahul, May 12, 2005
    #6
  7. rahul

    rahul Guest

    Gunnar Hjalmarsson wrote:

    > One idea, which attempts to grab one or more valid sentences and
    > disregard the rest:
    >
    > print join "\n", $ps =~

    /(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

    Thanks! It works great except a little complicated for me to
    understand. Will try and practice more. thanks again!

    -rahul
     
    rahul, May 12, 2005
    #7
  8. rahul

    rahul Guest

    A. Sinan Unur wrote:

    > How can you decide which language a word is written in using a

    regular
    > expression? Maybe you mean something else?


    I did mean something else actually. Just did not know how to put it in
    words.

    > Please explain what you actually want to match against. Why is the
    > uncommented test above preferable to:
    >
    > #! /usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > use strict;
    > use warnings;
    >
    > my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This

    iS
    > a vAlid sTatement.';
    >
    > for ( split /\s+/, $ps ) {
    > /^[a-zA-Z]+\.?$/ ? print "$_: yes\n" : print "$_: no\n";
    > }
    >
    > Your code, and my code, will print yes for each of the following
    > 'words': 'hbsjfsd skdfjh sdkfjhn'. Those 'words' are clearly not
    > English.


    Thanks. It works and I actually understand your code too! The statment
    should ve read 'match any letter in the english alphabet which
    optionally ends with a period'.

    -rahul
     
    rahul, May 12, 2005
    #8
  9. At 2005-05-12 02:31PM, rahul <> wrote:
    > wrote:
    > > I think this will do it:
    > >
    > > my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. .. This iS a vAlid sTatement.';
    > >
    > > foreach(split /\s+/,$ps){
    > > if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){
    > > print "$_: yes\n";
    > > }
    > > else {
    > > print "$_: no\n";
    > > }
    > > }

    >
    > Hi,
    > Thanks for the response. the script prints a "no" for 'statement.'


    It would be preferable to use [[:alpha:]] in place of [a-z] or [a-zA-Z]

    my @result = qw(no yes);
    my $re = qr/^[[:alpha:]]+\.?$/;
    foreach (split ' ', $ps) {
    print $_, ': ', $result[ /$re/ ], "\n";
    }


    or

    my @words = $ps =~ /(?:^|(?<=\s))([[:alpha:]]+)(?=[.\s]|$)/g;


    --
    Glenn Jackman
    NCF Sysadmin
     
    Glenn Jackman, May 12, 2005
    #9
  10. rahul

    Guest

    Gunnar Hjalmarsson wrote:
    > rahul wrote:
    > > I am trying to match english words in a string with white space(s)

    as
    > > delimiter. Additionally, I am trying to match a period at the end

    of a
    > > word/sentence.

    >
    > <snip>
    >
    > > my $ps = '1no . woRd5 he8re. a_nd n;one ,here eith!er hj.. ..

    This iS
    > > a vAlid sTatement.';

    >
    > One idea, which attempts to grab one or more valid sentences and
    > disregard the rest:
    >
    > print join "\n", $ps =~

    /(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;
    >
    > --
    > Gunnar Hjalmarsson
    > Email: http://www.gunnar.cc/cgi-bin/contact.pl


    Wow, you know how to do the hard stuff in the second half of the regex
    chapters that I keep putting off. Like they used to say about short
    but difficult proofs in the Math dept., take it home for the weekend,
    find a nice grassy area on a hillside in the sun, relax and contemplate
    it until it makes sense. That and a few pages from 'Programming Perl'
    should do it.

    wana
     
    , May 12, 2005
    #10
  11. rahul <> wrote:

    > 'match any letter in the english alphabet which
    > optionally ends with a period'.



    /[a-z]+\.?/gi;

    or maybe:

    /\b[a-z]+\b\.?/gi;


    (periods can appear in the _middle_ of a sentence too though...)


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, May 12, 2005
    #11
  12. rahul <> wrote:
    > wrote:



    >> if (/^\b[a-zA-Z]?[a-z]+(\.?)\b$/){

    ^^
    ^^
    > Thanks for the response. the script prints a "no" for 'statement.'



    Remove the 2nd word boundary, I'm not sure why it is in there anyway.

    If the string ends with \w, it is a no-op.

    (the 1st word boundary is _always_ a no-op, so it shouldn't be
    there either. The 1st a-z is superfluous too.)

    If the string ends with period, it causes the match to fail.


    if (/[A-Z]?[a-z]+(\.?)$/){


    > which is something im not able to figure out either.



    When $_ ends with a period, the \b falls between a not-word (\W) and
    a not-word character (end of string counts as \W), but \b requires
    either word/not-word (\w\W) or not-word/word (\W\w) in order to match.


    > im trying to say
    > its ok for an english word to have a period at the end of a sentence.



    What about question marks?

    What about exclamation marks!

    Mr. Rahul has this sentence where a period is not at the end!


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, May 12, 2005
    #12
  13. rahul wrote:
    > Gunnar Hjalmarsson wrote:
    >> One idea, which attempts to grab one or more valid sentences and
    >> disregard the rest:
    >>
    >> print join "\n", $ps =~
    >>
    >> /(?:^|(?<=\s))[a-z]+(?:\s+[a-z]+)*\.(?=\s|$)/gi;

    >
    > Thanks! It works great except a little complicated for me to
    > understand. Will try and practice more.


    Not sure what you mean by practice. When you see a regexp that you don't
    fully understand, you can break it down in pieces and look up in
    "perldoc perlre" those pieces you want to have explained. In this case,
    you may also want to read about the m// operator in "perldoc perlop".

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, May 12, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,310
  2. =?Utf-8?B?UmFlZCBTYXdhbGhh?=

    English/English DLL

    =?Utf-8?B?UmFlZCBTYXdhbGhh?=, Oct 15, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    1,681
    =?Utf-8?B?UmFlZCBTYXdhbGhh?=
    Oct 16, 2005
  3. IchBin
    Replies:
    1
    Views:
    788
  4. Replies:
    1
    Views:
    523
    Peter Flynn
    Jul 6, 2005
  5. Alexander
    Replies:
    8
    Views:
    543
    Jorgen Grahn
    Aug 6, 2010
Loading...

Share This Page