Character class [\W_] clarification

Discussion in 'Perl Misc' started by Fiaz Idris, Dec 10, 2003.

  1. Fiaz Idris

    Fiaz Idris Guest

    Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W

    I know that [\W] matches [^a-zA-Z_0-9]

    From Mastering Algorithms with Perl (Page.110), I see a character class
    [\W_] that does the following

    s/[\W_]+//g

    i.e. to replace (all non-word character and underscore) with (nothing).

    First, I couldn't understand the above that is because I interpreted
    above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

    s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

    That is to replace (non-word characters including underscore) with (nothing)
    and thought that the last underscore is infact unnecessary.

    My question is where in the documentation (anywhere) that says
    the [\W] will infact work with the interpretation as below:

    [~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
    give in (regex XXX) above.

    If this seems to be a dumb question, I apologise. But, still I require
    an explanation.
     
    Fiaz Idris, Dec 10, 2003
    #1
    1. Advertising

  2. On 9 Dec 2003 19:29:36 -0800, (Fiaz Idris) wrote:

    >Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W
    >
    >I know that [\W] matches [^a-zA-Z_0-9]
    >
    >From Mastering Algorithms with Perl (Page.110), I see a character class
    >[\W_] that does the following
    >
    >s/[\W_]+//g
    >
    >i.e. to replace (all non-word character and underscore) with (nothing).
    >
    >First, I couldn't understand the above that is because I interpreted
    >above regex as *** replace "\W" with "^a-zA-Z_0-9" ***
    >
    >s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)
    >
    >That is to replace (non-word characters including underscore) with (nothing)
    >and thought that the last underscore is infact unnecessary.


    I think the underscore is considred a legal character for perl words.

    Try this:

    #!/usr/bin/perl
    my $txt = '$$% 3b__c4 101 _ z42';
    my $i = $txt;
    my $j = $txt;
    my $k = $txt;
    $i =~ s/[\W_]+//g;
    $j =~ s/([\W]|_)+//g;
    $k =~ s/[\W]+//g;
    print "txt $txt, i $i, j $j, k $k;";

    >
    >My question is where in the documentation (anywhere) that says
    >the [\W] will infact work with the interpretation as below:


    perlre, thinks I



    ---
    Use the domain skylightview (dot) com for the reply address instead.
     
    William Herrera, Dec 10, 2003
    #2
    1. Advertising

  3. Fiaz Idris

    Anno Siegel Guest

    Fiaz Idris <> wrote in comp.lang.perl.misc:
    > Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W
    >
    > I know that [\W] matches [^a-zA-Z_0-9]
    >
    > From Mastering Algorithms with Perl (Page.110), I see a character class
    > [\W_] that does the following
    >
    > s/[\W_]+//g
    >
    > i.e. to replace (all non-word character and underscore) with (nothing).


    Yes, that's what it does.

    > First, I couldn't understand the above that is because I interpreted
    > above regex as *** replace "\W" with "^a-zA-Z_0-9" ***


    I don't understand what your interpretation was. Did you think it
    changes the two characters "\W" to something else? Or do you mean
    you thought it changes the behavior of "\W" for the rest of the program?

    > s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)
    >
    > That is to replace (non-word characters including underscore) with (nothing)
    > and thought that the last underscore is infact unnecessary.


    Well, it is. Any character need only appear once in a character class,
    whether negated or not.

    > My question is where in the documentation (anywhere) that says
    > the [\W] will infact work with the interpretation as below:
    >
    > [~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
    > give in (regex XXX) above.


    I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
    and /\W/ match exactly the same things, as well as the redundant
    [^a-zA-Z_0-9_].

    Anno
     
    Anno Siegel, Dec 10, 2003
    #3
  4. Fiaz Idris <> wrote:
    > s/[\W_]+//g
    > i.e. to replace (all non-word character and underscore) with (nothing).
    >
    > First, I couldn't understand the above that is because I interpreted
    > above regex as *** replace "\W" with "^a-zA-Z_0-9" ***
    >
    > s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

    [...]


    An example to back up Fiaz's confusion:

    $s = '=-_abc_-=';
    ($c=$s) =~ s/[\W]/./g; print "$c\n";
    ($c=$s) =~ s/[\W_]/./g; print "$c\n";

    Clearly [\W] is not equivalent to [\W_], so \W is not merely replaced
    with ^a-zA-Z_0-9 by Perl's regex engine.


    --
    Glenn Jackman
    NCF Sysadmin
     
    Glenn Jackman, Dec 10, 2003
    #4
  5. Fiaz Idris

    Fiaz Idris Guest

    -berlin.de (Anno Siegel) wrote in message news:<

    > > I know that [\W] matches [^a-zA-Z_0-9]


    > > [~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
    > > give in (regex XXX) above.

    >
    > I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
    > and /\W/ match exactly the same things, as well as the redundant
    > [^a-zA-Z_0-9_].


    Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
    for an example code that shows the difference.

    [\W] does not replace the underscore, but
    [\W_] also replaces the underscore.

    Programming Perl says

    Symbol ||| Meaning ||| As Bytes
    \W ||| Non-(word character) ||| [^a-zA-Z0-9_]

    According to the above representation for [\W] I assumed

    Point 1:
    [\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
    and thought that the last underscore is actually unnecessary.

    Point 2:
    But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
    that is (all the characters other than [A-Za-z0-9_] and include the [_]).

    Point 2 is what actually happens when using [\W_] but the documentation
    leads you to believe [\W_] is equivalent to Point 1 and we all know that
    that is not the case by running the sample code I mentioned before.

    So, where in the docs (anywhere) that points this out.

    I hope I have made myself clear.
     
    Fiaz Idris, Dec 11, 2003
    #5
  6. Fiaz Idris

    Sam Holden Guest

    On 10 Dec 2003 17:37:59 -0800, Fiaz Idris <> wrote:
    > -berlin.de (Anno Siegel) wrote in message news:<
    >
    >> > I know that [\W] matches [^a-zA-Z_0-9]

    >
    >> > [~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
    >> > give in (regex XXX) above.

    >>
    >> I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
    >> and /\W/ match exactly the same things, as well as the redundant
    >> [^a-zA-Z_0-9_].

    >
    > Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
    > for an example code that shows the difference.
    >
    > [\W] does not replace the underscore, but
    > [\W_] also replaces the underscore.
    >
    > Programming Perl says
    >
    > Symbol ||| Meaning ||| As Bytes
    > \W ||| Non-(word character) ||| [^a-zA-Z0-9_]
    >
    > According to the above representation for [\W] I assumed
    >
    > Point 1:
    > [\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
    > and thought that the last underscore is actually unnecessary.


    That's a pretty silly assumption. \W matches the same things as
    matched by [^a-zA-Z_0-9] (ignoring locales for the moment).

    [AB] matches A or B. so [\W_] matches \W or _. "_" isn't matched
    by \W but is by _, hence it matches [\W_].

    If I squinted I might be able to see how you could think [\W_] might
    be the same as [[^a-zA-Z_0-9]_] (by treating the explanation of
    what it matches as a literal expansion). But why anyone would think
    extra characters would be magically placed inside the []s is beyong
    me...


    > Point 2:
    > But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
    > that is (all the characters other than [A-Za-z0-9_] and include the [_]).
    >
    > Point 2 is what actually happens when using [\W_] but the documentation
    > leads you to believe [\W_] is equivalent to Point 1 and we all know that
    > that is not the case by running the sample code I mentioned before.
    >
    > So, where in the docs (anywhere) that points this out.


    perldoc perlre:

    \W Match a non-"word" character

    and

    You may use "\w", "\W", "\s", "\S", "\d", and "\D" within character
    classes

    I can't see how you could possibly come to your "Point 1" interpretation.


    --
    Sam Holden
     
    Sam Holden, Dec 11, 2003
    #6
  7. Fiaz Idris

    Uri Guttman Guest

    >>>>> "FI" == Fiaz Idris <> writes:

    > -berlin.de (Anno Siegel) wrote in message news:<
    >> > I know that [\W] matches [^a-zA-Z_0-9]


    >> > [~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
    >> > give in (regex XXX) above.

    >>
    >> I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
    >> and /\W/ match exactly the same things, as well as the redundant
    >> [^a-zA-Z_0-9_].


    > Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
    > for an example code that shows the difference.


    > [\W] does not replace the underscore, but
    > [\W_] also replaces the underscore.


    > Programming Perl says


    > Symbol ||| Meaning ||| As Bytes
    > \W ||| Non-(word character) ||| [^a-zA-Z0-9_]


    > According to the above representation for [\W] I assumed


    > Point 1:
    > [\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
    > and thought that the last underscore is actually unnecessary.


    you have to INVERT the class for \w to get \W. so \W does NOT contain
    _. your assumption that is has 2 _ is wrong. \W has NO _ so you must add
    one if you want to match it.

    the key is to remember that \w is a char class and \W is all the other
    chars. it is not the same as [^\w] which is sort of what you think it
    is.

    > Point 2:
    > But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
    > that is (all the characters other than [A-Za-z0-9_] and include the [_]).



    > Point 2 is what actually happens when using [\W_] but the documentation
    > leads you to believe [\W_] is equivalent to Point 1 and we all know that
    > that is not the case by running the sample code I mentioned before.


    the docs are accurate. you misinterpreted them as point 1.

    > So, where in the docs (anywhere) that points this out.


    what you quoted from the docs points this out.

    > I hope I have made myself clear.


    yes you did. and you were wrong and the docs are correct.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Dec 11, 2003
    #7
  8. On 10 Dec 2003 17:37:59 -0800, (Fiaz Idris) wrote:

    Point 1:
    [\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
    and thought that the last underscore is actually unnecessary.

    The problem is that, in a negated char class like [^a], any character you add
    to the class within those brackets, like [^ab], is added as an excluded char.
    But with th \W syntax, the 'negation' of \w is in the set of INCLUDED chars in
    the class, and is NOT continued to other chars in a bracketed charachter class
    containing \W.

    So, [\W] is the same as [^a-zA-Z0-9_], but
    [\W_] is the same as [^a-zA-Z0-9_]|_

    HTH,

    --------
    perl -MCrypt::Rot13 -e "$m=new Crypt::Rot13;$m->charge('WhfgNabgureCreyUnpxre');print $m->rot13;"
     
    William Herrera, Dec 11, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Paul Wu
    Replies:
    2
    Views:
    421
    Paul Wu
    May 5, 2005
  2. Karthik Kumar
    Replies:
    2
    Views:
    4,004
    Howard
    Sep 15, 2004
  3. Spitfire

    class design - clarification

    Spitfire, Feb 16, 2007, in forum: Java
    Replies:
    1
    Views:
    309
    Chris Smith
    Feb 16, 2007
  4. aegis

    clarification on character handling

    aegis, Aug 8, 2005, in forum: C Programming
    Replies:
    21
    Views:
    792
    Tim Rentsch
    Aug 18, 2005
  5. Sebastian
    Replies:
    17
    Views:
    366
    Gene Wirchenko
    Feb 4, 2013
Loading...

Share This Page