Character class [\W_] clarification

F

Fiaz Idris

Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W

I know that [\W] matches [^a-zA-Z_0-9]

From Mastering Algorithms with Perl (Page.110), I see a character class
[\W_] that does the following

s/[\W_]+//g

i.e. to replace (all non-word character and underscore) with (nothing).

First, I couldn't understand the above that is because I interpreted
above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

That is to replace (non-word characters including underscore) with (nothing)
and thought that the last underscore is infact unnecessary.

My question is where in the documentation (anywhere) that says
the [\W] will infact work with the interpretation as below:

[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

If this seems to be a dumb question, I apologise. But, still I require
an explanation.
 
W

William Herrera

Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W

I know that [\W] matches [^a-zA-Z_0-9]

From Mastering Algorithms with Perl (Page.110), I see a character class
[\W_] that does the following

s/[\W_]+//g

i.e. to replace (all non-word character and underscore) with (nothing).

First, I couldn't understand the above that is because I interpreted
above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

That is to replace (non-word characters including underscore) with (nothing)
and thought that the last underscore is infact unnecessary.

I think the underscore is considred a legal character for perl words.

Try this:

#!/usr/bin/perl
my $txt = '$$% 3b__c4 101 _ z42';
my $i = $txt;
my $j = $txt;
my $k = $txt;
$i =~ s/[\W_]+//g;
$j =~ s/([\W]|_)+//g;
$k =~ s/[\W]+//g;
print "txt $txt, i $i, j $j, k $k;";
My question is where in the documentation (anywhere) that says
the [\W] will infact work with the interpretation as below:

perlre, thinks I
 
A

Anno Siegel

Fiaz Idris said:
Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W

I know that [\W] matches [^a-zA-Z_0-9]

From Mastering Algorithms with Perl (Page.110), I see a character class
[\W_] that does the following

s/[\W_]+//g

i.e. to replace (all non-word character and underscore) with (nothing).

Yes, that's what it does.
First, I couldn't understand the above that is because I interpreted
above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

I don't understand what your interpretation was. Did you think it
changes the two characters "\W" to something else? Or do you mean
you thought it changes the behavior of "\W" for the rest of the program?
s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

That is to replace (non-word characters including underscore) with (nothing)
and thought that the last underscore is infact unnecessary.

Well, it is. Any character need only appear once in a character class,
whether negated or not.
My question is where in the documentation (anywhere) that says
the [\W] will infact work with the interpretation as below:

[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
and /\W/ match exactly the same things, as well as the redundant
[^a-zA-Z_0-9_].

Anno
 
G

Glenn Jackman

Fiaz Idris said:
s/[\W_]+//g
i.e. to replace (all non-word character and underscore) with (nothing).

First, I couldn't understand the above that is because I interpreted
above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)
[...]


An example to back up Fiaz's confusion:

$s = '=-_abc_-=';
($c=$s) =~ s/[\W]/./g; print "$c\n";
($c=$s) =~ s/[\W_]/./g; print "$c\n";

Clearly [\W] is not equivalent to [\W_], so \W is not merely replaced
with ^a-zA-Z_0-9 by Perl's regex engine.
 
F

Fiaz Idris

I know that [\W] matches [^a-zA-Z_0-9]
[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
and /\W/ match exactly the same things, as well as the redundant
[^a-zA-Z_0-9_].

Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
for an example code that shows the difference.

[\W] does not replace the underscore, but
[\W_] also replaces the underscore.

Programming Perl says

Symbol ||| Meaning ||| As Bytes
\W ||| Non-(word character) ||| [^a-zA-Z0-9_]

According to the above representation for [\W] I assumed

Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

Point 2:
But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
that is (all the characters other than [A-Za-z0-9_] and include the [_]).

Point 2 is what actually happens when using [\W_] but the documentation
leads you to believe [\W_] is equivalent to Point 1 and we all know that
that is not the case by running the sample code I mentioned before.

So, where in the docs (anywhere) that points this out.

I hope I have made myself clear.
 
S

Sam Holden

I know that [\W] matches [^a-zA-Z_0-9]
[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
and /\W/ match exactly the same things, as well as the redundant
[^a-zA-Z_0-9_].

Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
for an example code that shows the difference.

[\W] does not replace the underscore, but
[\W_] also replaces the underscore.

Programming Perl says

Symbol ||| Meaning ||| As Bytes
\W ||| Non-(word character) ||| [^a-zA-Z0-9_]

According to the above representation for [\W] I assumed

Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

That's a pretty silly assumption. \W matches the same things as
matched by [^a-zA-Z_0-9] (ignoring locales for the moment).

[AB] matches A or B. so [\W_] matches \W or _. "_" isn't matched
by \W but is by _, hence it matches [\W_].

If I squinted I might be able to see how you could think [\W_] might
be the same as [[^a-zA-Z_0-9]_] (by treating the explanation of
what it matches as a literal expansion). But why anyone would think
extra characters would be magically placed inside the []s is beyong
me...

Point 2:
But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
that is (all the characters other than [A-Za-z0-9_] and include the [_]).

Point 2 is what actually happens when using [\W_] but the documentation
leads you to believe [\W_] is equivalent to Point 1 and we all know that
that is not the case by running the sample code I mentioned before.

So, where in the docs (anywhere) that points this out.

perldoc perlre:

\W Match a non-"word" character

and

You may use "\w", "\W", "\s", "\S", "\d", and "\D" within character
classes

I can't see how you could possibly come to your "Point 1" interpretation.
 
U

Uri Guttman

FI" == Fiaz Idris said:
I know that [\W] matches [^a-zA-Z_0-9]
[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
and /\W/ match exactly the same things, as well as the redundant
[^a-zA-Z_0-9_].
Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
for an example code that shows the difference.
[\W] does not replace the underscore, but
[\W_] also replaces the underscore.
Programming Perl says
Symbol ||| Meaning ||| As Bytes
\W ||| Non-(word character) ||| [^a-zA-Z0-9_]
According to the above representation for [\W] I assumed
Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

you have to INVERT the class for \w to get \W. so \W does NOT contain
_. your assumption that is has 2 _ is wrong. \W has NO _ so you must add
one if you want to match it.

the key is to remember that \w is a char class and \W is all the other
chars. it is not the same as [^\w] which is sort of what you think it
is.
Point 2:
But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
that is (all the characters other than [A-Za-z0-9_] and include the [_]).

Point 2 is what actually happens when using [\W_] but the documentation
leads you to believe [\W_] is equivalent to Point 1 and we all know that
that is not the case by running the sample code I mentioned before.

the docs are accurate. you misinterpreted them as point 1.
So, where in the docs (anywhere) that points this out.

what you quoted from the docs points this out.
I hope I have made myself clear.

yes you did. and you were wrong and the docs are correct.

uri
 
W

William Herrera

On 10 Dec 2003 17:37:59 -0800, (e-mail address removed) (Fiaz Idris) wrote:

Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

The problem is that, in a negated char class like [^a], any character you add
to the class within those brackets, like [^ab], is added as an excluded char.
But with th \W syntax, the 'negation' of \w is in the set of INCLUDED chars in
the class, and is NOT continued to other chars in a bracketed charachter class
containing \W.

So, [\W] is the same as [^a-zA-Z0-9_], but
[\W_] is the same as [^a-zA-Z0-9_]|_

HTH,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top