Character class [\W_] clarification

Fiaz Idris · Dec 10, 2003

Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W

I know that [\W] matches [^a-zA-Z_0-9]

From Mastering Algorithms with Perl (Page.110), I see a character class
[\W_] that does the following

s/[\W_]+//g

i.e. to replace (all non-word character and underscore) with (nothing).

First, I couldn't understand the above that is because I interpreted
above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

That is to replace (non-word characters including underscore) with (nothing)
and thought that the last underscore is infact unnecessary.

My question is where in the documentation (anywhere) that says
the [\W] will infact work with the interpretation as below:

[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

If this seems to be a dumb question, I apologise. But, still I require
an explanation.

William Herrera · Dec 10, 2003

Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W

I know that [\W] matches [^a-zA-Z_0-9]

From Mastering Algorithms with Perl (Page.110), I see a character class
[\W_] that does the following

s/[\W_]+//g

i.e. to replace (all non-word character and underscore) with (nothing).

First, I couldn't understand the above that is because I interpreted
above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

That is to replace (non-word characters including underscore) with (nothing)
and thought that the last underscore is infact unnecessary.

I think the underscore is considred a legal character for perl words.

Try this:

#!/usr/bin/perl
my $txt = '$$% 3b__c4 101 _ z42';
my $i = $txt;
my $j = $txt;
my $k = $txt;
$i =~ s/[\W_]+//g;
$j =~ s/([\W]|_)+//g;
$k =~ s/[\W]+//g;
print "txt $txt, i $i, j $j, k $k;";

My question is where in the documentation (anywhere) that says
the [\W] will infact work with the interpretation as below:

perlre, thinks I

Anno Siegel · Dec 10, 2003

Fiaz Idris said:
Keywords: Character Class Regex Regular Expression Regular Expressions \W_ \W

I know that [\W] matches [^a-zA-Z_0-9]

From Mastering Algorithms with Perl (Page.110), I see a character class
[\W_] that does the following

s/[\W_]+//g

i.e. to replace (all non-word character and underscore) with (nothing).

Yes, that's what it does.

First, I couldn't understand the above that is because I interpreted
above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

I don't understand what your interpretation was. Did you think it
changes the two characters "\W" to something else? Or do you mean
you thought it changes the behavior of "\W" for the rest of the program?

s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

That is to replace (non-word characters including underscore) with (nothing)
and thought that the last underscore is infact unnecessary.

Well, it is. Any character need only appear once in a character class,
whether negated or not.

My question is where in the documentation (anywhere) that says
the [\W] will infact work with the interpretation as below:

[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
and /\W/ match exactly the same things, as well as the redundant
[^a-zA-Z_0-9_].

Anno

Glenn Jackman · Dec 10, 2003

Fiaz Idris said:
s/[\W_]+//g
i.e. to replace (all non-word character and underscore) with (nothing).

First, I couldn't understand the above that is because I interpreted
above regex as *** replace "\W" with "^a-zA-Z_0-9" ***

s/[^a-zA-Z_0-9_]+//g -------------->(regex XXX)

[...]

An example to back up Fiaz's confusion:

$s = '=-_abc_-=';
($c=$s) =~ s/[\W]/./g; print "$c\n";
($c=$s) =~ s/[\W_]/./g; print "$c\n";

Clearly [\W] is not equivalent to [\W_], so \W is not merely replaced
with ^a-zA-Z_0-9 by Perl's regex engine.

Fiaz Idris · Dec 11, 2003

I know that [\W] matches [^a-zA-Z_0-9]
[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

Click to expand...

I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
and /\W/ match exactly the same things, as well as the redundant
[^a-zA-Z_0-9_].

Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
for an example code that shows the difference.

[\W] does not replace the underscore, but
[\W_] also replaces the underscore.

Programming Perl says

Symbol ||| Meaning ||| As Bytes
\W ||| Non-(word character) ||| [^a-zA-Z0-9_]

According to the above representation for [\W] I assumed

Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

Point 2:
But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
that is (all the characters other than [A-Za-z0-9_] and include the [_]).

Point 2 is what actually happens when using [\W_] but the documentation
leads you to believe [\W_] is equivalent to Point 1 and we all know that
that is not the case by running the sample code I mentioned before.

So, where in the docs (anywhere) that points this out.

I hope I have made myself clear.

Sam Holden · Dec 11, 2003

I know that [\W] matches [^a-zA-Z_0-9]
[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

Click to expand...

I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
and /\W/ match exactly the same things, as well as the redundant
[^a-zA-Z_0-9_].

Click to expand...

Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
for an example code that shows the difference.

[\W] does not replace the underscore, but
[\W_] also replaces the underscore.

Programming Perl says

Symbol ||| Meaning ||| As Bytes
\W ||| Non-(word character) ||| [^a-zA-Z0-9_]

According to the above representation for [\W] I assumed

Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

That's a pretty silly assumption. \W matches the same things as
matched by [^a-zA-Z_0-9] (ignoring locales for the moment).

[AB] matches A or B. so [\W_] matches \W or _. "_" isn't matched
by \W but is by _, hence it matches [\W_].

If I squinted I might be able to see how you could think [\W_] might
be the same as [[^a-zA-Z_0-9]_] (by treating the explanation of
what it matches as a literal expansion). But why anyone would think
extra characters would be magically placed inside the []s is beyong
me...

Point 2:
But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
that is (all the characters other than [A-Za-z0-9_] and include the [_]).

Point 2 is what actually happens when using [\W_] but the documentation
leads you to believe [\W_] is equivalent to Point 1 and we all know that
that is not the case by running the sample code I mentioned before.

So, where in the docs (anywhere) that points this out.

perldoc perlre:

\W Match a non-"word" character

and

You may use "\w", "\W", "\s", "\S", "\d", and "\D" within character
classes

I can't see how you could possibly come to your "Point 1" interpretation.

Uri Guttman · Dec 11, 2003

FI" == Fiaz Idris said:
I know that [\W] matches [^a-zA-Z_0-9]
[~`!@#$%^&*()-[]:;<,./"? ........and so on] but not the interpretation
give in (regex XXX) above.

Click to expand...

I'm still not sure what discrepancy you are seeing. /[^a-zA-Z_0-9]/
and /\W/ match exactly the same things, as well as the redundant
[^a-zA-Z_0-9_].

Click to expand...

Maybe I didn't explain my confusion clearly. See Glenn Jackman's posting
for an example code that shows the difference.

[\W] does not replace the underscore, but
[\W_] also replaces the underscore.

Programming Perl says

Symbol ||| Meaning ||| As Bytes
\W ||| Non-(word character) ||| [^a-zA-Z0-9_]

According to the above representation for [\W] I assumed

Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

you have to INVERT the class for \w to get \W. so \W does NOT contain
_. your assumption that is has 2 _ is wrong. \W has NO _ so you must add
one if you want to match it.

the key is to remember that \w is a char class and \W is all the other
chars. it is not the same as [^\w] which is sort of what you think it
is.

Point 2:
But, [\W_] is actually equivalent to [^~!@#$%^&*()....._]
that is (all the characters other than [A-Za-z0-9_] and include the [_]).

Point 2 is what actually happens when using [\W_] but the documentation
leads you to believe [\W_] is equivalent to Point 1 and we all know that
that is not the case by running the sample code I mentioned before.

the docs are accurate. you misinterpreted them as point 1.

So, where in the docs (anywhere) that points this out.

what you quoted from the docs points this out.

I hope I have made myself clear.

yes you did. and you were wrong and the docs are correct.

uri

William Herrera · Dec 11, 2003

On 10 Dec 2003 17:37:59 -0800, (e-mail address removed) (Fiaz Idris) wrote:

Point 1:
[\W_] is equivalent to [^a-zA-Z0-9__] ----> (two underscores)
and thought that the last underscore is actually unnecessary.

The problem is that, in a negated char class like [^a], any character you add
to the class within those brackets, like [^ab], is added as an excluded char.
But with th \W syntax, the 'negation' of \w is in the set of INCLUDED chars in
the class, and is NOT continued to other chars in a bracketed charachter class
containing \W.

So, [\W] is the same as [^a-zA-Z0-9_], but
[\W_] is the same as [^a-zA-Z0-9_]|_

HTH,

Collect Excel Data from Website	5	Apr 30, 2022
Calculating a negated character class	2	Jun 18, 2012
character classes, locale and utf8 - strange behaviour	0	Apr 29, 2011
FAQ 6.8 How can I match a locale-smart version of "/[a-zA-Z]/"?	0	Jan 8, 2011
Formatting a long regex: can a character class [] be split overlines?	4	May 1, 2011
regex question	3	Dec 12, 2008
FAQ 6.1 How can I hope to use regular expressions without creating illegible and unmaintainable code	0	Feb 25, 2011
Tasks	1	Nov 29, 2022

Character class [\W_] clarification

Fiaz Idris

William Herrera

Anno Siegel

Glenn Jackman

Fiaz Idris

Sam Holden

Uri Guttman

William Herrera

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads