re Questions

Blake Adams · Jan 26, 2014

Im pretty new to Python and understand most of the basics of Python re but am stumped by a unexpected matching dynamics.

If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following:

re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_'].

Why does this happen?

Thanks in advance

Blake

Larry Martell · Jan 26, 2014

Im pretty new to Python and understand most of the basics of Python re but am stumped by a unexpected matching dynamics.

If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following:

re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_'].

Why does this happen?

Because the characters \ ] ^ and _ are between Z and a in the ASCII
character set.

You need to do this:

re.findall('[A-Za-z0-9_]','^;z %C\@0~_')

Chris Angelico · Jan 26, 2014

If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following:

re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_'].

Why does this happen?

Because \w is not the same as [A-z0-9_]. Quoting from the docs:

"""
\w For Unicode (str) patterns:Matches Unicode word characters; this
includes most characters that can be part of a word in any language,
as well as numbers and the underscore. If the ASCII flag is used, only
[a-zA-Z0-9_] is matched (but the flag affects the entire regular
expression, so in such cases using an explicit [a-zA-Z0-9_] may be a
better choice).For 8-bit (bytes) patterns:Matches characters
considered alphanumeric in the ASCII character set; this is equivalent
to [a-zA-Z0-9_].
"""

If you're working with a byte string, then you're close, but A-z is
quite different from A-Za-z. The set [A-z] is equivalent to
[ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's
a literal backslash in there, btw), so it'll also catch several
non-alphabetic characters. With a Unicode string, it's quite
distinctly different. Either way, \w means "word characters", though,
so just go ahead and use it whenever you want word characters

ChrisA

Roy Smith · Jan 26, 2014

Chris Angelico said:
The set [A-z] is equivalent to
[ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz]

I'm inclined to suggest the regex compiler should issue a warning for
this.

I've never seen a character range other than A-Z, a-z, or 0-9. Well, I
suppose A-F or a-f if you're trying to match hex digits (and some
variations on that for octal). But, I can't imagine any example where
somebody wrote A-z and it wasn't an error.

Blake Adams · Jan 26, 2014

Im pretty new to Python and understand most of the basics of Python re but am stumped by a unexpected matching dynamics.
If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following:
re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_'].
Why does this happen?

Click to expand...

Because the characters \ ] ^ and _ are between Z and a in the ASCII

character set.

You need to do this:

re.findall('[A-Za-z0-9_]','^;z %C\@0~_')

Got it that makes sense. Thanks for the quick reply Larry

Blake Adams · Jan 26, 2014

If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following:
re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_'].
Why does this happen?

Click to expand...

Because \w is not the same as [A-z0-9_]. Quoting from the docs:

"""

\w For Unicode (str) patterns:Matches Unicode word characters; this

includes most characters that can be part of a word in any language,

as well as numbers and the underscore. If the ASCII flag is used, only

[a-zA-Z0-9_] is matched (but the flag affects the entire regular

expression, so in such cases using an explicit [a-zA-Z0-9_] may be a

better choice).For 8-bit (bytes) patterns:Matches characters

considered alphanumeric in the ASCII character set; this is equivalent

to [a-zA-Z0-9_].

"""

If you're working with a byte string, then you're close, but A-z is

quite different from A-Za-z. The set [A-z] is equivalent to

[ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's

a literal backslash in there, btw), so it'll also catch several

non-alphabetic characters. With a Unicode string, it's quite

distinctly different. Either way, \w means "word characters", though,

so just go ahead and use it whenever you want word characters

ChrisA

Thanks Chris

Chris Angelico · Jan 26, 2014

Chris Angelico said:
Chris Angelico said:

The set [A-z] is equivalent to
[ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz]

Click to expand...

I'm inclined to suggest the regex compiler should issue a warning for
this.

I've never seen a character range other than A-Z, a-z, or 0-9. Well, I
suppose A-F or a-f if you're trying to match hex digits (and some
variations on that for octal). But, I can't imagine any example where
somebody wrote A-z and it wasn't an error.

I've used a variety of character ranges, certainly more than the 4-5
you listed, but I agree that A-z is extremely likely to be an error.
However, I've sometimes used a regex (bytes mode) to find, say, all
the ASCII printable characters - [ -~] - and I wouldn't want that
precluded. It's a bit tricky trying to figure out which are likely to
be errors and which are not, so I'd be inclined to keep things as they
are. No warnings.

ChrisA

Mark Lawrence · Jan 26, 2014

If I want to set up a match replicating the '\w' pattern I would assume that would be done with '[A-z0-9_]'. However, when I run the following:
re.findall('[A-z0-9_]','^;z %C\@0~_') it matches ['^', 'z', 'C', '\\', '0', '_']. I would expect the match to be ['z', 'C', '0', '_'].
Why does this happen?

Click to expand...

Because \w is not the same as [A-z0-9_]. Quoting from the docs:

"""

\w For Unicode (str) patterns:Matches Unicode word characters; this

includes most characters that can be part of a word in any language,

as well as numbers and the underscore. If the ASCII flag is used, only

[a-zA-Z0-9_] is matched (but the flag affects the entire regular

expression, so in such cases using an explicit [a-zA-Z0-9_] may be a

better choice).For 8-bit (bytes) patterns:Matches characters

considered alphanumeric in the ASCII character set; this is equivalent

to [a-zA-Z0-9_].

"""

If you're working with a byte string, then you're close, but A-z is

quite different from A-Za-z. The set [A-z] is equivalent to

[ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz] (that's

a literal backslash in there, btw), so it'll also catch several

non-alphabetic characters. With a Unicode string, it's quite

distinctly different. Either way, \w means "word characters", though,

so just go ahead and use it whenever you want word characters

ChrisA

Click to expand...

Thanks Chris

I'm pleased to see that your question has been answered.

Now would you please read and action this
https://wiki.python.org/moin/GoogleGroupsPython to prevent us seeing the
double line spacing above, thanks.

Mark Lawrence · Jan 26, 2014

Chris Angelico said:
Chris Angelico said:

The set [A-z] is equivalent to
[ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz]

Click to expand...

I'm inclined to suggest the regex compiler should issue a warning for
this.

I've never seen a character range other than A-Z, a-z, or 0-9. Well, I
suppose A-F or a-f if you're trying to match hex digits (and some
variations on that for octal). But, I can't imagine any example where
somebody wrote A-z and it wasn't an error.

Click to expand...

I've used a variety of character ranges, certainly more than the 4-5
you listed, but I agree that A-z is extremely likely to be an error.
However, I've sometimes used a regex (bytes mode) to find, say, all
the ASCII printable characters - [ -~] - and I wouldn't want that
precluded. It's a bit tricky trying to figure out which are likely to
be errors and which are not, so I'd be inclined to keep things as they
are. No warnings.

ChrisA

I suggest a single warning is always given "Regular expressions can be
fickle. Have you considered using string methods?". My apologies to
regex fans if they're currently choking over their tea, coffee, cocoa,
beer, scotch, saki, ouzo or whatever

Tim Chase · Jan 26, 2014

The set [A-z] is equivalent to
[ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz]

Click to expand...

I'm inclined to suggest the regex compiler should issue a warning
for this.

I've never seen a character range other than A-Z, a-z, or 0-9.
Well, I suppose A-F or a-f if you're trying to match hex digits
(and some variations on that for octal). But, I can't imagine any
example where somebody wrote A-z and it wasn't an error.

I'd not object to warnings on that one literal "A-z" set, but I've
done some work with VINsÂ¹ where the allowable character-set is A-Z and
digits, minus letters that can be hard to distinguish visually
(I/O/Q), so I've used ^[A-HJ-NPR-Z0-9]{17}$ as a first-pass filter
for VINs that were entered (often scanned, but occasionally
hand-keyed). In some environments, I've been able to intercept I/O/Q
and remap them accordingly to 1/0/0 to do the disambiguation for the
user. So I'd not want to see other character-classes touched, as
they can be perfectly legit.

-tkc

Â¹ http://en.wikipedia.org/wiki/Vehicle_Identification_Number

Re-using copyrighted code	2	Jun 8, 2013
I need help fixing my website	2	Oct 15, 2023
Pyautogui, cv2 and cannot find image	0	Feb 7, 2023
re question	4	Jun 23, 2006
Must be a bug in the re module [was: Why this result with the remodule]	0	Nov 3, 2010
FLV download script works, but I want to enhance it	3	May 6, 2009
HOWTO: Parsing email using Python part1	2	Jul 3, 2011
using re module to find " but not " alone ... is this a BUG in re?	5	Jun 12, 2008

re Questions

Blake Adams

Larry Martell

Chris Angelico

Roy Smith

Blake Adams

Blake Adams

Chris Angelico

Mark Lawrence

Mark Lawrence

Tim Chase

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads