regexp strangeness

D

Dale Amon

This finds nothing:

import re
import string
card = "abcdef"
DEC029 = re.compile("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\]%_>?]")
errs = DEC029.findall(card.strip("\n\r"))
print errs

This works correctly:

import re
import string
card = "abcdef"
DEC029 = re.compile("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!)\\;\]%_>?]")
errs = DEC029.findall(card.strip("\n\r"))
print errs

They differ only in the positioning of the quoted backslash.

Just in case it is of interest to anyone.




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iD8DBQFJ3lD7ZHES7UL0zXERAq8ZAJ9cjqDvq+KGT1JOtRNzrGeD/197AgCeOfQT
HSEYoUc5+d1O/1PuQKAVhLc=
=4XSt
-----END PGP SIGNATURE-----
 
P

Peter Otten

Dale said:
This finds nothing:

import re
import string
card = "abcdef"
DEC029 = re.compile("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\]%_>?]")
errs = DEC029.findall(card.strip("\n\r"))
print errs

This works correctly:

import re
import string
card = "abcdef"
DEC029 = re.compile("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!)\\;\]%_>?]")
errs = DEC029.findall(card.strip("\n\r"))
print errs

They differ only in the positioning of the quoted backslash.

Just in case it is of interest to anyone.

You have to escape twice; once for Python and once for the regular
expression. Or use raw strings, denoted by an r"..." prefix:
re.findall("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\]%_>?]", "abc") []
re.findall("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\\\\]%_>?]", "abc") ['a', 'b', 'c']
re.findall(r"[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\]%_>?]", "abc")
['a', 'b', 'c']

Peter
 
M

MRAB

Peter said:
Dale said:
This finds nothing:

import re
import string
card = "abcdef"
DEC029 = re.compile("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\]%_>?]")

The regular expression you're actually providing is:
>>> print "[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\]%_>?]"
[^&0-9A-Z/ $*,.\-:#@'="[<(+\^!);\\]%_>?]
^^^

The backslash is escaped (the "\\") and the set ends at the first "]".
errs = DEC029.findall(card.strip("\n\r"))
print errs

This works correctly:

import re
import string
card = "abcdef"
DEC029 = re.compile("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!)\\;\]%_>?]")

The regular expression you're actually providing is:
>>> print "[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!)\\;\]%_>?]"
[^&0-9A-Z/ $*,.\-:#@'="[<(+\^!)\;\]%_>?]
^^ ^

The first "]" is escaped (the "\]") and the set ends at the second "]".
errs = DEC029.findall(card.strip("\n\r"))
print errs

They differ only in the positioning of the quoted backslash.

Just in case it is of interest to anyone.

You have to escape twice; once for Python and once for the regular
expression. Or use raw strings, denoted by an r"..." prefix:
re.findall("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\]%_>?]", "abc") []
re.findall("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\\\\]%_>?]", "abc") ['a', 'b', 'c']
re.findall(r"[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\]%_>?]", "abc")
['a', 'b', 'c']
 
S

Steven D'Aprano

This finds nothing: ....
DEC029 = re.compile("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!);\\\]%_>?]")
This works correctly: ....
DEC029 = re.compile("[^&0-9A-Z/ $*,.\-:#@'=\"[<(+\^!)\\;\]%_>?]")

They differ only in the positioning of the quoted backslash.

So you're telling us that two different regexs do different things? Gosh.
Thanks for the heads up!

BTW, when creating regexes, you may find it much easier if you use raw
strings to avoid needing to escape backslashes:

'\n' in a string is a newline escape. To get a literal backslash followed
by an n, you can write '\\n' or r'\n'.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top