Python regular expression question!

unexpected · Sep 20, 2006

I'm trying to do a whole word pattern match for the term 'MULTX-'

Currently, my regular expression syntax is:

re.search(('^')+(keyword+'\\b')

where keyword comes from a list of terms. ('MULTX-' is in this list,
and hence a keyword).

My regular expression works for a variety of different keywords except
for 'MULTX-'. It does work for MULTX, however, so I'm thinking that the
'-' sign is delimited as a word boundary. Is there any way to get
Python to override this word boundary?

I've tried using raw strings, but the syntax is painful. My attempts
were:

re.search(('^')+("r"+keyword+'\b')
re.search(('^')+("r'"+keyword+'\b')

and then tried the even simpler:

re.search(('^')+("r'"+keyword)
re.search(('^')+("r''"+keyword)

and all of those failed for everything. Any suggestions?

Hallvard B Furuseth · Sep 20, 2006

unexpected said:
I'm trying to do a whole word pattern match for the term 'MULTX-'

Currently, my regular expression syntax is:

re.search(('^')+(keyword+'\\b')

\b matches the beginning/end of a word (characters a-zA-Z_0-9).
So that regex will match e.g. MULTX-FOO but not MULTX-.

Incidentally, in case the keyword contains regex special characters
(like '*') you may wish to escape it: re.escape(keyword).

unexpected · Sep 20, 2006

\b matches the beginning/end of a word (characters a-zA-Z_0-9).
So that regex will match e.g. MULTX-FOO but not MULTX-.

So is there a way to get \b to include - ?

Ant · Sep 20, 2006

unexpected said:
So is there a way to get \b to include - ?

No, but you can get the behaviour you want using negative lookaheads.
The following regex is effectively \b where - is treated as a word
character:

pattern = r"(?![a-zA-Z0-9_-])"

This effectively matches the next character that isn't in the group
[a-zA-Z0-9_-] but doesn't consume it. For example:

p = re.compile(r".*?(?![a-zA-Z0-9_-])(.*)")
s = "aabbcc_d-f-.XXX YYY"
m = p.search(s)
print m.group(1)

Click to expand...

Click to expand...

..XXX YYY

Note that the regex recognises the '.' as the end of the word, but
doesn't use it up in the match, so it is present in the final capturing
group. Contrast it with:

p = re.compile(r".*?[^a-zA-Z0-9_-](.*)")
s = "aabbcc_d-f-.XXX YYY"
m = p.search(s)
print m.group(1)

Click to expand...

Click to expand...

XXX YYY

Note here that "[^a-zA-Z0-9_-]" still denotes the end of the word, but
this time consumes it, so it doesn't appear in the final captured group.

unexpected · Sep 20, 2006

Sweet! Thanks so much!

Regular expression negative look-ahead	1	Jul 2, 2013
Regular expression for different date formats in Python	4	Nov 26, 2012
Python client/server that reads HTML body from server	1	Apr 12, 2023
Repeating assertions in regular expression	3	Jan 3, 2012
grimace: a fluent regular expression generator in Python	0	Jul 15, 2013
Regular expression worries	3	Oct 11, 2006
Regular expression bug?	11	Feb 19, 2009
Help with regular expression in python	1	Aug 18, 2011

Python regular expression question!

unexpected

Hallvard B Furuseth

unexpected

Ant

unexpected

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads