Andy Dingley said:
I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression
(s1|s2|s3|s4|s5)
for strings of "s1" etc.
Regex compilers are themselves quite good at optimising beyond this
It turns out this problem is a little trickier, especially when one of
your strings is a leading subset of another. For instance, let's say we
are looking for comparison operators, one of <, >, =, >=, <=, or !=.
Simply concatenating with intervening '|' characters gives this regexp:
"<|>|=|<=|>=|!="
However, the leading '<' and '>' alternatives mask the later '<=', '<>',
and '>=' alternatives, and so the regexp never matches the longer options
(I was not able to find a greediness switch that would override this). So
when searching "a >= b" we get this:
['>', '=']
By moving the longer option to the front of the regexp, the longer option
is no longer masked by the shorter:
['>=']
You also can't just concatenate input strings, since it is very likely
they will contain one of the magic re symbols ()[]?*./\+, etc. So
re.escape needs to be called to add the necessary '\'s.
Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes
the input strings, before concatenating them to a valid regexp. Of
course, the simplest approach would be to just sort the symbols by
descending length, but you may have some a priori knowledge that 'X' is a
very common match, and want that option tested as early as possible. So
this method only reorders strings if there is a masking problem.
def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols
for j,other in enumerate(symbols[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert(i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )
print createREfrom(["ABC","ABCDEF","ABCGHI"]) ABCDEF|ABCGHI|ABC
print createREfrom("> < = <= >= != << >> <<< >>>".split()) \>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <=
b")
['<=']
Note, this does not do any optimization, such as collapsing
"ABCDEF|ABCGHI" to "ABC(DEF|GHI)". I think there are some recipes in the
Python cookbook for such optimization.
-- Paul