Looking for a regexp generator based on a set of known string representative of a string set

vbfoobar · Sep 8, 2006

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

Ant · Sep 8, 2006

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

def getRegex(list_of_strings):
return ".*"

;-)

Andy Dingley · Sep 8, 2006

I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5)
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

bearophileHUGS · Sep 8, 2006

(e-mail address removed):

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine patterns, i.e. families of strings...).

This may be a very simple starting point:

import re
strings = ["foo", "bar", "$spam"]
strings2 = "|".join(re.escape(s) for s in strings)
strings2 'foo|bar|\\$spam'
finds = re.compile(strings2)

Click to expand...

Click to expand...

But I don't know how well this may work with many longer strings.
If that's not enoug we can think about more complex solutions.

Bye,
bearophile

Paul McGuire · Sep 8, 2006

Andy Dingley said:
I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

Click to expand...

(s1|s2|s3|s4|s5)
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

It turns out this problem is a little trickier, especially when one of your
strings is a leading subset of another. For instance, let's say we are
looking for comparison operators, one of <, >, =, >=, <=, or !=. Simply
concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!="

However, the leading '<' and '>' alternatives mask the later '<=', '<>', and
'>=' alternatives, and so the regexp never matches the longer options (I was
not able to find a greediness switch that would override this). So when
searching "a >= b" we get this:
['>', '=']

By moving the longer option to the front of the regexp, the longer option is
no longer masked by the shorter:
['>=']

You also can't just concatenate input strings, since it is very likely they
will contain one of the magic re symbols ()[]?*./\+, etc. So re.escape
needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes the
input strings, before concatenating them to a valid regexp. Of course, the
simplest approach would be to just sort the symbols by descending length,
but you may have some a priori knowledge that 'X' is a very common match,
and want that option tested as early as possible. So this method only
reorders strings if there is a masking problem.

def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols
for j,other in enumerate(symbols[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert(i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )

print createREfrom(["ABC","ABCDEF","ABCGHI"]) ABCDEF|ABCGHI|ABC
print createREfrom("> < = <= >= != << >> <<< >>>".split()) \>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <=
b")

Click to expand...

Click to expand...

Click to expand...

['<=']

Note, this does not do any optimization, such as collapsing "ABCDEF|ABCGHI"
to "ABC(DEF|GHI)". I think there are some recipes in the Python cookbook
for such optimization.

-- Paul

Diez B. Roggisch · Sep 8, 2006

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

For matching the exact set, of course a concatenation can be used.
However, this is of limited use, because then simple string find will
suffice.

In general, this can't be done. It might be possible that the underlying
structure of the language isn't regular at all - for example, the simple
language of palindromes isn't.

And even if it was - the search space is just to large to explore. You
could generate a bazillion matching expressions, but .* will always
match - so how do you prevent the generator from converging towards that?

If all you have are those strings, you are better off trying to infer
some structure yourself.

Diez

James Stroud · Sep 9, 2006

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html

But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

vbfoobar · Sep 9, 2006

James Stroud a écrit :

I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html

Indeed, this seems cool! Thanks for the suggestion

I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3

But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/

I will look at that more precisely, but at my first look
it seems this is more specialized and less accessible
for the common mortal...

James

Thanks. This may help me.

In addition I continue to look for other ideas, notably
because I want code that I can change myself,
and exclusively python code

James Stroud · Sep 10, 2006

I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3

Well, in the realm of sequence analysis, it is trivial to devise a regex
for these values because they are already aligned and of fixed length.
This is a couple of more lines than it needs to be, only so its easier
to follow the logic. This uses throw-away groups to avoid bracketed
sets, becuase you have dashes in your items. You might need to tweak the
following code if you have characters special to regex in your sequences
or if you want to use bracketed sets. The place to do this is in the
_joiner() function.

values = """
cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
""".split()

# python 2.5 has new list comp features to shorten this test, but
# the resulting list comp can begin to look ugly if the alternatives
# are complicated
def _joiner(c):
if len(c) == 1:
# will raise KeyError for empty column
return c.pop()
else:
return "(?:%s)" % '|'.join(c)

columns = [set(c) for c in zip(*values)]
col_strs = [_joiner(c) for c in columns]
rgx_str = "".join(col_strs)
exact_rgx_str = "^%s$" % rgx_str

# '(?:c|n)(?

|s)(?:k|u|d)g-3(?:1|0)0(?:A|0)(?:A|B|1|0|3|2|6|8)'
print rgx_str

But, if you get much more complicated that this, you will definitely
want to check out hmmer, especially if you can align your sequences.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Paul McGuire · Sep 16, 2006

Paul McGuire said:
Andy Dingley said:

I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

Click to expand...

(s1|s2|s3|s4|s5)
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

Click to expand...

It turns out this problem is a little trickier, especially when one of
your strings is a leading subset of another. For instance, let's say we
are looking for comparison operators, one of <, >, =, >=, <=, or !=.
Simply concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!="

However, the leading '<' and '>' alternatives mask the later '<=', '<>',
and '>=' alternatives, and so the regexp never matches the longer options
(I was not able to find a greediness switch that would override this). So
when searching "a >= b" we get this:
['>', '=']

By moving the longer option to the front of the regexp, the longer option
is no longer masked by the shorter:
['>=']

You also can't just concatenate input strings, since it is very likely
they will contain one of the magic re symbols ()[]?*./\+, etc. So
re.escape needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes
the input strings, before concatenating them to a valid regexp. Of
course, the simplest approach would be to just sort the symbols by
descending length, but you may have some a priori knowledge that 'X' is a
very common match, and want that option tested as early as possible. So
this method only reorders strings if there is a masking problem.

def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols
for j,other in enumerate(symbols[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert(i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )

print createREfrom(["ABC","ABCDEF","ABCGHI"]) ABCDEF|ABCGHI|ABC
print createREfrom("> < = <= >= != << >> <<< >>>".split()) \>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <=
b")

Click to expand...

Click to expand...

['<=']

Note, this does not do any optimization, such as collapsing
"ABCDEF|ABCGHI" to "ABC(DEF|GHI)". I think there are some recipes in the
Python cookbook for such optimization.

-- Paul

Recursive generator for combinations of a multiset?	0	Nov 21, 2013
Quickie - Regexp for a string not at the beginning of the line	16	Oct 25, 2012
unexpected regexp behaviour using 'A\|B\|C.....'	1	Jul 28, 2011
efficiently splitting up strings based on substrings	7	Sep 5, 2009
No way of looking for a regrexp match starting from a particularpoint in a string?	25	Jun 3, 2007
KirbyBase : replacing string exceptions	2	Nov 23, 2009
PyWart: PEP8: a seething cauldron of inconsistencies.	1	Jul 28, 2011
PyWart: PEP8: A cauldron of inconsistencies.	7	Jul 27, 2011

Looking for a regexp generator based on a set of known string representative of a string set

vbfoobar

Ant

Andy Dingley

bearophileHUGS

Paul McGuire

Diez B. Roggisch

James Stroud

vbfoobar

James Stroud

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads