Looking for a regexp generator based on a set of known string representative of a string set

V

vbfoobar

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea
 
A

Ant

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

def getRegex(list_of_strings):
return ".*"

;-)
 
A

Andy Dingley

I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5)
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this
 
B

bearophileHUGS

(e-mail address removed):
I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine patterns, i.e. families of strings...).

This may be a very simple starting point:
import re
strings = ["foo", "bar", "$spam"]
strings2 = "|".join(re.escape(s) for s in strings)
strings2 'foo|bar|\\$spam'
finds = re.compile(strings2)

But I don't know how well this may work with many longer strings.
If that's not enoug we can think about more complex solutions.

Bye,
bearophile
 
P

Paul McGuire

Andy Dingley said:
I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5)
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

It turns out this problem is a little trickier, especially when one of your
strings is a leading subset of another. For instance, let's say we are
looking for comparison operators, one of <, >, =, >=, <=, or !=. Simply
concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!="

However, the leading '<' and '>' alternatives mask the later '<=', '<>', and
'>=' alternatives, and so the regexp never matches the longer options (I was
not able to find a greediness switch that would override this). So when
searching "a >= b" we get this:
['>', '=']

By moving the longer option to the front of the regexp, the longer option is
no longer masked by the shorter:
['>=']


You also can't just concatenate input strings, since it is very likely they
will contain one of the magic re symbols ()[]?*./\+, etc. So re.escape
needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes the
input strings, before concatenating them to a valid regexp. Of course, the
simplest approach would be to just sort the symbols by descending length,
but you may have some a priori knowledge that 'X' is a very common match,
and want that option tested as early as possible. So this method only
reorders strings if there is a masking problem.


def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols
for j,other in enumerate(symbols[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert(i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )
print createREfrom(["ABC","ABCDEF","ABCGHI"]) ABCDEF|ABCGHI|ABC
print createREfrom("> < = <= >= != << >> <<< >>>".split()) \>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <=
b")
['<=']


Note, this does not do any optimization, such as collapsing "ABCDEF|ABCGHI"
to "ABC(DEF|GHI)". I think there are some recipes in the Python cookbook
for such optimization.

-- Paul
 
D

Diez B. Roggisch

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

For matching the exact set, of course a concatenation can be used.
However, this is of limited use, because then simple string find will
suffice.

In general, this can't be done. It might be possible that the underlying
structure of the language isn't regular at all - for example, the simple
language of palindromes isn't.

And even if it was - the search space is just to large to explore. You
could generate a bazillion matching expressions, but .* will always
match - so how do you prevent the generator from converging towards that?

If all you have are those strings, you are better off trying to infer
some structure yourself.

Diez
 
J

James Stroud

Hello

I am looking for python code that takes as input a list of strings
(most similar,
but not necessarily, and rather short: say not longer than 50 chars)
and that computes and outputs the python regular expression that
matches
these string values (not necessarily strictly, perhaps the code is able
to determine
patterns, i.e. families of strings...).

Thanks for any idea

I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html

But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
V

vbfoobar

James Stroud a écrit :
I'm not sure your application, but Genomicists and Proteomicists have
found that Hidden Markov Models can be very powerful for developing
pattern models. Perhaps have a look at "Biological Sequence Analysis" by
Durbin et al.

Also, a very cool regex based algorithm was developed at IBM:

http://cbcsrv.watson.ibm.com/Tspd.html

Indeed, this seems cool! Thanks for the suggestion

I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
But I think HMMs are the way to go. Check out HMMER at WUSTL by Sean
Eddy and colleagues:

http://hmmer.janelia.org/

http://selab.janelia.org/people/eddys/

I will look at that more precisely, but at my first look
it seems this is more specialized and less accessible
for the common mortal...

Thanks. This may help me.

In addition I continue to look for other ideas, notably
because I want code that I can change myself,
and exclusively python code
 
J

James Stroud

I have tried their online Text-symbol Pattern Discovery
with these input values:

cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3

Well, in the realm of sequence analysis, it is trivial to devise a regex
for these values because they are already aligned and of fixed length.
This is a couple of more lines than it needs to be, only so its easier
to follow the logic. This uses throw-away groups to avoid bracketed
sets, becuase you have dashes in your items. You might need to tweak the
following code if you have characters special to regex in your sequences
or if you want to use bracketed sets. The place to do this is in the
_joiner() function.


values = """
cpkg-30000
cpkg-31008
cpkg-3000A
cpkg-30006
nsug-300AB
nsug-300A2
cpdg-30001
nsug-300A3
""".split()

# python 2.5 has new list comp features to shorten this test, but
# the resulting list comp can begin to look ugly if the alternatives
# are complicated
def _joiner(c):
if len(c) == 1:
# will raise KeyError for empty column
return c.pop()
else:
return "(?:%s)" % '|'.join(c)

columns = [set(c) for c in zip(*values)]
col_strs = [_joiner(c) for c in columns]
rgx_str = "".join(col_strs)
exact_rgx_str = "^%s$" % rgx_str

# '(?:c|n)(?:p|s)(?:k|u|d)g-3(?:1|0)0(?:A|0)(?:A|B|1|0|3|2|6|8)'
print rgx_str


But, if you get much more complicated that this, you will definitely
want to check out hmmer, especially if you can align your sequences.

James


--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
P

Paul McGuire

Paul McGuire said:
Andy Dingley said:
I am looking for python code that takes as input a list of strings
[...] and outputs the python regular expression

(s1|s2|s3|s4|s5)
for strings of "s1" etc.

Regex compilers are themselves quite good at optimising beyond this

It turns out this problem is a little trickier, especially when one of
your strings is a leading subset of another. For instance, let's say we
are looking for comparison operators, one of <, >, =, >=, <=, or !=.
Simply concatenating with intervening '|' characters gives this regexp:

"<|>|=|<=|>=|!="

However, the leading '<' and '>' alternatives mask the later '<=', '<>',
and '>=' alternatives, and so the regexp never matches the longer options
(I was not able to find a greediness switch that would override this). So
when searching "a >= b" we get this:
['>', '=']

By moving the longer option to the front of the regexp, the longer option
is no longer masked by the shorter:
['>=']


You also can't just concatenate input strings, since it is very likely
they will contain one of the magic re symbols ()[]?*./\+, etc. So
re.escape needs to be called to add the necessary '\'s.

Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes
the input strings, before concatenating them to a valid regexp. Of
course, the simplest approach would be to just sort the symbols by
descending length, but you may have some a priori knowledge that 'X' is a
very common match, and want that option tested as early as possible. So
this method only reorders strings if there is a masking problem.


def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols
for j,other in enumerate(symbols[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert(i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )
print createREfrom(["ABC","ABCDEF","ABCGHI"]) ABCDEF|ABCGHI|ABC
print createREfrom("> < = <= >= != << >> <<< >>>".split()) \>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <=
b")
['<=']


Note, this does not do any optimization, such as collapsing
"ABCDEF|ABCGHI" to "ABC(DEF|GHI)". I think there are some recipes in the
Python cookbook for such optimization.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,439
Members
44,829
Latest member
PIXThurman

Latest Threads

Top