Generating all permutations from a regexp

  • Thread starter =?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=
  • Start date
?

=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=

With regexps you can search for strings matching it. For example,
given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
do the reverse, from a regexp generate all strings that could match
it.

The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
"AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".

Is this possible to do? Obviously, for some regexps the set of matches
is unbounded (a list of everything that matches "*" would be very
unpractical), but how would you do it for simple regexps like the one
above?
 
N

Nick Craig-Wood

BJörn Lindqvist said:
With regexps you can search for strings matching it. For example,
given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
do the reverse, from a regexp generate all strings that could match
it.

The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
"AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".

Is this possible to do? Obviously, for some regexps the set of matches
is unbounded (a list of everything that matches "*" would be very
unpractical), but how would you do it for simple regexps like the one
above?

A regular expression matcher uses a state machine to match strings.

What you want to do is do a traversal through that state machine
generating all possible matches at each point.

Luckily the python regexp matcher is written in python, so you have
access to the state machine directly if you want.

Take a look at sre*.py in the python library and you might be able to
work out what to do! I had a brief look myself, and it
looked... complicated!
 
F

Fredrik Lundh

Nick said:
A regular expression matcher uses a state machine to match strings.

unless it's the kind of regular expression matcher that doesn't use a
state machine, like the one in Python.

</F>
 
P

Paul McGuire

With regexps you can search for strings matching it. For example,
given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
do the reverse, from a regexp generate all strings that could match
it.

The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
"AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".

Is this possible to do?

Here is a first cut at your problem
(http://pyparsing-public.wikispaces.com/space/showimage/invRegex.py).
I used pyparsing to identify repeatable ranges within a regex, then
attached generator-generating classes to parse actions for each type of
regex element. Some limitations:
- unbounded '*' and '+' repetition is not allowed
- only supports \d, \w, and \s macros

Here are the test cases in the program that generate the expected list
of permutations:
[A-E]
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}
 
N

Nick Craig-Wood

Fredrik Lundh said:
unless it's the kind of regular expression matcher that doesn't use a
state machine, like the one in Python.

Ah! Well that is outside of my experience then... I've only really
studied the matchers produced by flex before. Apologies for the
mis-information!
 
C

Chris Johnson

BJörn Lindqvist said:
With regexps you can search for strings matching it. For example,
given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
do the reverse, from a regexp generate all strings that could match
it.

The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
"AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".

Is this possible to do? Obviously, for some regexps the set of matches
is unbounded (a list of everything that matches "*" would be very
unpractical), but how would you do it for simple regexps like the one
above?

For a very small number of characters, it would be feasible. For any
finite number of characters, it would be possible (though it wouldn't
take much to take longer than the age of the universe). For reference,
in your simple example, you have 17,576,000 matching strings.

I'm curious as to why you would wish to do this. I certainly understand
considering hard problems for their own sake, but when I formulate
them, there's always some impetus that makes me say "Huh. Now I
wonder..."
 
T

Thomas Ploch

Fredrik said:
unless it's the kind of regular expression matcher that doesn't use a
state machine, like the one in Python.

</F>

How is the matching engine implemented then?

I thought regular languages can be described by deterministic /
non-deterministic finite state machines. I am just curious ...

Thomas
 
C

Chris Johnson

Thomas said:
How is the matching engine implemented then?

I thought regular languages can be described by deterministic /
non-deterministic finite state machines. I am just curious ...

Regular expressions are described by N?DFAs, but Perl-compatible
regular expressions (which pretty much every language implements now)
are not true regular expressions. They are actually Turing complete,
and thus have features that cannot be implemented in N?DFAs.
 
P

Paul McGuire

With regexps you can search for strings matching it. For example,
given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
do the reverse, from a regexp generate all strings that could match
it.

The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
"AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".

Is this possible to do?

Here is a first cut at your problem
(http://pyparsing-public.wikispaces.com/space/showimage/invRegex.py).
I used pyparsing to identify repeatable ranges within a regex, then
attached generator-generating classes to parse actions for each type of
regex element. Some limitations:
- unbounded '*' and '+' repetition is not allowed
- only supports \d, \w, and \s macros

=====================

Download the latest version of this file. It is now importable as its own
module, with the invert method that takes a regexp string and returns a
generator that yields all the possible matching strings. This file also
includes a simple count method, which returns the number of elements
returned by a generator (as opposed to calling len(list(invert("..."))),
which generates an intermediate list just to invoke len on it).

The reg exp features that have been added are:
- alternation using '|'
- min-max repetition using {min,max} format
- '.' wildcard character

Also fixed some repetition bugs, where "foobar{2}" was treated like
"(foobar){2}" - now both cases are handled correctly.

-- Paul
 
?

=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=

BJörn Lindqvist said:
With regexps you can search for strings matching it. For example,
given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
do the reverse, from a regexp generate all strings that could match
it.

The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
"AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".

Is this possible to do? Obviously, for some regexps the set of matches
is unbounded (a list of everything that matches "*" would be very
unpractical), but how would you do it for simple regexps like the one
above?

For a very small number of characters, it would be feasible. For any
finite number of characters, it would be possible (though it wouldn't
take much to take longer than the age of the universe). For reference,
in your simple example, you have 17,576,000 matching strings.

I'm curious as to why you would wish to do this. I certainly understand
considering hard problems for their own sake, but when I formulate
them, there's always some impetus that makes me say "Huh. Now I
wonder..."

I have a thousand use cases in mind. For example:

1. Generate sentences for a bot:

ipermute("(I|You|He|She|It|They) do( not)? (dis)?like (you|him|her|me|they)"):

Generates many grammatically incorrect sentences but you get the point.

2. Generate URL:s to web resources:

The following should generate URL:s to all c.l.p digests from the mail archive:

def download_clp():
year_re = "(199\d|200[0-6])"
months = ["January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December"]
month_re = '(' + '|'.join(months) + ')'
fmt = "http://mail\.python\.org/pipermail/python-list/%s-%s\.txt"
url_re = fmt % (year_re, month_re)
for x in ipermute(url_re):
print "Downloading", x
code to download here

The same approach could be used to download all threads in a forum for
example, or all articles on Slashdot.

3. "Visualising" regular expressions. I think it would be easier to
write regular expressions if you could see what kind of data they
would match.

4. Port scanning:

ip_tuple = "(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])"
for x in ipermute("192\.168\." + "\.".join([ip_tuple] * 2)):
scan_ip(x)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top