The inverse problem: generate all instances of a regexp

Andrei Alexandrescu (See Website For Email) · Feb 15, 2006

I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an error
message.

For example, if $re = '(debug|release)/progname\.(c|h)', Generate($re)
should return four strings:

'debug/progname.c'
'debug/progname.h'
'release/progname.c'
'release/progname.h'

If, on the other hand, $re = '(debug/release)/.*\.(c|h)' then
Generate($re) should die.

Any ideas on how to do that in a clever manner? (The only way that I
think of is to parse $re and use recursion etc.)

Andrei

tuser · Feb 15, 2006

Andrei said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an error
message.

For example, if $re = '(debug|release)/progname\.(c|h)', Generate($re)
should return four strings:

'debug/progname.c'
'debug/progname.h'
'release/progname.c'
'release/progname.h'

If, on the other hand, $re = '(debug/release)/.*\.(c|h)' then
Generate($re) should die.

Any ideas on how to do that in a clever manner? (The only way that I
think of is to parse $re and use recursion etc.)

Andrei

I haven't tried it, but Regexp:

arser might be useful.
See
http://search.cpan.org/~pinyan/Regexp-Parser-0.20/lib/Regexp/Parser.pm

But I think regular expressions in Perl are quite complex, so some
people (including myself) consider them to be programs in their own
right.

Your original question might trigger the old discussions in
computational theory about whether or not it can be decided at all if a
given program (i.e. regular expression) terminates in finite time.

Ingo Menger · Feb 15, 2006

Andrei said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an error
message.

For example, if $re = '(debug|release)/progname\.(c|h)', Generate($re)
should return four strings:

'debug/progname.c'
'debug/progname.h'
'release/progname.c'
'release/progname.h'

If, on the other hand, $re = '(debug/release)/.*\.(c|h)' then
Generate($re) should die.

Any ideas on how to do that in a clever manner? (The only way that I
think of is to parse $re and use recursion etc.)

This would be no fun, since the function would have to die for most
regexes (i.e. those that contain * or +).
For all ther constructs, this function would be fairly trivial, but
would return lots of strings in cases that look harmless, like for
example
\d{1,10}

A lot more intersting would be the inverse function, i.e. a function
that, given a list of strings, returns the shortest regex that matches
those (and only those) strings. I guess this one is not easily
computable.

Anno Siegel · Feb 15, 2006

Andrei Alexandrescu (See Website For Email) said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an error
message.

For example, if $re = '(debug|release)/progname\.(c|h)', Generate($re)
should return four strings:

'debug/progname.c'
'debug/progname.h'
'release/progname.c'
'release/progname.h'

If, on the other hand, $re = '(debug/release)/.*\.(c|h)' then
Generate($re) should die.

Any ideas on how to do that in a clever manner? (The only way that I
think of is to parse $re and use recursion etc.)

MJD discusses this problem extensively in his _Higher Order Perl_. The
solution presented there deals with the problem of infinitely many
solutions by giving an iterator. That is a function that doesn't try
to give you all solutions at once, but returns another solution each
time you call it.

Anno

John W. Krahn · Feb 15, 2006

Andrei said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an error
message.

For example, if $re = '(debug|release)/progname\.(c|h)', Generate($re)
should return four strings:

'debug/progname.c'
'debug/progname.h'
'release/progname.c'
'release/progname.h'

If, on the other hand, $re = '(debug/release)/.*\.(c|h)' then
Generate($re) should die.

That is not infinite, unless you know of a file system that can use infinitely
large file names.

John

Ingo Menger · Feb 15, 2006

John said:
That is not infinite, unless you know of a file system that can use infinitely
large file names.

The functions that deal with strings and regular expressions know
nothing about filesystems and their restrictions.
Likewise, the fact that there is no computer with infinite memory tells
us nothing about the set of strings that match a given $re and if this
set is finite or not.
If a string is in the set of strings that match the $re and if that
string could be stored in all available computer memory taken together
are 2 totally different questions.

Josef Moellers · Feb 15, 2006

Andrei said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an error
message.

For example, if $re = '(debug|release)/progname\.(c|h)', Generate($re)
should return four strings:

'debug/progname.c'
'debug/progname.h'
'release/progname.c'
'release/progname.h'

You do know, that your re would also match all strings that have
arbitrary pre- and postfix strings, such as
adebug/progname.c
bdebug/progname.c
cdebug/progname.c
...
debug/progname.ca
debug/progname.cb
debug/progname.cc
...
To name but a few ;-)

If you anchor your pattern, then you'r more safe:
$re = '^(debug|release)/progname\.(c|h)$'

Not being an expert in these trings, if you could somehow get a dfa for
your re, then you could recursively walk through the graph and construct
what you want.

Josef

A. Sinan Unur · Feb 15, 2006

I wonder how this could be done elegantly.

Hmmm ... Please don't bury your question in the subject line.

Subject: The inverse problem: generate all instances of a regexp

Given a regular expression $re, write a function Generate($) that
returns a list of all possible strings that could match $re. If the
set is infinite, die with an error message.

The YAPE::Regex modules might be of some help:

#!/usr/bin/perl

use strict;
use warnings;

use YAPE::Regex::Explain;

my $one = qr{(debug|release)/progname\.(c|h)};
my $two = qr{(debug/release)/.*\.(c|h)};

print YAPE::Regex::Explain->new($one)->explain, "\n",
YAPE::Regex::Explain->new($two)->explain, "\n";

Any ideas on how to do that in a clever manner? (The only way that I
think of is to parse $re and use recursion etc.)

I have to wonder, however, why you want to do this. There might be
another, more appropriate solution to the real problem you are trying to
solve.

Sinan

it_says_BALLS_on_your forehead · Feb 15, 2006

Andrei said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an error
message.

isn't this problem NP-complete? the NFA would need to match every
possible scenario. a set can be finite and still really really really
big. really.

bugbear · Feb 15, 2006

Andrei said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an error
message.

The latter sounds like the "halting problem"

BugBear

Randal L. Schwartz · Feb 15, 2006

bugbear> http://www.stonehenge.com/merlyn/LinuxMag/col04.html

And be sure to check out (in the CPAN) Inline::Spew, which came
from me writing <URL:http://www.stonehenge.com/merlyn/LinuxMag/col57.html>.

That way, you don't have to copy the code from LM-col04 carefully into your
program, and you can optimize the compilation from Spew to data structures.

print "Just another Perl hacker,"; # the original

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[email protected]> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

Ingo Menger · Feb 15, 2006

bugbear said:
The latter sounds like the "halting problem"

No, its trivial.
Any regex containing {n,} describes an infinite set. Note that * is
short for {0,} and + is short for {1,}.
The rest describe finite sets. The proof is easily done by examining
all remaining regex constructs and finding either
- that the set of strings that'd match is finite (such as a character
standing for itself or a character class),
- or that the given way of combining sets is finite as long as there
are no infinite sets

It is assumed, of course, that the regex is anchored on both sites,
i.e. it has to match the whole string (/^re$/). Otherwise, it would
describe an infinite set (/^re/ or /re$/) in any case.
It is also assumed that the set of characters is finite.

Ben Bacarisse · Feb 15, 2006

The latter sounds like the "halting problem"

At first glace it looks way, way easier to me. The set of matching
strings is unbounded if the RE does not start and end with ^ and $ or if
in contains any of the forms <r>* <r>+ <r>{<n>,} such that the nested RE
<r> can actually match something (or non-zero length). In all the
examples I can think of, you can tell if an RE can't match anything.

Andrei Alexandrescu (See Website For Email) · Feb 15, 2006

John said:
That is not infinite, unless you know of a file system that can use infinitely
large file names.

Well I didn't say $re really represents a file name ;o). But, point
taken. Let me replace "infinite" with "theoretically unbounded".

Andrei

Andrei Alexandrescu (See Website For Email) · Feb 15, 2006

Josef said:
You do know, that your re would also match all strings that have
arbitrary pre- and postfix strings, such as
adebug/progname.c
bdebug/progname.c
cdebug/progname.c
...
debug/progname.ca
debug/progname.cb
debug/progname.cc
...
To name but a few ;-)

If you anchor your pattern, then you'r more safe:
$re = '^(debug|release)/progname\.(c|h)$'

Oh yes, that's what I was referring to.

Andrei

Andrei Alexandrescu (See Website For Email) · Feb 15, 2006

A. Sinan Unur said:
Hmmm ... Please don't bury your question in the subject line.

Subject: The inverse problem: generate all instances of a regexp

I think the actual subject conveys more information than "How to solve
elegantly the inverse problem: ..."

The YAPE::Regex modules might be of some help:

Thanks, I'll look into it.

I have to wonder, however, why you want to do this. There might be
another, more appropriate solution to the real problem you are trying to
solve.

Ok, here's my problem.

I am writing a "make" facility for perl. Yes, Unix "make". I find make
obtuse and limited in unexpected ways, and I'm generating many files
from within Perl itself, so I wrote a little engine that allows me to write:

use Make;

sub Compile { ... }
sub Link { ... }

Depends "aout", "main.o", \&Link;
Depends "main.o", ["main.c", "main.h"], \&Compile;

Make "aout";

Of course, the simplified example above doesn't show the power of the
engine, which shines through when it comes about automatically inferring
rules. For example, the last rule above could be inferred from:

DepPattern '(.*)\.o', ['$1.c', '$1.h'], \&Compile;

I'm quite pleased with having the power of full regexes when defining
makefile rules. Now, I wanted to do something more ambitious. I wanted
to be able to specify a pattern like:

DepPatternMulti 'debug|candidate|release',
'$multi/(.*)\.o', ['$1.c', '$1.h'], \&Compile;

The line above would be equivalent with:

DepPattern 'debug/(.*)\.o', ['$1.c', '$1.h'], \&Compile;
DepPattern 'candidate/(.*)\.o', ['$1.c', '$1.h'], \&Compile;
DepPattern 'release/(.*)\.o', ['$1.c', '$1.h'], \&Compile;

For that, I'd need to expand a possibly complex regexp into all of the
possible strings it could generate. Of course, more limited solutions
would be usable, but the "ars gratia artis" solution would involve
dealing with the full-fledged problem.

Andrei

Andrei Alexandrescu (See Website For Email) · Feb 16, 2006

Randal said:
bugbear> http://www.stonehenge.com/merlyn/LinuxMag/col04.html

And be sure to check out (in the CPAN) Inline::Spew, which came
from me writing <URL:http://www.stonehenge.com/merlyn/LinuxMag/col57.html>.

That way, you don't have to copy the code from LM-col04 carefully into your
program, and you can optimize the compilation from Spew to data structures.

The Spew module is painfully close to what I'd need. May I suggest
adding a function that generates (perhaps in an iterative manner) *all*
of the sentences that could be generated by the grammar?

Andrei

frederic · Feb 17, 2006

Andrei Alexandrescu (See Website For Email) said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an
error message.

For example, if $re = '(debug|release)/progname\.(c|h)', Generate($re)
should return four strings:

'debug/progname.c'
'debug/progname.h'
'release/progname.c'
'release/progname.h'

If, on the other hand, $re = '(debug/release)/.*\.(c|h)' then
Generate($re) should die.

Any ideas on how to do that in a clever manner? (The only way that I
think of is to parse $re and use recursion etc.)

From a theoretical point of view : the easiest way to do what you want is to
turn your regular expression into a finite automaton (see, for example,
http://lambda.uta.edu/cse5317/spring01/notes/node9.html
and nodes around). Then, enumerating the words recognized by a f.a. or
telling if it matches an infinite set of words is easy. I give a simple
implementation of it at the end of this message. Here is a simple
example of use:

my $dfa_hello = word_scanner "Hello, ";
my $dfa_bye = word_scanner "Bye, ";
my $dfa_world = word_scanner "World!";
my $dfa_earth = word_scanner "Earth!";

my $dfa = concat_scanner((union_scanner $dfa_hello, $dfa_bye),
(union_scanner $dfa_world, $dfa_earth));

# DFA matches the regexp: ((Hello, |Bye, )(World!|Earth!))
print (join '; ', (generate_words $dfa, $$dfa{INITSTATE}, {} ));
# Prints: Hello, World!; Hello, Earth!; Bye, World!; Bye, Earth!

generate_words generates the whole set of words matched, or croaks
if the set is infinite (I can explain how to rewrite it to match against,
for example, files into a directory) ; it does not use any random
enumeration : every matched word is given.

Of course, translating from a regexp syntax into a dfa using these
functions needs a parser of regexps, which is (very) tedious but is an
exercice on standard regexps, not on the reverse problem.

Here is the code:

# By Frederic Beal, on 2006/02/17.
# This code is in the public domain: do what you want of it, I shall not be
# responsible for whatever comes from it.

use strict;
use warnings;
sub generate_words($$$);

sub generate_words($$$) {
my ($dfa, $init, $visited) = @_;
# dfa: the automaton
# init: the state from which to visit
# visited: states already visited

my @words = ();

push @words, ""
if($init eq $$dfa{FINALSTATE} );

croak("Loop detected")
if exists $$visited{$init};

foreach(@{$$dfa{TRANSITIONS}}) {
next unless $$_[0] eq $init;

if(defined $$_[2]) {
my $letter = $$_[2];
my @foo = generate_words($dfa, $$_[1], { $init => 1, %$visited });
foreach(@foo) {
push @words, "$letter$_";
}
} else {
@words = ( @words, generate_words($dfa, $$_[1], { $init => 1, %$visited }) );
}
}
return @words;
}

my $last_id = 0;

sub gen_id() {
return "s" . $last_id++;
}

sub word_scanner($) {
my $word = shift;
my $init = gen_id;
my $current = $init;
my $last;
my @transitions = ();
my @states = ();

foreach(split '', $word) {
$last = $current;
$current = gen_id;
push @states, $current;
push @transitions, [$last, $current, $_];
}

return { INITSTATE => $init,
FINALSTATE => $current,
STATES => [ @states ],
TRANSITIONS => [ @transitions ] };
}

sub union_scanner($$) {
my ($dfa1, $dfa2) = @_;
my $init = gen_id;
my $final = gen_id;

return { INITSTATE => $init,
FINALSTATE => $final,
STATES => [ $init, @{$$dfa1{STATES}}, @{$$dfa2{STATES}}, $final ],
TRANSITIONS =>
[ [ $init, $$dfa1{INITSTATE} ],
[ $init, $$dfa2{INITSTATE} ],
[ $$dfa1{FINALSTATE}, $final ],
[ $$dfa2{FINALSTATE}, $final ],
@{$$dfa1{TRANSITIONS}},
@{$$dfa2{TRANSITIONS}} ] };
}

sub concat_scanner($$) {
my ($dfa1, $dfa2) = @_;
return { INITSTATE => $$dfa1{INITSTATE},
FINALSTATE => $$dfa2{FINALSTATE},
STATES => [ @{$$dfa1{STATES}}, @{$$dfa2{STATES}} ],
TRANSITIONS =>
[ @{$$dfa1{TRANSITIONS}},
[ $$dfa1{FINALSTATE}, $$dfa2{INITSTATE} ],
@{$$dfa2{TRANSITIONS}} ] };
}

sub star_scanner($) {
my $dfa = shift;
my $init = gen_id;
my $final = gen_id;

return { INITSTATE => $init,
FINALSTATE => $final,
STATES => [ $init, @{$$dfa{STATES}}, $final ],
TRANSITIONS =>
[ @{$$dfa{TRANSITIONS}},
[ $init, $final ],
[ $init, $$dfa{INITSTATE} ],
[ $$dfa{FINALSTATE}, $final ],
[ $final, $$dfa{INITSTATE} ] ] };
}

Frederic Beal · Feb 17, 2006

Andrei Alexandrescu (See Website For Email) said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an
error message.

For example, if $re = '(debug|release)/progname\.(c|h)', Generate($re)
should return four strings:

'debug/progname.c'
'debug/progname.h'
'release/progname.c'
'release/progname.h'

If, on the other hand, $re = '(debug/release)/.*\.(c|h)' then
Generate($re) should die.

Any ideas on how to do that in a clever manner? (The only way that I
think of is to parse $re and use recursion etc.)

From a theoretical point of view : the easiest way to do what you want is to
turn your regular expression into a finite automaton (see, for example,
http://lambda.uta.edu/cse5317/spring01/notes/node9.html
and nodes around). Then, enumerating the words recognized by a f.a. or
telling if it matches an infinite set of words is easy. I give a simple
implementation of it at the end of this message. Here is a simple
example of use:

my $dfa_hello = word_scanner "Hello, ";
my $dfa_bye = word_scanner "Bye, ";
my $dfa_world = word_scanner "World!";
my $dfa_earth = word_scanner "Earth!";

my $dfa = concat_scanner((union_scanner $dfa_hello, $dfa_bye),
(union_scanner $dfa_world, $dfa_earth));

# DFA matches the regexp: ((Hello, |Bye, )(World!|Earth!))
print (join '; ', (generate_words $dfa, $$dfa{INITSTATE}, {} ));
# Prints: Hello, World!; Hello, Earth!; Bye, World!; Bye, Earth!

generate_words generates the whole set of words matched, or croaks
if the set is infinite (I can explain how to rewrite it to match against,
for example, files into a directory) ; it does not use any random
enumeration : every matched word is given.

Of course, translating from a regexp syntax into a dfa using these
functions needs a parser of regexps, which is (very) tedious but is an
exercice on standard regexps, not on the reverse problem.

Here is the code:

# By Frederic Beal, on 2006/02/17.
# This code is in the public domain: do what you want of it, I shall not be
# responsible for whatever comes from it.

use strict;
use warnings;
sub generate_words($$$);

sub generate_words($$$) {
my ($dfa, $init, $visited) = @_;
# dfa: the automaton
# init: the state from which to visit
# visited: states already visited

my @words = ();

push @words, ""
if($init eq $$dfa{FINALSTATE} );

croak("Loop detected")
if exists $$visited{$init};

foreach(@{$$dfa{TRANSITIONS}}) {
next unless $$_[0] eq $init;

if(defined $$_[2]) {
my $letter = $$_[2];
my @foo = generate_words($dfa, $$_[1], { $init => 1, %$visited });
foreach(@foo) {
push @words, "$letter$_";
}
} else {
@words = ( @words, generate_words($dfa, $$_[1], { $init => 1, %$visited }) );
}
}
return @words;
}

my $last_id = 0;

sub gen_id() {
return "s" . $last_id++;
}

sub word_scanner($) {
my $word = shift;
my $init = gen_id;
my $current = $init;
my $last;
my @transitions = ();
my @states = ();

foreach(split '', $word) {
$last = $current;
$current = gen_id;
push @states, $current;
push @transitions, [$last, $current, $_];
}

return { INITSTATE => $init,
FINALSTATE => $current,
STATES => [ @states ],
TRANSITIONS => [ @transitions ] };
}

sub union_scanner($$) {
my ($dfa1, $dfa2) = @_;
my $init = gen_id;
my $final = gen_id;

return { INITSTATE => $init,
FINALSTATE => $final,
STATES => [ $init, @{$$dfa1{STATES}}, @{$$dfa2{STATES}}, $final ],
TRANSITIONS =>
[ [ $init, $$dfa1{INITSTATE} ],
[ $init, $$dfa2{INITSTATE} ],
[ $$dfa1{FINALSTATE}, $final ],
[ $$dfa2{FINALSTATE}, $final ],
@{$$dfa1{TRANSITIONS}},
@{$$dfa2{TRANSITIONS}} ] };
}

sub concat_scanner($$) {
my ($dfa1, $dfa2) = @_;
return { INITSTATE => $$dfa1{INITSTATE},
FINALSTATE => $$dfa2{FINALSTATE},
STATES => [ @{$$dfa1{STATES}}, @{$$dfa2{STATES}} ],
TRANSITIONS =>
[ @{$$dfa1{TRANSITIONS}},
[ $$dfa1{FINALSTATE}, $$dfa2{INITSTATE} ],
@{$$dfa2{TRANSITIONS}} ] };
}

sub star_scanner($) {
my $dfa = shift;
my $init = gen_id;
my $final = gen_id;

return { INITSTATE => $init,
FINALSTATE => $final,
STATES => [ $init, @{$$dfa{STATES}}, $final ],
TRANSITIONS =>
[ @{$$dfa{TRANSITIONS}},
[ $init, $final ],
[ $init, $$dfa{INITSTATE} ],
[ $$dfa{FINALSTATE}, $final ],
[ $final, $$dfa{INITSTATE} ] ] };
}

Frederic Beal · Feb 17, 2006

Frederic Beal said:
croak("Loop detected")
if exists $$visited{$init};

s/croak/die/ (too much XS recently)

Regexp slowdown	12	Sep 5, 2006
Generating all permutations from a regexp	9	Dec 22, 2006
Hard regexp problem	4	Oct 23, 2004
Building the finite state machine of a Ruby regexp	3	Aug 1, 2006
Can anyone write this recursion for simple regexp more beautifullyand clearly than the braggarts	157	Aug 29, 2009
generate all possible strings of given length given a set of characters	6	Oct 3, 2005
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
My Regexp XML Parser -> Structured Perl Data, Cut & Paste Version, No Module's (Vol I)	43	Dec 21, 2005

The inverse problem: generate all instances of a regexp

Andrei Alexandrescu (See Website For Email)

tuser

Ingo Menger

Anno Siegel

John W. Krahn

Ingo Menger

Josef Moellers

A. Sinan Unur

it_says_BALLS_on_your forehead

bugbear

Randal L. Schwartz

Ingo Menger

Ben Bacarisse

Andrei Alexandrescu (See Website For Email)

Andrei Alexandrescu (See Website For Email)

Andrei Alexandrescu (See Website For Email)

Andrei Alexandrescu (See Website For Email)

frederic

Frederic Beal

Frederic Beal

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads