Andrei Alexandrescu (See Website For Email) said:
I wonder how this could be done elegantly. Given a regular expression
$re, write a function Generate($) that returns a list of all possible
strings that could match $re. If the set is infinite, die with an
error message.
For example, if $re = '(debug|release)/progname\.(c|h)', Generate($re)
should return four strings:
'debug/progname.c'
'debug/progname.h'
'release/progname.c'
'release/progname.h'
If, on the other hand, $re = '(debug/release)/.*\.(c|h)' then
Generate($re) should die.
Any ideas on how to do that in a clever manner? (The only way that I
think of is to parse $re and use recursion etc.)
From a theoretical point of view : the easiest way to do what you want is to
turn your regular expression into a finite automaton (see, for example,
http://lambda.uta.edu/cse5317/spring01/notes/node9.html
and nodes around). Then, enumerating the words recognized by a f.a. or
telling if it matches an infinite set of words is easy. I give a simple
implementation of it at the end of this message. Here is a simple
example of use:
my $dfa_hello = word_scanner "Hello, ";
my $dfa_bye = word_scanner "Bye, ";
my $dfa_world = word_scanner "World!";
my $dfa_earth = word_scanner "Earth!";
my $dfa = concat_scanner((union_scanner $dfa_hello, $dfa_bye),
(union_scanner $dfa_world, $dfa_earth));
# DFA matches the regexp: ((Hello, |Bye, )(World!|Earth!))
print (join '; ', (generate_words $dfa, $$dfa{INITSTATE}, {} ));
# Prints: Hello, World!; Hello, Earth!; Bye, World!; Bye, Earth!
generate_words generates the whole set of words matched, or croaks
if the set is infinite (I can explain how to rewrite it to match against,
for example, files into a directory) ; it does not use any random
enumeration : every matched word is given.
Of course, translating from a regexp syntax into a dfa using these
functions needs a parser of regexps, which is (very) tedious but is an
exercice on standard regexps, not on the reverse problem.
Here is the code:
# By Frederic Beal, on 2006/02/17.
# This code is in the public domain: do what you want of it, I shall not be
# responsible for whatever comes from it.
use strict;
use warnings;
sub generate_words($$$);
sub generate_words($$$) {
my ($dfa, $init, $visited) = @_;
# dfa: the automaton
# init: the state from which to visit
# visited: states already visited
my @words = ();
push @words, ""
if($init eq $$dfa{FINALSTATE} );
croak("Loop detected")
if exists $$visited{$init};
foreach(@{$$dfa{TRANSITIONS}}) {
next unless $$_[0] eq $init;
if(defined $$_[2]) {
my $letter = $$_[2];
my
@foo = generate_words($dfa, $$_[1], { $init => 1, %$visited });
foreach(
@foo) {
push @words, "$letter$_";
}
} else {
@words = ( @words, generate_words($dfa, $$_[1], { $init => 1, %$visited }) );
}
}
return @words;
}
my $last_id = 0;
sub gen_id() {
return "s" . $last_id++;
}
sub word_scanner($) {
my $word = shift;
my $init = gen_id;
my $current = $init;
my $last;
my @transitions = ();
my @states = ();
foreach(split '', $word) {
$last = $current;
$current = gen_id;
push @states, $current;
push @transitions, [$last, $current, $_];
}
return { INITSTATE => $init,
FINALSTATE => $current,
STATES => [ @states ],
TRANSITIONS => [ @transitions ] };
}
sub union_scanner($$) {
my ($dfa1, $dfa2) = @_;
my $init = gen_id;
my $final = gen_id;
return { INITSTATE => $init,
FINALSTATE => $final,
STATES => [ $init, @{$$dfa1{STATES}}, @{$$dfa2{STATES}}, $final ],
TRANSITIONS =>
[ [ $init, $$dfa1{INITSTATE} ],
[ $init, $$dfa2{INITSTATE} ],
[ $$dfa1{FINALSTATE}, $final ],
[ $$dfa2{FINALSTATE}, $final ],
@{$$dfa1{TRANSITIONS}},
@{$$dfa2{TRANSITIONS}} ] };
}
sub concat_scanner($$) {
my ($dfa1, $dfa2) = @_;
return { INITSTATE => $$dfa1{INITSTATE},
FINALSTATE => $$dfa2{FINALSTATE},
STATES => [ @{$$dfa1{STATES}}, @{$$dfa2{STATES}} ],
TRANSITIONS =>
[ @{$$dfa1{TRANSITIONS}},
[ $$dfa1{FINALSTATE}, $$dfa2{INITSTATE} ],
@{$$dfa2{TRANSITIONS}} ] };
}
sub star_scanner($) {
my $dfa = shift;
my $init = gen_id;
my $final = gen_id;
return { INITSTATE => $init,
FINALSTATE => $final,
STATES => [ $init, @{$$dfa{STATES}}, $final ],
TRANSITIONS =>
[ @{$$dfa{TRANSITIONS}},
[ $init, $final ],
[ $init, $$dfa{INITSTATE} ],
[ $$dfa{FINALSTATE}, $final ],
[ $final, $$dfa{INITSTATE} ] ] };
}