RegExp as Finite State Machine

Georg Jaehnig · Dec 30, 2008

Hi,

I wonder if it's possible to have Finite State Machines (FSM) [1] in
Javascript. Basically, they must be already implemented because a
RegExp is a FSM , both describe a regular language.

But the formalism of FSMs allows much more than RegExps seem to allow.
Intersection for instance:

pattern1 = new RegExp( "(a|b|c)" ); // contains a,b,c
pattern2 = new RegExp( "(a|b)" ); // contains a,b

// This would be nice:
pattern3 = RegExp.intersect( pattern1, pattern2 )
// pattern3 = /"(a|b)"/

So do you maybe know a library with those functions? If not, is there
any way to access the internal representation of the RegExp object to
implement e.g. an intersection algorithm?

[1] http://en.wikipedia.org/wiki/Finite_state_machine

Stevo · Dec 30, 2008

Georg said:
I wonder if it's possible to have Finite State Machines (FSM) [1] in
Javascript. Basically, they must be already implemented because a
RegExp is a FSM , both describe a regular language.

What ? Which language does a state machine represent ? Greek ? When did
the dictionary-eating types (that like to take a simple concept that can
be described in a paragraph or two, but instead turn it into a 5-page
marketing-speak document) decide to add the word Finite to state
machines just so they can make it a TLA ? It's always been a plain state
machine in my experience. None of the state machines I ever programmed
"described a regular language". They all handled various input events
and made some choices based on the current state, resulting in some
output and sometimes a change in state.

When did RegExp become a state machine ? It's a regular expression
search [and replace] function.

But the formalism of FSMs allows much more than RegExps seem to allow.

What ? Oh it's the formalism that does it, I never realized it was the
formalism. Is it's RegExps informalism (or is it unformalisticness?)
that lets it down ?

Intersection for instance:

pattern1 = new RegExp( "(a|b|c)" ); // contains a,b,c
pattern2 = new RegExp( "(a|b)" ); // contains a,b

// This would be nice:
pattern3 = RegExp.intersect( pattern1, pattern2 )
// pattern3 = /"(a|b)"/

So do you maybe know a library with those functions? If not, is there
any way to access the internal representation of the RegExp object to
implement e.g. an intersection algorithm?

[1] http://en.wikipedia.org/wiki/Finite_state_machine

I very much doubt you'll find anything like that.

Thomas 'PointedEars' Lahn · Jan 1, 2009

Georg said:
I wonder if it's possible to have Finite State Machines (FSM) [1] in
Javascript. Basically, they must be already implemented because a
RegExp is a FSM , both describe a regular language.

A regular expression *engine* must be an FSM, because the formal
language it accepts is regular -- Regular Expressions.

AFAIK, a RegEx engine is implemented either as a DFA or an NFA. See
also: Friedl, Jeffrey E. F.; Mastering Regular Expressions, Second
Edition, chapter 4 "The Mechanics of Expression Processing", O'Reilly,
1997 (you can read that sample chapter online, for free); maybe it is
also in the Third Edition, O'Reilly, 2006.

The Specification of ECMAScript RegExp processing can be found in the
ECMAScript Language Specification (ECMA-262), Edition 3 Final [ES3F],
about page 160. There are at least two implementations of it that are
Open Source, Mozilla.org SpiderMonkey and Apple JavaScriptCore (of
WebKit).

But the formalism of FSMs allows much more than RegExps seem to allow.

Of course. The former is an formal automaton, the latter is the
implementation of a formal language.

Intersection for instance:

pattern1 = new RegExp( "(a|b|c)" ); // contains a,b,c
pattern2 = new RegExp( "(a|b)" ); // contains a,b

// This would be nice:
pattern3 = RegExp.intersect( pattern1, pattern2 )
// pattern3 = /"(a|b)"/

So do you maybe know a library with those functions?

My RegExp library, but it is not online yet.

If not, is there
any way to access the internal representation of the RegExp object to
implement e.g. an intersection algorithm?

(I'll answer anyway if you don't mind.)

An FSM reads a string of characters from an input alphabet. So if you
want to do with RegExp *objects* what you can do with a modified FSM,
you have to use the *string* representation of the object.

If you had cared to read [ES3F], you would have known that all
ECMAScript values have such a representation. RegExp objects, in
particular, inherit a property, â€˜sourceâ€™, which evaluates to it, and a
method, toString(), which returns it. (I am not completely sure as to
the availability of the former yet. It is specified, but I have only
tested it positive from JScript 3.0 [IE/Win 4.0] on, and in Safari on
this iPhone 3G [firmware version 2.2 (5G77)].)

Trimming and concatenating the results, and passing them to the RegExp
factory/constructor is left as an exercise to the reader.

HTH

PointedEars

Stevo · Jan 1, 2009

Thomas said:
Georg said:

I wonder if it's possible to have Finite State Machines (FSM) [1] in
Javascript. Basically, they must be already implemented because a
RegExp is a FSM , both describe a regular language.

Click to expand...

A regular expression *engine* must be an FSM, because the formal
language it accepts is regular -- Regular Expressions.

You found a kindred spirit Georg. One that speaks your language.

Of course. The former is an formal automaton, the latter is the
implementation of a formal language.

PointedEars

Happy New Year Thomas, You're my hero

Lasse Reichstein Nielsen · Jan 4, 2009

Thomas 'PointedEars' Lahn said:
Georg said:

I wonder if it's possible to have Finite State Machines (FSM) [1] in
Javascript. Basically, they must be already implemented because a
RegExp is a FSM , both describe a regular language.

Click to expand...

A regular expression *engine* must be an FSM, because the formal
language it accepts is regular -- Regular Expressions.

It can be a FSM, but it doesn't have to be one. For the "regular"
expressions in Javascript, it can't be, since they recognize
non-regular languages (but only because of back-references).

Classsical counter-example of regularness:
/^(11+)\1+$/
which recognizes composite numbers (non-primes) in unary notation.

The original definition of Javascript regular expressions was based on
the backtracking implementations in Netscape and IE. The specification
is given specifically in terms of when to backtrack, which makes it
harder to implement the specified semantics without backtracking (e.g.,
if there are more ways to match, the specification is to pick the "first"
one, not the longest one as in other regexp-implementations).

Of the current crop of browsers, AFAIK none of them use a FSM
based implementation of regular expressions.

The Specification of ECMAScript RegExp processing can be found in the
ECMAScript Language Specification (ECMA-262), Edition 3 Final [ES3F],
about page 160. There are at least two implementations of it that are
Open Source, Mozilla.org SpiderMonkey and Apple JavaScriptCore (of
WebKit).

The one in JavaScriptCore is currently being rewritten from scratch.
The original version is an adaption of PCRE (Perl-regexps) to Javascript
regexps, and can't really be recommended as an example.
The new version is not feature complete yet.

/L

Thomas 'PointedEars' Lahn · Jan 4, 2009

Lasse said:
Thomas 'PointedEars' Lahn said:

Georg said:

I wonder if it's possible to have Finite State Machines (FSM) [1] in
Javascript. Basically, they must be already implemented because a
RegExp is a FSM , both describe a regular language.

Click to expand...

A regular expression *engine* must be an FSM, because the formal
language it accepts is regular -- Regular Expressions.

Click to expand...

It can be a FSM, but it doesn't have to be one.

True, it could also be an alternating finite automaton or a read-only Turing
machine. (I'd like to see that!)

For the "regular" expressions in Javascript, it can't be, since they recognize
non-regular languages (but only because of back-references).

Classsical counter-example of regularness:
/^(11+)\1+$/
which recognizes composite numbers (non-primes) in unary notation.

In which way is that proof that ECMAScript Regular Expressions are not a
regular language?

Of the current crop of browsers, AFAIK none of them use a FSM
based implementation of regular expressions.

So sure are you, despite other engines that also support backreferences
(like egrep) have been demonstrated to use FSM implementations?

<http://oreilly.com/catalog/regex/chapter/ch04.html>

PointedEars

Lasse Reichstein Nielsen · Jan 4, 2009

Thomas 'PointedEars' Lahn said:
Lasse Reichstein Nielsen wrote:

In which way is that proof that ECMAScript Regular Expressions are not a
regular language?

The language is not regular (It's composite, the language of "1"'s of
prime length, is not regular. This is failry easily proved by assuming
it's regular, and using pumping to derive a contradition).

The expression is a valid Javascript regular expression that recognizes
a non-regular language, so Javascript regexps are not restricted to the
regular languages (and therefore not possible to implement using only
a finite state machine).

So sure are you, despite other engines that also support backreferences
(like egrep) have been demonstrated to use FSM implementations?
<http://oreilly.com/catalog/regex/chapter/ch04.html>

Referring to this?
"You might, however, notice that GNU's version of egrep does support
backreferences. It does so by having two complete engines under the
hood! It first uses a DFA engine to see whether a match is likely,
and then uses an NFA engine (which supports the full flavor,
including backreferences) to confirm the match."

I'm only talking about the current regexp implementations because I
have actually been looking at them.

What I probably /should/ have also said, is that by FSM I mean a
deterministic machine (i.e., a DFA, one that is run linearly over the
input). A nondeterministic machine can use any number of tricks to
implement the nondeterminism, including backtracking.

/L

Thomas 'PointedEars' Lahn · Jan 4, 2009

Lasse said:
The language is not regular (It's composite, the language of "1"'s of
prime length, is not regular. This is failry easily proved by assuming
it's regular, and using pumping to derive a contradition).

Ah, but I think you confuse the language that a regular expression can
match, L_m(r) -- he above matched language is context-free (Chomsky Type-2)
but not regular (Type-3), if I'm not mistaken -- with the Regular Expression
language, L_r. In order to prove that the ECMAScript Regular Expression
language is not regular, you have to apply the pumping lemma for regular
languages to the expression itself, not to the strings it could match.

The expression is a valid Javascript regular expression that recognizes
a non-regular language, so Javascript regexps are not restricted to the
regular languages (and therefore not possible to implement using only
a finite state machine).

Sorry, I don't think you are correct here.

Referring to this?
"You might, however, notice that GNU's version of egrep does support
backreferences. It does so by having two complete engines under the
hood! It first uses a DFA engine to see whether a match is likely,
and then uses an NFA engine (which supports the full flavor,
including backreferences) to confirm the match."

Yes, exactly.

I'm only talking about the current regexp implementations because I
have actually been looking at them.

We'll see.

What I probably /should/ have also said, is that by FSM I mean a
deterministic machine (i.e., a DFA, one that is run linearly over the
input). A nondeterministic machine can use any number of tricks to
implement the nondeterminism, including backtracking.

But your premise is incorrect. Nondeterministic Finite Automata (NFAs),
also called "nondeterministic finite state machines", are definitely a
subset of Finite State Machines (FSMs).

<http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine>

PointedEars

Stevo · Jan 5, 2009

Thomas said:
Ah, but I think you confuse the language that a regular expression can
match, L_m(r) -- he above matched language is context-free (Chomsky Type-2)
but not regular (Type-3), if I'm not mistaken -- with the Regular Expression
language, L_r. In order to prove that the ECMAScript Regular Expression
language is not regular, you have to apply the pumping lemma for regular
languages to the expression itself, not to the strings it could match.

Anyone else brave enough to admit this is way over their head? Check out
these two links. The first one might as well be in Japanese for the
amount of sense it makes to me:

http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine#Properties_of_NFA-.CE.B5

This one initially looks less complicated, but after reading the whole
section I'm left open-mouthed and saying "Huh?" to myself...

http://en.wikipedia.org/wiki/Regular_languages#Deciding_whether_a_language_is_regular

I'm not sure if this newsgroup will consider everything after the # on
those links as part of the links. We'll see.

Lasse Reichstein Nielsen · Jan 5, 2009

Thomas 'PointedEars' Lahn said:
Lasse Reichstein Nielsen wrote:
[/^(11+)\1+$/]

Ah, but I think you confuse the language that a regular expression can
match, L_m(r) -- he above matched language is context-free (Chomsky Type-2)
but not regular (Type-3), if I'm not mistaken -- with the Regular Expression
language, L_r.

Probably. The latter is much less interesting (and definitly context free)

The language recognized by the regexp above is actually not even
context-free. For a language over an alphabet with only one element,
the pumping lemma for context free languages is equivalent to the
one for regular languages. It's probably Type-1.
IIRC, it's known to be in (NP intersect co-NP).

In order to prove that the ECMAScript Regular Expression
language is not regular, you have to apply the pumping lemma for regular
languages to the expression itself, not to the strings it could match.

It's even simpler than that. The regular expression language has
matched parentheses to an arbitrary depth. That alone is enough to
rule out being regular.

Sorry, I don't think you are correct here.

I should be, unless there is a flaw in the logic.

If all Javascript regexps can be implemented using only finite state
machines, then all the languages recognized by them are regular (a FSM
can only recognize a regular language).

Since the language of /^(11+)\1+$/ is not regular, and is a Javascript
regexp, there is at least one language that cannot be implemented using
a FSM.

...

But your premise is incorrect. Nondeterministic Finite Automata (NFAs),
also called "nondeterministic finite state machines", are definitely a
subset of Finite State Machines (FSMs).

That is definitly true. However, we don't have the hardware yet to
implement NFA's directly, so we have to emulate the nondeterminism
somehow. The typical way is to implement a nondeterministic choice by
trying one branch and setting up a backtrack point to try the other in
case the first fails. This gets you a stack of backtrack points, which
means that the implementation no longer uses a finite amount of
space. The space used can depend on the input, not only the automaton.
This is why I don't think of this implementation as really being a
finite state machine, although it is emulating one at a certain
level of abstraction.

There are other ways to implement a NFA that doesn't use a stack,
but instead determinizes the DFA (possibly lazily during execution).
This can instead lead to exponential space blow-up, which is still
finite, but not always manageable.

/L

Georg Jaehnig · Jan 5, 2009

Hi,

thanks for your answer!

MyRegExplibrary, but it is not online yet.

Can't wait to see it!

If you had cared to read [ES3F], you would have known that all
ECMAScript values have such a representation.RegExpobjects, in
particular, inherit a property, ‘source’, which evaluates to it, and a
method, toString(), which returns it. (I am not completely sure as to
the availability of the former yet. It is specified, but I have only
tested it positive from JScript 3.0 [IE/Win 4.0] on, and in Safari on
this iPhone 3G [firmware version 2.2 (5G77)].)

Trimming and concatenating the results, and passing them to theRegExp
factory/constructor is left as an exercise to the reader.

Well, I would know how to implement the union operation in this way:

var reA = new RegExp("a");
var reB = new RegExp("b");
var reUnion = new RegExp( "(" + reA.source + "|" + reB.source + ")" );

But how would you do intersection?

One problem I would like to solve (and until now I think it needs
intersection) is to check, if a string is a prefix of a regex. For
instance:

var re = new RegExp("abc|bc");
var s = "a";

My idea is to create a new RegExp concatening s and .*:
var sRe = new RegExp(s + ".*");
and see if the intersection of re and sRe is not empty.

But if there's a simpler way maybe I wouldn't need intersection.

Lasse Reichstein Nielsen · Jan 5, 2009

.... (RegExp composition) ...

But how would you do intersection?

One problem I would like to solve (and until now I think it needs
intersection) is to check, if a string is a prefix of a regex. For
instance:

var re = new RegExp("abc|bc");
var s = "a";

I.e., whether there exists a suffix such that re.test(s+suffix)?
That's a hard problem in general. Very hard, actually. I'm guessing
exponential space (but most problems with regexps are that hard
in their full generality).

My idea is to create a new RegExp concatening s and .*:
var sRe = new RegExp(s + ".*");
and see if the intersection of re and sRe is not empty.

But if there's a simpler way maybe I wouldn't need intersection.

I doubt there is. If you just had to create a regexp that matched
the intersection, you could use look-ahead:
/(?=a)(?:abc|bc)/
That creates a regexp for the intersection of the two languages,
but you still need to figure out whether there is a string that is
matched by that. That's the hard part.

/L

Michael Wojcik · Jan 5, 2009

Stevo said:
[discussion of FSMs and regular languages]

Click to expand...

Anyone else brave enough to admit this is way over their head?

Why does that require bravery?

Some people have studied computer science. Others have not. This is
not a moral failing.

Apparently you have not. We could have guessed that from your first
post to this thread, so it's not clear to me why you continued to
provide us with evidence.

I'm not sure if this newsgroup will consider everything after the # on
those links as part of the links. We'll see.

The newsgroup has no opinions about syntax. Usenet propagation
mechanisms such as NNTP have syntactic rules, but do not care about
"links" (URLs) in the message body. Usenet user agents may attempt to
detect URLs in the message body, but that's an application feature
which depends entirely on the user agent in question.

Thomas 'PointedEars' Lahn · Jan 6, 2009

Georg said:
thanks for your answer!

You are welcome.

Can't wait to see it!

So do I

If you had cared to read [ES3F], you would have known that all
ECMAScript values have such a representation.RegExpobjects, in
particular, inherit a property, â€˜sourceâ€™, which evaluates to it, and a
method, toString(), which returns it. (I am not completely sure as to
the availability of the former yet. It is specified, but I have only
tested it positive from JScript 3.0 [IE/Win 4.0] on, and in Safari on
this iPhone 3G [firmware version 2.2 (5G77)].)

Trimming and concatenating the results, and passing them to theRegExp
factory/constructor is left as an exercise to the reader.

Click to expand...

Well, I would know how to implement the union operation in this way:

var reA = new RegExp("a");
var reB = new RegExp("b");
var reUnion = new RegExp( "(" + reA.source + "|" + reB.source + ")" );

But how would you do intersection?

Given your initial problem (I'm declaring identifiers here as you should
have) --

var pattern1 = new RegExp("(a|b|c)"); // contains a,b,c
var pattern2 = new RegExp("(a|b)"); // contains a,b

// This would be nice:
var pattern3 = RegExp.intersect( pattern1, pattern2 )
// pattern3 = /"(a|b)"/

-- simple intersection would be fairly easy:

RegExp.prototype.intersect = function(pattern2) {
/* Rule out invalid values */
if (!pattern2 || pattern2.constructor != RegExp) return null;

var
s = this.source.replace(/^$?([^)]*)$?$/, "$1"),
s2 = pattern2.source.replace(/^$?([^)]*)$?$/, "$1");

/* Register all parts within alternation of this pattern */
var a = s.split("|"), o = {};
for (var i = a.length; i--

o[a] = true;

/* Register all parts within alternation of pattern2 */
var a2 = s2.split("|"), o2 = {};
for (i = a2.length; i-- o2[a2] = true;

/* Compose the new alternation out of common parts */
var hOP = (function() {
if (typeof Object.prototype.hasOwnProperty == "function")
{
return function(o, p) {
return o.hasOwnProperty(p);
};
}
else
{
// suffices *here*
return function(o, p) {
return typeof o[p] != "undefined"
&& typeof o.constructor.prototype[p] == "undefined";
};
}
})();

a = [];
for (var p in o)
{
if (hOP(o2, p)) a.push(p);
}

return new RegExp("(" + a.join("|") + ")");
};

var pattern1 = /(a|b|c)/;
var pattern2 = /(a|b)/;
var pattern3 = pattern1.intersect(pattern2);

// yields /(a|b)/ or /(b|a)/ (implementation-dependent)
console.log(pattern3);

One problem I would like to solve (and until now I think it needs
intersection) is to check, if a string is a prefix of a regex. [...]

Click to expand...

See Lasse's answer.

PointedEars

Georg Jaehnig · Jan 6, 2009

Hello,

I.e., whether there exists a suffix such that re.test(s+suffix)?
That's a hard problem in general. Very hard, actually. I'm guessing
exponential space

I don't think so. "s + any suffix" is a regular language because it
can be described as a common regex: /s.*/

Checking if "s + any suffix" matches a regex is actually checking if
the intersection of /s.*/ and the regex is not empty. Intersection of
2 regexes is intersection of 2 finite state machines which needs
quadratic space (no. of states of FSA1 times no. of states of FSA2).

Checking if a FSA describes only the empty language is only checking
if it has no final states.

I doubt there is. If you just had to create a regexp that matched
the intersection, you could use look-ahead:
/(?=a)(?:abc|bc)/

Cool, lookahead really seems to model intersection, so let's have a
nice function:

RegExp.prototype.intersect = function( pattern2 ) {
return new RegExp( '(?=' + this.source + ')(?=' + pattern2.source +
')' );
}

That creates a regexp for the intersection of the two languages,
but you still need to figure out whether there is a string that is
matched by that. That's the hard part.

I see. Ah, too bad! Because if the RegExp object would be implemented
as a FSA, technically it should be really easy (see above - just
checking if there are final states).

Georg Jaehnig · Jan 6, 2009

Hi,

-- simple intersection would be fairly easy:

RegExp.prototype.intersect = function(pattern2) {
/* Rule out invalid values */
if (!pattern2 || pattern2.constructor != RegExp) return null;

var
s = this.source.replace(/^$?([^)]*)$?$/, "$1"),
s2 = pattern2.source.replace(/^$?([^)]*)$?$/, "$1");

/* Register all parts within alternation of this pattern */
var a = s.split("|"), o = {};
for (var i = a.length; i-- o[a] = true;

/* Register all parts within alternation of pattern2 */
var a2 = s2.split("|"), o2 = {};
for (i = a2.length; i-- o2[a2] = true;

/* Compose the new alternation out of common parts */
var hOP = (function() {
if (typeof Object.prototype.hasOwnProperty == "function")
{
return function(o, p) {
return o.hasOwnProperty(p);
};
}
else
{
// suffices *here*
return function(o, p) {
return typeof o[p] != "undefined"
&& typeof o.constructor.prototype[p] == "undefined";
};
}
})();

a = [];
for (var p in o)
{
if (hOP(o2, p)) a.push(p);
}

return new RegExp("(" + a.join("|") + ")");
};

var pattern1 = /(a|b|c)/;
var pattern2 = /(a|b)/;
var pattern3 = pattern1.intersect(pattern2);

// yields /(a|b)/ or /(b|a)/ (implementation-dependent)
console.log(pattern3);

This function seems to work only for alternation regexes but gives
weird results on simple cases like this:

var pattern1 = /./;
var pattern2 = /a/;
var pattern3 = pattern1.intersection(pattern2)

document.write( pattern3.source + " -- " + pattern3.test("b") );

I would want /a/ -- false
but it gives () -- true

Thomas 'PointedEars' Lahn · Jan 6, 2009

Georg said:
Thomas 'PointedEars' Lahn wrote:
[...]

-- simple intersection would be fairly easy:

RegExp.prototype.intersect = function(pattern2) {
/* Rule out invalid values */
if (!pattern2 || pattern2.constructor != RegExp) return null;

var
s = this.source.replace(/^$?([^)]*)$?$/, "$1"),
s2 = pattern2.source.replace(/^$?([^)]*)$?$/, "$1");

/* Register all parts within alternation of this pattern */
var a = s.split("|"), o = {};
for (var i = a.length; i-- o[a] = true;

/* Register all parts within alternation of pattern2 */
var a2 = s2.split("|"), o2 = {};
for (i = a2.length; i-- o2[a2] = true;

/* Compose the new alternation out of common parts */
var hOP = (function() {
if (typeof Object.prototype.hasOwnProperty == "function")
{
return function(o, p) {
return o.hasOwnProperty(p);
};
}
else
{
// suffices *here*
return function(o, p) {
return typeof o[p] != "undefined"
&& typeof o.constructor.prototype[p]== "undefined";
};
}
})();

a = [];
for (var p in o)
{
if (hOP(o2, p)) a.push(p);
}

return new RegExp("(" + a.join("|") + ")");
};

var pattern1 = /(a|b|c)/;
var pattern2 = /(a|b)/;
var pattern3 = pattern1.intersect(pattern2);

// yields /(a|b)/ or /(b|a)/ (implementation-dependent)
console.log(pattern3);

Click to expand...

This function seems to work only for alternation regexes but gives
weird results on simple cases like this:

var pattern1 = /./;
var pattern2 = /a/;
var pattern3 = pattern1.intersection(pattern2)

document.write( pattern3.source + " -- " + pattern3.test("b") );

I would want /a/ -- false

This method is a quick hack that handles *simple* strings (hence my
saying "simple intersection"); it was not intended as or is supposed
to be a complete solution.

Happy coding.

but it gives () -- true

Click to expand...

Of course, the empty string matches everything.

PointedEars

Lasse Reichstein Nielsen · Jan 6, 2009

Georg Jaehnig said:
Hello,

I don't think so. "s + any suffix" is a regular language because it
can be described as a common regex: /s.*/

Checking if "s + any suffix" matches a regex is actually checking if
the intersection of /s.*/ and the regex is not empty. Intersection of
2 regexes is intersection of 2 finite state machines which needs
quadratic space (no. of states of FSA1 times no. of states of FSA2).

I think you are right.

I was thinking of a problem, sometimes called EQREX, of determining
whether two regular expressions define the same languge. That one is
EXP-SPACE complete, but this problem is actually simpler (you only
need to find one matching string to prove the property, not check
something for all strings).

You can stay with nondeterministic automatons (NFA) and still solve
the problem, and for a regexp, you can generate a NFA that is
proportional in size to the regexp (the corresponding DFA could be
exponential, though).

Whether an NFA accepts any string is solvable by a simple algorithm
(perhaps as simple as finding the shortest path from the start state
to all other states, and see if you hit an accept state). At worst
it's quadratic, and you'll even find a shortest matched string

Checking if a FSA describes only the empty language is only checking
if it has no final states.

The direct product construction will create all the product states,
including all possible accept states. You need to see if they are
reachable (or prune non-reachable states and see if any are left).
But definitly doable.

RegExp.prototype.intersect = function( pattern2 ) {
return new RegExp( '(?=' + this.source + ')(?=' + pattern2.source +
')' );
}

No need to use look-ahead on the second pattern.

You might need to rename any back-references in the second pattern, if
both patterns contain captures.

Apart from that, it should work.

/L

Thomas 'PointedEars' Lahn · Jan 6, 2009

kangax said:
Thomas said:

[`RegExp.prototype.intersect` implementation]

Click to expand...

It might be worth mentioning that this implementation will not work with
some of the escaped characters (i.e. "|", "(" and ")") [1] as well as
with nested groups [2]:

Hence "*simple* intersection".

[1] `/(a|b|c)/` and `/(a|b\|c)/` produce `/(a|c)/` instead of a proper `a`

I don't see how this can be accomplished with using .source.split
(/.../) since we don't have negative lookbehind (?<!) in ECMAScript
implementations with which you could exclude `\|' as a delimiter; so
it probably needs to be solved with RegExp-based string parsing.

[2] `/(a|b)/` and `/(a|b(xx)?)/` produce `/()/` instead of, at least, `/a/`

I'm looking forward to your suggestions.

PointedEars

Lasse Reichstein Nielsen · Jan 6, 2009

Lasse Reichstein Nielsen said:
No need to use look-ahead on the second pattern.

You might need to rename any back-references in the second pattern, if
both patterns contain captures.

Apart from that, it should work.

Ahem, with a few caveats.

The resulting regexp test if both of the original regexps match an input
string *starting at the same point*.

The regexp is unanchored if the original patters are, and then it can
match at any point of the input.

They don't necessarily match the same substring at that point.
I.e.
/(?=ab)a/
will match "cab", because one string matches "a" and the other "ab".

That might or might not be what you want.
It's reasonable if you consider "cab" to be in the "language of the
regexp" for both patterns ("a" and "ab"). However, in that case it
would be wrong that /(?=c)ab/ doesn't match. You would need to allow
for different starting points: /(?=[^]*c)[^]*ab/ (where [^] is just the
shortest way to say "any character"

.

If you don't want "cab" to be in the "intersection" of /a/ and /ab/,
then you should probably anchor both patterns to begin with, i.e.,
/^ab$/ and /^a$/)

It all boils down to which definition of "intersection" and "language
of a regexp" that is used.

/L 'details.devil != undefined'

Building the finite state machine of a Ruby regexp	3	Aug 1, 2006
regexp finite state machine?	0	Mar 8, 2005
finite state machine : transitions in absence of external events	2	Sep 16, 2009
State machine builder: looking for feedback	0	Dec 30, 2007
Looking for Python Finite State Machine Implementations	0	Jun 24, 2005
FSM (Finite State Machine) Generator - Open Source	1	Feb 3, 2004
FSM (Finite State Machine) Generator - Open Source	10	Feb 3, 2004
FSM (Finite State Machine) Generator - Open Source	2	Feb 3, 2004

RegExp as Finite State Machine

Georg Jaehnig

Stevo

Thomas 'PointedEars' Lahn

Stevo

Lasse Reichstein Nielsen

Thomas 'PointedEars' Lahn

Lasse Reichstein Nielsen

Thomas 'PointedEars' Lahn

Stevo

Lasse Reichstein Nielsen

Georg Jaehnig

Lasse Reichstein Nielsen

Michael Wojcik

Thomas 'PointedEars' Lahn

Georg Jaehnig

Georg Jaehnig

Thomas 'PointedEars' Lahn

Lasse Reichstein Nielsen

Thomas 'PointedEars' Lahn

Lasse Reichstein Nielsen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads