RegExp as Finite State Machine

Discussion in 'Javascript' started by Georg Jaehnig, Dec 30, 2008.

  1. Hi,

    I wonder if it's possible to have Finite State Machines (FSM) [1] in
    Javascript. Basically, they must be already implemented because a
    RegExp is a FSM , both describe a regular language.

    But the formalism of FSMs allows much more than RegExps seem to allow.
    Intersection for instance:

    pattern1 = new RegExp( "(a|b|c)" ); // contains a,b,c
    pattern2 = new RegExp( "(a|b)" ); // contains a,b

    // This would be nice:
    pattern3 = RegExp.intersect( pattern1, pattern2 )
    // pattern3 = /"(a|b)"/

    So do you maybe know a library with those functions? If not, is there
    any way to access the internal representation of the RegExp object to
    implement e.g. an intersection algorithm?

    [1] http://en.wikipedia.org/wiki/Finite_state_machine

    --
    Georg | http://serchilo.net - command the web
     
    Georg Jaehnig, Dec 30, 2008
    #1
    1. Advertising

  2. Georg Jaehnig

    Stevo Guest

    Georg Jaehnig wrote:
    > I wonder if it's possible to have Finite State Machines (FSM) [1] in
    > Javascript. Basically, they must be already implemented because a
    > RegExp is a FSM , both describe a regular language.


    What ? Which language does a state machine represent ? Greek ? When did
    the dictionary-eating types (that like to take a simple concept that can
    be described in a paragraph or two, but instead turn it into a 5-page
    marketing-speak document) decide to add the word Finite to state
    machines just so they can make it a TLA ? It's always been a plain state
    machine in my experience. None of the state machines I ever programmed
    "described a regular language". They all handled various input events
    and made some choices based on the current state, resulting in some
    output and sometimes a change in state.

    When did RegExp become a state machine ? It's a regular expression
    search [and replace] function.

    > But the formalism of FSMs allows much more than RegExps seem to allow.


    What ? Oh it's the formalism that does it, I never realized it was the
    formalism. Is it's RegExps informalism (or is it unformalisticness?)
    that lets it down ?

    > Intersection for instance:
    >
    > pattern1 = new RegExp( "(a|b|c)" ); // contains a,b,c
    > pattern2 = new RegExp( "(a|b)" ); // contains a,b
    >
    > // This would be nice:
    > pattern3 = RegExp.intersect( pattern1, pattern2 )
    > // pattern3 = /"(a|b)"/
    >
    > So do you maybe know a library with those functions? If not, is there
    > any way to access the internal representation of the RegExp object to
    > implement e.g. an intersection algorithm?
    >
    > [1] http://en.wikipedia.org/wiki/Finite_state_machine


    I very much doubt you'll find anything like that.
     
    Stevo, Dec 30, 2008
    #2
    1. Advertising

  3. Georg Jaehnig wrote:
    > I wonder if it's possible to have Finite State Machines (FSM) [1] in
    > Javascript. Basically, they must be already implemented because a
    > RegExp is a FSM , both describe a regular language.


    A regular expression *engine* must be an FSM, because the formal
    language it accepts is regular -- Regular Expressions.

    AFAIK, a RegEx engine is implemented either as a DFA or an NFA. See
    also: Friedl, Jeffrey E. F.; Mastering Regular Expressions, Second
    Edition, chapter 4 "The Mechanics of Expression Processing", O'Reilly,
    1997 (you can read that sample chapter online, for free); maybe it is
    also in the Third Edition, O'Reilly, 2006.

    The Specification of ECMAScript RegExp processing can be found in the
    ECMAScript Language Specification (ECMA-262), Edition 3 Final [ES3F],
    about page 160. There are at least two implementations of it that are
    Open Source, Mozilla.org SpiderMonkey and Apple JavaScriptCore (of
    WebKit).

    > But the formalism of FSMs allows much more than RegExps seem to allow.


    Of course. The former is an formal automaton, the latter is the
    implementation of a formal language.

    > Intersection for instance:
    >
    > pattern1 = new RegExp( "(a|b|c)" ); // contains a,b,c
    > pattern2 = new RegExp( "(a|b)" ); // contains a,b
    >
    > // This would be nice:
    > pattern3 = RegExp.intersect( pattern1, pattern2 )
    > // pattern3 = /"(a|b)"/
    >
    > So do you maybe know a library with those functions?


    My RegExp library, but it is not online yet.

    > If not, is there
    > any way to access the internal representation of the RegExp object to
    > implement e.g. an intersection algorithm?


    (I'll answer anyway if you don't mind.)

    An FSM reads a string of characters from an input alphabet. So if you
    want to do with RegExp *objects* what you can do with a modified FSM,
    you have to use the *string* representation of the object.

    If you had cared to read [ES3F], you would have known that all
    ECMAScript values have such a representation. RegExp objects, in
    particular, inherit a property, ‘source’, which evaluates to it, and a
    method, toString(), which returns it. (I am not completely sure as to
    the availability of the former yet. It is specified, but I have only
    tested it positive from JScript 3.0 [IE/Win 4.0] on, and in Safari on
    this iPhone 3G [firmware version 2.2 (5G77)].)

    Trimming and concatenating the results, and passing them to the RegExp
    factory/constructor is left as an exercise to the reader.


    HTH

    PointedEars
     
    Thomas 'PointedEars' Lahn, Jan 1, 2009
    #3
  4. Georg Jaehnig

    Stevo Guest

    Thomas 'PointedEars' Lahn wrote:
    > Georg Jaehnig wrote:
    >> I wonder if it's possible to have Finite State Machines (FSM) [1] in
    >> Javascript. Basically, they must be already implemented because a
    >> RegExp is a FSM , both describe a regular language.

    >
    > A regular expression *engine* must be an FSM, because the formal
    > language it accepts is regular -- Regular Expressions.


    You found a kindred spirit Georg. One that speaks your language.

    >> But the formalism of FSMs allows much more than RegExps seem to allow.

    >
    > Of course. The former is an formal automaton, the latter is the
    > implementation of a formal language.
    >
    > PointedEars


    Happy New Year Thomas, You're my hero :)
     
    Stevo, Jan 1, 2009
    #4
  5. Thomas 'PointedEars' Lahn <> writes:

    > Georg Jaehnig wrote:
    >> I wonder if it's possible to have Finite State Machines (FSM) [1] in
    >> Javascript. Basically, they must be already implemented because a
    >> RegExp is a FSM , both describe a regular language.

    >
    > A regular expression *engine* must be an FSM, because the formal
    > language it accepts is regular -- Regular Expressions.


    It can be a FSM, but it doesn't have to be one. For the "regular"
    expressions in Javascript, it can't be, since they recognize
    non-regular languages (but only because of back-references).

    Classsical counter-example of regularness:
    /^(11+)\1+$/
    which recognizes composite numbers (non-primes) in unary notation.

    The original definition of Javascript regular expressions was based on
    the backtracking implementations in Netscape and IE. The specification
    is given specifically in terms of when to backtrack, which makes it
    harder to implement the specified semantics without backtracking (e.g.,
    if there are more ways to match, the specification is to pick the "first"
    one, not the longest one as in other regexp-implementations).

    Of the current crop of browsers, AFAIK none of them use a FSM
    based implementation of regular expressions.

    > The Specification of ECMAScript RegExp processing can be found in the
    > ECMAScript Language Specification (ECMA-262), Edition 3 Final [ES3F],
    > about page 160. There are at least two implementations of it that are
    > Open Source, Mozilla.org SpiderMonkey and Apple JavaScriptCore (of
    > WebKit).


    The one in JavaScriptCore is currently being rewritten from scratch.
    The original version is an adaption of PCRE (Perl-regexps) to Javascript
    regexps, and can't really be recommended as an example.
    The new version is not feature complete yet.

    /L
    --
    Lasse Reichstein Holst Nielsen
    'Javascript frameworks is a disruptive technology'
     
    Lasse Reichstein Nielsen, Jan 4, 2009
    #5
  6. Lasse Reichstein Nielsen wrote:
    > Thomas 'PointedEars' Lahn <> writes:
    >> Georg Jaehnig wrote:
    >>> I wonder if it's possible to have Finite State Machines (FSM) [1] in
    >>> Javascript. Basically, they must be already implemented because a
    >>> RegExp is a FSM , both describe a regular language.

    >> A regular expression *engine* must be an FSM, because the formal
    >> language it accepts is regular -- Regular Expressions.

    >
    > It can be a FSM, but it doesn't have to be one.


    True, it could also be an alternating finite automaton or a read-only Turing
    machine. (I'd like to see that!)

    > For the "regular" expressions in Javascript, it can't be, since they recognize
    > non-regular languages (but only because of back-references).
    >
    > Classsical counter-example of regularness:
    > /^(11+)\1+$/
    > which recognizes composite numbers (non-primes) in unary notation.


    In which way is that proof that ECMAScript Regular Expressions are not a
    regular language?

    > Of the current crop of browsers, AFAIK none of them use a FSM
    > based implementation of regular expressions.


    So sure are you, despite other engines that also support backreferences
    (like egrep) have been demonstrated to use FSM implementations?

    <http://oreilly.com/catalog/regex/chapter/ch04.html>


    PointedEars
     
    Thomas 'PointedEars' Lahn, Jan 4, 2009
    #6
  7. Thomas 'PointedEars' Lahn <> writes:

    > Lasse Reichstein Nielsen wrote:


    >> Classsical counter-example of regularness:
    >> /^(11+)\1+$/
    >> which recognizes composite numbers (non-primes) in unary notation.

    >
    > In which way is that proof that ECMAScript Regular Expressions are not a
    > regular language?


    The language is not regular (It's composite, the language of "1"'s of
    prime length, is not regular. This is failry easily proved by assuming
    it's regular, and using pumping to derive a contradition).

    The expression is a valid Javascript regular expression that recognizes
    a non-regular language, so Javascript regexps are not restricted to the
    regular languages (and therefore not possible to implement using only
    a finite state machine).

    >> Of the current crop of browsers, AFAIK none of them use a FSM
    >> based implementation of regular expressions.


    > So sure are you, despite other engines that also support backreferences
    > (like egrep) have been demonstrated to use FSM implementations?
    > <http://oreilly.com/catalog/regex/chapter/ch04.html>


    Referring to this?
    "You might, however, notice that GNU's version of egrep does support
    backreferences. It does so by having two complete engines under the
    hood! It first uses a DFA engine to see whether a match is likely,
    and then uses an NFA engine (which supports the full flavor,
    including backreferences) to confirm the match."

    I'm only talking about the current regexp implementations because I
    have actually been looking at them.

    What I probably /should/ have also said, is that by FSM I mean a
    deterministic machine (i.e., a DFA, one that is run linearly over the
    input). A nondeterministic machine can use any number of tricks to
    implement the nondeterminism, including backtracking.

    /L
    --
    Lasse Reichstein Holst Nielsen
    'Javascript frameworks is a disruptive technology'
     
    Lasse Reichstein Nielsen, Jan 4, 2009
    #7
  8. Lasse Reichstein Nielsen wrote:
    > Thomas 'PointedEars' Lahn <> writes:
    >> Lasse Reichstein Nielsen wrote:
    >>> Classsical counter-example of regularness:
    >>> /^(11+)\1+$/
    >>> which recognizes composite numbers (non-primes) in unary notation.

    >> In which way is that proof that ECMAScript Regular Expressions are not a
    >> regular language?

    >
    > The language is not regular (It's composite, the language of "1"'s of
    > prime length, is not regular. This is failry easily proved by assuming
    > it's regular, and using pumping to derive a contradition).


    Ah, but I think you confuse the language that a regular expression can
    match, L_m(r) -- he above matched language is context-free (Chomsky Type-2)
    but not regular (Type-3), if I'm not mistaken -- with the Regular Expression
    language, L_r. In order to prove that the ECMAScript Regular Expression
    language is not regular, you have to apply the pumping lemma for regular
    languages to the expression itself, not to the strings it could match.

    > The expression is a valid Javascript regular expression that recognizes
    > a non-regular language, so Javascript regexps are not restricted to the
    > regular languages (and therefore not possible to implement using only
    > a finite state machine).


    Sorry, I don't think you are correct here.

    >>> Of the current crop of browsers, AFAIK none of them use a FSM
    >>> based implementation of regular expressions.

    >>
    >> So sure are you, despite other engines that also support backreferences
    >> (like egrep) have been demonstrated to use FSM implementations?
    >> <http://oreilly.com/catalog/regex/chapter/ch04.html>

    >
    > Referring to this?
    > "You might, however, notice that GNU's version of egrep does support
    > backreferences. It does so by having two complete engines under the
    > hood! It first uses a DFA engine to see whether a match is likely,
    > and then uses an NFA engine (which supports the full flavor,
    > including backreferences) to confirm the match."


    Yes, exactly.

    > I'm only talking about the current regexp implementations because I
    > have actually been looking at them.


    We'll see.

    > What I probably /should/ have also said, is that by FSM I mean a
    > deterministic machine (i.e., a DFA, one that is run linearly over the
    > input). A nondeterministic machine can use any number of tricks to
    > implement the nondeterminism, including backtracking.


    But your premise is incorrect. Nondeterministic Finite Automata (NFAs),
    also called "nondeterministic finite state machines", are definitely a
    subset of Finite State Machines (FSMs).

    <http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine>


    PointedEars
     
    Thomas 'PointedEars' Lahn, Jan 4, 2009
    #8
  9. Georg Jaehnig

    Stevo Guest

    Thomas 'PointedEars' Lahn wrote:
    > Lasse Reichstein Nielsen wrote:
    >> Thomas 'PointedEars' Lahn <> writes:
    >>> Lasse Reichstein Nielsen wrote:
    >>>> Classsical counter-example of regularness:
    >>>> /^(11+)\1+$/
    >>>> which recognizes composite numbers (non-primes) in unary notation.
    >>> In which way is that proof that ECMAScript Regular Expressions are not a
    >>> regular language?

    >> The language is not regular (It's composite, the language of "1"'s of
    >> prime length, is not regular. This is failry easily proved by assuming
    >> it's regular, and using pumping to derive a contradition).

    >
    > Ah, but I think you confuse the language that a regular expression can
    > match, L_m(r) -- he above matched language is context-free (Chomsky Type-2)
    > but not regular (Type-3), if I'm not mistaken -- with the Regular Expression
    > language, L_r. In order to prove that the ECMAScript Regular Expression
    > language is not regular, you have to apply the pumping lemma for regular
    > languages to the expression itself, not to the strings it could match.


    Anyone else brave enough to admit this is way over their head? Check out
    these two links. The first one might as well be in Japanese for the
    amount of sense it makes to me:

    http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine#Properties_of_NFA-.CE.B5


    This one initially looks less complicated, but after reading the whole
    section I'm left open-mouthed and saying "Huh?" to myself...

    http://en.wikipedia.org/wiki/Regular_languages#Deciding_whether_a_language_is_regular


    I'm not sure if this newsgroup will consider everything after the # on
    those links as part of the links. We'll see.
     
    Stevo, Jan 5, 2009
    #9
  10. Thomas 'PointedEars' Lahn <> writes:

    > Lasse Reichstein Nielsen wrote:


    [/^(11+)\1+$/]

    > Ah, but I think you confuse the language that a regular expression can
    > match, L_m(r) -- he above matched language is context-free (Chomsky Type-2)
    > but not regular (Type-3), if I'm not mistaken -- with the Regular Expression
    > language, L_r.


    Probably. The latter is much less interesting (and definitly context free)

    The language recognized by the regexp above is actually not even
    context-free. For a language over an alphabet with only one element,
    the pumping lemma for context free languages is equivalent to the
    one for regular languages. It's probably Type-1.
    IIRC, it's known to be in (NP intersect co-NP).

    > In order to prove that the ECMAScript Regular Expression
    > language is not regular, you have to apply the pumping lemma for regular
    > languages to the expression itself, not to the strings it could match.


    It's even simpler than that. The regular expression language has
    matched parentheses to an arbitrary depth. That alone is enough to
    rule out being regular.

    >> The expression is a valid Javascript regular expression that recognizes
    >> a non-regular language, so Javascript regexps are not restricted to the
    >> regular languages (and therefore not possible to implement using only
    >> a finite state machine).

    >
    > Sorry, I don't think you are correct here.


    I should be, unless there is a flaw in the logic.

    If all Javascript regexps can be implemented using only finite state
    machines, then all the languages recognized by them are regular (a FSM
    can only recognize a regular language).

    Since the language of /^(11+)\1+$/ is not regular, and is a Javascript
    regexp, there is at least one language that cannot be implemented using
    a FSM.

    ...

    >> What I probably /should/ have also said, is that by FSM I mean a
    >> deterministic machine (i.e., a DFA, one that is run linearly over the
    >> input). A nondeterministic machine can use any number of tricks to
    >> implement the nondeterminism, including backtracking.

    >
    > But your premise is incorrect. Nondeterministic Finite Automata (NFAs),
    > also called "nondeterministic finite state machines", are definitely a
    > subset of Finite State Machines (FSMs).


    That is definitly true. However, we don't have the hardware yet to
    implement NFA's directly, so we have to emulate the nondeterminism
    somehow. The typical way is to implement a nondeterministic choice by
    trying one branch and setting up a backtrack point to try the other in
    case the first fails. This gets you a stack of backtrack points, which
    means that the implementation no longer uses a finite amount of
    space. The space used can depend on the input, not only the automaton.
    This is why I don't think of this implementation as really being a
    finite state machine, although it is emulating one at a certain
    level of abstraction.

    There are other ways to implement a NFA that doesn't use a stack,
    but instead determinizes the DFA (possibly lazily during execution).
    This can instead lead to exponential space blow-up, which is still
    finite, but not always manageable.

    /L
    --
    Lasse Reichstein Holst Nielsen
    'Javascript frameworks is a disruptive technology'
     
    Lasse Reichstein Nielsen, Jan 5, 2009
    #10
  11. Hi,

    thanks for your answer!

    On 1 Jan., 12:52, Thomas 'PointedEars' Lahn <>
    wrote:
    > > So do you maybe know a library with those functions?

    >
    > MyRegExplibrary, but it is not online yet.


    Can't wait to see it! :)

    > If you had cared to read [ES3F], you would have known that all
    > ECMAScript values have such a representation.RegExpobjects, in
    > particular, inherit a property, ‘source’, which evaluates to it, and a
    > method, toString(), which returns it. (I am not completely sure as to
    > the availability of the former yet. It is specified, but I have only
    > tested it positive from JScript 3.0 [IE/Win 4.0] on, and in Safari on
    > this iPhone 3G [firmware version 2.2 (5G77)].)
    >
    > Trimming and concatenating the results, and passing them to theRegExp
    > factory/constructor is left as an exercise to the reader.


    Well, I would know how to implement the union operation in this way:

    var reA = new RegExp("a");
    var reB = new RegExp("b");
    var reUnion = new RegExp( "(" + reA.source + "|" + reB.source + ")" );

    But how would you do intersection?

    One problem I would like to solve (and until now I think it needs
    intersection) is to check, if a string is a prefix of a regex. For
    instance:

    var re = new RegExp("abc|bc");
    var s = "a";

    My idea is to create a new RegExp concatening s and .*:
    var sRe = new RegExp(s + ".*");
    and see if the intersection of re and sRe is not empty.

    But if there's a simpler way maybe I wouldn't need intersection.

    --
    Georg | http://serchilo.net - command the web
     
    Georg Jaehnig, Jan 5, 2009
    #11
  12. Georg Jaehnig <> writes:

    .... (RegExp composition) ...

    > But how would you do intersection?


    > One problem I would like to solve (and until now I think it needs
    > intersection) is to check, if a string is a prefix of a regex. For
    > instance:
    >
    > var re = new RegExp("abc|bc");
    > var s = "a";


    I.e., whether there exists a suffix such that re.test(s+suffix)?
    That's a hard problem in general. Very hard, actually. I'm guessing
    exponential space (but most problems with regexps are that hard
    in their full generality).

    > My idea is to create a new RegExp concatening s and .*:
    > var sRe = new RegExp(s + ".*");
    > and see if the intersection of re and sRe is not empty.


    > But if there's a simpler way maybe I wouldn't need intersection.


    I doubt there is. If you just had to create a regexp that matched
    the intersection, you could use look-ahead:
    /(?=a)(?:abc|bc)/
    That creates a regexp for the intersection of the two languages,
    but you still need to figure out whether there is a string that is
    matched by that. That's the hard part.

    /L
    --
    Lasse Reichstein Holst Nielsen
    'Javascript frameworks is a disruptive technology'
     
    Lasse Reichstein Nielsen, Jan 5, 2009
    #12
  13. Stevo wrote:
    >> [discussion of FSMs and regular languages]

    >
    > Anyone else brave enough to admit this is way over their head?


    Why does that require bravery?

    Some people have studied computer science. Others have not. This is
    not a moral failing.

    Apparently you have not. We could have guessed that from your first
    post to this thread, so it's not clear to me why you continued to
    provide us with evidence.

    > I'm not sure if this newsgroup will consider everything after the # on
    > those links as part of the links. We'll see.


    The newsgroup has no opinions about syntax. Usenet propagation
    mechanisms such as NNTP have syntactic rules, but do not care about
    "links" (URLs) in the message body. Usenet user agents may attempt to
    detect URLs in the message body, but that's an application feature
    which depends entirely on the user agent in question.

    --
    Michael Wojcik
    Micro Focus
    Rhetoric & Writing, Michigan State University
     
    Michael Wojcik, Jan 5, 2009
    #13
  14. Georg Jaehnig wrote:
    > thanks for your answer!


    You are welcome.

    > Thomas 'PointedEars' Lahn wrote:
    >>> So do you maybe know a library with those functions?

    >> MyRegExplibrary, but it is not online yet.

    >
    > Can't wait to see it! :)


    So do I :)

    >> If you had cared to read [ES3F], you would have known that all
    >> ECMAScript values have such a representation.RegExpobjects, in
    >> particular, inherit a property, ‘source’, which evaluates to it, and a
    >> method, toString(), which returns it. (I am not completely sure as to
    >> the availability of the former yet. It is specified, but I have only
    >> tested it positive from JScript 3.0 [IE/Win 4.0] on, and in Safari on
    >> this iPhone 3G [firmware version 2.2 (5G77)].)
    >>
    >> Trimming and concatenating the results, and passing them to theRegExp
    >> factory/constructor is left as an exercise to the reader.

    >
    > Well, I would know how to implement the union operation in this way:
    >
    > var reA = new RegExp("a");
    > var reB = new RegExp("b");
    > var reUnion = new RegExp( "(" + reA.source + "|" + reB.source + ")" );
    >
    > But how would you do intersection?


    Given your initial problem (I'm declaring identifiers here as you should
    have) --

    var pattern1 = new RegExp("(a|b|c)"); // contains a,b,c
    var pattern2 = new RegExp("(a|b)"); // contains a,b

    // This would be nice:
    var pattern3 = RegExp.intersect( pattern1, pattern2 )
    // pattern3 = /"(a|b)"/

    -- simple intersection would be fairly easy:

    RegExp.prototype.intersect = function(pattern2) {
    /* Rule out invalid values */
    if (!pattern2 || pattern2.constructor != RegExp) return null;

    var
    s = this.source.replace(/^\(?([^)]*)\)?$/, "$1"),
    s2 = pattern2.source.replace(/^\(?([^)]*)\)?$/, "$1");

    /* Register all parts within alternation of this pattern */
    var a = s.split("|"), o = {};
    for (var i = a.length; i--;) o[a] = true;

    /* Register all parts within alternation of pattern2 */
    var a2 = s2.split("|"), o2 = {};
    for (i = a2.length; i--;) o2[a2] = true;

    /* Compose the new alternation out of common parts */
    var hOP = (function() {
    if (typeof Object.prototype.hasOwnProperty == "function")
    {
    return function(o, p) {
    return o.hasOwnProperty(p);
    };
    }
    else
    {
    // suffices *here*
    return function(o, p) {
    return typeof o[p] != "undefined"
    && typeof o.constructor.prototype[p] == "undefined";
    };
    }
    })();

    a = [];
    for (var p in o)
    {
    if (hOP(o2, p)) a.push(p);
    }

    return new RegExp("(" + a.join("|") + ")");
    };

    var pattern1 = /(a|b|c)/;
    var pattern2 = /(a|b)/;
    var pattern3 = pattern1.intersect(pattern2);

    // yields /(a|b)/ or /(b|a)/ (implementation-dependent)
    console.log(pattern3);

    > One problem I would like to solve (and until now I think it needs
    > intersection) is to check, if a string is a prefix of a regex. [...]


    See Lasse's answer.


    PointedEars
     
    Thomas 'PointedEars' Lahn, Jan 6, 2009
    #14
  15. Hello,

    On 5 Jan., 22:44, Lasse Reichstein Nielsen <>
    wrote:
    > I.e., whether there exists a suffix such that re.test(s+suffix)?
    > That's a hard problem in general. Very hard, actually. I'm guessing
    > exponential space


    I don't think so. "s + any suffix" is a regular language because it
    can be described as a common regex: /s.*/

    Checking if "s + any suffix" matches a regex is actually checking if
    the intersection of /s.*/ and the regex is not empty. Intersection of
    2 regexes is intersection of 2 finite state machines which needs
    quadratic space (no. of states of FSA1 times no. of states of FSA2).

    Checking if a FSA describes only the empty language is only checking
    if it has no final states.

    > > My idea is to create a new RegExp concatening s and .*:
    > > var sRe = new RegExp(s + ".*");
    > > and see if the intersection of re and sRe is not empty.
    > > But if there's a simpler way maybe I wouldn't need intersection.

    >
    > I doubt there is. If you just had to create a regexp that matched
    > the intersection, you could use look-ahead:
    >   /(?=a)(?:abc|bc)/


    Cool, lookahead really seems to model intersection, so let's have a
    nice function:

    RegExp.prototype.intersect = function( pattern2 ) {
    return new RegExp( '(?=' + this.source + ')(?=' + pattern2.source +
    ')' );
    }

    > That creates a regexp for the intersection of the two languages,
    > but you still need to figure out whether there is a string that is
    > matched by that. That's the hard part.


    I see. Ah, too bad! Because if the RegExp object would be implemented
    as a FSA, technically it should be really easy (see above - just
    checking if there are final states).

    --
    Georg | http://serchilo.net - command the web
     
    Georg Jaehnig, Jan 6, 2009
    #15
  16. Hi,

    On 6 Jan., 01:21, Thomas 'PointedEars' Lahn <>
    wrote:
    > -- simple intersection would be fairly easy:
    >
    >   RegExp.prototype.intersect = function(pattern2) {
    >     /* Rule out invalid values */
    >     if (!pattern2 || pattern2.constructor != RegExp) return null;
    >
    >     var
    >       s = this.source.replace(/^\(?([^)]*)\)?$/, "$1"),
    >       s2 = pattern2.source.replace(/^\(?([^)]*)\)?$/, "$1");
    >
    >     /* Register all parts within alternation of this pattern */
    >     var a = s.split("|"), o = {};
    >     for (var i = a.length; i--;) o[a] = true;
    >
    >     /* Register all parts within alternation of pattern2 */
    >     var a2 = s2.split("|"), o2 = {};
    >     for (i = a2.length; i--;) o2[a2] = true;
    >
    >     /* Compose the new alternation out of common parts */
    >     var hOP = (function() {
    >       if (typeof Object.prototype.hasOwnProperty == "function")
    >       {
    >         return function(o, p) {
    >           return o.hasOwnProperty(p);
    >         };
    >       }
    >       else
    >       {
    >         // suffices *here*
    >         return function(o, p) {
    >           return typeof o[p] != "undefined"
    >                  && typeof o.constructor.prototype[p] == "undefined";
    >         };
    >       }
    >     })();
    >
    >     a = [];
    >     for (var p in o)
    >     {
    >       if (hOP(o2, p)) a.push(p);
    >     }
    >
    >     return new RegExp("(" + a.join("|") + ")");
    >   };
    >
    >   var pattern1 = /(a|b|c)/;
    >   var pattern2 = /(a|b)/;
    >   var pattern3 = pattern1.intersect(pattern2);
    >
    >   // yields /(a|b)/ or /(b|a)/ (implementation-dependent)
    >   console.log(pattern3);


    This function seems to work only for alternation regexes but gives
    weird results on simple cases like this:

    var pattern1 = /./;
    var pattern2 = /a/;
    var pattern3 = pattern1.intersection(pattern2)

    document.write( pattern3.source + " -- " + pattern3.test("b") );

    I would want /a/ -- false
    but it gives () -- true

    --
    Georg
     
    Georg Jaehnig, Jan 6, 2009
    #16
  17. Georg Jaehnig wrote:
    > Thomas 'PointedEars' Lahn wrote:
    > [...]
    > > -- simple intersection would be fairly easy:
    > >
    > >   RegExp.prototype.intersect = function(pattern2) {
    > >     /* Rule out invalid values */
    > >     if (!pattern2 || pattern2.constructor != RegExp) return null;
    > >
    > >     var
    > >       s = this.source.replace(/^\(?([^)]*)\)?$/, "$1"),
    > >       s2 = pattern2.source.replace(/^\(?([^)]*)\)?$/, "$1");
    > >
    > >     /* Register all parts within alternation of this pattern */
    > >     var a = s.split("|"), o = {};
    > >     for (var i = a.length; i--;) o[a] = true;
    > >
    > >     /* Register all parts within alternation of pattern2 */
    > >     var a2 = s2.split("|"), o2 = {};
    > >     for (i = a2.length; i--;) o2[a2] = true;
    > >
    > >     /* Compose the new alternation out of common parts */
    > >     var hOP = (function() {
    > >       if (typeof Object.prototype.hasOwnProperty == "function")
    > >       {
    > >         return function(o, p) {
    > >           return o.hasOwnProperty(p);
    > >         };
    > >       }
    > >       else
    > >       {
    > >         // suffices *here*
    > >         return function(o, p) {
    > >           return typeof o[p] != "undefined"
    > >                  && typeof o.constructor.prototype[p]== "undefined";
    > >         };
    > >       }
    > >     })();
    > >
    > >     a = [];
    > >     for (var p in o)
    > >     {
    > >       if (hOP(o2, p)) a.push(p);
    > >     }
    > >
    > >     return new RegExp("(" + a.join("|") + ")");
    > >   };
    > >
    > >   var pattern1 = /(a|b|c)/;
    > >   var pattern2 = /(a|b)/;
    > >   var pattern3 = pattern1.intersect(pattern2);
    > >
    > >   // yields /(a|b)/ or /(b|a)/ (implementation-dependent)
    > >   console.log(pattern3);

    >
    > This function seems to work only for alternation regexes but gives
    > weird results on simple cases like this:
    >
    >   var pattern1 = /./;
    >   var pattern2 = /a/;
    >   var pattern3 = pattern1.intersection(pattern2)
    >
    > document.write( pattern3.source + " -- " + pattern3.test("b") );
    >
    > I would want /a/ -- false


    This method is a quick hack that handles *simple* strings (hence my
    saying "simple intersection"); it was not intended as or is supposed
    to be a complete solution.

    Happy coding.

    > but it gives () -- true


    Of course, the empty string matches everything.


    PointedEars
     
    Thomas 'PointedEars' Lahn, Jan 6, 2009
    #17
  18. Georg Jaehnig <> writes:

    > Hello,
    >
    > On 5 Jan., 22:44, Lasse Reichstein Nielsen <>
    > wrote:
    >> I.e., whether there exists a suffix such that re.test(s+suffix)?
    >> That's a hard problem in general. Very hard, actually. I'm guessing
    >> exponential space

    >
    > I don't think so. "s + any suffix" is a regular language because it
    > can be described as a common regex: /s.*/
    >
    > Checking if "s + any suffix" matches a regex is actually checking if
    > the intersection of /s.*/ and the regex is not empty. Intersection of
    > 2 regexes is intersection of 2 finite state machines which needs
    > quadratic space (no. of states of FSA1 times no. of states of FSA2).


    I think you are right.

    I was thinking of a problem, sometimes called EQREX, of determining
    whether two regular expressions define the same languge. That one is
    EXP-SPACE complete, but this problem is actually simpler (you only
    need to find one matching string to prove the property, not check
    something for all strings).

    You can stay with nondeterministic automatons (NFA) and still solve
    the problem, and for a regexp, you can generate a NFA that is
    proportional in size to the regexp (the corresponding DFA could be
    exponential, though).

    Whether an NFA accepts any string is solvable by a simple algorithm
    (perhaps as simple as finding the shortest path from the start state
    to all other states, and see if you hit an accept state). At worst
    it's quadratic, and you'll even find a shortest matched string :)

    > Checking if a FSA describes only the empty language is only checking
    > if it has no final states.


    The direct product construction will create all the product states,
    including all possible accept states. You need to see if they are
    reachable (or prune non-reachable states and see if any are left).
    But definitly doable.

    > RegExp.prototype.intersect = function( pattern2 ) {
    > return new RegExp( '(?=' + this.source + ')(?=' + pattern2.source +
    > ')' );
    > }


    No need to use look-ahead on the second pattern.

    You might need to rename any back-references in the second pattern, if
    both patterns contain captures.

    Apart from that, it should work.

    /L
    --
    Lasse Reichstein Holst Nielsen
    'Javascript frameworks is a disruptive technology'
     
    Lasse Reichstein Nielsen, Jan 6, 2009
    #18
  19. kangax wrote:
    > Thomas 'PointedEars' Lahn wrote:
    > > [`RegExp.prototype.intersect` implementation]

    >
    > It might be worth mentioning that this implementation will not work with
    > some of the escaped characters (i.e. "|", "(" and ")") [1] as well as
    > with nested groups [2]:


    Hence "*simple* intersection".

    > [1] `/(a|b|c)/` and `/(a|b\|c)/` produce `/(a|c)/` instead of a proper `a`


    I don't see how this can be accomplished with using .source.split
    (/.../) since we don't have negative lookbehind (?<!) in ECMAScript
    implementations with which you could exclude `\|' as a delimiter; so
    it probably needs to be solved with RegExp-based string parsing.

    > [2] `/(a|b)/` and `/(a|b(xx)?)/` produce `/()/` instead of, at least, `/a/`


    I'm looking forward to your suggestions.


    PointedEars
     
    Thomas 'PointedEars' Lahn, Jan 6, 2009
    #19
  20. Lasse Reichstein Nielsen <> writes:

    > Georg Jaehnig <> writes:


    >> RegExp.prototype.intersect = function( pattern2 ) {
    >> return new RegExp( '(?=' + this.source + ')(?=' + pattern2.source +
    >> ')' );
    >> }

    >
    > No need to use look-ahead on the second pattern.
    >
    > You might need to rename any back-references in the second pattern, if
    > both patterns contain captures.
    >
    > Apart from that, it should work.


    Ahem, with a few caveats.

    The resulting regexp test if both of the original regexps match an input
    string *starting at the same point*.

    The regexp is unanchored if the original patters are, and then it can
    match at any point of the input.

    They don't necessarily match the same substring at that point.
    I.e.
    /(?=ab)a/
    will match "cab", because one string matches "a" and the other "ab".

    That might or might not be what you want.
    It's reasonable if you consider "cab" to be in the "language of the
    regexp" for both patterns ("a" and "ab"). However, in that case it
    would be wrong that /(?=c)ab/ doesn't match. You would need to allow
    for different starting points: /(?=[^]*c)[^]*ab/ (where [^] is just the
    shortest way to say "any character" :).

    If you don't want "cab" to be in the "intersection" of /a/ and /ab/,
    then you should probably anchor both patterns to begin with, i.e.,
    /^ab$/ and /^a$/)

    It all boils down to which definition of "intersection" and "language
    of a regexp" that is used.

    /L 'details.devil != undefined'
    --
    Lasse Reichstein Holst Nielsen
    'Javascript frameworks is a disruptive technology'
     
    Lasse Reichstein Nielsen, Jan 6, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. deejayfred
    Replies:
    0
    Views:
    554
    deejayfred
    Oct 2, 2003
  2. SomeDude
    Replies:
    3
    Views:
    3,159
    arant
    Aug 14, 2006
  3. Inderkal
    Replies:
    8
    Views:
    1,260
    rickman
    Dec 9, 2004
  4. Gilles Lacroix
    Replies:
    3
    Views:
    178
    Gilles Lacroix
    Aug 3, 2006
  5. Walter Roberson

    regexp finite state machine?

    Walter Roberson, Mar 8, 2005, in forum: Perl Misc
    Replies:
    0
    Views:
    150
    Walter Roberson
    Mar 8, 2005
Loading...

Share This Page