No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match
using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):
start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)
Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.
The DFA of course could never backtrack but is there a DFA for this regex?
Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):
{start,a_star,any_star}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star}
),
{any_star}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star}
),
{a_star,any_star,final}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star,final}
),
{any_star,final}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star,final}
)
The resulting FA is deterministic, and {a_star,any_star,final} and {any_star,final} are both accepting states.
If we need to capture subexpressions, we can introduce additional counters and labeled transitions with priorities to resolve greedy/non-greedy ambiguities (as is often done in Ragel to resolve ambiguities). Prioritized transitions are also necessary for regexps which are not anchored at the beginning.
-mental