Parens Matching

Chris Uppal · Feb 17, 2007

Ken said:
I probably should have been more transparent in my original post. I'm not
trying to test a string for balanced parens. I need it to be a match
termination condition. In other words, I have a fairly complex pattern.
So it tells when a match is done. I need to say the match is done if the
character pattern is played out or the match will play out before I
balance parens.

It sounds to me as if this is a mis-application of regular expressions (IMO
nearly all applications of regular expressions in "real" programming languages
/are/ mis-applications).

Would it be easier to use bracket matching to delimit the sub-sequence you want
with real code and then (if regexes still seem like a good idea) use regexes to
pick it apart ?

Alternatively, use regexes to "tokenise" the input (identifying (, ), and the
other stuff you are interested in) and then use real code to traverse the
tokens (words) using bracket-counting to keep track of where you are in the
input stream.

I first wrote this in C#. .Net has extended RegExp that lets you swap
named groups on condition.

<shudder/>

-- chris

Daniel Pitts · Feb 18, 2007

Hmm, I don't think your definition is correct.
Regex isn't always implementedwith a DSM, it might use NDA, which
still doesn't solve it.

Click to expand...

What do you mean by NDA? Non-deterministic automaton?

In either case, the following is a deterministic state machine solves
the problem.

Click to expand...

private static boolean isBalancedCount(String string) {
int parens = 0;
for (char c : string.toCharArray()) {
if (c == '(') {
++parens;
}
if (c == ')' && --parens < 0) {
return false;
}
}
return parens == 0;
}

Click to expand...

Nope, it doesn't solve the problem. The int type in Java has finite
capacity - see Integer.MAX_VALUE - so more accurately this a finite
deterministic state machine. I'm actually trying to remember if
'deterministic state machine' usually implies the 'finite' part in
common usage. A look at wikipedia seems to suggest that, but I wasn't
looking at it for long...

So, anyway, I can think of an input that causes the above code to fail -
and hence problem not solved.

To give a simplified example of this, suppose we have a Java where
Integer.MAX_VALUE = 1, and .MIN_VALUE = -2. So an int can have 4
distinct values for int. Assuming this, would would your code save about
the following inputs?

(((( - this would be a false positive
(((((((( - another false positive
((())) - this would be a false negative

Similarly, your code above in the world of real Java would give a false
positive for the input:

'(' * [(Integer.MAX_VALUE - Integer.MIN_VALUE + 1) * n]

where n is any positive integer.

Basically, true parens matching is scuppered, because of the need for
infinite memory!

lex

(((( and (((((((( both give the correct output.
((())) also results in the correct output. Hmm, are you sure you
analyzed my algorithm correctly?

As for the case of:
'(' * [(Integer.MAX_VALUE - Integer.MIN_VALUE + 1) * n]

You're talking about at least a 4 gigabyte string. As a mater of
practicality, I find that type of input to be unreasonable, but it
would be simple enough to change the code to hand off to a different
algorithm:

private static boolean isBalancedCount(String string) {
int parens = 0;
for (char c : string.toCharArray()) {
if (c == '(' && parens++ == Integer.MAX_INT) {
// We can't handle the trueth, but isBalancedRe can!
return isBalancedRe(string);
}
if (c == ')' && --parens < 0) {
return false;
}
}
return parens == 0;
}

Alex Hunsley · Feb 18, 2007

Daniel said:
To give a simplified example of this, suppose we have a Java where
Integer.MAX_VALUE = 1, and .MIN_VALUE = -2. So an int can have 4
distinct values for int. Assuming this, would would your code save about
the following inputs?

(((( - this would be a false positive
(((((((( - another false positive
((())) - this would be a false negative

Similarly, your code above in the world of real Java would give a false
positive for the input:

'(' * [(Integer.MAX_VALUE - Integer.MIN_VALUE + 1) * n]

where n is any positive integer.

Basically, true parens matching is scuppered, because of the need for
infinite memory!

lex

Click to expand...

(((( and (((((((( both give the correct output.
((())) also results in the correct output. Hmm, are you sure you
analyzed my algorithm correctly?

I wasn't talking about your code running under normal Java, I was
talking about a pretend one with very small Integer.MAX_VALUE etc. To
quote myself: "To give a simplified example of this, suppose we have a
Java where ... [stuff] ... assuming this, what would your code [say]
about the following inputs?" I was making a simplification to make a
more accessible toy example where it was easy to play with the cases
where it would produce the wrong answers for 'extreme' cases.

As for the case of:
'(' * [(Integer.MAX_VALUE - Integer.MIN_VALUE + 1) * n]

You're talking about at least a 4 gigabyte string. As a mater of
practicality, I find that type of input to be unreasonable, but it
would be simple enough to change the code to hand off to a different
algorithm:

I think we're coming back to the issue mentioned before about theory
versus practicality: I do know that your code works for all practical
purposes - it's correct for all practical purposes - as opposed to the
computer science theoretic sense in which bracket matching can't be done
(by a regular expression or by a FSA or by a finite computer).

I'm just being too theoretic for my own good, is all....

lex

Alex Hunsley · Feb 18, 2007

Chris said:
I would understand "non-deterministic state machine" to mean the
non-deterministic variant of finite state automaton). And the deterministic
and non-deterministic versions of FSAs are equivalent in power.

Are you sure about that? As far as I can recall, a non-deterministic FSA
(as that is what I am meaning) can solve the bracket matching problem,
but a deterministic one can't. Put it this way - how would you design a
deterministic FSA that did bracket matching on arbitrary length input?
You can't, if it is going to remain finite... And it's easy enough to do
with a non-deterministic FSA.

Is it that *some* non-deterministic FSA can be 'unfolded' into
deterministic ones, but not all, perhaps?

lex

Do you use the
term to mean something other form of automaton ?

No, FSA is right, that's basically what I'm thinking of...

Alex Hunsley · Feb 18, 2007

Chris said:
You can do it in constant space for arbitrary length inputs /if/ you are
allowed to use the input as working space too -- just replace '()' by '' until
there's nothing left.

Yay! That's the Turing Machine way. Overwrite the tape...
There was something perversely satisfying about 'writing' Turing
machines at uni to solve problems like multiplication of two numbers (in
binary). The first and original "virtual machine"!

I love the idea that it wouldn't be hard to build an actual (finite!)
Turing machine out of bits of something or other, and have toyed with
that idea at various times.

Patricia Shanahan · Feb 18, 2007

Alex said:
Are you sure about that? As far as I can recall, a non-deterministic FSA
(as that is what I am meaning) can solve the bracket matching problem,
but a deterministic one can't. Put it this way - how would you design a
deterministic FSA that did bracket matching on arbitrary length input?
You can't, if it is going to remain finite... And it's easy enough to do
with a non-deterministic FSA.

How do you do it with a non-deterministic FSA?

Is it that *some* non-deterministic FSA can be 'unfolded' into
deterministic ones, but not all, perhaps?

No, ANY non-deterministic FSA can be unfolded into a deterministic one.

The simple, brute force, approach is to create a DFSA whose states are
the sets of states of the NDFSA. It has a transition from x to y on
input symbol s if, and only if, y is the set of possible states of the
NDFSA after input s when in some state in x.

The DFSA's start state is the set of start states of the NDFSA. Its
accept states are those states that contain at least one of the NDFSA's
accept states.

The DFSA tracks the set of states the NDFSA could reach, from any start
state, given the input so far.

Patricia

Alex Hunsley · Feb 19, 2007

Patricia said:
How do you do it with a non-deterministic FSA?

Here's one I just sketched. Does it look right?

http://farm1.static.flickr.com/136/395180058_ea4d66f451.jpg?v=0

But I can't see how this can be made into a DFSA because from what I
sketched for that you end up with a big 'ladder' of states that
stretches off to infinity (hence non finite).

No, ANY non-deterministic FSA can be unfolded into a deterministic one.

The simple, brute force, approach is to create a DFSA whose states are
the sets of states of the NDFSA. It has a transition from x to y on
input symbol s if, and only if, y is the set of possible states of the
NDFSA after input s when in some state in x.

The DFSA's start state is the set of start states of the NDFSA. Its
accept states are those states that contain at least one of the NDFSA's
accept states.

The DFSA tracks the set of states the NDFSA could reach, from any start
state, given the input so far.

Thanks for the reminder of all that!
lex

Alex Hunsley · Feb 19, 2007

Daniel said:
Or, to get down right esoteric:
private static boolean isBalancedCount(String string) {
int parens = 0;
for (char c : string.toCharArray()) {
if ((c == '(' && ++ parens < 0) || c==')' && --parens < 0)
{
return false;
}
}
return parens == 0;
}

CafeBabe points to those who know why that works the way it does.

I saw CafeBabe scavenging some fields for DeadBeef last night! Poor babe....
lex

Chris Uppal · Feb 19, 2007

Alex said:
Here's one I just sketched. Does it look right?

http://farm1.static.flickr.com/136/395180058_ea4d66f451.jpg?v=0

I don't think that does what you intend. For instance, if it sees input string
")" then it'll end up in states {A,B}. The ')' will take it from the start
state, A, to state A, and also (non-deterministically) to state B. Since A is
accepting that means it will accept the input ")" -- which is rather noticeably
not a string of balanced, properly nested, brackets ;-)

But I can't see how this can be made into a DFSA because from what I
sketched for that you end up with a big 'ladder' of states that
stretches off to infinity (hence non finite).

It is equivalent to a DFA with two states, I'll call 'em S0 and S1.
S0 is the start state and is accepting.
There are four transitions as follows (I hope the notation is self-evident):
S0 -- ) --> S0
S0 -- ( --> S1
S1 -- ( --> S1
S1 -- ) --> S0

That DFA was generated and minimised mechanically, btw. It is always possible
to transform an NFA into an equivalent DFA, and what's more it can be done by a
fairly simple algorithm -- no human insight required ;-) The key point is
that, since the NFA has -- by definition -- a finite set of states, the set of
/sets/ of states is also finite. And by constructing a DFA with a state
corresponding to each (non-empty) set of NFA states, you can translate the NFA
into (typically rather bigger) DFA. If you'd like more details then you could
start with the Wikipeadia article:
http://en.wikipedia.org/wiki/Powerset_construction
(I haven't read it, so I don't know how clear, or even correct, it is, but it
does have convincing-looking pictures ;-) Or try Google:
nfa dfa construction

The algorithm yields a 3-state FSA with 6 transitions, and then minimising that
produces the above.

-- chris

Patricia Shanahan · Feb 19, 2007

Alex said:
Here's one I just sketched. Does it look right?

http://farm1.static.flickr.com/136/395180058_ea4d66f451.jpg?v=0

It accepts any string whose last character is ")", regardless of whether
the parentheses are balanced or not. There is an edge for ")" leading to
the accepting state A from both itself and B.

But I can't see how this can be made into a DFSA because from what I
sketched for that you end up with a big 'ladder' of states that
stretches off to infinity (hence non finite).

Each rung on the ladder can be represented by a set of states of the
NDFSA, and the next set of states depends only on the current set of
states and the input.

Patricia

Christian · Feb 19, 2007

Patricia said:
It accepts any string whose last character is ")", regardless of whether
the parentheses are balanced or not. There is an edge for ")" leading to
the accepting state A from both itself and B.

the power set of any finite set is finite!
http://en.wikipedia.org/wiki/Powerset_construction
basically the same for any automat or machine ... non deterministic and
deterministic allways have the same power ..
even on a touring machine ... non-deterministic may just be faster...

Oliver Wong · Feb 21, 2007

I disagree, because a proof (by construction) exists which shows that
any non-deterministic state machine can be converted to a determinisitc
state machine. If the NDSM has K states, then DSM will have 2^K states. If
you allow for infinite NDSM, it only seems fair that you also allow for
infinite DSMs.

What do you mean by NDA? Non-deterministic automaton?

Probably. I'll come back to this in a moment later in this post.

Nope, it doesn't solve the problem. The int type in Java has finite
capacity - see Integer.MAX_VALUE - so more accurately this a finite
deterministic state machine. I'm actually trying to remember if
'deterministic state machine' usually implies the 'finite' part in common
usage. A look at wikipedia seems to suggest that, but I wasn't looking at
it for long...

The usually, when someone talks about deterministic automatons and
non-deterministic automatons, they mean finite ones. So the usual
abbreviations are DFA and NDFA. We can, of course, talk about infinite
automatons, but there seems to be less literature on that topic (perhaps
because it's equivalent to, but more complex than, some other machine, e.g.
a push down stack machine or a Turing Machine or something, I don't know).

So if you're talking about finite DAs and Nodes, then they cannot solve
the "balance parentheses" problem. If you're talking about infinite ones,
then if a NDA can solve it, so can a DA. And if a NDA cannot solve it,
neither can a DA. The two are equivalent in power.

The interests in NDAs is that they take up less "code" to represent
(recall that a K state NDA translates into a 2^K state DA), *and* (and
here's the part that makes them interesting to study nowadays) a quantum
computer can run NDAs "natively", whereas a classical computer would have to
emulate the NDA by converting it (either implicitly or explicitly) into a
DA, and gain that 2^K performance penalty factor.

- Oliver

Daniel Pitts · Feb 21, 2007

I disagree, because a proof (by construction) exists which shows that
any non-deterministic state machine can be converted to a determinisitc
state machine. If the NDSM has K states, then DSM will have 2^K states. If
you allow for infinite NDSM, it only seems fair that you also allow for
infinite DSMs.

Probably. I'll come back to this in a moment later in this post.

The usually, when someone talks about deterministic automatons and
non-deterministic automatons, they mean finite ones. So the usual
abbreviations are DFA and NDFA. We can, of course, talk about infinite
automatons, but there seems to be less literature on that topic (perhaps
because it's equivalent to, but more complex than, some other machine, e.g.
a push down stack machine or a Turing Machine or something, I don't know).

So if you're talking about finite DAs and Nodes, then they cannot solve
the "balance parentheses" problem. If you're talking about infinite ones,
then if a NDA can solve it, so can a DA. And if a NDA cannot solve it,
neither can a DA. The two are equivalent in power.

The interests in NDAs is that they take up less "code" to represent
(recall that a K state NDA translates into a 2^K state DA), *and* (and
here's the part that makes them interesting to study nowadays) a quantum
computer can run NDAs "natively", whereas a classical computer would have to
emulate the NDA by converting it (either implicitly or explicitly) into a
DA, and gain that 2^K performance penalty factor.

- Oliver

Its important to say that an NDA with K states has an equivalent DA
with at most 2^K states. The set of all NDA contains the set of all
Deterministic Automota. Determinism is a property of the stateful
automata.

For example, if you have the simple regex "abcd", the NDA and DA
produce the same graph, since it is only possible to be in one state
at a time. (s)->a->b->c->d->(e)

Whereas "a[bc]d" produces the NDA
(s)->a
/ \
b c
\ /
d->(e)
and the DA
(s)->a->b and c->d->(e)

The DA actually has fewer "states" than the NDA inthis case (as b and
c are combined)

Christian · Feb 21, 2007

Daniel said:
I disagree, because a proof (by construction) exists which shows that
any non-deterministic state machine can be converted to a determinisitc
state machine. If the NDSM has K states, then DSM will have 2^K states. If
you allow for infinite NDSM, it only seems fair that you also allow for
infinite DSMs.

Probably. I'll come back to this in a moment later in this post.

The usually, when someone talks about deterministic automatons and
non-deterministic automatons, they mean finite ones. So the usual
abbreviations are DFA and NDFA. We can, of course, talk about infinite
automatons, but there seems to be less literature on that topic (perhaps
because it's equivalent to, but more complex than, some other machine, e.g.
a push down stack machine or a Turing Machine or something, I don't know).

So if you're talking about finite DAs and Nodes, then they cannot solve
the "balance parentheses" problem. If you're talking about infinite ones,
then if a NDA can solve it, so can a DA. And if a NDA cannot solve it,
neither can a DA. The two are equivalent in power.

The interests in NDAs is that they take up less "code" to represent
(recall that a K state NDA translates into a 2^K state DA), *and* (and
here's the part that makes them interesting to study nowadays) a quantum
computer can run NDAs "natively", whereas a classical computer would have to
emulate the NDA by converting it (either implicitly or explicitly) into a
DA, and gain that 2^K performance penalty factor.

- Oliver

Click to expand...

Its important to say that an NDA with K states has an equivalent DA
with at most 2^K states. The set of all NDA contains the set of all
Deterministic Automota. Determinism is a property of the stateful
automata.

For example, if you have the simple regex "abcd", the NDA and DA
produce the same graph, since it is only possible to be in one state
at a time. (s)->a->b->c->d->(e)

Whereas "a[bc]d" produces the NDA
(s)->a
/ \
b c
\ /
d->(e)
and the DA
(s)->a->b and c->d->(e)

The DA actually has fewer "states" than the NDA inthis case (as b and
c are combined)

for me important would be to say that java's regexp seem not to be
equivalent to regexp in science (which have the same power as DFA or NFA
or epsilon-FA)

javas regexp allow you to backreference something you mached earlier
which might make them as powerful as a ContexFree-grammar/PDA (may be I
don't know, just a guess)

so it seems for me currently not proven that you can't do this parens
matching...
but this seems to be rather scientific intrest...there are better ways
to do it...

Oliver Wong · Feb 21, 2007

[long discussion snipped]

Right.

Right, that's what I intended to convey.

[...]

Christian said:
for me important would be to say that java's regexp seem not to be
equivalent to regexp in science (which have the same power as DFA or NFA
or epsilon-FA)

Most modern languages's regexp implementation seem to be derived from
Perl's regexp implementation, which was already more powerful than
"classical" regular expressions, DFAs and NFAs.

- Oliver

Gordon Beaton · Feb 22, 2007

Most modern languages's regexp implementation seem to be derived
from Perl's regexp implementation, which was already more powerful
than "classical" regular expressions, DFAs and NFAs.

Most Perl regular expressions can be rewritten using "classical"
regexps, IIUC it's only backreferences that can't.

On the other hand, the Perl (Java, Python, Ruby, etc) regexp
implementation is extremely *slow*:

http://swtch.com/~rsc/regexp/regexp1.html

/gordon

Chris Uppal · Feb 22, 2007

Gordon said:
On the other hand, the Perl (Java, Python, Ruby, etc) regexp
implementation is extremely *slow*:

http://swtch.com/~rsc/regexp/regexp1.html

Thank you! I had come across that page recently, and was just wondering how I
was going to find it again to mention here...

(One caveat: he says that the NFA-based approaches are "little-known" -- that
seems something of an overstatement to me. I doubt if there are many people
with an interest in the /implementation/ of regexps who don't know about
NFA/DFA construction, at least in general terms. If all he means is that most
/users/ of Perl-style regexps don't realise that there's a performance price
for the complicated hacks, then I have no disagreement.)

-- chris

Gordon Beaton · Feb 22, 2007

Thank you! I had come across that page recently, and was just
wondering how I was going to find it again to mention here...

For me it was exactly the opposite - I saw it on the erlang-questions
list just yesterday and had to figure out where I had seen *this*
thread in order to post it...

/gordon

Oliver Wong · Feb 22, 2007

Daniel Pitts said:
For example, if you have the simple regex "abcd", the NDA and DA
produce the same graph, since it is only possible to be in one state
at a time. (s)->a->b->c->d->(e)

Whereas "a[bc]d" produces the NDA
(s)->a
/ \
b c
\ /
d->(e)
and the DA
(s)->a->b and c->d->(e)

The DA actually has fewer "states" than the NDA inthis case (as b and
c are combined)

I was thinking about these comments a bit more, and I realized I don't
really understand your notation. In particular, I have no idea how to read
your DA. What you call an NDA looks like a DA to me. That is,

(s)->a
/ \
b c
\ /
d->(e)

looks like a *deterministic* finite automaton, in that there's only ever at
most 1 possible transition you can take given the next input token and the
current state. To make it an NDA, you'd probably need something like:

(s)->a
/|\
/ | \
b b c
| | |
h i j
\ | /
\|/
d->(e)

(which implements the regexp: "a(bh|bi|cj)d")

where if you're at the top state, and you see a b, you don't know whether to
take the left one or the middle one.

But even then, I think (but cannot recall precisely) that the way the
terms NFA and DFA were defined to me, every DFA is considered to also be an
NFA (and so perhaps the N should stand for "not necessarily deterministic"
rather than "non-determistic"), but not every NFA is also a DFA.

- Oliver

Chris Uppal · Feb 22, 2007

Oliver said:
I was thinking about these comments a bit more, and I realized I don't
really understand your notation. In particular, I have no idea how to read
your DA.

I think Daniel's notation is non-standard, but it does make seem to make sense.
What he's doing is removing the states themselves from the picture, leaving
only the trails of input words that have been recognised.

So in:

Whereas "a[bc]d" produces the NDA
(s)->a
/ \
b c
\ /
d->(e)

the first 'a' is not a state label, but marks the fact that we have accepted an
'a'. The actual state we are in has no name (in this notation) but is sort of
hidden "just after" the label 'a'. From that state we can transition by
accepting 'b' to an anonymous state hidden behind the label 'b', or can
transition by accepting 'c' to the anonymous state hidden behind the 'c' label.
And from either of those states we can accept 'd' to go to the corresponding
state, and since that is a final state, we have matched the input. And the NFA
does indeed recognise a(b|c)d.

I don't think the notation covers all possible cases, for all it is nice and
clear for the cases it /does/ cover. The problems comes when one state has two
or more incoming edges with different labels (including epsilon transitions):

(S0)---x---> (S1)
| |
a y
| |
v |
(S2)---b--->[S3]

(I hope my ASCII art is legible). S0 is the start state, from which we can
transition via 'x' to S1, or via 'a' to S2. From S1 and S2 we can transition
via 'b' or 'y' (respectively) to S3, the accepting state. Thus it recognises
(ab|xy). I don't think that can be represented in Daniel's notation without
changing the definition of the NFA itself.

I don't know whether every regexp can be represented as /some/ NFA in Daniel's
notation. I haven't been able to think of a counter example yet, although the
NFA is sometimes more awkward than one which can be written in the classic
diagram format.

But even then, I think (but cannot recall precisely) that the way the
terms NFA and DFA were defined to me, every DFA is considered to also be
an NFA (and so perhaps the N should stand for "not necessarily
deterministic" rather than "non-determistic"), but not every NFA is also
a DFA.

It depends on how you see it. Every DFA corresponds trivially to an NFA with
the same structure, so in that sense the DFAs are a subset of the NFAs. But
the formal definition of a DFA is different from that of a NFA, since it has a
/single/' state at any one time rather than a set of states. And of course
type Set<State> is not assignable to, or from, type State.

-- chris

FAQ 4.23 How do I find matching/nesting anything?	0	Apr 2, 2011
regex extension to handle matching parens?	5	Jan 23, 2009
Regexp for parsing?	6	Nov 19, 2007
JavaScript Challenge: Validating Email Addresses	1	Oct 6, 2023
Two expression variant of ()	13	Mar 30, 2012
Contents of an exception object	95	May 28, 2014
help with regex matching multiple %e	0	Mar 3, 2011
Simple web framework - improvements to makefile	0	Feb 1, 2023

Parens Matching

Chris Uppal

Daniel Pitts

Alex Hunsley

Alex Hunsley

Alex Hunsley

Patricia Shanahan

Alex Hunsley

Alex Hunsley

Chris Uppal

Patricia Shanahan

Christian

Oliver Wong

Daniel Pitts

Christian

Oliver Wong

Gordon Beaton

Chris Uppal

Gordon Beaton

Oliver Wong

Chris Uppal

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads