Roedy Green said:
please define DFA and NFA and where newbies can learn more.
Right, sorry. DFA and NFA stands for Deterministic Finite Automaton and
Nondeterministric Finite Automaton respectively. One typically studies them
in the field of Theory of Computation, where it is explored what
computational models we have, and how powerful they are. Roughly, Regular
Languages, which is what Regular Expressions can recognize, is the weakest,
then comes Context-Free Languages (which are recognized by Context-Free
Grammars), then finally there's the Decidable languages (which are
recognized by Turing machines).
The most popular book, it seems, for studying this topic is
"Introduction to the Theory of Computation" ISBN 053494728X by Michael
Sipser. That's the book I used, at least, and it gave the algorithm for
converting *FAs into regular expressions, which I later translated to Java
and posted on SourceForge as mentioned earlier in this thread.
To get a formal explanation of what DFAs and NFAs are, you'd probably be
better off reading Sipser's book. Here I'll give a very informal overview
which thus may use the wrong terminology, or be technically incorrect in
some details, for the sake of brevity. The general concept should be correct
though.
A DFA is typically represented using a state diagram, where states are
represented by circles, and transitions are represented by labelled arrows
that start at some circle, and point to another (or the same) circle. The
label on the arrow indicates what character should be read at that point. So
here's my attempt at ASCII art representing a simple state diagram:
A
|--|
| | B
V | |----------|
V |
||====|| |----| ----|
|| S0 || | S1 | | A
||====|| |----| <---|
B ^
|--------------|
So, hopefully, as you can see, this DFA has 2 states, S0 and S1, and has 4
edges. Traditionally, "accepting states" are drawn in a double circle, so
here S0 is an accepting state, but S1 is not. You can have more than one
accepting state. You also have to define a start state, and AFAIK, there's
no standard notation for this, so I'll just say S0 is a start state. You can
only have 1 accepting state.
DFAs are machines for accepting strings. So given some string, for each
character in that string, you should find the arrow that comes from the
state you're currently in, with the label matching the character you're
currently processing to determine what state you should go into next.
The above DFA would accept the string "BAAB", but it would not accept
"BAAA". Let's walk through the strings to see what happens. First, the
string "BAAB". You start at the start state, which is S0, and the first
character you see is B. According to the diagram, you should move into state
S1. The next character is A. There's an arrow coming from S1 labelled with
"A" which goes right back to S1, so we consume the character "A" and we stay
in state S1. Next we see another "A". The same thing happens again, where we
consume the "A" and we remain in state S1. Then next character is "B",
which, according to the diagram, means we should move to state S0. Now this
is the end of the string, so we take a look at whether we are in an
accepting state or not. We're in S0, which is an accepting state, so the
string "BAAB" is accepted.
If you try the same thing for "BAAA", you should start in S0, then move to
S1, then you'll keep staying in S1 as you consume all the "A"s. Finally,
when the string ends, you'll still be in S1, which is not an accepting
state, so the string "BAAA" is rejected.
Sipser gives a proof in the book which shows that DFAs are exactly as
powerful as regular expressions, so any string you can match using a regular
expression, you can also match using a DFA.
The OP's original request was for a regular expression which could accept a
string which could not start nor end with a space, but which could contain
the space character inside if it were a single space, but not a sequence of
two spaces in a row or more.
Now I don't know how to write a regular expression for that off the top of
my head, but I can draw a DFA for that pretty easily. In this case, I'll
give the DFA in the form of a transition table:
Non-Space Character Space Character Accepting
S0 Transition to S0 Transition to S3 False
S1 Transition to S1 Transition to S2 True
S2 Transition to S1 Transition to S3 False
S3 Transition to S3 Transition to S3 False
S0 is the start state. State S3 is known as a "trap state", which is just an
informal term to mean once you enter S3, you can never get out, and S3 is
made to be non-accepting. That way, the moment you see something which makes
you sure that the string cannot be accepted, you can transition to S3, where
the DFA will just eat up the rest of the string and then, when it reaches
the end of the string, it'll reject it.
When you start at S0, I know that the string cannot start with a space, so
if I see a space character, I immediately transition to S3, which will trap
the DFA into a failure. Otherwise, I'll see a non-space character, which
takes me to S1.
S1 is an accepting state, because I assume strings of length 1 are okay. S0
could have been an accepting state too, which would have meant strings of
length 0 are okay, but I didn't know the OP's requirements, so I just made
an assumption here and chose to make it non-accepting. In S1, if we keep
seeing non-space characters, we're happy and we keep eating them up, staying
in S1, where we can terminate and accept at any time.
But what if we see a space character? A single space character all by itself
is allowed, but there's if you see one space character, there's a danger
that there might be a second space character coming up, which is not allowed
in the strings we're interested in. So I create "danger" state, called s2.
If you're in s1, and you see a space, you're moved into "danger mode".
In S2, if you see another non-space character, then the danger is gone, and
you can safely go back to S1. However, if you see a second space character,
then you know this string is guaranteed to be invalid, which is why it
transitions to S3.
Note that S2 was not made accepting. Why? Because the OP's original problem
description said that the string cannot end in a space, but if we're in S2,
we just saw a space, so if the string ends there, we should reject it.
Finally S3 always transitions to itself no matter what. That's the nature of
trap states; they're like black holes which gobble up everything in their
path.
The algorithm for transforming a DFA into a regular expression is not
difficult, but it'll probably significantly lengthen this already very long
post (and requires many diagrams to properly explain, and I'm not feeling
very ASCII-artsy right now). As such, if you're interested, I recommend you
buy Sipser's book. Alternative, you can download my implementation from
SourceForge and take a look at the source code (I believe it's less than 1
page long), and try to decypher it yourself. The URL again, if you're
interested, is
http://sourceforge.net/projects/gnfa2re/
- Oliver