regexp for legal JNDI name (covering check for 2 space sequence)

D

davidmichaelkarr

I found I needed to build a regexp for a legal JNDI name. Tracing
through the schemas, it appears that a JNDI name is supposed to be a
"xsd:token", which has a curious definition. From the definition, it
doesn't have leading or trailing spaces, or carriage returns, line
feeds, or tabs. The odd part is that it cannot have sequences of TWO
or more spaces, but single spaces are allowed.

I'm not concerned about the leading/trailing spaces. The two-space
thing is what I can't figure out. In general, how do you specify a
regexp to match a string that specifies a substring to NOT find in the
candidate string?

I could always separate this into a regexp test and then an "indexOf()"
test, but I'd prefer to put it all into the regexp.
 
J

Joan

I found I needed to build a regexp for a legal JNDI name.
Tracing
through the schemas, it appears that a JNDI name is supposed to
be a
"xsd:token", which has a curious definition. From the
definition, it
doesn't have leading or trailing spaces, or carriage returns,
line
feeds, or tabs. The odd part is that it cannot have sequences
of TWO
or more spaces, but single spaces are allowed.

I'm not concerned about the leading/trailing spaces. The
two-space
thing is what I can't figure out. In general, how do you
specify a
regexp to match a string that specifies a substring to NOT find
in the
candidate string?

I could always separate this into a regexp test and then an
"indexOf()"
test, but I'd prefer to put it all into the regexp.

What do you have so far?
 
O

Oliver Wong

The easy part that doesn't cover the 2 space sequence:

"^[^\n\r\t]+$"

How about you define a character class that covers all the legal
characters except for 2 spaces, called it :alphanum: (since I assume that
the set of alphanumerical characters falls in this category). Also, I'll
assume \s is the character to mean a space.

^:alphanum:(:alphanum:|\s:alphanum:)*$

This reads: a non-space character, followed by any number of: (a
non-space character OR a space followed by a non-space character)

BTW, the trick I usually use is to first draw a DFA or NFA, then convert
it to a Regular Expression. I've written a program in Java to do this
conversion automatically for me, although in simple cases like this (where
the DFA only had 2 nodes), it's easier to just do the conversion by hand.

You can download my program at: http://sourceforge.net/projects/gnfa2re/

- Oliver
 
R

Roedy Green

I'm not concerned about the leading/trailing spaces. The two-space
thing is what I can't figure out. In general, how do you specify a
regexp to match a string that specifies a substring to NOT find in the
candidate string?

Java has (?!X)

I have not tried the Java version out , but in other regex systems
you have to find something then add -- but NOT this.
 
D

davidmichaelkarr

Very intriguing. That lead me to a working expression.

I looked at your project. I assume the only thing I can see is the
GNFA.java source file. There don't appear to be any other artifacts. I
would experiment with it, but I don't have a clue how to use it.
 
O

Oliver Wong

Roedy Green said:
please define DFA and NFA and where newbies can learn more.

Right, sorry. DFA and NFA stands for Deterministic Finite Automaton and
Nondeterministric Finite Automaton respectively. One typically studies them
in the field of Theory of Computation, where it is explored what
computational models we have, and how powerful they are. Roughly, Regular
Languages, which is what Regular Expressions can recognize, is the weakest,
then comes Context-Free Languages (which are recognized by Context-Free
Grammars), then finally there's the Decidable languages (which are
recognized by Turing machines).

The most popular book, it seems, for studying this topic is
"Introduction to the Theory of Computation" ISBN 053494728X by Michael
Sipser. That's the book I used, at least, and it gave the algorithm for
converting *FAs into regular expressions, which I later translated to Java
and posted on SourceForge as mentioned earlier in this thread.

To get a formal explanation of what DFAs and NFAs are, you'd probably be
better off reading Sipser's book. Here I'll give a very informal overview
which thus may use the wrong terminology, or be technically incorrect in
some details, for the sake of brevity. The general concept should be correct
though.

A DFA is typically represented using a state diagram, where states are
represented by circles, and transitions are represented by labelled arrows
that start at some circle, and point to another (or the same) circle. The
label on the arrow indicates what character should be read at that point. So
here's my attempt at ASCII art representing a simple state diagram:

A
|--|
| | B
V | |----------|
V |
||====|| |----| ----|
|| S0 || | S1 | | A
||====|| |----| <---|
B ^
|--------------|

So, hopefully, as you can see, this DFA has 2 states, S0 and S1, and has 4
edges. Traditionally, "accepting states" are drawn in a double circle, so
here S0 is an accepting state, but S1 is not. You can have more than one
accepting state. You also have to define a start state, and AFAIK, there's
no standard notation for this, so I'll just say S0 is a start state. You can
only have 1 accepting state.

DFAs are machines for accepting strings. So given some string, for each
character in that string, you should find the arrow that comes from the
state you're currently in, with the label matching the character you're
currently processing to determine what state you should go into next.

The above DFA would accept the string "BAAB", but it would not accept
"BAAA". Let's walk through the strings to see what happens. First, the
string "BAAB". You start at the start state, which is S0, and the first
character you see is B. According to the diagram, you should move into state
S1. The next character is A. There's an arrow coming from S1 labelled with
"A" which goes right back to S1, so we consume the character "A" and we stay
in state S1. Next we see another "A". The same thing happens again, where we
consume the "A" and we remain in state S1. Then next character is "B",
which, according to the diagram, means we should move to state S0. Now this
is the end of the string, so we take a look at whether we are in an
accepting state or not. We're in S0, which is an accepting state, so the
string "BAAB" is accepted.

If you try the same thing for "BAAA", you should start in S0, then move to
S1, then you'll keep staying in S1 as you consume all the "A"s. Finally,
when the string ends, you'll still be in S1, which is not an accepting
state, so the string "BAAA" is rejected.

Sipser gives a proof in the book which shows that DFAs are exactly as
powerful as regular expressions, so any string you can match using a regular
expression, you can also match using a DFA.

The OP's original request was for a regular expression which could accept a
string which could not start nor end with a space, but which could contain
the space character inside if it were a single space, but not a sequence of
two spaces in a row or more.

Now I don't know how to write a regular expression for that off the top of
my head, but I can draw a DFA for that pretty easily. In this case, I'll
give the DFA in the form of a transition table:

Non-Space Character Space Character Accepting
S0 Transition to S0 Transition to S3 False
S1 Transition to S1 Transition to S2 True
S2 Transition to S1 Transition to S3 False
S3 Transition to S3 Transition to S3 False

S0 is the start state. State S3 is known as a "trap state", which is just an
informal term to mean once you enter S3, you can never get out, and S3 is
made to be non-accepting. That way, the moment you see something which makes
you sure that the string cannot be accepted, you can transition to S3, where
the DFA will just eat up the rest of the string and then, when it reaches
the end of the string, it'll reject it.

When you start at S0, I know that the string cannot start with a space, so
if I see a space character, I immediately transition to S3, which will trap
the DFA into a failure. Otherwise, I'll see a non-space character, which
takes me to S1.

S1 is an accepting state, because I assume strings of length 1 are okay. S0
could have been an accepting state too, which would have meant strings of
length 0 are okay, but I didn't know the OP's requirements, so I just made
an assumption here and chose to make it non-accepting. In S1, if we keep
seeing non-space characters, we're happy and we keep eating them up, staying
in S1, where we can terminate and accept at any time.

But what if we see a space character? A single space character all by itself
is allowed, but there's if you see one space character, there's a danger
that there might be a second space character coming up, which is not allowed
in the strings we're interested in. So I create "danger" state, called s2.
If you're in s1, and you see a space, you're moved into "danger mode".

In S2, if you see another non-space character, then the danger is gone, and
you can safely go back to S1. However, if you see a second space character,
then you know this string is guaranteed to be invalid, which is why it
transitions to S3.

Note that S2 was not made accepting. Why? Because the OP's original problem
description said that the string cannot end in a space, but if we're in S2,
we just saw a space, so if the string ends there, we should reject it.

Finally S3 always transitions to itself no matter what. That's the nature of
trap states; they're like black holes which gobble up everything in their
path.

The algorithm for transforming a DFA into a regular expression is not
difficult, but it'll probably significantly lengthen this already very long
post (and requires many diagrams to properly explain, and I'm not feeling
very ASCII-artsy right now). As such, if you're interested, I recommend you
buy Sipser's book. Alternative, you can download my implementation from
SourceForge and take a look at the source code (I believe it's less than 1
page long), and try to decypher it yourself. The URL again, if you're
interested, is http://sourceforge.net/projects/gnfa2re/

- Oliver
 
O

Oliver Wong

Very intriguing. That lead me to a working expression.

I looked at your project. I assume the only thing I can see is the
GNFA.java source file. There don't appear to be any other artifacts. I
would experiment with it, but I don't have a clue how to use it.

Yes, the project originated because I wanted to convert relatively large
DFAs (with 20+ states) into a regular expression for a compiler-related
project I was working on. I googled for an existing program to do this, but
could find none, so I quickly whipped one together and posted it on
SourceForge for the next person who wanted to do something similar. As such,
there hasn't been much effort to make it "user friendly".

It does have some JavaDoc written to explain how to use it, but the
comments probably assume you're familiar with DFAs, NFAs and GNFAs (perhaps
by reading Michael Sipser's book which I mention in another branch of this
thread). Essentially, you have to create an N by N array which acts as a
transition table (I describe what a transition table is in that same other
post). The class then crunches the transition table down into a regular
expression.

Towards the bottom of the code, there's a "public static void main"
method which shows an example usage.

- Oliver
 
R

Roedy Green

If you try the same thing for "BAAA", you should start in S0, then move to
S1, then you'll keep staying in S1 as you consume all the "A"s. Finally,
when the string ends, you'll still be in S1, which is not an accepting
state, so the string "BAAA" is rejected.

You can code these things quite neatly with each state represented by
an enum constant. Each enum constant implements a next method like
this:

// presuming we are in this state, and the input is c,
// which state should we go into next?
// State is the state enum.

State next ( char c )
{
switch( c )
{
case 'A':
return S1;
case 'B':
case 'C':
case 'D':
return S2;
default:
return S3;
}
}

Since Java does not have case ranges, spelling out every character
individually can get tedious. So I handle this with a categorise
method that categorises characters by groups with an category enum,
e.g. UPPER_CASE DIGITS ACCENTED SLASH STAR COLON ADDITION,
then next works with the categories in its switch.
This code is remarkably fast.

categorise works with a combination of ifs to determine range band and
switches for fine tuning.

All the JDisplay colourised listings on my side were parsed this way.
I wrote finite state automatons to parse snippets of Java, HTML, SQL
and BAT file language. I handle XML with the HTML colouriser.
 
A

Alan Moore

I found I needed to build a regexp for a legal JNDI name. Tracing
through the schemas, it appears that a JNDI name is supposed to be a
"xsd:token", which has a curious definition. From the definition, it
doesn't have leading or trailing spaces, or carriage returns, line
feeds, or tabs. The odd part is that it cannot have sequences of TWO
or more spaces, but single spaces are allowed.

I'm not concerned about the leading/trailing spaces. The two-space
thing is what I can't figure out. In general, how do you specify a
regexp to match a string that specifies a substring to NOT find in the
candidate string?

I could always separate this into a regexp test and then an "indexOf()"
test, but I'd prefer to put it all into the regexp.

Match anything that's not one of the whitespace characters, or a space
if it's not followed by another space (using negative lookahead):

^(?:[^ \r\n\t]+| (?! ))+$

I assume formfeeds and vertical tabs are also verboten, so you should
be able to use \S instead of the negated character class. Here's a
regex that also covers the leading/trailing spaces issue:

^\S(?:\S+| (?! ))*(?<=\S)$

This requires the target to contain one non-whitespace character,
followed by zero or more non-WS characters or single spaces. Then it
uses lookbehind to ensure that the final character is non-WS.
 
O

Oliver Wong

[This post intentionally does not quote anything from any prior messages]

I just saw your Finite State page at
http://mindprod.com/jgloss/finitestate.html and under NFAs, you say "NFAs
don't always give the same answer." This is not what the "Nonderterministic"
in "Nonderministic Finite Automata" refers to; at least, that's not what it
means under the traditional usage of the term "NFA".

Sipser gives a (very) formal definition of NFA in his book, and under
this definition, whether or not a given NFA accepts a given input is
completely deterministic.

NFAs are akin to hyper-parallel machines (Quantum computers are
reminiscent of NFAs, but I'm don't know enough about quantum computing to
say much more than that) in that rather than making a choice to take one or
the other branch of a computational path, an NFA can take both branches at
the same time to see which one will yield a meaningful answer.

- Oliver
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top