regexp(ing) Backus-Naurish expressions ...

Discussion in 'Java' started by qwertmonkey@syberianoutpost.ru, Mar 10, 2013.

  1. Guest

    I need to set up some code's running context via properties files and I want
    to make sure that users don't get too playful messing with them, because that
    could alter results greatly and in unexpected ways (they must probably won't
    be able to make sense of and then they would bother the hell out of you)
    ~
    So, I must do some sanity check the running parameters if entered via the
    command prompt or if the defaults are used from the properties files
    ~
    I am telling you all of that because you many know of libraries to do such
    thing
    ~
    I think one possible way to do that is via a regexp, which should match all
    the options included in the test array aISAr
    ~
    One of the problems I am having is that if you enter as options say [true|t],
    the matcher would match just the "t" of "true" and I want for "true" to be
    actually matched another one is that, say, " true ", should be matched, as well
    as "false [ nix |mac| windows ] line.separator" ...
    ~
    Any ideas you would share?
    ~
    thanks,
    lbrtchx
    ~
    ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ TEST CODE ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;

    // __
    public class RegexMatches02Test{
    // __
    public static void main( String args[] ){
    String aRegEx;
    String aIS;
    Pattern Ptrn;
    Matcher Mtchr;
    int iCnt, iMtxStart, iMtxEnd;
    // __
    aRegEx = "^\\s*[true|false|t|f]{1}\\s*\\[";
    aRegEx = "^\\s*[true|false|t|f]{1}";
    aRegEx = "^\\s*[true|false|t|f]{1}\\s*";
    aRegEx = "^\\s*[true|false t|f]{1}\\s*";

    // __
    String[] aISAr = new String[]{
    " true[a|b |c ] q"
    , " true [a|b |c ] q"
    , "true [a|b |c ] q"
    , "true[a|b|c] b"
    , "true[a|b|c]q"
    , "False[ y | n | q ] q"
    , "false[nix|windows|mac]line.separator"
    , "false [ nix |mac| windows ] line.separator"
    , "T[y|n]q"
    , "T[y]"
    , "false"
    , "faLse"
    , "true"
    , "TrUe"
    , "F"
    , "T"
    };
    int iISArL = aISAr.length, i = 0;
    // __
    boolean IsLoop;
    Ptrn = Pattern.compile(aRegEx, Pattern.CASE_INSENSITIVE);

    System.err.println("// __ matching pattern: |" + aRegEx + "|");

    Mtchr = Ptrn.matcher(aISAr); // get a matcher object
    IsLoop = (i < iISArL);
    while(IsLoop){
    System.err.println("// __ |" + i + "|" + aISAr + "|");
    iCnt = 0;
    // __
    while(Mtchr.find()){
    iMtxStart = Mtchr.start();
    iMtxEnd = Mtchr.end();
    System.err.println("|" + iCnt + "|" + iMtxStart + "|" + iMtxEnd + "|" +
    aISAr.substring(iMtxStart, iMtxEnd) + "|");
    ++iCnt;
    }// (Mtchr.find())
    System.err.println("~");
    // __
    ++i;
    IsLoop = (i < iISArL);
    if(IsLoop){ Mtchr.reset(aISAr); }
    }// while(IsLoop)
    }
    }
    , Mar 10, 2013
    #1
    1. Advertising

  2. Arne Vajhøj Guest

    On 3/9/2013 9:27 PM, wrote:
    > I need to set up some code's running context via properties files and I want
    > to make sure that users don't get too playful messing with them, because that
    > could alter results greatly and in unexpected ways (they must probably won't
    > be able to make sense of and then they would bother the hell out of you)
    > ~
    > So, I must do some sanity check the running parameters if entered via the
    > command prompt or if the defaults are used from the properties files
    > ~
    > I am telling you all of that because you many know of libraries to do such
    > thing
    > ~
    > I think one possible way to do that is via a regexp, which should match all
    > the options included in the test array aISAr
    > ~
    > One of the problems I am having is that if you enter as options say [true|t],
    > the matcher would match just the "t" of "true" and I want for "true" to be
    > actually matched another one is that, say, " true ", should be matched, as well
    > as "false [ nix |mac| windows ] line.separator" ...
    > ~
    > Any ideas you would share?


    I would do it as:
    - switch from properties to XML
    - define a schema for the XML with strict restrictions on data
    - let the application parse that with a validating parser and
    read it into some config object, this will ensure that required
    information is there and that the data types are correct
    - let the application apply business validation rules in Java code
    on the config objects - this will ensure that the various
    information is consistent

    Arne
    Arne Vajhøj, Mar 10, 2013
    #2
    1. Advertising

  3. On 3/9/2013 8:27 PM, wrote:
    > One of the problems I am having is that if you enter as options say [true|t],
    > the matcher would match just the "t" of "true" and I want for "true" to be
    > actually matched another one is that, say, " true ", should be matched, as well
    > as "false [ nix |mac| windows ] line.separator" ...


    Do you know the syntax of Java's regular expressions? See
    <http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html>.

    In short, anything contained within square brackets is considered to be
    a set of characters to match on, so [true|t] succeeds if the character
    it's matching against is a t, r, u, e, or |. The syntax you probably
    wanted was (true|t), which would either match the string "true" or the
    string "t".

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer ðŸ§, Mar 10, 2013
    #3
  4. Stefan Ram Guest

    writes:
    > I am telling you all of that because you many know of libraries to do such
    >thing


    The config class can be seen as a bean, and then bean
    validation can be applied, possibly (I never used that).

    http://docs.oracle.com/javaee/6/tutorial/doc/gircz.html

    > One of the problems I am having is that if you enter as options say [true|t],
    >the matcher would match just the "t" of "true" and I want for "true" to be


    (?:true|t(?=[^r][^u][^e]))

    (sketch, untested)
    Stefan Ram, Mar 10, 2013
    #4
  5. Roedy Green Guest

    Roedy Green, Mar 10, 2013
    #5
  6. markspace Guest

    On 3/9/2013 6:27 PM, wrote:

    > One of the problems I am having is that if you enter as options say [true|t],
    > the matcher would match just the "t" of "true" and I want for "true" to be
    > actually matched another one is that, say, " true ", should be matched, as well
    > as "false [ nix |mac| windows ] line.separator" ...
    > ~
    > Any ideas you would share?
    > ~



    Based on your syntax example and you title, why bother with
    "Backus-Naurish?" Java has full parser generators.

    http://www.antlr.org/
    markspace, Mar 10, 2013
    #6
  7. On 10.03.2013 15:57, Roedy Green wrote:
    > On Sun, 10 Mar 2013 02:27:32 +0000 (UTC),
    > wrote, quoted or indirectly quoted
    > someone who said :
    >
    >> Any ideas you would share?

    >
    > Regexes are quite limited.


    I beg to differ: it's amazing what you can do with them. Especially
    modern RX engines are usually much more powerful than those needed for
    the class of regular languages.

    > When you bang into their limits you can
    > write a finite state machine or use a parser.


    What limitations would make me want to write a FSM instead by hand?

    Cheers

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Mar 10, 2013
    #7
  8. Stefan Ram Guest

    Robert Klemme <> writes:
    >What limitations would make me want to write a FSM instead by hand?


    It is a natural idea that the user may input simple
    arithmetic expressions with numeric literals, basic
    arithmetics, parentheses and algebraic signs when the
    program asks for a numeric value.
    Stefan Ram, Mar 10, 2013
    #8
  9. Roedy Green Guest

    Examples where regexes run out of steam:
    parsing Java, HTML, BAT language ... to do syntax colouring.
    screen scraping, where what you want can appear in arbiter orders, be
    missing, or enclosed in a variety of delimiters.

    creating code to simulate the output of forms. You have to do it in
    stages. You pick out a string then you pick out strings of that


    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Software gets slower faster than hardware gets faster.
    ~ Niklaus Wirth (born: 1934-02-15 age: 79) Wirth's Law
    Roedy Green, Mar 10, 2013
    #9
  10. Roedy Green Guest

    On Sun, 10 Mar 2013 22:39:22 +0100, Robert Klemme
    <> wrote, quoted or indirectly quoted
    someone who said :

    >What limitations would make me want to write a FSM instead by hand?


    Compacting out nugatory space in HTML would be another example.

    Though they are quite complicated, I find FSMs very easy to write, and
    they almost always work first time. You can narrow your thinking to a
    tiny case and ignore the big picture quite safely.

    In contrast, I find my regexes (of any complexity) nearly always have
    some unexpected behaviour, often than does not show up immediately.

    The other complicating factor is I use three different regex schemes
    in a day: Java, Funduc and SlickEdit. I keep borrowing syntax from
    one of the other schemes than the one I am using. Some day I will
    have to write replacements that use Java syntax.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Software gets slower faster than hardware gets faster.
    ~ Niklaus Wirth (born: 1934-02-15 age: 79) Wirth's Law
    Roedy Green, Mar 10, 2013
    #10
  11. On 10.03.2013 23:21, Stefan Ram wrote:
    > Robert Klemme <> writes:
    >> What limitations would make me want to write a FSM instead by hand?

    >
    > It is a natural idea that the user may input simple
    > arithmetic expressions with numeric literals, basic
    > arithmetics, parentheses and algebraic signs when the
    > program asks for a numeric value.


    I am sorry but you are not answering the question.

    Cheers

    robert


    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Mar 11, 2013
    #11
  12. On 10.03.2013 23:54, Roedy Green wrote:
    > Examples where regexes run out of steam:


    I never said you can do anything with regexps. You said they are "quite
    limited" to which I responded "I beg to differ: it's amazing what you
    can do with them." I think you are talking completely past me.

    > parsing Java, HTML, BAT language ... to do syntax colouring.


    For that you need a context free parser anyway and would not create a
    FSM by hand.

    > screen scraping, where what you want can appear in arbiter orders, be
    > missing, or enclosed in a variety of delimiters.


    Still, I haven't seen a single reason to create a FSM by hand.

    > creating code to simulate the output of forms. You have to do it in
    > stages. You pick out a string then you pick out strings of that


    Regexps are for _parsing_ and not for _generating_.

    Cheers

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Mar 11, 2013
    #12
  13. On 11.03.2013 00:24, Roedy Green wrote:
    > On Sun, 10 Mar 2013 22:39:22 +0100, Robert Klemme
    > <> wrote, quoted or indirectly quoted
    > someone who said :
    >
    >> What limitations would make me want to write a FSM instead by hand?

    >
    > Compacting out nugatory space in HTML would be another example.


    There are tools for processing tag based languages. Why would I want to
    create a FSM by hand for that?

    > Though they are quite complicated, I find FSMs very easy to write, and
    > they almost always work first time. You can narrow your thinking to a
    > tiny case and ignore the big picture quite safely.


    Certainly you can write FSMs for a lot of things. But you were claiming
    that a manual FSM should be used instead of a regexp engine; so the
    question remains unanswered: why would anyone create a FSM by hand for
    parsing?

    > In contrast, I find my regexes (of any complexity) nearly always have
    > some unexpected behaviour, often than does not show up immediately.


    Well, that certainly depends on your familiarity with the tool. To me
    this sounds suspiciously like NIH syndrome. I am so familiar with using
    regular expressions of various kinds that it would not occur to me to
    start writing a FSM for parsing by hand. That is such a waste of time.

    > The other complicating factor is I use three different regex schemes
    > in a day: Java, Funduc and SlickEdit. I keep borrowing syntax from
    > one of the other schemes than the one I am using.


    And how exactly do you implement a FSM in SlickEdit?

    > Some day I will
    > have to write replacements that use Java syntax.


    Not sure what you mean by that.

    Cheers

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Mar 11, 2013
    #13
  14. Arne Vajhøj Guest

    On 3/11/2013 4:08 PM, Robert Klemme wrote:
    > On 11.03.2013 00:24, Roedy Green wrote:
    >> On Sun, 10 Mar 2013 22:39:22 +0100, Robert Klemme
    >> <> wrote, quoted or indirectly quoted
    >> someone who said :
    >>
    >>> What limitations would make me want to write a FSM instead by hand?

    >>
    >> Compacting out nugatory space in HTML would be another example.

    >
    > There are tools for processing tag based languages. Why would I want to
    > create a FSM by hand for that?
    >
    >> Though they are quite complicated, I find FSMs very easy to write, and
    >> they almost always work first time. You can narrow your thinking to a
    >> tiny case and ignore the big picture quite safely.

    >
    > Certainly you can write FSMs for a lot of things. But you were claiming
    > that a manual FSM should be used instead of a regexp engine; so the
    > question remains unanswered: why would anyone create a FSM by hand for
    > parsing?


    It sounds cool to claim to do so in a usenet thread!

    :)

    >> The other complicating factor is I use three different regex schemes
    >> in a day: Java, Funduc and SlickEdit. I keep borrowing syntax from
    >> one of the other schemes than the one I am using.

    >
    > And how exactly do you implement a FSM in SlickEdit?
    >
    >> Some day I will
    >> have to write replacements that use Java syntax.

    >
    > Not sure what you mean by that.


    I think he is talking about writing a plugin with a 100%
    Java compatible regex syntax.

    Arne
    Arne Vajhøj, Mar 11, 2013
    #14
  15. On 03/11/2013 09:59 PM, Arne Vajhøj wrote:
    > On 3/11/2013 4:08 PM, Robert Klemme wrote:


    >> Certainly you can write FSMs for a lot of things. But you were claiming
    >> that a manual FSM should be used instead of a regexp engine; so the
    >> question remains unanswered: why would anyone create a FSM by hand for
    >> parsing?

    >
    > It sounds cool to claim to do so in a usenet thread!
    >
    > :)


    You've got a point there!

    Cheers

    robert
    Robert Klemme, Mar 11, 2013
    #15
  16. On 3/10/2013 5:54 PM, Roedy Green wrote:
    > Examples where regexes run out of steam:
    > parsing Java, HTML, BAT language ... to do syntax colouring.


    Actually, all of those examples fall under the category of lexing, which
    is very easy to do with regular expressions; the python equivalent of
    flex uses regular expressions internally to do the lexing. Basically,
    what you'd have to do is this:

    1. For each token, compute the regex that matches the token and enclose
    it in a named capturing group
    2. Combine the token regexes into a single regex using disjunctions
    3. Run the large regex on the input string by continually finding
    matches until it runs out of them.
    4. For each match, use the named capturing group to do actions for that
    part of the input string.

    > screen scraping, where what you want can appear in arbiter orders, be
    > missing, or enclosed in a variety of delimiters.


    ([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^
    \t\r\n()<>,:;@["])+

    That is an example of a production regular expression I use specifically
    for tokenizing. Note in particular that I am matching two separate kinds
    of string literals ("foo" and [foo]). The hard part here is that I'm
    dealing with an idiot language that made comment-parsing context-free,
    but I decided to say "to hell with this" and ignore that fact, banking
    that it's a rare edge case I never have to deal with.

    Granted, such large regular expressions can become extremely unwieldly
    (said regex is actually composed out of about five lines of code plus
    detailed comments above each part explaining what it does), but it's
    still very simple to do in a regex.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer ðŸ§, Mar 11, 2013
    #16
  17. Stefan Ram Guest

    =?UTF-8?B?Sm9zaHVhIENyYW5tZXIg8J+Qpw==?= <> writes:
    >On 3/10/2013 5:54 PM, Roedy Green wrote:
    >>parsing Java

    >Actually, all of those examples fall under the category of lexing,


    Parsing is not lexing, usually parsing comes after lexing.
    Stefan Ram, Mar 11, 2013
    #17
  18. Eric Sosman Guest

    On 3/11/2013 6:00 PM, Joshua Cranmer 🧠wrote:
    > [...]
    > ([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^
    > \t\r\n()<>,:;@["])+
    >
    > That is an example of a production regular expression I use specifically
    > for tokenizing. [...]


    As Ed Post noted nearly thirty years ago:

    It has been observed that a TECO command sequence
    more closely resembles transmission line noise
    than readable text.
    -- "Real Programmers Don't Use PASCAL"

    Nobody I know of uses TECO any more, but regexes satisfy
    people's craving for gibberish.

    --
    Eric Sosman
    d
    Eric Sosman, Mar 11, 2013
    #18
  19. On 3/11/2013 6:31 PM, Eric Sosman wrote:
    > On 3/11/2013 6:00 PM, Joshua Cranmer 🧠wrote:
    >> [...]
    >> ([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^
    >> \t\r\n()<>,:;@["])+
    >>
    >> That is an example of a production regular expression I use specifically
    >> for tokenizing. [...]

    >
    > As Ed Post noted nearly thirty years ago:
    >
    > It has been observed that a TECO command sequence
    > more closely resembles transmission line noise
    > than readable text.
    > -- "Real Programmers Don't Use PASCAL"
    >
    > Nobody I know of uses TECO any more, but regexes satisfy
    > people's craving for gibberish.


    $ edit/teco z.z
    %Can't find file "Z.Z"
    %Creating new file
    *ex$$

    :)

    (sorry - the only thing I know about TECO is how to exit)

    Arne
    Arne Vajhøj, Mar 11, 2013
    #19
  20. Eric Sosman Guest

    On 3/11/2013 6:40 PM, Arne Vajhøj wrote:
    > On 3/11/2013 6:31 PM, Eric Sosman wrote:
    >>[...]
    >> Nobody I know of uses TECO any more, but regexes satisfy
    >> people's craving for gibberish.

    >
    > $ edit/teco z.z
    > %Can't find file "Z.Z"
    > %Creating new file
    > *ex$$
    >
    > :)
    >
    > (sorry - the only thing I know about TECO is how to exit)


    Perhaps the most important lesson of all! ;-)

    --
    Eric Sosman
    d
    Eric Sosman, Mar 12, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kenneth Brody

    Sorta-OT: John Backus obit

    Kenneth Brody, Mar 21, 2007, in forum: C Programming
    Replies:
    5
    Views:
    294
    Nick Keighley
    Mar 22, 2007
  2. Jesse Merriman

    Backus, Functional Programming, and Ruby

    Jesse Merriman, Mar 25, 2007, in forum: Ruby
    Replies:
    10
    Views:
    228
    Giles Bowkett
    Mar 26, 2007
  3. Joao Silva
    Replies:
    16
    Views:
    359
    7stud --
    Aug 21, 2009
  4. Replies:
    7
    Views:
    296
    Arved Sandstrom
    Mar 15, 2013
  5. Replies:
    1
    Views:
    297
    Arne Vajhøj
    Mar 15, 2013
Loading...

Share This Page