Parsing a Boolean expression easy?

Discussion in 'Java' started by cbongior@stny.rr.com, Aug 15, 2005.

  1. Guest

    String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
    plumbing\"";

    I would like to parse the above text into it's 'components' in an easy
    and preferrably native java library fashion. I mean, I can implement a
    custom parse, but it would be a little ugly.

    Ultimately, I would like the following tokens:

    1) Christian bongiorno
    2) AND
    3) Joe
    4) OR
    5) Electrical, plumbing

    With StringTokenizer I can correctly get the quoted words, but it
    doesn't distingush the non-quoted. So, I get

    StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
    AND Joe OR \"Electrical, plumbing\"","\"");

    produces

    1) Christian bongiorno
    2) AND Joe OR
    3) Electrical, plumbing

    As you guessed, this is for text searching. Also, No 3rd party
    libraries. But be all core Java

    ideas?
     
    , Aug 15, 2005
    #1
    1. Advertising

  2. Oliver Wong Guest

    <> wrote in message
    news:...
    > String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
    > plumbing\"";
    >
    > I would like to parse the above text into it's 'components' in an easy
    > and preferrably native java library fashion. I mean, I can implement a
    > custom parse, but it would be a little ugly.
    >
    > Ultimately, I would like the following tokens:
    >
    > 1) Christian bongiorno
    > 2) AND
    > 3) Joe
    > 4) OR
    > 5) Electrical, plumbing
    >
    > With StringTokenizer I can correctly get the quoted words, but it
    > doesn't distingush the non-quoted. So, I get
    >
    > StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
    > AND Joe OR \"Electrical, plumbing\"","\"");
    >
    > produces
    >
    > 1) Christian bongiorno
    > 2) AND Joe OR
    > 3) Electrical, plumbing
    >
    > As you guessed, this is for text searching. Also, No 3rd party
    > libraries. But be all core Java
    >
    > ideas?


    There's a difference between parsing and tokenizing. A lot of the time
    when people say parsing, they mean tokenizing (which is why the string
    tokenizer solves their problem). The problem you're describing is actual,
    real parsing.

    If you don't want to use 3rd party tools, then you'll just have to write
    a parser by hand. Lookup "recursive descent parsing". You may also want to
    try posting future questions on this project to comp.compilers to learn more
    about parsing theory.

    - Oliver
     
    Oliver Wong, Aug 15, 2005
    #2
    1. Advertising

  3. shakah Guest

    wrote:
    > String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
    > plumbing\"";
    >
    > I would like to parse the above text into it's 'components' in an easy
    > and preferrably native java library fashion. I mean, I can implement a
    > custom parse, but it would be a little ugly.
    >
    > Ultimately, I would like the following tokens:
    >
    > 1) Christian bongiorno
    > 2) AND
    > 3) Joe
    > 4) OR
    > 5) Electrical, plumbing
    >
    > With StringTokenizer I can correctly get the quoted words, but it
    > doesn't distingush the non-quoted. So, I get
    >
    > StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
    > AND Joe OR \"Electrical, plumbing\"","\"");
    >
    > produces
    >
    > 1) Christian bongiorno
    > 2) AND Joe OR
    > 3) Electrical, plumbing


    How about something with regular expressions, e.g.:

    jc@soyuz:~/tmp$ cat bparse.java
    public class bparse {
    public static void main(String [] asArgs) {
    java.util.regex.Pattern p
    = java.util.regex.Pattern.compile(asArgs[0]) ;
    System.out.println(" regex: '" + asArgs[0] + "'") ;
    for(int i=1; i<asArgs.length; ++i) {
    String sExpr = asArgs ;
    System.out.println("input str: '" + sExpr + "'") ;
    java.util.regex.Matcher m = p.matcher(sExpr) ;
    while(m.find()) {
    System.out.println(
    " match: '"
    + sExpr.substring(m.start(), m.end()) + "'") ;
    }
    }
    }
    }

    jc@soyuz:~/tmp$ java bparse '("[^"]*"|AND|OR|[A-Za-z0-9]+)'
    "\"christian bongiorno\" AND Joe OR \"Electrical, plumbing\""
    regex: '("[^"]*"|AND|OR|[A-Za-z0-9]+)'
    input str: '"christian bongiorno" AND Joe OR "Electrical, plumbing"'
    match: '"christian bongiorno"'
    match: 'AND'
    match: 'Joe'
    match: 'OR'
    match: '"Electrical, plumbing"'

    ?
     
    shakah, Aug 15, 2005
    #3
  4. Guest

    I was SURE regular expression could do it, but my regexp skills SUCK!
    As an aside, linux interprets the commandline differently than windows.
    Windows turned those commandline args into like, 8 seperate arguments.
    So, I adapted but, it works!

    Thanks

    Christian

    http://christian.bongiorno.org/resume.pdf
     
    , Aug 15, 2005
    #4
  5. Guest

    One question though? In the results, is it possible to easily throw out
    the " " around a quoted part?

    so...
    instead of
    match: '"christian bongiorno"'

    I get
    match: 'christian bongiorno'
     
    , Aug 15, 2005
    #5
  6. shakah Guest

    wrote:
    > One question though? In the results, is it possible to easily throw out
    > the " " around a quoted part?
    >
    > so...
    > instead of
    > match: '"christian bongiorno"'
    >
    > I get
    > match: 'christian bongiorno'


    How about:
    while(m.find()) {
    String sMatch = sExpr.substring(m.start(), m.end()) ;
    if(sMatch.startsWith("\"") && sMatch.endsWith("\"")) {
    sMatch = sMatch.substring(1, sMatch.length()-1) ;
    }
    System.out.println(" match: '" + sMatch + "'") ;
    }
     
    shakah, Aug 15, 2005
    #6
  7. Roedy Green Guest

    On 15 Aug 2005 12:47:59 -0700, wrote or quoted :

    >As you guessed, this is for text searching. Also, No 3rd party
    >libraries. But be all core Java


    Here is how you could implement an elcheapo tokenizer.

    use a regex to split on space, teaching it to ignore spaces inside
    quotes.

    Use a HashMap of defined words and keywords mapping to an enum that
    classifies them.

    Look up the word to see if it is magic e.g. and or.

    You now have an array of tokens that identify their general class.
    That is a lot easier to parse, especially if you use a postfix
    notation.

    The other approach is to use a parser generator, which will be much
    easier than you imagine. See http://mindprod.com/jgloss/parser.html
     
    Roedy Green, Aug 15, 2005
    #7
  8. Guest

    Thanks, I was thinking that something in the REGEX could do it. I
    logicked around it already.
     
    , Aug 16, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    935
    GIMME
    Feb 11, 2004
  2. Hazzard
    Replies:
    2
    Views:
    683
    Hazzard
    Apr 6, 2004
  3. Bruno Desthuilliers
    Replies:
    5
    Views:
    418
    Bruno Desthuilliers
    Aug 29, 2007
  4. J Leonard
    Replies:
    4
    Views:
    12,894
    Mark Space
    Jan 19, 2008
  5. Metre Meter
    Replies:
    7
    Views:
    467
    Metre Meter
    Aug 6, 2010
Loading...

Share This Page