Parsing a Boolean expression easy?

C

cbongior

String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
plumbing\"";

I would like to parse the above text into it's 'components' in an easy
and preferrably native java library fashion. I mean, I can implement a
custom parse, but it would be a little ugly.

Ultimately, I would like the following tokens:

1) Christian bongiorno
2) AND
3) Joe
4) OR
5) Electrical, plumbing

With StringTokenizer I can correctly get the quoted words, but it
doesn't distingush the non-quoted. So, I get

StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
AND Joe OR \"Electrical, plumbing\"","\"");

produces

1) Christian bongiorno
2) AND Joe OR
3) Electrical, plumbing

As you guessed, this is for text searching. Also, No 3rd party
libraries. But be all core Java

ideas?
 
O

Oliver Wong

String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
plumbing\"";

I would like to parse the above text into it's 'components' in an easy
and preferrably native java library fashion. I mean, I can implement a
custom parse, but it would be a little ugly.

Ultimately, I would like the following tokens:

1) Christian bongiorno
2) AND
3) Joe
4) OR
5) Electrical, plumbing

With StringTokenizer I can correctly get the quoted words, but it
doesn't distingush the non-quoted. So, I get

StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
AND Joe OR \"Electrical, plumbing\"","\"");

produces

1) Christian bongiorno
2) AND Joe OR
3) Electrical, plumbing

As you guessed, this is for text searching. Also, No 3rd party
libraries. But be all core Java

ideas?

There's a difference between parsing and tokenizing. A lot of the time
when people say parsing, they mean tokenizing (which is why the string
tokenizer solves their problem). The problem you're describing is actual,
real parsing.

If you don't want to use 3rd party tools, then you'll just have to write
a parser by hand. Lookup "recursive descent parsing". You may also want to
try posting future questions on this project to comp.compilers to learn more
about parsing theory.

- Oliver
 
S

shakah

String testText = "\"christian bongiorno\" AND Joe OR \"Electrical,
plumbing\"";

I would like to parse the above text into it's 'components' in an easy
and preferrably native java library fashion. I mean, I can implement a
custom parse, but it would be a little ugly.

Ultimately, I would like the following tokens:

1) Christian bongiorno
2) AND
3) Joe
4) OR
5) Electrical, plumbing

With StringTokenizer I can correctly get the quoted words, but it
doesn't distingush the non-quoted. So, I get

StringTokenizer tokens = new StringTokenizer("\"christian bongiorno\"
AND Joe OR \"Electrical, plumbing\"","\"");

produces

1) Christian bongiorno
2) AND Joe OR
3) Electrical, plumbing

How about something with regular expressions, e.g.:

jc@soyuz:~/tmp$ cat bparse.java
public class bparse {
public static void main(String [] asArgs) {
java.util.regex.Pattern p
= java.util.regex.Pattern.compile(asArgs[0]) ;
System.out.println(" regex: '" + asArgs[0] + "'") ;
for(int i=1; i<asArgs.length; ++i) {
String sExpr = asArgs ;
System.out.println("input str: '" + sExpr + "'") ;
java.util.regex.Matcher m = p.matcher(sExpr) ;
while(m.find()) {
System.out.println(
" match: '"
+ sExpr.substring(m.start(), m.end()) + "'") ;
}
}
}
}

jc@soyuz:~/tmp$ java bparse '("[^"]*"|AND|OR|[A-Za-z0-9]+)'
"\"christian bongiorno\" AND Joe OR \"Electrical, plumbing\""
regex: '("[^"]*"|AND|OR|[A-Za-z0-9]+)'
input str: '"christian bongiorno" AND Joe OR "Electrical, plumbing"'
match: '"christian bongiorno"'
match: 'AND'
match: 'Joe'
match: 'OR'
match: '"Electrical, plumbing"'

?
 
C

cbongior

I was SURE regular expression could do it, but my regexp skills SUCK!
As an aside, linux interprets the commandline differently than windows.
Windows turned those commandline args into like, 8 seperate arguments.
So, I adapted but, it works!

Thanks

Christian

http://christian.bongiorno.org/resume.pdf
 
C

cbongior

One question though? In the results, is it possible to easily throw out
the " " around a quoted part?

so...
instead of
match: '"christian bongiorno"'

I get
match: 'christian bongiorno'
 
S

shakah

One question though? In the results, is it possible to easily throw out
the " " around a quoted part?

so...
instead of
match: '"christian bongiorno"'

I get
match: 'christian bongiorno'

How about:
while(m.find()) {
String sMatch = sExpr.substring(m.start(), m.end()) ;
if(sMatch.startsWith("\"") && sMatch.endsWith("\"")) {
sMatch = sMatch.substring(1, sMatch.length()-1) ;
}
System.out.println(" match: '" + sMatch + "'") ;
}
 
R

Roedy Green

As you guessed, this is for text searching. Also, No 3rd party
libraries. But be all core Java

Here is how you could implement an elcheapo tokenizer.

use a regex to split on space, teaching it to ignore spaces inside
quotes.

Use a HashMap of defined words and keywords mapping to an enum that
classifies them.

Look up the word to see if it is magic e.g. and or.

You now have an array of tokens that identify their general class.
That is a lot easier to parse, especially if you use a postfix
notation.

The other approach is to use a parser generator, which will be much
easier than you imagine. See http://mindprod.com/jgloss/parser.html
 
C

cbongior

Thanks, I was thinking that something in the REGEX could do it. I
logicked around it already.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top