Boolean search query parsing (hierarchical)

R

Robert Watkins

Has anyone ever run into code that will parse a Boolean query into a
tree structure? I'm working on a search application, one of the features
of which is that it should be able to store a set of queries, each of
which can refer to other queries, i.e.:

1. cat OR dog
2. apples AND oranges AND (NOT bananas)
3. fred NEAR barney
4. #1 OR #2
5. #4 AND #3

Users should be able to edit and delete queries, so that, using the
above example, if one were to delete query #3, query #5 would have to
change also (becoming "#4").

The given example is rather simple, but a set of queries can run upwards
of 300 individual queries, of any level of complexity. And when a query
is deleted, all queries that reference it must change to accommodate the
deletion, and -- here's the fun part -- these queries must be
reformatted so that the Boolean syntax is not broken. For example, if we
start with the query element:

n. bottles AND (#3 AND (NOT #2)) OR (plates AND (#2 OR #4))

and we delete #2, we must end up with:

n. bottles AND #3 OR (plates AND #4)

The example may be convoluted, but I think you can see what I mean.

I've been playing around with Jakarta's Lucene 1.2, which is intended as
a complete full-text search solution, but the QueryParserTokenManager
class allows me to fiddle with query elements quite nicely. However,
it's still at only one dimension, if you will, and I think (perhaps
spuriously) that query reformatting might be easier with some sort of
hierarchical structure. And while, yes, it would be possible to write
this myself, I'd rather see if someone smarter has done it already! I've
also been fiddling with Lucene 1.3-RC2, trying, for example to override
the toString() method of the BooleanQuery class, but it's become quite a
mess, and I don't feel like I've moved any closer to a solution.

Help! (and thank you)
-- Robert
 
K

Kai Grossjohann

I would use AntLR or JavaCUP or JavaCC or another parser generator to
write a parser for this stuff. Then you get abstract syntax trees
which are much easier to handle.

Hm.

Originally, I thought that it is easy to replace deleted queries with
a boolean constant, but I guess that both #1 OR #2 and #1 AND #2
should become #1 when you delete #2. So that approach won't work.

But still, you can just walk up the tree from the deleted node and
simlify as you go along, or write a recursive function to do it in the
top-down fashion.

For the recursive alternative, the idea would be to say that
simplify(a AND b) is the same as simplify(a) AND simplify(b), and so
on. You then construct some special cases: simplify(a AND <deleted>)
is simplify(a) and so on.

Kai
 
R

Robert Watkins

Danke, Herr Kai --

Using a parser generator is precisely what I was trying to avoid. More
precisely, I was trying to avoid having to write the grammar for Boolean
queries. That I haven't been able to find a grammar for Boolean queries on
the Net has surprised me. Perhaps it is too much like a Computer Science
course exercise! (And, unfortunately, I never studies CS.)

However, despite having made some progress using Lucene, it does seem as if
I will have to sit down and try to penetrate the documentation for JavaCC
(or similar). If this doesn't addle my brain, I will take your suggestions
about a simplify() method into accounnt.

Thank you,
-- Robert
 
E

EJP

The grammar for Boolean queries is about 8 lines long and can be found
in any compiler text book or as part of most language specifications.
 
R

Robert Watkins

EJP said:
The grammar for Boolean queries is about 8 lines long and can be found
in any compiler text book or as part of most language specifications.
Well, EJP, it was mighty kind of you to take the trouble to post those 8
lines... Oh wait, you didn't!

If I had a compiler textbook at hand, I probably wouldn't need to post to a
newsgroup, but guess what: I don't. So if you thought you were being
helpful, let me clarify something: you weren't. But if you were simply
trying to be mean and to petty: you hit it on the nose.

Hint: next time don't bother.
 
R

Robert Watkins

Chris said:
What is your application, exactly? If Lucene doesn't cut it, try
Dieselpoint.
http://dieselpoint.com

Thanks, Chris, but we already use Verity as our primary search engine. I
was just trying to use the parsing functionality built into Lucene to place
a Boolean query into some sort of data structure that would allow the query
to be editied while maintaining proper syntax.

I eventually bit the proverbial bullet and wrote a grammar using javacc. It
seems to work (i.e. I haven't been able to come up with a query that breaks
it) but as it's my first grammar, I'm not entirely confident that I've used
the right approach.

-- Robert
 
R

Robert Watkins

I would use AntLR or JavaCUP or JavaCC or another parser generator to
write a parser for this stuff. Then you get abstract syntax trees
which are much easier to handle.

Indeed, I did as you suggested and wrote a grammar using javacc. It took me
some time to figure things out, but I think I finally got it. If you want
to see the fruits of my labour, I've posted the logic part of the grammar
in a posting entitled "Boolean grammar (javacc)".

Thanks again,
-- Robert
 
E

EJP

Robert said:
Well, EJP, it was mighty kind of you to take the trouble to post those 8
lines... Oh wait, you didn't!

If I had a compiler textbook at hand, I probably wouldn't need to post to a
newsgroup, but guess what: I don't. So if you thought you were being
helpful, let me clarify something: you weren't. But if you were simply
trying to be mean and to petty: you hit it on the nose.

Hint: next time don't bother.

Not trying to be mean, but it's better to get them from a textbook than
from my aging memory. This is worth bothering about, ditto the original
question.
 
K

Kai Grossjohann

Robert Watkins said:
Danke, Herr Kai --

Using a parser generator is precisely what I was trying to avoid. More
precisely, I was trying to avoid having to write the grammar for Boolean
queries. That I haven't been able to find a grammar for Boolean queries on
the Net has surprised me. Perhaps it is too much like a Computer Science
course exercise! (And, unfortunately, I never studies CS.)

Let's see now... I'll use EBNF, I hope that can be translated fairly
straightforwardly to javacc...

Query ::= Disjunction
Disjunction ::= Conjunction
| Disjunction "OR" Conjunction
Conjunction ::= Negation
| Conjunction "AND" Negation
Negation ::= Term
| "NOT" Term
Term ::= Word
| "(" Disjunction ")"
Word ::= < sequence of characters >

Now what remains is to construct a tree from this. The idea behind
that is to make objects for each nonterminal of this grammar. So you
have a class Disjunction, a class Conjunction, and so on. They all
inherit from a general Query superclass. And the class Disjunction,
for example, would have two members left and right for the two
arguments of the "OR" operator.

Does this make sense at all?

Kai
 
R

Robert Watkins

(e-mail address removed) (Kai Grossjohann) wrote in
Robert Watkins said:
Danke, Herr Kai --

Using a parser generator is precisely what I was trying to avoid.
[ snipped ]

Let's see now... I'll use EBNF, I hope that can be translated fairly
straightforwardly to javacc...
[ snipped ]

Does this make sense at all?
Kai --

What you sent makes perfect sense ... now. Over the past weekend I worked
proctically non-stop (one meal a day and about 4-5 hours sleep at night)
and came up with a grammar and AST using javacc and jjtree. You can see
that grammar (at least the logic part) in a posting to this group
entitled "Boolean grammar (javacc)". I got a response by e-mail on
Wednesday to that posting in which another kind soul told me how I could
convert what I had come up with (which was particlaly LL(2)) into a fully
LL(1) grammar. I've now implemented those changes and it's working
wonderfully.

It may get more interesting, however, as the project on which I'm working
might demand that I allow for fielded queries. At the moment, there are
two terminals in my grammar: WORDs (words, wuoted phrases, etc.) and
QREFs (references to other queries). Fielded searches, however, cannot
accept QREFs but can be defined by a QREF. The problem (which I haven't
begun to think about in earnest) will be to define some sort of Query
(from your example) that can be recognised as part of a fielded search
but not conflict with the Query class that can include QREFs. Even if the
project doesn't require this, I may try it as an exercise.

Thanks again,
-- Robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top