Google-like query tokenizer

A

aaronfude

Hi,

Is there a java utilitily that can tokenize a google-like query?
Meaning that tokens are separated by spaces unless grouped with
parentheses. Can the StringTokenizer do this?

Very many thanks in advance!

Aaron Fude
 
B

Bart Cremers

(e-mail address removed) schreef:
Hi,

Is there a java utilitily that can tokenize a google-like query?
Meaning that tokens are separated by spaces unless grouped with
parentheses. Can the StringTokenizer do this?

Very many thanks in advance!

Aaron Fude

This can be easily achieved using regular expressions:


import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class QuerySplit {
private static String query = "test \"one two three\" more testing
\"one two\" done";

private static String regex = "\"[^\"]*\"|[^\\s]+";

public static void main(String[] args) {
Pattern pattern = Pattern.compile(regex);

Matcher matcher = pattern.matcher(query);

while (matcher.find()) {
String toSearch = query.substring(matcher.start(),
matcher.end());
if (toSearch.startsWith("\"") && toSearch.endsWith("\"")) {
toSearch = toSearch.substring(1, toSearch.length() -
1);
}
System.out.println(toSearch);
}
}
}


Regards,

Bart
 
B

bugbear

Hi,

Is there a java utilitily that can tokenize a google-like query?
Meaning that tokens are separated by spaces unless grouped with
parentheses. Can the StringTokenizer do this?

If you wish to also hande stuff like '|' (OR)
and quotes (for phrases) and the '+' and '-'
stuff, you'll need quite a "complete" little
parser implementation.

BugBear
 
O

Oliver Wong

Hi,

Is there a java utilitily that can tokenize a google-like query?
Meaning that tokens are separated by spaces unless grouped with
parentheses. Can the StringTokenizer do this?

I believe that StringTokenizer on its own can't do it, though you could
use StringTokenizer as part of an implementation of a state machine to
achieve what you want.

Can the parentheses be nested? E.g. is this legal: "a ( b c ( d e ) f
g ) h i"?

- Oliver
 
A

Alan Meyer

Hi,

Is there a java utilitily that can tokenize a google-like query?
Meaning that tokens are separated by spaces unless grouped with
parentheses. Can the StringTokenizer do this?

Very many thanks in advance!

Aaron Fude

There are a number of lex/yacc like implementations in
Java. Google for "lex yacc java" to find them.

Lex and yacc are ancient UNIX compiler construction tools.

Lex is a lexical analyzer that breaks a string into tokens.
Yacc is a parser generator that recognizes production
rules defining a syntax.

If you've never used them or heard of them, you'll find
they require a significant learning curve to master. But
once mastered, they allow you to build very complicated
parsers for all kinds of different syntactical rules using
very little code.

Alan
 
M

Martin Gregorie

Alan said:
There are a number of lex/yacc like implementations in
Java. Google for "lex yacc java" to find them.

Lex and yacc are ancient UNIX compiler construction tools.

Lex is a lexical analyzer that breaks a string into tokens.
Yacc is a parser generator that recognizes production
rules defining a syntax.

If you've never used them or heard of them, you'll find
they require a significant learning curve to master. But
once mastered, they allow you to build very complicated
parsers for all kinds of different syntactical rules using
very little code.
I'd suggest you look at Coco/R, which you can find at:

http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/

Coco/R is available for several languages including Java. I've used the
Java version to develop a parser for POSIX C code generation
conditionals (#if and friends for the C speakers) and found it worked
well and is somewhat easier to use than lex and yacc.

Its biggest benefit is that its a single code generator that generates
both the tokeniser and the parser. The documentation, supplied in PDF
format, is pretty good too.

Another benefit is that the code skeletons for its generated tokeniser
and parser classes can be easily modified if you're reasonably
competent. The standard generated code assumes input is from a file, but
I needed to be able to process a string. Making that change was trivial.
 
A

Andrew Lampert

An alternative to CocoR that has already been suggested is JavaCC - see
http://javacc.dev.java.net. Well implemented and supported, with a
large community of users. I've used it in the distant past (about 5
years ago) and it suited my needs perfectly for building a reasonably
complex parser.

Cheers,
Andrew
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Can't solve problems! please Help 0
tokenizer 6
Good String Tokenizer 2
Javamail and followup flags 2
JDialog focus 3
tokenizer class 1
running queries in loop 1
Common denominator for applets 4

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top