Scanner annoyances

markspace · Jul 19, 2009

I was looking at using the Scanner class for some string parsing. I'd
like to be able to pull out tokens (with a matching regex) from a
string. The long term goal is to make a little recursive decent parser,
so just greping the token out of the string globally won't do. I need
to check syntax.

The way Scanner class uses its delimiter seems spastic, however. For
example, to parse a string like "abc+def", hasMatch( "\\w+" ) won't
detect the token "abc" at the start of the string, unless the delimiter
includes a literal "+". The string "abc" must be delimited, I can't
just use the default whitespace delimiter.

This seems just plain wrong to me. Is there something I'm missing?
Right now I'm forced to set the delimiter to the inverse set of
characters that I want each time I use hasNext(Pattern). It's really
awkward. Perhaps someone can point out a better way to use Scanner, or
a better way of doing this.

Here's a little test program which works correctly. Switch the
commented statements around as indicated to see the "broken" behavior.

package fubar;
import java.util.Scanner;

public class ScannerTest {

public static void main( String[] args )
{
final String WORD = "[a-z]+";
String[] testVectors = {
"22.22",
"11.11mno",
"asd33.333xyz",
};
String[] testDelimiters = {
"\\z",
"\\b",
"\\s*",
"\\s+",
};

for( String s : testVectors ) {
for( String d : testDelimiters )
{
Scanner scan = new Scanner( s );
// Uncomment the next two lines for broken behavior
// System.out.print( "Scan: " + s + " with delim: "+ d+" " );
// scan.useDelimiter( d );
// then remove the two uses of "useDelimiter" inside the
// for loop
for( int max = 0; scan.hasNext() && max++ < 20; )
{
// comment or remove next line for broken behavior
scan.useDelimiter( "[^a-z]" );
if( scan.hasNext( WORD ) )
{
System.out.print( scan.next( WORD )+" " );
}
// comment or remove next line for broken behavior
scan.useDelimiter( "[^\\.0-9\\-]");
if( scan.hasNextBigDecimal() )
{
System.out.print( scan.nextBigDecimal() + " " );
}
}
System.out.println();
}
}
}

}

John B. Matthews · Jul 19, 2009

markspace said:
I was looking at using the Scanner class for some string parsing.
I'd like to be able to pull out tokens (with a matching regex) from a
string. The long term goal is to make a little recursive decent
parser, so just greping the token out of the string globally won't
do. I need to check syntax.

The way Scanner class uses its delimiter seems spastic, however. For
example, to parse a string like "abc+def", hasMatch( "\\w+" ) won't
detect the token "abc" at the start of the string, unless the
delimiter includes a literal "+". The string "abc" must be
delimited, I can't just use the default whitespace delimiter.

I had better luck using StreamTokenizer [1] to implement a recursive
descent parser [2]. By default, the scanner [3] in StreamTokenizer
recognizes numbers as double values, but resetSyntax() can be used to
treat digits as ordinary characters. It should possible to convert a
numeric string to BigDecimal.

Alternatively, it might be worthwhile to write a custom BDTokenizer that
recognizes BigDecimal tokens.

[...]

[1]<http://java.sun.com/javase/6/docs/api/java/io/StreamTokenizer.html>
[2]<http://sites.google.com/site/drjohnbmatthews/enumerated-functions>
[3]<http://en.wikipedia.org/wiki/Lexical_analysis>

markspace · Jul 19, 2009

John said:
It should possible to convert a
numeric string to BigDecimal.

The Scanner example I posted recognizes BigDecimal just fine. I just
have to set the delimiter each time before I call "nextBigDecimal".

What I was hoping for was to discover some way where I didn't have to
set a delimiter each time I look for a different token.

Unless I'm missing the point of your post completely (beside the "roll
your own bit")....

John B. Matthews · Jul 19, 2009

markspace said:
The Scanner example I posted recognizes BigDecimal just fine. I just
have to set the delimiter each time before I call "nextBigDecimal".

What I was hoping for was to discover some way where I didn't have to
set a delimiter each time I look for a different token.

Ah, Scanner's default delimiters are defined by Character.isWhitespace()
and your testDelimiters have whitespace defined in Pattern, but your
testVectors appear to have none of those symbols between WORDS and
BigDecimal numbers. I don't see how you can avoid the delimiters you're
using.

Unless I'm missing the point of your post completely (beside the "roll
your own bit")....

I had trouble with Scanner and found StreamTokenizer more familiar. I
see now that Scanner recognizes a rich variety of tokens, so it may be a
better choice. I'm guessing a Scanner might be used in extending
StreamTokenizer to recognize BigDecimal, perhaps as TT_BIGDEC.

Would it be possible to see your grammar?

markspace · Jul 19, 2009

John said:
Would it be possible to see your grammar?

It's just a project I'm working on for my own edification. The grammar
is ordinary algebraic expressions: + - * / % along with parenthesis for
grouping, unary operators + and -, words for identifiers and BigDecimals
as literals.

I've got a grammar worked out, though I was going to see if I could find
a standard one for algebra just to compare against. I wanted to try to
implement it first, which is turning into a much larger PITA than I was
hoping.

Hopefully this wont be too embarrassing:

Expression := GroupBinaryExpression | OpExpression | UnaryExpression

GroupBinaryExpression := GroupExpression | GroupExpression Binary2H

GroupExpression := '(' Expression ')'

UnaryExpression := ('+'|'-') ( GroupExpression | OpExpression )

OpExpression := TerminalExpression | TerminalExpression Binary2H

Binary2H := binOp (TerminalExpression | UnaryExpression |
GroupExpression) Binary2H*

TerminalExpression := Ident | BigDecimal

binOp := '+' | '-' | '*' | '/' | '%'

Ident := re:[a-zA-Z_]\w*

BigDecimal := java.math.BigDecimal

Tom Anderson · Jul 19, 2009

It's just a project I'm working on for my own edification. The grammar
is ordinary algebraic expressions: + - * / % along with parenthesis for
grouping, unary operators + and -, words for identifiers and BigDecimals
as literals.

Funnily enough, i'm doing exactly the same thing at the moment. I had a
look at StringTokenizer and StreamTokenizer, but not Scanner because i've
never used it before, and decided to write my own lexer. Here it is:

http://urchin.earth.li/~twic/Code/MCalc/TokenString.java

It's pretty simple. The methods and constants from Operator, NameToken,
etc should all be pretty self-explanatory.

You use it like this:

String s = "(1 / frequency) + 4 * 4 / log(amplitude * 2)";
for (Token tok: new TokenString(s)) {
// etc
}

It currently only does positive integer literals, but handling negatives
and floats would just be a matter of changing the block starting "if
(Character.isDigit(ch)) {".

I've got a grammar worked out, though I was going to see if I could find a
standard one for algebra just to compare against. I wanted to try to
implement it first, which is turning into a much larger PITA than I was
hoping.

Hopefully this wont be too embarrassing:

Expression := GroupBinaryExpression | OpExpression | UnaryExpression
GroupBinaryExpression := GroupExpression | GroupExpression Binary2H
GroupExpression := '(' Expression ')'
UnaryExpression := ('+'|'-') ( GroupExpression | OpExpression )
OpExpression := TerminalExpression | TerminalExpression Binary2H
Binary2H := binOp (TerminalExpression | UnaryExpression | GroupExpression) Binary2H*
TerminalExpression := Ident | BigDecimal
binOp := '+' | '-' | '*' | '/' | '%'
Ident := re:[a-zA-Z_]\w*
BigDecimal := java.math.BigDecimal

How are you planning on doing operator precedence, if at all? The standard
approaches are either to write it into the grammar, which you aren't
doing, or use this:

http://en.wikipedia.org/wiki/Shunting_yard_algorithm

Which is pretty hard to apply to an AST. There are tree-based approaches
to precedence, i believe, but they sound complicated to me.

tom

John B. Matthews · Jul 19, 2009

markspace said:
John said:

Would it be possible to see your grammar?

Click to expand...

It's just a project I'm working on for my own edification. The
grammar is ordinary algebraic expressions: + - * / % along with
parenthesis for grouping, unary operators + and -, words for
identifiers and BigDecimals as literals.

I've got a grammar worked out, though I was going to see if I could
find a standard one for algebra just to compare against. I wanted to
try to implement it first, which is turning into a much larger PITA
than I was hoping.

Hopefully this wont be too embarrassing:

Expression := GroupBinaryExpression | OpExpression | UnaryExpression

GroupBinaryExpression := GroupExpression | GroupExpression Binary2H

GroupExpression := '(' Expression ')'

UnaryExpression := ('+'|'-') ( GroupExpression | OpExpression )

OpExpression := TerminalExpression | TerminalExpression Binary2H

Binary2H := binOp (TerminalExpression | UnaryExpression |
GroupExpression) Binary2H*

TerminalExpression := Ident | BigDecimal

binOp := '+' | '-' | '*' | '/' | '%'

Ident := re:[a-zA-Z_]\w*

BigDecimal := java.math.BigDecimal

If I may make two observations: First, all the non-unary operators
appear to have the same precedence. It's common to distinguish among
unary, multiplicative, and additive operators [1], each having
progressively lower precedence [2]:

expression := ['+'|'-'] term {('+'|'-') term} end

term := factor {('*'|'/'|'%') factor} end

factor := ident | number | '(' expression ')' end

ident := re:[a-zA-Z_]\w* end

number := defined in Scanner end

Second, this grammar does not appear to require whitespace between
identifiers and numbers, as one never follows the other without an
intervening operator.

One plan of attack would be to extend StreamTokenizer to recognize
BigDecimal, perhaps using Scanner. Then, adapt my example [3] to return
BigDecimal, rather than double. Finally, add the % operation to term().

Or you could just exec() bc [4]. Or you could bind to mpfr/gmp [5],
for a real PITA.

[1]<http://en.wikipedia.org/wiki/Recursive_descent_parser>
[2]<http://java.sun.com/docs/books/tutorial/java/nutsandbolts/operators.html>
[3]<http://sites.google.com/site/drjohnbmatthews/enumerated-functions>
[4]<http://www.gnu.org/software/bc/>
[5]<http://www.mpfr.org/> <http://gmplib.org/>

markspace · Jul 19, 2009

Tom said:
How are you planning on doing operator precedence, if at all? The
standard approaches are either to write it into the grammar, which you
aren't doing, or use this:

http://en.wikipedia.org/wiki/Shunting_yard_algorithm

Which is pretty hard to apply to an AST. There are tree-based approaches
to precedence, i believe, but they sound complicated to me.

I was planning on using something like your Shunting yard algorithm, or
some other form of precedence parser. I hadn't worked that bit out yet.

Shunting yard is interesting because I could store the result of that
for faster (?) evaluation later. Hmmm.

I hadn't thought about writing precedence into the grammar. What I've
got now is just a syntax parser. I could I suppose, but it seems
simpler to break it into two steps.

markspace · Jul 19, 2009

John said:
If I may make two observations: First, all the non-unary operators
appear to have the same precedence. It's common to distinguish among
unary, multiplicative, and additive operators [1], each having
progressively lower precedence [2]:

Thanks for pointing that out. I hadn't thought about baking precedence
into the parser. I seemed easier to worry about it later after I'd got
the syntax parsed. I'll think about switching that.

Second, this grammar does not appear to require whitespace between
identifiers and numbers, as one never follows the other without an
intervening operator.

True. The only real sticking point right now is the pattern that
Scanner uses for BigDecimal. That pattern includes floating point too,
for some reason, which means it tends to look for things like 3.333e+10,
which I don't want. That's a little monkey wrench I'll have to think about.

One plan of attack would be to extend StreamTokenizer to recognize
BigDecimal, perhaps using Scanner. Then, adapt my example [3] to return
BigDecimal, rather than double. Finally, add the % operation to term().

Yeah that's an idea. I might give this some more thought. Thanks for
those links too, they're useful.

John B. Matthews · Jul 19, 2009

markspace said:
I was planning on using something like your Shunting yard algorithm, or
some other form of precedence parser. I hadn't worked that bit out yet.

Shunting yard is interesting because I could store the result of that
for faster (?) evaluation later. Hmmm.

I hadn't thought about writing precedence into the grammar. What I've
got now is just a syntax parser. I could I suppose, but it seems
simpler to break it into two steps.

It's a trade-off. An article [1] cited in the link Tom referenced
above explains in more detail: "The classic solution ... create a
new nonterminal for each level of precedence ... it has a few
drawbacks: 1) The size of the code is proportional to the number of
precedence levels. 2) The speed of the algorithm is proportional to the
number of precedence levels. 3) The number of precedence levels is
built in."

For grammars with many levels of precedence, the author proposes an
alternative called "Precedence Climbing".

[1]<http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm>

Tom Anderson · Jul 19, 2009

I was planning on using something like your Shunting yard algorithm, or some
other form of precedence parser. I hadn't worked that bit out yet.

Shunting yard is interesting because I could store the result of that
for faster (?) evaluation later. Hmmm.

I concur with your "?". Are you thinking of storing the reverse-Polish
form? I'm not even materialising that, but turning it on the fly into an
AST - i can evaluate the AST directly. All the other approaches produce
ASTs, so they have the same end result.

I hadn't thought about writing precedence into the grammar. What I've
got now is just a syntax parser. I could I suppose, but it seems
simpler to break it into two steps.

It seems that way. But if you're building a tree ignoring precedence, then
i think the job of turning into a precedence-sound tree is somewhat
tricky. If you're just trying to parse arithmetic, rather than a more
complex language, then a recursive descent parser may be a chainsaw when
what you need is an electric drill (the electric drill being Shunting
yard or precedence climbing).

Actually, i'd suggest you use precedence climbing rather than shunting
yard. We can compare results when we're done!

tom

John B. Matthews · Jul 20, 2009

markspace said:
John B. Matthews wrote: [...]
The only real sticking point right now is the pattern that Scanner
uses for BigDecimal. That pattern includes floating point too, for
some reason, which means it tends to look for things like 3.333e+10,
which I don't want. That's a little monkey wrench I'll have to think
about.

Ironically, when I was looking for examples, more than a few were
devoted to adding that very feature to StreamTokenizer, e.g.

<http://www.resplendent.com/StlFileParser.java>

[...]

Help in hangman game	1	Jul 24, 2023
My Status, Ciphertext	2	Nov 28, 2023
Scanner Bug?	10	Dec 30, 2009
Need Help: Program to Accept 2 Matrices and Show their Sum	0	Aug 21, 2022
Can't solve problems! please Help	0	Sep 26, 2022
School Project	1	Dec 8, 2022
Scanner input problem	2	May 25, 2008
What is the most astounding C++ syntax construct?	0	Dec 22, 2022

Scanner annoyances

markspace

John B. Matthews

markspace

John B. Matthews

markspace

Tom Anderson

John B. Matthews

markspace

markspace

John B. Matthews

Tom Anderson

John B. Matthews

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads