M
markspace
I was looking at using the Scanner class for some string parsing. I'd
like to be able to pull out tokens (with a matching regex) from a
string. The long term goal is to make a little recursive decent parser,
so just greping the token out of the string globally won't do. I need
to check syntax.
The way Scanner class uses its delimiter seems spastic, however. For
example, to parse a string like "abc+def", hasMatch( "\\w+" ) won't
detect the token "abc" at the start of the string, unless the delimiter
includes a literal "+". The string "abc" must be delimited, I can't
just use the default whitespace delimiter.
This seems just plain wrong to me. Is there something I'm missing?
Right now I'm forced to set the delimiter to the inverse set of
characters that I want each time I use hasNext(Pattern). It's really
awkward. Perhaps someone can point out a better way to use Scanner, or
a better way of doing this.
Here's a little test program which works correctly. Switch the
commented statements around as indicated to see the "broken" behavior.
package fubar;
import java.util.Scanner;
public class ScannerTest {
public static void main( String[] args )
{
final String WORD = "[a-z]+";
String[] testVectors = {
"22.22",
"11.11mno",
"asd33.333xyz",
};
String[] testDelimiters = {
"\\z",
"\\b",
"\\s*",
"\\s+",
};
for( String s : testVectors ) {
for( String d : testDelimiters )
{
Scanner scan = new Scanner( s );
// Uncomment the next two lines for broken behavior
// System.out.print( "Scan: " + s + " with delim: "+ d+" " );
// scan.useDelimiter( d );
// then remove the two uses of "useDelimiter" inside the
// for loop
for( int max = 0; scan.hasNext() && max++ < 20; )
{
// comment or remove next line for broken behavior
scan.useDelimiter( "[^a-z]" );
if( scan.hasNext( WORD ) )
{
System.out.print( scan.next( WORD )+" " );
}
// comment or remove next line for broken behavior
scan.useDelimiter( "[^\\.0-9\\-]");
if( scan.hasNextBigDecimal() )
{
System.out.print( scan.nextBigDecimal() + " " );
}
}
System.out.println();
}
}
}
}
like to be able to pull out tokens (with a matching regex) from a
string. The long term goal is to make a little recursive decent parser,
so just greping the token out of the string globally won't do. I need
to check syntax.
The way Scanner class uses its delimiter seems spastic, however. For
example, to parse a string like "abc+def", hasMatch( "\\w+" ) won't
detect the token "abc" at the start of the string, unless the delimiter
includes a literal "+". The string "abc" must be delimited, I can't
just use the default whitespace delimiter.
This seems just plain wrong to me. Is there something I'm missing?
Right now I'm forced to set the delimiter to the inverse set of
characters that I want each time I use hasNext(Pattern). It's really
awkward. Perhaps someone can point out a better way to use Scanner, or
a better way of doing this.
Here's a little test program which works correctly. Switch the
commented statements around as indicated to see the "broken" behavior.
package fubar;
import java.util.Scanner;
public class ScannerTest {
public static void main( String[] args )
{
final String WORD = "[a-z]+";
String[] testVectors = {
"22.22",
"11.11mno",
"asd33.333xyz",
};
String[] testDelimiters = {
"\\z",
"\\b",
"\\s*",
"\\s+",
};
for( String s : testVectors ) {
for( String d : testDelimiters )
{
Scanner scan = new Scanner( s );
// Uncomment the next two lines for broken behavior
// System.out.print( "Scan: " + s + " with delim: "+ d+" " );
// scan.useDelimiter( d );
// then remove the two uses of "useDelimiter" inside the
// for loop
for( int max = 0; scan.hasNext() && max++ < 20; )
{
// comment or remove next line for broken behavior
scan.useDelimiter( "[^a-z]" );
if( scan.hasNext( WORD ) )
{
System.out.print( scan.next( WORD )+" " );
}
// comment or remove next line for broken behavior
scan.useDelimiter( "[^\\.0-9\\-]");
if( scan.hasNextBigDecimal() )
{
System.out.print( scan.nextBigDecimal() + " " );
}
}
System.out.println();
}
}
}
}