Scanner Bug?

M

markspace

Hi all,

The following code demonstrates a bug in the java.util.Scanner class, I
think. It creates a large file, then attempts to read in the same file
with a Scanner using a delimiter of "\z".

This doesn't work. Only a part of the file is read (the first 1024
bytes). The result is that the comparison operation fails. I can
manually inspect the file created and it does have the correct number of
strings -- 16384. This code finds the last string at number 284.

Am I doing something wrong? Or should I report this to Sun?


<output>
run:
fileContents.length()=1024
Exception in thread "main" java.lang.RuntimeException: Result was: 28,
expected 283
at scannerbug.ScannerBug.testContents(ScannerBug.java:55)
at scannerbug.ScannerBug.main(ScannerBug.java:30)
Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)
</output>


<sscce>

package scannerbug;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Scanner;


public class ScannerBug
{

private static final String FILE = "test.txt";
private static final int TEST_FILE_SIZE = 16 * 1024;

public static void main( String[] args )
throws Exception
{
makeTestSource( FILE );
String fileContents = new Scanner( new File( FILE ) ).useDelimiter(
"\\z" ).next();
System.err.println( "fileContents.length()=" +
fileContents.length() );
testContents( fileContents );
}

private static void makeTestSource( String string )
throws IOException
{
BufferedWriter bw = new BufferedWriter( new FileWriter( string ) );
for( int i = 0; i < TEST_FILE_SIZE; i++ )
{
bw.write( Integer.toString( i ) );
bw.write( '\n' );
}
bw.close();
}

private static void testContents( String string )
{
Scanner scanner = new Scanner( string );
for( int i = 0; i < TEST_FILE_SIZE; i++ )
{
if( scanner.hasNextInt() )
{
int result = scanner.nextInt();
if( i != result )
{
throw new RuntimeException( "Result was: " + result
+ ", expected " + i );
}
}else
{
throw new RuntimeException(
"Ran out of ints in string at pos "
+ i );
}
}
}
}
</sscce>
 
M

markspace

Peter said:
The only thing that would prevent it from doing that is an exception
during reading. Why an exception would occur, I have no idea. But have
you checked the return value from Scanner.ioException() when you detect
the invalid input, to see if there was in fact one?


I think this is a good idea. Unfortunately, ioException() returns null
and the input is still only 1024 bytes long. Oh well.

<snippet>
public static void main( String[] args )
throws Exception
{
makeTestSource( FILE );
Scanner scanner = new Scanner( new File( FILE ) ).useDelimiter(
"\\z" );
String fileContents = scanner.next();
System.err.println( scanner.ioException() );

System.err.println( "fileContents.length()=" +
fileContents.length() );
testContents( fileContents );
}
</snippet>

<output>
run:
null
fileContents.length()=1024
Exception in thread "main" java.lang.RuntimeException: Result was: 28,
expected 283
at scannerbug.ScannerBug.testContents(ScannerBug.java:52)
at scannerbug.ScannerBug.main(ScannerBug.java:27)
Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)
</output>
 
A

Arne Vajhøj

The following code demonstrates a bug in the java.util.Scanner class, I
think. It creates a large file, then attempts to read in the same file
with a Scanner using a delimiter of "\z".

This doesn't work. Only a part of the file is read (the first 1024
bytes). The result is that the comparison operation fails. I can
manually inspect the file created and it does have the correct number of
strings -- 16384. This code finds the last string at number 284.

Am I doing something wrong? Or should I report this to Sun?

A quick glance at the source shows a default buffer size of 1024.

For some reason that buffer does not get extended.

Arne
 
J

John B. Matthews

markspace said:
The following code demonstrates a bug in the java.util.Scanner class,
I think. It creates a large file, then attempts to read in the same
file with a Scanner using a delimiter of "\z".

This doesn't work. Only a part of the file is read (the first 1024
bytes). The result is that the comparison operation fails. I can
manually inspect the file created and it does have the correct number
of strings -- 16384. This code finds the last string at number 284.

As Pete suggested, I get normal results using other delimiters, e.g. \Z,
\e or \00. I see that \Z matches "The end of the input but for the final
terminator, if any." In contrast, \z is matches "The end of the input."
I wondered if 1024 might be the length of an underlying buffer, but
wrapping the file in a large BufferedInputStream didn't change anything.
The Scanner's ioException() method reports null.
Am I doing something wrong?

I don't think so.
Or should I report this to Sun?

I had fun reporting this NetBeans bug:

<http://sites.google.com/site/trashgod/profile>
 
M

markspace

John said:
As Pete suggested, I get normal results using other delimiters, e.g. \Z,
\e or \00. I see that \Z matches "The end of the input but for the final
terminator, if any." In contrast, \z is matches "The end of the input."


\Z works correctly for me also, but it does discard the final newline,
which is undesired. (I have a larger program with different tests, which
fails testing due to the missing newline, even though the SSCCE I posted
succeeds). \00 works ("\\00") but seems dangerous, if a NUL happens to
appear in the input stream. Same for \e.

I think that \z should work the same as \Z, except for the final
newline. That it doesn't seems to be a bug.


Incidentally, my current solution is below. ;)


import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;

/**
*
* @author Brenden
*/
public final class UnboundedCharSeq
implements CharSequence
{

private static final int BUF_SIZE = 1024;
final ArrayList<char[]> buffers;
final int startOffset;
final int endOffset;

int hashcode;

public UnboundedCharSeq( File f )
throws FileNotFoundException, IOException
{
this( new FileInputStream( f ) );
}

public UnboundedCharSeq( InputStream ins )
throws IOException
{
InputStreamReader inr = new InputStreamReader( ins );
int totlaBytes = 0;
ArrayList<char[]> buf = new ArrayList<char[]>();

try
{
char[] charBuf = new char[BUF_SIZE];
buf.add( charBuf );
int retVal;
int pos = 0;
int len = charBuf.length;

while( (retVal = inr.read( charBuf, pos, len )) >= 0 )
{
totlaBytes += retVal;
pos += retVal;
len -= retVal;
if( len == 0 )
{
charBuf = new char[BUF_SIZE];
buf.add( charBuf );
pos = 0;
len = charBuf.length;
}
}
}finally
{
inr.close();
}

buffers = buf;
startOffset = 0;
endOffset = totlaBytes;
}

private UnboundedCharSeq( ArrayList<char[]> buf, int start, int end )
{
buffers = buf;
startOffset = start;
endOffset = end;
}

public int length()
{
return endOffset - startOffset;
}

public char charAt( int index )
{
checkIndex( index );
index += startOffset;
return buffers.get( index / BUF_SIZE )[index % BUF_SIZE];
}

@Override
public CharSequence subSequence( int start, int end )
{
checkIndex( start );
if( end < start || end > length() ) {
throw new IndexOutOfBoundsException( "end index: " + end
+ " must be ["+start+".." + length() + "]" );
}
start += startOffset;
end += startOffset;
return new UnboundedCharSeq( buffers, start, end );
}

private void checkIndex( int index )
{
if( index >= length() || index < 0 )
{
throw new IndexOutOfBoundsException( "index: " + index
+ " must be [0.." + length() + ")" );
}
}

@Override
public boolean equals( Object obj )
{
if( !(obj instanceof UnboundedCharSeq ) ) {
return false;
}
return contentEquals( (CharSequence) obj );
}

@Override
public int hashCode()
{
if( hashcode == 0 ) {
hashcode = length() * 31 + 17;
for( int i = 0; i < length(); i++ ) {
hashcode = hashcode * 37 + charAt( i );
}
}
return hashcode;
}

@Override
public String toString()
{
char[] temp = new char[length()];

for( int i = 0; i < length(); i++ ) {
temp = charAt( i );
}
return new String( temp );
}

public boolean contentEquals( CharSequence cs ) {
if( cs.length() != length() ) {
return false;
}

for( int i = 0; i < length(); i++ ) {
if( charAt( i ) != cs.charAt( i ) ) {
return false;
}
}
return true;
}
}
 
K

Knute Johnson

Hi all,

The following code demonstrates a bug in the java.util.Scanner class, I
think. It creates a large file, then attempts to read in the same file
with a Scanner using a delimiter of "\z".

This doesn't work. Only a part of the file is read (the first 1024
bytes). The result is that the comparison operation fails. I can
manually inspect the file created and it does have the correct number of
strings -- 16384. This code finds the last string at number 284.

\$ will work just fine or \z?m or \Z?m for that matter.

There is a difference between Java and Perl in that end of line/string
metacharacters don't use the same language in the docs that Java does.
Java calls \z end of input. I don't really know if that was meant to be
anything different or not.

I'll leave it to you or someone else to test Perl for similar behavior
but I suspect you have detected a bug.
 
M

markspace

Knute said:
\$ will work just fine or \z?m or \Z?m for that matter.

Yes these do work. ?m confuses me though, is that supposed to be
"multiline mode?" I thought it went in parenthesis like this: "(?m)".
 
R

Roedy Green

with a Scanner using a delimiter of "\z".

In the days of CPM and DOS \z was a magic EOF character. Perhaps your
OS it still treating it as such. Try a different char.
 
E

Eric Sosman

In the days of CPM and DOS \z was a magic EOF character. Perhaps your
OS it still treating it as such. Try a different char.

Have you confused \z with CTRL-Z, ASCII SUB, 0x1A?

In any event, the Scanner doesn't treat \z as a literal
character, but as the source for a Pattern. As far as the
Pattern is concerned, \z is a two-character sequence that
stands for the condition "end of input" and not for any
particular character value. (Indeed, \z *cannot* match a
character!)
 
K

Knute Johnson

Yes these do work. ?m confuses me though, is that supposed to be
"multiline mode?"

Yes

I thought it went in parenthesis like this: "(?m)".

After reading the Perl book again, I'm not sure why it works without the
() and really don't understand why it doesn't work with them.

I think I'm just going to put this one into the unknowable box.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top