Scanner Bug?

Discussion in 'Java' started by markspace, Dec 30, 2009.

  1. markspace

    markspace Guest

    Hi all,

    The following code demonstrates a bug in the java.util.Scanner class, I
    think. It creates a large file, then attempts to read in the same file
    with a Scanner using a delimiter of "\z".

    This doesn't work. Only a part of the file is read (the first 1024
    bytes). The result is that the comparison operation fails. I can
    manually inspect the file created and it does have the correct number of
    strings -- 16384. This code finds the last string at number 284.

    Am I doing something wrong? Or should I report this to Sun?


    <output>
    run:
    fileContents.length()=1024
    Exception in thread "main" java.lang.RuntimeException: Result was: 28,
    expected 283
    at scannerbug.ScannerBug.testContents(ScannerBug.java:55)
    at scannerbug.ScannerBug.main(ScannerBug.java:30)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 0 seconds)
    </output>


    <sscce>

    package scannerbug;

    import java.io.BufferedWriter;
    import java.io.File;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.util.Scanner;


    public class ScannerBug
    {

    private static final String FILE = "test.txt";
    private static final int TEST_FILE_SIZE = 16 * 1024;

    public static void main( String[] args )
    throws Exception
    {
    makeTestSource( FILE );
    String fileContents = new Scanner( new File( FILE ) ).useDelimiter(
    "\\z" ).next();
    System.err.println( "fileContents.length()=" +
    fileContents.length() );
    testContents( fileContents );
    }

    private static void makeTestSource( String string )
    throws IOException
    {
    BufferedWriter bw = new BufferedWriter( new FileWriter( string ) );
    for( int i = 0; i < TEST_FILE_SIZE; i++ )
    {
    bw.write( Integer.toString( i ) );
    bw.write( '\n' );
    }
    bw.close();
    }

    private static void testContents( String string )
    {
    Scanner scanner = new Scanner( string );
    for( int i = 0; i < TEST_FILE_SIZE; i++ )
    {
    if( scanner.hasNextInt() )
    {
    int result = scanner.nextInt();
    if( i != result )
    {
    throw new RuntimeException( "Result was: " + result
    + ", expected " + i );
    }
    }else
    {
    throw new RuntimeException(
    "Ran out of ints in string at pos "
    + i );
    }
    }
    }
    }
    </sscce>
     
    markspace, Dec 30, 2009
    #1
    1. Advertising

  2. markspace

    markspace Guest

    Peter Duniho wrote:

    > The only thing that would prevent it from doing that is an exception
    > during reading. Why an exception would occur, I have no idea. But have
    > you checked the return value from Scanner.ioException() when you detect
    > the invalid input, to see if there was in fact one?



    I think this is a good idea. Unfortunately, ioException() returns null
    and the input is still only 1024 bytes long. Oh well.

    <snippet>
    public static void main( String[] args )
    throws Exception
    {
    makeTestSource( FILE );
    Scanner scanner = new Scanner( new File( FILE ) ).useDelimiter(
    "\\z" );
    String fileContents = scanner.next();
    System.err.println( scanner.ioException() );

    System.err.println( "fileContents.length()=" +
    fileContents.length() );
    testContents( fileContents );
    }
    </snippet>

    <output>
    run:
    null
    fileContents.length()=1024
    Exception in thread "main" java.lang.RuntimeException: Result was: 28,
    expected 283
    at scannerbug.ScannerBug.testContents(ScannerBug.java:52)
    at scannerbug.ScannerBug.main(ScannerBug.java:27)
    Java Result: 1
    BUILD SUCCESSFUL (total time: 0 seconds)
    </output>
     
    markspace, Dec 30, 2009
    #2
    1. Advertising

  3. markspace

    Arne Vajhøj Guest

    On 30-12-2009 13:09, markspace wrote:
    > The following code demonstrates a bug in the java.util.Scanner class, I
    > think. It creates a large file, then attempts to read in the same file
    > with a Scanner using a delimiter of "\z".
    >
    > This doesn't work. Only a part of the file is read (the first 1024
    > bytes). The result is that the comparison operation fails. I can
    > manually inspect the file created and it does have the correct number of
    > strings -- 16384. This code finds the last string at number 284.
    >
    > Am I doing something wrong? Or should I report this to Sun?


    A quick glance at the source shows a default buffer size of 1024.

    For some reason that buffer does not get extended.

    Arne
     
    Arne Vajhøj, Dec 30, 2009
    #3
  4. In article <hhg510$hi9$-september.org>,
    markspace <> wrote:

    > The following code demonstrates a bug in the java.util.Scanner class,
    > I think. It creates a large file, then attempts to read in the same
    > file with a Scanner using a delimiter of "\z".
    >
    > This doesn't work. Only a part of the file is read (the first 1024
    > bytes). The result is that the comparison operation fails. I can
    > manually inspect the file created and it does have the correct number
    > of strings -- 16384. This code finds the last string at number 284.


    As Pete suggested, I get normal results using other delimiters, e.g. \Z,
    \e or \00. I see that \Z matches "The end of the input but for the final
    terminator, if any." In contrast, \z is matches "The end of the input."
    I wondered if 1024 might be the length of an underlying buffer, but
    wrapping the file in a large BufferedInputStream didn't change anything.
    The Scanner's ioException() method reports null.

    > Am I doing something wrong?


    I don't think so.

    > Or should I report this to Sun?


    I had fun reporting this NetBeans bug:

    <http://sites.google.com/site/trashgod/profile>

    --
    John B. Matthews
    trashgod at gmail dot com
    <http://sites.google.com/site/drjohnbmatthews>
     
    John B. Matthews, Dec 30, 2009
    #4
  5. markspace

    markspace Guest

    John B. Matthews wrote:
    >
    > As Pete suggested, I get normal results using other delimiters, e.g. \Z,
    > \e or \00. I see that \Z matches "The end of the input but for the final
    > terminator, if any." In contrast, \z is matches "The end of the input."



    \Z works correctly for me also, but it does discard the final newline,
    which is undesired. (I have a larger program with different tests, which
    fails testing due to the missing newline, even though the SSCCE I posted
    succeeds). \00 works ("\\00") but seems dangerous, if a NUL happens to
    appear in the input stream. Same for \e.

    I think that \z should work the same as \Z, except for the final
    newline. That it doesn't seems to be a bug.


    Incidentally, my current solution is below. ;)


    import java.io.File;
    import java.io.FileInputStream;
    import java.io.FileNotFoundException;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.util.ArrayList;

    /**
    *
    * @author Brenden
    */
    public final class UnboundedCharSeq
    implements CharSequence
    {

    private static final int BUF_SIZE = 1024;
    final ArrayList<char[]> buffers;
    final int startOffset;
    final int endOffset;

    int hashcode;

    public UnboundedCharSeq( File f )
    throws FileNotFoundException, IOException
    {
    this( new FileInputStream( f ) );
    }

    public UnboundedCharSeq( InputStream ins )
    throws IOException
    {
    InputStreamReader inr = new InputStreamReader( ins );
    int totlaBytes = 0;
    ArrayList<char[]> buf = new ArrayList<char[]>();

    try
    {
    char[] charBuf = new char[BUF_SIZE];
    buf.add( charBuf );
    int retVal;
    int pos = 0;
    int len = charBuf.length;

    while( (retVal = inr.read( charBuf, pos, len )) >= 0 )
    {
    totlaBytes += retVal;
    pos += retVal;
    len -= retVal;
    if( len == 0 )
    {
    charBuf = new char[BUF_SIZE];
    buf.add( charBuf );
    pos = 0;
    len = charBuf.length;
    }
    }
    }finally
    {
    inr.close();
    }

    buffers = buf;
    startOffset = 0;
    endOffset = totlaBytes;
    }

    private UnboundedCharSeq( ArrayList<char[]> buf, int start, int end )
    {
    buffers = buf;
    startOffset = start;
    endOffset = end;
    }

    public int length()
    {
    return endOffset - startOffset;
    }

    public char charAt( int index )
    {
    checkIndex( index );
    index += startOffset;
    return buffers.get( index / BUF_SIZE )[index % BUF_SIZE];
    }

    @Override
    public CharSequence subSequence( int start, int end )
    {
    checkIndex( start );
    if( end < start || end > length() ) {
    throw new IndexOutOfBoundsException( "end index: " + end
    + " must be ["+start+".." + length() + "]" );
    }
    start += startOffset;
    end += startOffset;
    return new UnboundedCharSeq( buffers, start, end );
    }

    private void checkIndex( int index )
    {
    if( index >= length() || index < 0 )
    {
    throw new IndexOutOfBoundsException( "index: " + index
    + " must be [0.." + length() + ")" );
    }
    }

    @Override
    public boolean equals( Object obj )
    {
    if( !(obj instanceof UnboundedCharSeq ) ) {
    return false;
    }
    return contentEquals( (CharSequence) obj );
    }

    @Override
    public int hashCode()
    {
    if( hashcode == 0 ) {
    hashcode = length() * 31 + 17;
    for( int i = 0; i < length(); i++ ) {
    hashcode = hashcode * 37 + charAt( i );
    }
    }
    return hashcode;
    }

    @Override
    public String toString()
    {
    char[] temp = new char[length()];

    for( int i = 0; i < length(); i++ ) {
    temp = charAt( i );
    }
    return new String( temp );
    }

    public boolean contentEquals( CharSequence cs ) {
    if( cs.length() != length() ) {
    return false;
    }

    for( int i = 0; i < length(); i++ ) {
    if( charAt( i ) != cs.charAt( i ) ) {
    return false;
    }
    }
    return true;
    }
    }
     
    markspace, Dec 30, 2009
    #5
  6. On 12/30/2009 10:09 AM, markspace wrote:
    > Hi all,
    >
    > The following code demonstrates a bug in the java.util.Scanner class, I
    > think. It creates a large file, then attempts to read in the same file
    > with a Scanner using a delimiter of "\z".
    >
    > This doesn't work. Only a part of the file is read (the first 1024
    > bytes). The result is that the comparison operation fails. I can
    > manually inspect the file created and it does have the correct number of
    > strings -- 16384. This code finds the last string at number 284.


    \$ will work just fine or \z?m or \Z?m for that matter.

    There is a difference between Java and Perl in that end of line/string
    metacharacters don't use the same language in the docs that Java does.
    Java calls \z end of input. I don't really know if that was meant to be
    anything different or not.

    I'll leave it to you or someone else to test Perl for similar behavior
    but I suspect you have detected a bug.

    --

    Knute Johnson
    email s/nospam/knute2010/
     
    Knute Johnson, Dec 31, 2009
    #6
  7. markspace

    markspace Guest

    Knute Johnson wrote:

    >
    > \$ will work just fine or \z?m or \Z?m for that matter.


    Yes these do work. ?m confuses me though, is that supposed to be
    "multiline mode?" I thought it went in parenthesis like this: "(?m)".
     
    markspace, Dec 31, 2009
    #7
  8. markspace

    Roedy Green Guest

    On Wed, 30 Dec 2009 10:09:32 -0800, markspace <>
    wrote, quoted or indirectly quoted someone who said :

    >with a Scanner using a delimiter of "\z".


    In the days of CPM and DOS \z was a magic EOF character. Perhaps your
    OS it still treating it as such. Try a different char.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    If you give someone a program, you will frustrate them for a day; if you teach them how to program, you will frustrate them for a lifetime.
     
    Roedy Green, Jan 1, 2010
    #8
  9. markspace

    Eric Sosman Guest

    On 1/1/2010 12:06 AM, Roedy Green wrote:
    > On Wed, 30 Dec 2009 10:09:32 -0800, markspace<>
    > wrote, quoted or indirectly quoted someone who said :
    >
    >> with a Scanner using a delimiter of "\z".

    >
    > In the days of CPM and DOS \z was a magic EOF character. Perhaps your
    > OS it still treating it as such. Try a different char.


    Have you confused \z with CTRL-Z, ASCII SUB, 0x1A?

    In any event, the Scanner doesn't treat \z as a literal
    character, but as the source for a Pattern. As far as the
    Pattern is concerned, \z is a two-character sequence that
    stands for the condition "end of input" and not for any
    particular character value. (Indeed, \z *cannot* match a
    character!)

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jan 1, 2010
    #9
  10. markspace

    Arne Vajhøj Guest

    Arne Vajhøj, Jan 2, 2010
    #10
  11. On 12/31/2009 12:10 PM, markspace wrote:
    > Knute Johnson wrote:
    >
    >>
    >> \$ will work just fine or \z?m or \Z?m for that matter.

    >
    > Yes these do work. ?m confuses me though, is that supposed to be
    > "multiline mode?"


    Yes

    I thought it went in parenthesis like this: "(?m)".

    After reading the Perl book again, I'm not sure why it works without the
    () and really don't understand why it doesn't work with them.

    I think I'm just going to put this one into the unknowable box.

    --

    Knute Johnson
    email s/nospam/knute2010/
     
    Knute Johnson, Jan 3, 2010
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?V29uZw==?=

    Asp.net upload file links to virus scanner

    =?Utf-8?B?V29uZw==?=, Aug 30, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    513
    =?Utf-8?B?V29uZw==?=
    Aug 30, 2004
  2. =?Utf-8?B?V29uZw==?=

    Asp.net Upload File links to Virus Scanner

    =?Utf-8?B?V29uZw==?=, Aug 30, 2004, in forum: ASP .Net
    Replies:
    4
    Views:
    4,002
    bsandhu
    Dec 16, 2005
  3. Replies:
    5
    Views:
    1,616
    tehka
    Sep 5, 2006
  4. cccc

    Scanner image manager

    cccc, Jun 15, 2004, in forum: Java
    Replies:
    1
    Views:
    463
    Dave Neary
    Jun 15, 2004
  5. el goog
    Replies:
    0
    Views:
    7,325
    el goog
    Feb 16, 2005
Loading...

Share This Page