Think I May Have Found a Bug with wc -l

KevinSimonson · Mar 19, 2011

I'm wondering if I may have found a bug with either my system's Java
interpreter or my system's Cygwin Unix emulation. I've written a
program whose contents I show below that produces a file that gives a
different number of lines when passed as an argument to Unix’s “wc –l”
and when passed to an extremely simple Java program that uses the
<Scanner> class’ <hasNextLine()> and <nextLine()> methods to count how
many lines a file has.
I’ve also written a Java program that uses the same two methods to
simply output to the screen each line from the input file, and have
displayed its output next to the output of the Unix command “cat” to
show the difference between those two outputs. Note the blank line
between the two lines of text.
So apparently Unix’s “wc –l” and “cat” treat a series of two carriage
returns followed by a new-line as one line separator while <Scanner>’s
<nextLine()> treats those three characters as two line separators.
Why the difference?

Kevin Simonson

###################################################################################

Script started on Sat Mar 19 11:11:56 2011
sh-4.1$ pwd
/cygdrive/c/Users/kvnsmnsn/Java/WclBug
sh-4.1$ ls -F
JvCat.class JvWcl.class WclBreaker.class WclBroken
JvCat.java JvWcl.java WclBreaker.java
sh-4.1$ cat WclBreaker.java
import java.io.FileWriter;
import java.io.BufferedWriter;
import java.io.PrintWriter;
import java.io.IOException;

public class WclBreaker
{
public static void main ( String[] arguments)
{
if (arguments.length == 3)
{ try
{ PrintWriter breaker
= new PrintWriter
( new BufferedWriter( new
FileWriter( arguments[ 0])));
breaker.println( arguments[ 1] + "\r\r\n" + arguments[ 2]);
breaker.close();
}
catch (IOException excptn)
{ System.err.println
( "Couldn't open file \"" + arguments[ 0] + "\" for
output!");
}
}
else
{ System.out.println( "Usage is");
System.out.println
( " java WclBreaker <broken-file> <first-string> <second-
string>");
}
}
}
sh-4.1$ cat JvWcl.java
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class JvWcl
{
public static void main ( String[] arguments)
{
if (0 < arguments.length)
{ int lineCount;
Scanner scnnr;
for (int arg = 0; arg < arguments.length; arg++)
{ try
{ scnnr = new Scanner( new File( arguments[ arg]));
for (lineCount = 0; scnnr.hasNextLine(); lineCount++)
{ scnnr.nextLine();
}
scnnr.close();
System.out.println( lineCount + " " + arguments[ 0]);
}
catch (FileNotFoundException excptn)
{ System.err.println
( "JvWcl: " + arguments[ arg] + ": No such file or
directory");
}
}
}
else
{ System.out.println( "Usage is\n java JvWcl <file-name>+");
}
}
}
sh-4.1$ cat JvCat.java
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class JvCat
{
public static void main ( String[] arguments)
{
if (0 < arguments.length)
{ Scanner scnnr;
int arg = 0;
try
{ while (arg < arguments.length)
{ scnnr = new Scanner( new File( arguments[ arg++]));
while (scnnr.hasNextLine())
{ System.out.println( scnnr.nextLine());
}
scnnr.close();
}
}
catch (FileNotFoundException excptn)
{ System.err.println
( "Couldn't find file \"" + arguments[ arg - 1] + "\"!");
}
}
else
{ System.out.println( "Usage is\n java JvCat <file-name>+");
}
}
}
sh-4.1$ java WclBreaker AbcDef.Txt Abc Def
sh-4.1$ wc -l AbcDef.Txt
2 AbcDef.Txt
sh-4.1$ java JvWcl AbcDef.Txt
3 AbcDef.Txt
sh-4.1$ cat AbcDef.Txt
Abc
Def
sh-4.1$ java JvCat AbcDef.Txt
Abc

Def
sh-4.1$ exit
exit

Script done on Sat Mar 19 11:13:55 2011

Daniele Futtorovic · Mar 19, 2011

So apparently Unix’s “wc –l” and “cat” treat a series of two carriage
returns followed by a new-line as one line separator while<Scanner>’s
<nextLine()> treats those three characters as two line separators.
Why the difference?

And you call that a bug with wc -l?

wc will split on the unix line separator, that is \n. Java's Scanner is
apparently intended to recognise input from different OSses, and to it,
a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

What you're looking at is two different tools doing things differently.
There's no 'bug' in either of them. They're just different.

KevinSimonson · Mar 19, 2011

So apparently Unix’s “wc –l” and “cat” treat a series of two carriage
returns followed by a new-line as one line separator while<Scanner>’s
<nextLine()> treats those three characters as two line separators.
Why the difference?

Click to expand...

And you call that a bug with wc -l?

wc will split on the unix line separator, that is \n. Java's Scanner is
apparently intended to recognise input from different OSses, and to it,
a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

What you're looking at is two different tools doing things differently.
There's no 'bug' in either of them. They're just different.

But two consecutive carriage returns followed by a new-line _don't_
match "\r\n[\n\r\u2028\u2029\u0085]" twice; in fact it doesn't even
match it _once_. So why is the <Scanner> object's <nextLine()> after
the first line giving me an empty line?

Kevin Simonson

Eric Sosman · Mar 19, 2011

[...]
wc will split on the unix line separator, that is \n. Java's Scanner is
apparently intended to recognise input from different OSses, and to it,
a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".
[...]

Click to expand...

But two consecutive carriage returns followed by a new-line _don't_
match "\r\n[\n\r\u2028\u2029\u0085]" twice; in fact it doesn't even
match it _once_. So why is the<Scanner> object's<nextLine()> after
the first line giving me an empty line?

You seem to have overlooked the | in the regular expression.
When confronted with the input "\r\r\n", the second part of the
regex matches the initial "\r". The Scanner presumably consumes
that matched character and presents the remaining "\r\n" to the
regex, which then matches both characters with its first part.
So the Scanner sees "\r\r\n" as two separators, "\r" and "\r\n".

Peter J. Holzer · Mar 19, 2011

So apparently Unixâ€™s â€œwc â€“lâ€ and â€œcatâ€ treat a series of two carriage
returns followed by a new-line as one line separator while<Scanner>â€™s
<nextLine()> Â treats those three characters as two line separators.
Why the difference?

Click to expand...

And you call that a bug with wc -l?

wc will split on the unix line separator, that is \n. Java's Scanner is
apparently intended to recognise input from different OSses, and to it,
a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

What you're looking at is two different tools doing things differently.
There's no 'bug' in either of them. They're just different.

Click to expand...

But two consecutive carriage returns followed by a new-line _don't_
match "\r\n[\n\r\u2028\u2029\u0085]" twice; in fact it doesn't even
match it _once_.

True, but Kevin wrote "\r\n|[\n\r\u2028\u2029\u0085]" (note the extra
"|" in the middle).

The first "\r" matches /[\n\r\u2028\u2029\u0085]/.
Then the remaining "\r\n" matches /\r\n/.

hp

Mike Schilling · Mar 20, 2011

KevinSimonson said:
I'm wondering if I may have found a bug with either my system's Java
interpreter or my system's Cygwin Unix emulation. I've written a
program whose contents I show below that produces a file that gives a
different number of lines when passed as an argument to Unix’s “wc –l”
and when passed to an extremely simple Java program that uses the
<Scanner> class’ <hasNextLine()> and <nextLine()> methods to count how
many lines a file has.
I’ve also written a Java program that uses the same two methods to
simply output to the screen each line from the input file, and have
displayed its output next to the output of the Unix command “cat” to
show the difference between those two outputs. Note the blank line
between the two lines of text.
So apparently Unix’s “wc –l” and “cat” treat a series of two carriage
returns followed by a new-line as one line separator while <Scanner>’s
<nextLine()> treats those three characters as two line separators.
Why the difference?

Because that's not a sensible combination of characters for a text file.
In Windows format, a CR should always be followed by an LF. In Unix format,
a CR should not appear in a text file. There isn’t a single best way to
handle the situation.

Lawrence D'Oliveiro · Mar 20, 2011

Daniele Futtorovic said:
Java's Scanner is apparently intended to recognise input from different
OSses, and to it, a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line separators?

Arne Vajhøj · Mar 20, 2011

Daniele Futtorovic said:
Daniele Futtorovic said:

Java's Scanner is apparently intended to recognise input from different
OSses, and to it, a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

Click to expand...

Why? Who uses those additional Unicode characters as line separators?

\u0085 = NEL = new line
\u2028 = LS = line separator
\u2029 = PS = paragraph separator

Seems logical to me.

(NEL is an EBCDIC related thingy. I don't know what use LF and PS)

Arne

Lawrence D'Oliveiro · Mar 20, 2011

Daniele Futtorovic said:
Daniele Futtorovic said:

Java's Scanner is apparently intended to recognise input from different
OSses, and to it, a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

Click to expand...

Why? Who uses those additional Unicode characters as line separators?

Click to expand...

\u0085 = NEL = new line
\u2028 = LS = line separator
\u2029 = PS = paragraph separator

Seems logical to me.

Let me ask again: who uses them?

Lew · Mar 20, 2011

Well, there's always these guys:

http://msdn.microsoft.com/en-us/library/dd374110(v=vs.85).aspx

If "Lawrence" had the wit and industry to do even the most cursory search on
his own he would very, very quickly have found

http://en.wikipedia.org/wiki/Newline

and thus had the answer to his ostensible question.

Of course we know his real motivation had little or nothing to do with gaining
knowledge.

Daniele Futtorovic · Mar 20, 2011

Arne Vajhøj said:
Arne Vajhøj said:

In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

Click to expand...

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 =PS =
paragraph separator

Seems logical to me.

Click to expand...

Let me ask again: who uses them?

Correct answer to the wrong question: someone.

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?

Lawrence D'Oliveiro · Mar 20, 2011

Daniele Futtorovic said:
Arne VajhÃ¸j said:

On 19-03-2011 22:39, Lawrence D'Oliveiro wrote:

In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 = PS =
paragraph separator

Seems logical to me.

Click to expand...

Let me ask again: who uses them?

Click to expand...

Correct answer to the wrong question: someone.
Who?

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?

Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?

Daniele Futtorovic · Mar 20, 2011

Daniele Futtorovic said:
Daniele Futtorovic said:

In message<[email protected]>, Arne Vajhøj
wrote:

On 19-03-2011 22:39, Lawrence D'Oliveiro wrote:

In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 = PS =
paragraph separator

Seems logical to me.

Let me ask again: who uses them?

Click to expand...

Correct answer to the wrong question: someone.
Who?

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?

Click to expand...

Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?

Let me ROFL.

Arved Sandstrom · Mar 20, 2011

Daniele Futtorovic said:
Daniele Futtorovic said:

In message<[email protected]>, Arne VajhÃ¸j
wrote:

On 19-03-2011 22:39, Lawrence D'Oliveiro wrote:

In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 = PS =
paragraph separator

Seems logical to me.

Let me ask again: who uses them?

Click to expand...

Correct answer to the wrong question: someone.
Who?

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?

Click to expand...

Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?

The Unicode Standard from the Unicode Consortium is the convention. I'm
guessing it's reasonably accepted these days.

Why bust on a technology for having the temerity to use part of a standard?

I hear an implied statement here, though, that it's somehow not kosher
for Java to impose any conventions. Why is that exactly?

AHS

Arne VajhÃ¸j · Mar 20, 2011

Daniele Futtorovic said:
Daniele Futtorovic said:

In message<[email protected]>, Arne VajhÃ¸j
wrote:

On 19-03-2011 22:39, Lawrence D'Oliveiro wrote:

In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 = PS =
paragraph separator

Seems logical to me.

Let me ask again: who uses them?

Click to expand...

Correct answer to the wrong question: someone.
Who?

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?

Click to expand...

Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?

They are defined in Unicode.

Java supports Unicode.

The conclusion seems obvious.

Arne

Arne Vajhøj · Mar 20, 2011

Well, there's always these guys:

http://msdn.microsoft.com/en-us/library/dd374110(v=vs.85).aspx

Actually I don't think they really uses them.

They just write the mandatory bla bla when describing
Unicode.

Arne

Arne VajhÃ¸j · Mar 20, 2011

In message<[email protected]>, Daniele Futtorovic
wrote:

Java's Scanner is apparently intended to recognise input from different
OSses, and to it, a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line separators?

Click to expand...

\u0085 = NEL = new line
\u2028 = LS = line separator
\u2029 = PS = paragraph separator

Seems logical to me.

(NEL is an EBCDIC related thingy. I don't know what use LF and PS)

Click to expand...

Let me ask again: who uses them?

Read!!

Arne

Volker Borchert · Mar 21, 2011

Lawrence said:
Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?

It's just following the "be strict in what you generate, be generous
in what you accept" rule.

Lew · Mar 21, 2011

Volker said:
It's just following the "be strict in what you generate, be generous
in what you accept" rule.

Plus it's blazingly obvious from the cited material upthread that Java is
handling an accepted, even documented, convention. There were references not
written by the Java folks, e.g., Leif's to

http://msdn.microsoft.com/en-us/library/dd374110(v=vs.85).aspx

so "Lawrence" really should have figured that one out on his own, given that
the question was already answered here before he asked it.

Is Scanner's nextLine() Supposed to Return True with Unread Empty Lines?	1	Mar 13, 2011
Using Enumerated Types as Array Indexes	51	Aug 16, 2011
Scanner Bug?	10	Dec 30, 2009
Novice to Generics Trying to Implement a Generic Priority Queue	13	Apr 8, 2011
Passing a Literal Array to a Method	1	Mar 3, 2011
How to sort a CSV file with merge sort JAVA	7	May 6, 2021
using Scanner with a bar ("\|") delimiter	3	Jan 26, 2007
I have something bugs	9	Jul 3, 2007

Think I May Have Found a Bug with wc -l

KevinSimonson

Daniele Futtorovic

KevinSimonson

Eric Sosman

Peter J. Holzer

Mike Schilling

Lawrence D'Oliveiro

Arne Vajhøj

Lawrence D'Oliveiro

Lew

Daniele Futtorovic

Lawrence D'Oliveiro

Daniele Futtorovic

Arved Sandstrom

Arne VajhÃ¸j

Arne Vajhøj

Arne VajhÃ¸j

Volker Borchert

Lew

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads