Think I May Have Found a Bug with wc -l

K

KevinSimonson

I'm wondering if I may have found a bug with either my system's Java
interpreter or my system's Cygwin Unix emulation. I've written a
program whose contents I show below that produces a file that gives a
different number of lines when passed as an argument to Unix’s “wc –l”
and when passed to an extremely simple Java program that uses the
<Scanner> class’ <hasNextLine()> and <nextLine()> methods to count how
many lines a file has.
I’ve also written a Java program that uses the same two methods to
simply output to the screen each line from the input file, and have
displayed its output next to the output of the Unix command “cat” to
show the difference between those two outputs. Note the blank line
between the two lines of text.
So apparently Unix’s “wc –l” and “cat” treat a series of two carriage
returns followed by a new-line as one line separator while <Scanner>’s
<nextLine()> treats those three characters as two line separators.
Why the difference?

Kevin Simonson


###################################################################################

Script started on Sat Mar 19 11:11:56 2011
sh-4.1$ pwd
/cygdrive/c/Users/kvnsmnsn/Java/WclBug
sh-4.1$ ls -F
JvCat.class JvWcl.class WclBreaker.class WclBroken
JvCat.java JvWcl.java WclBreaker.java
sh-4.1$ cat WclBreaker.java
import java.io.FileWriter;
import java.io.BufferedWriter;
import java.io.PrintWriter;
import java.io.IOException;

public class WclBreaker
{
public static void main ( String[] arguments)
{
if (arguments.length == 3)
{ try
{ PrintWriter breaker
= new PrintWriter
( new BufferedWriter( new
FileWriter( arguments[ 0])));
breaker.println( arguments[ 1] + "\r\r\n" + arguments[ 2]);
breaker.close();
}
catch (IOException excptn)
{ System.err.println
( "Couldn't open file \"" + arguments[ 0] + "\" for
output!");
}
}
else
{ System.out.println( "Usage is");
System.out.println
( " java WclBreaker <broken-file> <first-string> <second-
string>");
}
}
}
sh-4.1$ cat JvWcl.java
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class JvWcl
{
public static void main ( String[] arguments)
{
if (0 < arguments.length)
{ int lineCount;
Scanner scnnr;
for (int arg = 0; arg < arguments.length; arg++)
{ try
{ scnnr = new Scanner( new File( arguments[ arg]));
for (lineCount = 0; scnnr.hasNextLine(); lineCount++)
{ scnnr.nextLine();
}
scnnr.close();
System.out.println( lineCount + " " + arguments[ 0]);
}
catch (FileNotFoundException excptn)
{ System.err.println
( "JvWcl: " + arguments[ arg] + ": No such file or
directory");
}
}
}
else
{ System.out.println( "Usage is\n java JvWcl <file-name>+");
}
}
}
sh-4.1$ cat JvCat.java
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class JvCat
{
public static void main ( String[] arguments)
{
if (0 < arguments.length)
{ Scanner scnnr;
int arg = 0;
try
{ while (arg < arguments.length)
{ scnnr = new Scanner( new File( arguments[ arg++]));
while (scnnr.hasNextLine())
{ System.out.println( scnnr.nextLine());
}
scnnr.close();
}
}
catch (FileNotFoundException excptn)
{ System.err.println
( "Couldn't find file \"" + arguments[ arg - 1] + "\"!");
}
}
else
{ System.out.println( "Usage is\n java JvCat <file-name>+");
}
}
}
sh-4.1$ java WclBreaker AbcDef.Txt Abc Def
sh-4.1$ wc -l AbcDef.Txt
2 AbcDef.Txt
sh-4.1$ java JvWcl AbcDef.Txt
3 AbcDef.Txt
sh-4.1$ cat AbcDef.Txt
Abc
Def
sh-4.1$ java JvCat AbcDef.Txt
Abc

Def
sh-4.1$ exit
exit

Script done on Sat Mar 19 11:13:55 2011
 
D

Daniele Futtorovic

So apparently Unix’s “wc –l” and “cat” treat a series of two carriage
returns followed by a new-line as one line separator while<Scanner>’s
<nextLine()> treats those three characters as two line separators.
Why the difference?

And you call that a bug with wc -l?

wc will split on the unix line separator, that is \n. Java's Scanner is
apparently intended to recognise input from different OSses, and to it,
a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

What you're looking at is two different tools doing things differently.
There's no 'bug' in either of them. They're just different.
 
K

KevinSimonson

So apparently Unix’s “wc –l” and “cat” treat a series of two carriage
returns followed by a new-line as one line separator while<Scanner>’s
<nextLine()>  treats those three characters as two line separators.
Why the difference?

And you call that a bug with wc -l?

wc will split on the unix line separator, that is \n. Java's Scanner is
apparently intended to recognise input from different OSses, and to it,
a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

What you're looking at is two different tools doing things differently.
There's no 'bug' in either of them. They're just different.

But two consecutive carriage returns followed by a new-line _don't_
match "\r\n[\n\r\u2028\u2029\u0085]" twice; in fact it doesn't even
match it _once_. So why is the <Scanner> object's <nextLine()> after
the first line giving me an empty line?

Kevin Simonson
 
E

Eric Sosman

[...]
wc will split on the unix line separator, that is \n. Java's Scanner is
apparently intended to recognise input from different OSses, and to it,
a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".
[...]
But two consecutive carriage returns followed by a new-line _don't_
match "\r\n[\n\r\u2028\u2029\u0085]" twice; in fact it doesn't even
match it _once_. So why is the<Scanner> object's<nextLine()> after
the first line giving me an empty line?

You seem to have overlooked the | in the regular expression.
When confronted with the input "\r\r\n", the second part of the
regex matches the initial "\r". The Scanner presumably consumes
that matched character and presents the remaining "\r\n" to the
regex, which then matches both characters with its first part.
So the Scanner sees "\r\r\n" as two separators, "\r" and "\r\n".
 
P

Peter J. Holzer

So apparently Unix’s “wc –l†and “cat†treat a series of two carriage
returns followed by a new-line as one line separator while<Scanner>’s
<nextLine()>  treats those three characters as two line separators.
Why the difference?

And you call that a bug with wc -l?

wc will split on the unix line separator, that is \n. Java's Scanner is
apparently intended to recognise input from different OSses, and to it,
a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

What you're looking at is two different tools doing things differently.
There's no 'bug' in either of them. They're just different.

But two consecutive carriage returns followed by a new-line _don't_
match "\r\n[\n\r\u2028\u2029\u0085]" twice; in fact it doesn't even
match it _once_.

True, but Kevin wrote "\r\n|[\n\r\u2028\u2029\u0085]" (note the extra
"|" in the middle).

The first "\r" matches /[\n\r\u2028\u2029\u0085]/.
Then the remaining "\r\n" matches /\r\n/.

hp
 
M

Mike Schilling

KevinSimonson said:
I'm wondering if I may have found a bug with either my system's Java
interpreter or my system's Cygwin Unix emulation. I've written a
program whose contents I show below that produces a file that gives a
different number of lines when passed as an argument to Unix’s “wc –l”
and when passed to an extremely simple Java program that uses the
<Scanner> class’ <hasNextLine()> and <nextLine()> methods to count how
many lines a file has.
I’ve also written a Java program that uses the same two methods to
simply output to the screen each line from the input file, and have
displayed its output next to the output of the Unix command “cat” to
show the difference between those two outputs. Note the blank line
between the two lines of text.
So apparently Unix’s “wc –l” and “cat” treat a series of two carriage
returns followed by a new-line as one line separator while <Scanner>’s
<nextLine()> treats those three characters as two line separators.
Why the difference?

Because that's not a sensible combination of characters for a text file.
In Windows format, a CR should always be followed by an LF. In Unix format,
a CR should not appear in a text file. There isn’t a single best way to
handle the situation.
 
L

Lawrence D'Oliveiro

Daniele Futtorovic said:
Java's Scanner is apparently intended to recognise input from different
OSses, and to it, a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line separators?
 
A

Arne Vajhøj

Daniele Futtorovic said:
Java's Scanner is apparently intended to recognise input from different
OSses, and to it, a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line separators?

\u0085 = NEL = new line
\u2028 = LS = line separator
\u2029 = PS = paragraph separator

Seems logical to me.

(NEL is an EBCDIC related thingy. I don't know what use LF and PS)

Arne
 
L

Lawrence D'Oliveiro

Daniele Futtorovic said:
Java's Scanner is apparently intended to recognise input from different
OSses, and to it, a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line separators?

\u0085 = NEL = new line
\u2028 = LS = line separator
\u2029 = PS = paragraph separator

Seems logical to me.

Let me ask again: who uses them?
 
D

Daniele Futtorovic

Arne Vajhøj said:
In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 =PS =
paragraph separator

Seems logical to me.

Let me ask again: who uses them?

Correct answer to the wrong question: someone.

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?
 
L

Lawrence D'Oliveiro

Daniele Futtorovic said:
Arne Vajhøj said:
On 19-03-2011 22:39, Lawrence D'Oliveiro wrote:

In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 = PS =
paragraph separator

Seems logical to me.

Let me ask again: who uses them?

Correct answer to the wrong question: someone.
Who?

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?

Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?
 
D

Daniele Futtorovic

Daniele Futtorovic said:
In message<[email protected]>, Arne Vajhøj
wrote:

On 19-03-2011 22:39, Lawrence D'Oliveiro wrote:

In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 = PS =
paragraph separator

Seems logical to me.

Let me ask again: who uses them?

Correct answer to the wrong question: someone.
Who?

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?

Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?

Let me ROFL.
 
A

Arved Sandstrom

Daniele Futtorovic said:
In message<[email protected]>, Arne Vajhøj
wrote:

On 19-03-2011 22:39, Lawrence D'Oliveiro wrote:

In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 = PS =
paragraph separator

Seems logical to me.

Let me ask again: who uses them?

Correct answer to the wrong question: someone.
Who?

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?

Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?

The Unicode Standard from the Unicode Consortium is the convention. I'm
guessing it's reasonably accepted these days.

Why bust on a technology for having the temerity to use part of a standard?

I hear an implied statement here, though, that it's somehow not kosher
for Java to impose any conventions. Why is that exactly?

AHS
 
A

Arne Vajhøj

Daniele Futtorovic said:
In message<[email protected]>, Arne Vajhøj
wrote:

On 19-03-2011 22:39, Lawrence D'Oliveiro wrote:

In message<[email protected]>, Daniele
Futtorovic wrote:

Java's Scanner is apparently intended to recognise input from
different OSses, and to it, a line break is something that
matches the pattern: "\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line
separators?

\u0085 = NEL = new line \u2028 = LS = line separator \u2029 = PS =
paragraph separator

Seems logical to me.

Let me ask again: who uses them?

Correct answer to the wrong question: someone.
Who?

Correct question: can you, as the designer of a multi-platform tool,
safely assume no one who matters ever does or will?

Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?

They are defined in Unicode.

Java supports Unicode.

The conclusion seems obvious.

Arne
 
A

Arne Vajhøj

In message<[email protected]>, Daniele Futtorovic
wrote:

Java's Scanner is apparently intended to recognise input from different
OSses, and to it, a line break is something that matches the pattern:
"\r\n|[\n\r\u2028\u2029\u0085]".

Why? Who uses those additional Unicode characters as line separators?

\u0085 = NEL = new line
\u2028 = LS = line separator
\u2029 = PS = paragraph separator

Seems logical to me.

(NEL is an EBCDIC related thingy. I don't know what use LF and PS)

Let me ask again: who uses them?

Read!!

Arne
 
V

Volker Borchert

Lawrence said:
Let me ask again in a different way: is Java just trying to handle an
accepted convention, or is it trying to impose one?

It's just following the "be strict in what you generate, be generous
in what you accept" rule.
 
L

Lew

Volker said:
It's just following the "be strict in what you generate, be generous
in what you accept" rule.

Plus it's blazingly obvious from the cited material upthread that Java is
handling an accepted, even documented, convention. There were references not
written by the Java folks, e.g., Leif's to

http://msdn.microsoft.com/en-us/library/dd374110(v=vs.85).aspx

so "Lawrence" really should have figured that one out on his own, given that
the question was already answered here before he asked it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,818
Messages
2,569,727
Members
45,664
Latest member
Phil79581

Latest Threads

Top