Find the number of lines in a text file

M

Martin Gregorie

Tor said:
Exception: If it is known the file has a set line (record) size in
bytes, and the line separator is known, then the number of lines =
file.size()/(recordSize+separatorSize)
>
Depends what operating system you're dealing with and how the JVM
implementation gets file size from it. Some operating systems return
block_size * blocks_in_file as the file size, rather than the space
occupied by the file contents.
 
S

Simon

Martin said:
Depends what operating system you're dealing with and how the JVM
implementation gets file size from it. Some operating systems return
block_size * blocks_in_file as the file size, rather than the space
occupied by the file contents.

I wasn't aware of this. This implies that creating a byte buffer with
"new byte[file.length()]" to read the file contents into memory is not a good
idea. Even worse, you won't even get an ArrayIndexOutOfBoundsException when you
fill the array, because the array will always be too large and never too small.

Do you have an example where File.length() does not return the actual filesize?

Cheers,
Simon
 
M

Martin Gregorie

Simon said:
Martin said:
Depends what operating system you're dealing with and how the JVM
implementation gets file size from it. Some operating systems return
block_size * blocks_in_file as the file size, rather than the space
occupied by the file contents.

I wasn't aware of this. This implies that creating a byte buffer with
"new byte[file.length()]" to read the file contents into memory is not a good
idea.
>
It's not so bad from that point of view because the buffer can be at
most blocksize-1 (or clustersize-1 for a FAT32 partition) bytes too big.
Even worse, you won't even get an ArrayIndexOutOfBoundsException when you
fill the array, because the array will always be too large and never too small.
That's true, but if the file is a serial file it will normally have an
end marker. Looking for it works though its scarcely portable.
Do you have an example where File.length() does not return the actual filesize?
The classic example was from the dark ages before the MS FAT filing
system introduced clustering to get round disk size limitations. Text
files were always terminated with ^Z and reading past that until EOF was
returned picked up all the garbage left over from the last file that
used that block. That's why old DOS programs check for ^Z or EOF!

Maybe somebody who knows the innards of current MS NTFS filing systems
can say what they do.

I have a FAT32 filing system I can look at later: right now its being
backed up from Linux. I can tell you (I looked) that file lengths in
FAT32 filing systems are correctly reported by Linux but I can't
remember what Win95/98/ME does.

If you want to buffer a complete file, the safest way is probably to
append bytes or lines to a StringBuffer or to do the equivalent with
bytes and don't take any notice of the File.length() except as information.

Here are other ways I know to get file lengths that do not match the
amount of data in the file:

- File.length() returns an "unspecified" value if the file is a
directory. To me this says either that data files are scanned to
determine their length or that the OS is asked how long the file is and
its reply is returned without further checks. Either way the value is
most likely platform-dependent.

- In UNIX or Linux all files, including directories, have a length, but
the length of a directory is usually longer than the data it contains
because directories are not sequential files.

- Similarly, you can put gaps in a UNIX/Linux file by doing the
following:
create the file
seek to n * 1000 /* force the file to be large */
seek to 0
write 'n' bytes /* write at the start of the file */
seek to n * 100 /* leaving a gap of n * 99 bytes
write 'n' bytes /* write in the middle of the file */
close the file

Of course, this is exactly what a database manager does.
A directory listing will report the file size as n * 1000 but the last
899 * n bytes will be junk.

The bottom line is that, unless you know for sure that the file was
created with sequential writes *and* that the OS always returns a file
length that's accurate to the exact byte, then doing anything except
reading through the file is deeply suspect.
 
T

Tor Iver Wilhelmsen

Simon said:
I wasn't aware of this. This implies that creating a byte buffer
with "new byte[file.length()]" to read the file contents into memory
is not a good idea. Even worse, you won't even get an
ArrayIndexOutOfBoundsException when you fill the array, because the
array will always be too large and never too small.

Yes, but of course you ALWAYS!!!!! check the return value of
read(byte[]) to see how many bytes were actually read, so it's
NEVER!!!! an issue if you write your code correctly. :)
 
S

Simon

Tor said:
Simon said:
I wasn't aware of this. This implies that creating a byte buffer
with "new byte[file.length()]" to read the file contents into memory
is not a good idea. Even worse, you won't even get an
ArrayIndexOutOfBoundsException when you fill the array, because the
array will always be too large and never too small.

Yes, but of course you ALWAYS!!!!! check the return value of
read(byte[]) to see how many bytes were actually read, so it's
NEVER!!!! an issue if you write your code correctly. :)

Yes, of course, but it doesn't help :)
If File.length() would return the length of the actual contents, I could do the
following. Assume I implement a helper method

public static byte[] getFileContents(File file);

that is supposed to do what is the obvious thing to do for a method with this
name. In the implementation, I could create the byte array of length
file.length(), initialise the offset into the array to 0, make repeated calls to
read(byte[], offset, byteBuffer.length-offset) incrementing the offset according
to the return value if it is >= 0 and stopping when the return value is -1. This
would all be correct. However, if File.length() would be "correct" (in my
sense), I could assume that byteArray now contains the file's contents and
return it. This, however, is not true. I would still have to create a new byte
array of length "offset" now, and copy the old array into the new.

Cheers,
Simon
 
C

Chris Uppal

Martin said:
Maybe somebody who knows the innards of current MS NTFS filing systems
can say what they do.

I don't think any version of Windows has ever had any difficulty supplying the
correct size for a file (except maybe for 32-bit limits on integers -- but
that's a different issue).

If you want to buffer a complete file, the safest way is probably to
append bytes or lines to a StringBuffer or to do the equivalent with
bytes and don't take any notice of the File.length() except as
information.

That is probably true, but not so much because the file length may be wrong (I
don't know of any system where it could be, but I don't know much about Java on
small devices or mainframe-ish machines), as because the file size may change
between when you measure it and when you've finished reading.

- Similarly, you can put gaps in a UNIX/Linux file by doing the
following:
create the file
seek to n * 1000 /* force the file to be large */
seek to 0
write 'n' bytes /* write at the start of the file */
seek to n * 100 /* leaving a gap of n * 99 bytes
write 'n' bytes /* write in the middle of the file */
close the file

That doesn't create a file with a length other than what it claims. The size
of the file is precisely as specified -- in this case it might claim there were
10,000 bytes in the file and that is precisely what you'll read from it (in
Java, C, or any other language). It's just that the on-disk representation is
optimised to have some "holes" in it -- but that's not visible or relevant to
the application programmer any more than the fact that a file on Windows may be
stored on-disk in compressed form.

-- chris
 
M

Martin Gregorie

Chris said:
That doesn't create a file with a length other than what it claims. The size
of the file is precisely as specified -- in this case it might claim there were
10,000 bytes in the file and that is precisely what you'll read from it (in
Java, C, or any other language).
>
Of course.
It's just that the on-disk representation is
optimised to have some "holes" in it -- but that's not visible or relevant to
the application programmer any more than the fact that a file on Windows may be
stored on-disk in compressed form.
Au contraire. The holes are decidedly relevant if you try to read the
file sequentially without understanding its format or that it may
contain holes.

I've seen this done, not as the artificial example I described, but by
the Sculptor 4GL which can create a file to hold a nominated number of
records. Doing this helps performance by preventing the file extending
and fragmenting as records are added. That DB also created "holes" by
writing zeros to deleted records. Again, big trouble if you don't
understand what you're reading.
 
C

Chris Uppal

Martin said:
Au contraire. The holes are decidedly relevant if you try to read the
file sequentially without understanding its format or that it may
contain holes.

Yes indeed. If you tar/cpio/gzip/zip up a sparse file then it'll stop being
sparse (barring odd GNU extensions to tar). I think the same applies to cp and
so on. But the only "problem" is that you'll end up with a file where
(potentially large) stretches of nuls are represented on-disk as lots of
nul-bytes -- the /semantics/ of the file are indentical, but the compression
has been lost.

And, of course, a clever file copy would preserve those holes -- or would even
introduce them in files which had been created with explicit nul-bytes.

-- chris
 
M

Martin Gregorie

Chris said:
Yes indeed. If you tar/cpio/gzip/zip up a sparse file then it'll stop being
sparse (barring odd GNU extensions to tar). I think the same applies to cp and
so on. But the only "problem" is that you'll end up with a file where
(potentially large) stretches of nuls are represented on-disk as lots of
nul-bytes -- the /semantics/ of the file are indentical, but the compression
has been lost.

And, of course, a clever file copy would preserve those holes -- or would even
introduce them in files which had been created with explicit nul-bytes.
I think we're in agreement - copying or compressing a sparse file should
retain its sparseness while (hopefully) defragmenting the file if its in
a file system that can have physically fragmented files (MS FAT and OS/9
RBF file systems spring to mind). The same should apply if you use the
value returned by the File.length() method to allocate a buffer and
block the file image into it.

However, we seem to have both strayed somewhat from what I thought the
OP was asking: namely about using the file length as an aid to
extracting the data from a file, which is not a good idea IMO.
 
J

John W. Kennedy

Martin said:
The classic example was from the dark ages before the MS FAT filing
system introduced clustering to get round disk size limitations. Text
files were always terminated with ^Z and reading past that until EOF was
returned picked up all the garbage left over from the last file that
used that block. That's why old DOS programs check for ^Z or EOF!

That was a CP/M restriction that carried over into DOS 1.0's version of
BASIC, even though DOS never had the problem. The BASIC in DOS 1.1 fixed
it, but by then it was too late.
 
M

Martin Gregorie

John said:
That was a CP/M restriction that carried over into DOS 1.0's version of
BASIC, even though DOS never had the problem. The BASIC in DOS 1.1 fixed
it, but by then it was too late.
IIRC I also ran into it with flavors of Borland C under DOS 4.2
(shudder) and 5.0.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top