Simon said:
Martin said:
Depends what operating system you're dealing with and how the JVM
implementation gets file size from it. Some operating systems return
block_size * blocks_in_file as the file size, rather than the space
occupied by the file contents.
I wasn't aware of this. This implies that creating a byte buffer with
"new byte[file.length()]" to read the file contents into memory is not a good
idea.
>
It's not so bad from that point of view because the buffer can be at
most blocksize-1 (or clustersize-1 for a FAT32 partition) bytes too big.
Even worse, you won't even get an ArrayIndexOutOfBoundsException when you
fill the array, because the array will always be too large and never too small.
That's true, but if the file is a serial file it will normally have an
end marker. Looking for it works though its scarcely portable.
Do you have an example where File.length() does not return the actual filesize?
The classic example was from the dark ages before the MS FAT filing
system introduced clustering to get round disk size limitations. Text
files were always terminated with ^Z and reading past that until EOF was
returned picked up all the garbage left over from the last file that
used that block. That's why old DOS programs check for ^Z or EOF!
Maybe somebody who knows the innards of current MS NTFS filing systems
can say what they do.
I have a FAT32 filing system I can look at later: right now its being
backed up from Linux. I can tell you (I looked) that file lengths in
FAT32 filing systems are correctly reported by Linux but I can't
remember what Win95/98/ME does.
If you want to buffer a complete file, the safest way is probably to
append bytes or lines to a StringBuffer or to do the equivalent with
bytes and don't take any notice of the File.length() except as information.
Here are other ways I know to get file lengths that do not match the
amount of data in the file:
- File.length() returns an "unspecified" value if the file is a
directory. To me this says either that data files are scanned to
determine their length or that the OS is asked how long the file is and
its reply is returned without further checks. Either way the value is
most likely platform-dependent.
- In UNIX or Linux all files, including directories, have a length, but
the length of a directory is usually longer than the data it contains
because directories are not sequential files.
- Similarly, you can put gaps in a UNIX/Linux file by doing the
following:
create the file
seek to n * 1000 /* force the file to be large */
seek to 0
write 'n' bytes /* write at the start of the file */
seek to n * 100 /* leaving a gap of n * 99 bytes
write 'n' bytes /* write in the middle of the file */
close the file
Of course, this is exactly what a database manager does.
A directory listing will report the file size as n * 1000 but the last
899 * n bytes will be junk.
The bottom line is that, unless you know for sure that the file was
created with sequential writes *and* that the OS always returns a file
length that's accurate to the exact byte, then doing anything except
reading through the file is deeply suspect.