File.lastModified *extremely* slow ?

R

Robert Mischke

I'm writing a backup program that needs to check a lot of file sizes
and timestamps. Reading the directory tree itself is reasonably fast:
about 4 seconds for a test tree with 4.000 directories and about
40.000 files.

But when I add code to read the timestamps of the files, it gets
extremely slow: from 4 seconds without to 28 (!) seconds with calling
File.lastModified once per file. The actual class that gets
constructed from each found File object looks like this:


public class SingleFile
{
public String filename;
public long timestamp;
public long size;

public SingleFile(File fileObject)
{
// with this line alone, scanning takes 4 seconds
filename = fileObject.getName().intern();

// the critical line which adds about 24 seconds:
timestamp = fileObject.lastModified();

// this line adds another 2 seconds (30 sec. total)
size = fileObject.length();
}

//...
}

I made sure that each File object passed to the constructor is
actually a file, not a directory. I'm using the Sun SDK 1.4.2_03
virtual machine.

Has anyone encountered this before? A web search turned up nothing;
any hints and ideas are much appreciated.


Thanks,
Robert
 
T

Tony Morris

----- Original Message -----
From: "Robert Mischke" <[email protected]>
Newsgroups: comp.lang.java.programmer
Sent: Sunday, August 29, 2004 7:34 PM
Subject: File.lastModified *extremely* slow ?

I'm writing a backup program that needs to check a lot of file sizes
and timestamps. Reading the directory tree itself is reasonably fast:
about 4 seconds for a test tree with 4.000 directories and about
40.000 files.

But when I add code to read the timestamps of the files, it gets
extremely slow: from 4 seconds without to 28 (!) seconds with calling
File.lastModified once per file. The actual class that gets
constructed from each found File object looks like this:


public class SingleFile
{
public String filename;
public long timestamp;
public long size;

public SingleFile(File fileObject)
{
// with this line alone, scanning takes 4 seconds
filename = fileObject.getName().intern();

// the critical line which adds about 24 seconds:
timestamp = fileObject.lastModified();

// this line adds another 2 seconds (30 sec. total)
size = fileObject.length();
}

//...
}

I made sure that each File object passed to the constructor is
actually a file, not a directory. I'm using the Sun SDK 1.4.2_03
virtual machine.

Has anyone encountered this before? A web search turned up nothing;
any hints and ideas are much appreciated.


Thanks,
Robert

I ran a test on a 2x1GHz (dual) CPU machine with 2.5 GB RAM using Sun J2SE
SDK 1.4.2_04 on Windows 2003 Enterprise Edition.
I removed the (probably erroneous) call to intern() the name of the file -
why do you do this? Is it necessary?
I found that for 100000 (10^5) calls, it took 47ms to call getName(), 6094ms
to call lastModified(), and 8578ms to call length(). These times were
consistent (only slight deviation) for various files over several runs.

I believe that this is reasonable performance and nothing out of the
ordinary.

What exactly are you trying to achieve?


// Test case
import java.io.File;

class X
{
public static void main(String[] args)
{
File f = new File(args[0]);
long times = Long.parseLong(args[1]);

long t1 = System.currentTimeMillis();

for(int i = 0 ; i < times; i++)
{
String filename = f.getName();
}

long t2 = System.currentTimeMillis();

for(int i = 0 ; i < times; i++)
{
long timestamp = f.lastModified();
}

long t3 = System.currentTimeMillis();

for(int i = 0 ; i < times; i++)
{
long size = f.length();
}

long t4 = System.currentTimeMillis();

System.out.println(t2 - t1);
System.out.println(t3 - t2);
System.out.println(t4 - t3);
}
}


Tony Morris
http://xdweb.net/~dibblego/
 
C

Chris Uppal

Tony said:
I ran a test on a 2x1GHz (dual) CPU machine with 2.5 GB RAM using Sun J2SE
SDK 1.4.2_04 on Windows 2003 Enterprise Edition.

But your test repeatedly gets the data for the /same/ file, this will almost
certainly end up testing the OS's caches more than anything else. For a
realistic test you have to loop over lots of different files.

-- chris
 
C

Chris Uppal

Robert said:
But when I add code to read the timestamps of the files, it gets
extremely slow: from 4 seconds without to 28 (!) seconds with calling
File.lastModified once per file

This sounds like an OS effect to me. Or, to be more precise, it sounds like a
effect of the design of the filesystem. What OS are you running on ?

When the OS scans a directory looking for files, it will be (in a typical
implementation) scanning over a disk block, reading a list of {filename,
location} pairs (plus some housekeeping data). The filename is the name of the
file, the location is the place on disk where the real data for the file is
kept. If the housekeeping data does not include the last-modified date
"in-line", as it were, with the housekeeping data (and I don't think it usually
does -- it /can't/ on UNIX) then fetching that information will require an
extra disk access for each file that you scan. (A disk seek and a disk read in
the worst case, but with luck the various caches in the OS and disk controller,
will reduce the impact of this).

Assume you are scanning over 40K files. Assume that the worst case did happen
and checking the last-modified date required an extra seek and read for each of
them. Assume an average time of 10 msec for a seek-and-read (just for example,
I don't know what a typical time would be these days, but it'll something of
that general order). Under those assumptions you can see that getting the
last-modified dates would add approximately 400 seconds to the execution time !
(So it looks as if the OS's caches are working quite well for you in this
case.)

-- chris
 
J

Jacob

Tony said:
I ran a test on a 2x1GHz (dual) CPU machine with 2.5 GB RAM using Sun J2SE
SDK 1.4.2_04 on Windows 2003 Enterprise Edition.
I removed the (probably erroneous) call to intern() the name of the file -
why do you do this? Is it necessary?
I found that for 100000 (10^5) calls, it took 47ms to call getName(), 6094ms
to call lastModified(), and 8578ms to call length(). These times were
consistent (only slight deviation) for various files over several runs.

I believe that this is reasonable performance and nothing out of the
ordinary.

I've ran your test program on a small Linux laptop
with the following metrics:

1000000 x
getName() - 101ms
lastModified() - 1085ms
length() - 984ms
 
S

Sudsy

Robert said:
I'm writing a backup program that needs to check a lot of file sizes
and timestamps. Reading the directory tree itself is reasonably fast:
about 4 seconds for a test tree with 4.000 directories and about
40.000 files.

But when I add code to read the timestamps of the files, it gets
extremely slow: from 4 seconds without to 28 (!) seconds with calling
File.lastModified once per file. The actual class that gets
constructed from each found File object looks like this:
<snip>

I was intrigued, so I set up a simple test to walk through a filesystem
and invoke stat(2) on every file. I'm running Linux on an AMD XP1800+
and the filesystem resides on a Maxtor 40MB drive rotating @ 7,200 RPM.
Multiple runs exhibited a time of approximately 1.5 seconds for 44,411
files. That works out to roughly 30 microseconds/file.
But I note that you failed to mention your platform.
 
C

Chris Uppal

Sudsy said:
I was intrigued, so I set up a simple test to walk through a filesystem
and invoke stat(2) on every file. I'm running Linux on an AMD XP1800+
and the filesystem resides on a Maxtor 40MB drive

Did you really mean only 40MB ? If so then the chances are the whole thing is
cached in OS ram.

Even if not then there's a good chance that the "Multiple runs" will get all
the relevant data into cache -- I'd be less than impressed by an OS that
/failed/ to do so....

-- chris
 
S

Sudsy

Chris said:
Sudsy wrote:




Did you really mean only 40MB ? If so then the chances are the whole thing is
cached in OS ram.

Oops! I meant 40 GB. And I ported the program to Java to get some
comparable numbers. While the results showed frequent (2 runs out
of 10) anomolies, it took about 3.45 seconds to "walk" the filesystem
and retrieve the last modified times.
So we're talking roughly 30 us for native C vs 80 us for Java. You
still retrieve the information in about 3.5 seconds which leads me
to the conclusion that the OP is running a platform other than
Linux.
Maybe a change is in order? :)
 
R

Robert Mischke

Tony Morris said:
I removed the (probably erroneous) call to intern() the name of the file -
why do you do this? Is it necessary?

I use it to conserve memory, which might become a bottleneck with more
files. In my test case (40000 files), using intern() reduced the peak
memory usage from about 20 mb to only 5 mb, at the price of a slight
runtime increase from 4000 to 4200 msec. Sounds like a good trade-off
to me :)

I was a little surprised myself at how big the saving was (as I used a
"real-world" directory tree, not some dummy files with identical
names). So far, I see no reason to not use intern(). In my
application, it's perfectly ok that the String objects are shared.
 
R

Robert Mischke

Chris Uppal said:
This sounds like an OS effect to me. Or, to be more precise, it sounds like a
effect of the design of the filesystem. What OS are you running on ?

You're right - I should have mentioned that. It's Windows 2000, and
the data was read from a FAT32 partition.
When the OS scans a directory looking for files, it will be (in a typical
implementation) scanning over a disk block, reading a list of {filename,
location} pairs (plus some housekeeping data). The filename is the name of the
file, the location is the place on disk where the real data for the file is
kept. If the housekeeping data does not include the last-modified date
"in-line", as it were, with the housekeeping data (and I don't think it usually
does -- it /can't/ on UNIX) then fetching that information will require an
extra disk access for each file that you scan. (A disk seek and a disk read in
the worst case, but with luck the various caches in the OS and disk controller,
will reduce the impact of this).

I see what you mean, but AFAIK, FAT32 stores the modification date in
the meta data. A hint that the problem is Java-specific is that a
similar Delphi application (the one I'm porting to Java) needs about
4-5 seconds to read everything - including the file sizes and
timestamps. When scanning for files in Delphi, you always get a data
structure called TWin32FindData, which contains all the info.

As you wrote, the cause might be the overhead of actually *searching*
for this file, opposed to reading a whole directory (with
Fíle.listFiles()) in the order they're stored on the disk. But then
I'm wondering: Shouldn't the Java API provide a way to perform
something like "get all available info about this directory" for cases
like this? I'm thinking of something Like listFiles(), but with
extended File objects that store/cache this data as they receive them.

The problem is that this kind of performance is kind of a show-stopper
for my application. I was prepared to take some runtime penalty,
compared to native i/o programming, but an increase of >500% is just
too much... and I would rather not resort to platform-specific JNI
code.

Robert
 
R

Robert Mischke

Sudsy said:
So we're talking roughly 30 us for native C vs 80 us for Java. You
still retrieve the information in about 3.5 seconds which leads me
to the conclusion that the OP is running a platform other than
Linux.

Well done, Sherlock :) Win2000/FAT32, as written above.

Maybe a change is in order? :)

Well... to prepare for such a change, I'm porting my applications from
Delphi to Java in the first place ;)

(Ok, the fact that I *like* to code in Java might play a role, too.)


Robert
 
C

Chris Uppal

Robert said:
I see what you mean, but AFAIK, FAT32 stores the modification date in
the meta data. A hint that the problem is Java-specific is that a
similar Delphi application (the one I'm porting to Java) needs about
4-5 seconds to read everything - including the file sizes and
timestamps. When scanning for files in Delphi, you always get a data
structure called TWin32FindData, which contains all the info.
OK.


As you wrote, the cause might be the overhead of actually *searching*
for this file, opposed to reading a whole directory (with
Fíle.listFiles()) in the order they're stored on the disk. But then
I'm wondering: Shouldn't the Java API provide a way to perform
something like "get all available info about this directory" for cases
like this? I'm thinking of something Like listFiles(), but with
extended File objects that store/cache this data as they receive them.

It does sound reasonable when you look at it from a Windows programmer's
perspective. Unfortunately it looks like pointless complication from a UNIX
programmer's perspective, since there is no underlying API that it maps onto,
and it would seem that Sun's Java programmers think like UNIXers ;-)

I can't really offer any help, I'm afraid. It looks as if you are stuck with
it.

The problem is that this kind of performance is kind of a show-stopper
for my application.

Is it really a show stopper ? I can see that if the time had grown from (say)
around 1/2 second to around 1/2 a minute then that might make an application
completely unusable. But the time was already too long (at 4 seconds) to keep
a user waiting without the usual paraphernalia of progress bars, wait cursors,
etc, so can't they just wait longer ? (Assuming that you don't want to, or
can't, redesign your program so that it doesn't need all the timestamps
up-front.)

However...
I was prepared to take some runtime penalty,
compared to native i/o programming, but an increase of >500% is just
too much... and I would rather not resort to platform-specific JNI
code.

If the main problem only occurs on Windows (as Sudsy's post indicates that it
isn't a problem in Linux -- although I'm not sure that I'd take one test as
representative), then the non-cross-platform nature of JNI becomes much less of
a problem: use JNI on Windows and "normal" Java everywhere else.

-- chris
 
M

Mark Thornton

Robert said:
You're right - I should have mentioned that. It's Windows 2000, and
the data was read from a FAT32 partition.




I see what you mean, but AFAIK, FAT32 stores the modification date in
the meta data. A hint that the problem is Java-specific is that a
similar Delphi application (the one I'm porting to Java) needs about
4-5 seconds to read everything - including the file sizes and
timestamps. When scanning for files in Delphi, you always get a data
structure called TWin32FindData, which contains all the info.

As you wrote, the cause might be the overhead of actually *searching*
for this file, opposed to reading a whole directory (with
Fíle.listFiles()) in the order they're stored on the disk. But then
I'm wondering: Shouldn't the Java API provide a way to perform
something like "get all available info about this directory" for cases
like this? I'm thinking of something Like listFiles(), but with
extended File objects that store/cache this data as they receive them.

The problem is that this kind of performance is kind of a show-stopper
for my application. I was prepared to take some runtime penalty,
compared to native i/o programming, but an increase of >500% is just
too much... and I would rather not resort to platform-specific JNI
code.

Robert

The much delayed JSR203 (http://www.jcp.org/en/jsr/detail?id=203) might
provide a solution when it is eventually completed. Until then using
NTFS instead of FAT32 might help (the directory is sorted by name, so
locating an entry in a large directory is faster). You could also try
using JNI.

Mark Thornton
 
J

John C. Bollinger

Robert said:
I use it to conserve memory, which might become a bottleneck with more
files. In my test case (40000 files), using intern() reduced the peak
memory usage from about 20 mb to only 5 mb, at the price of a slight
runtime increase from 4000 to 4200 msec. Sounds like a good trade-off
to me :)

If memory is a potential bottleneck then you may be shooting yourself in
the foot. When you intern() a String, you ensure that a copy of that
String is retained in the VM for the remainder of the VM's lifetime; for
40000 Strings that's probably not too bad, but a real filesystem may
have many, many more files.
I was a little surprised myself at how big the saving was (as I used a
"real-world" directory tree, not some dummy files with identical
names). So far, I see no reason to not use intern(). In my
application, it's perfectly ok that the String objects are shared.

The only time it shouldn't be OK to share String objects is when it is
necessary for an == comparison fail for two Strings containing the same
sequence of characters. That is rare. The main purpose of intern() is
precisely to make such comparisons succeed, in view of the fact that an
== comparison is much faster than an equals() evaluation.

If your peak memory usage is so much higher without intern() then that
likely signals a problem in the part of your program that you spelled
"//...". E.g. you may be unnecessarily creating duplicates of the
filename Strings. Any way around, though, having a heap full of Strings
that are eligible for GC is a lot better than having a heap full that
are not eligible and never will be.


John Bollinger
(e-mail address removed)
 
J

John C. Bollinger

Robert said:
I'm writing a backup program that needs to check a lot of file sizes
and timestamps. Reading the directory tree itself is reasonably fast:
about 4 seconds for a test tree with 4.000 directories and about
40.000 files.

But when I add code to read the timestamps of the files, it gets
extremely slow: from 4 seconds without to 28 (!) seconds with calling
File.lastModified once per file. The actual class that gets

[...]

I can't explain your observations, but I ran my own test just now (Java
1.4.2 / Win2000 SP4 / Athlon (classic) 1.4 GHz / 25 GB FAT32 filesystem
/ 7200 RPM disk), and my results don't match up very well with yours. I
created a simple FS scanner that recursively traverses a directory tree.
In phase 1 of the test the scanner invokes getName() on each regular
file; in phase 2 (separate test runs) it invokes both getName() and
lastModified(). I observed a big speedup (ca. 2.5x) from the first run
to all subsequent ones, presumably because of the OS caches being
loaded, so that run was excluded from the statistics.

I found that for 101208 files in 4882 directories, the scan took an
average of 12.9 seconds without lastModified(), and 15.6 seconds with
lastModified(), for a difference of roughly 21%. [Note that JVM startup
time is _not_ factored out of those numbers.]

I am left to conclude that the lastModified() method is not the root of
your problem.


John Bollinger
(e-mail address removed)
 
D

DrAcKe

The first thing to come to my mind after read this post it's that your
windows partition maybe need a "defrag".

By3z, DrAcKe

PS Sorry for my bad english
 
J

John C. Bollinger

Michael said:

That's an interesting article; thanks for pointing it out. The
article's premise is slightly flawed, but the code (the first example,
at least) works. I will therefore have to retrench, and say that when
you intern() a String you *may* ensure that a copy is retained for the
remainder of the VM's lifetime. Whether or not that is is in fact
ensured depends on the String implementation [not on the VM
implementation, as the JW article's author mistakenly implied].


John Bollinger
(e-mail address removed)
 
M

Mark Thornton

John said:
That's an interesting article; thanks for pointing it out. The
article's premise is slightly flawed, but the code (the first example,
at least) works. I will therefore have to retrench, and say that when
you intern() a String you *may* ensure that a copy is retained for the
remainder of the VM's lifetime. Whether or not that is is in fact
ensured depends on the String implementation [not on the VM
implementation, as the JW article's author mistakenly implied].
JVMs do not have to follow the java source code when implementing
standard classes and methods. For example the java.Math class appears to
delegate to StrictMath, but many JVMs will replace that implementation
with (much faster) intrinsic methods.

Mark Thornton
 
R

Robert Mischke

John C. Bollinger said:
If memory is a potential bottleneck then you may be shooting yourself in
the foot. When you intern() a String, you ensure that a copy of that
String is retained in the VM for the remainder of the VM's lifetime; for
40000 Strings that's probably not too bad, but a real filesystem may
have many, many more files.

I read conflicting info on this point, so I ran some tests myself:
interning a lot of strings which have some overlap (ie duplicates),
and then releasing all external references to it. With the standard
Sun JVM, they were garbage collected just fine. IIRC, there's a class
called WeakHashmap in the API, which might just be how it works
internally.

But thanks for pointing it out, even if seems not to be the case here.

Robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top