The first 10 files

W

Wojtek

Using:

int max = 10;
int count = 0;

for (File thisFile : aDir.listFiles())
{
doSomething(thisFile);

if ( ++count >= max )
break;
}

gives me the first ten files in aDir. But if aDir contains 30K files,
then the listFiles() will run for a long time as it builds an array for
the 30K files.

Is there a way to have Java only get the first "max" files?
 
R

Roedy Green

Is there a way to have Java only get the first "max" files?

not without jni
--
Roedy Green Canadian Mind Products http://mindprod.com
The first 90% of the code accounts for the first 90% of the development time.
The remaining 10% of the code accounts for the other 90% of the development
time.
~ Tom Cargill Ninety-ninety Law
 
J

John B. Matthews

Wojtek <[email protected]> said:
Using:

int max = 10;
int count = 0;

for (File thisFile : aDir.listFiles())
{
doSomething(thisFile);

if ( ++count >= max )
break;
}

gives me the first ten files in aDir. But if aDir contains 30K files,
then the listFiles() will run for a long time as it builds an array
for the 30K files.

Is there a way to have Java only get the first "max" files?

In a GUI context, one approach uses a SwingWorker to query the file
system in the background and update a `TableModel` in the worker's
process() method. A complete example is examined here:

<http://codereview.stackexchange.com/q/4446/6692>

Although it may be beyond your control, you should also critically
assess a design having tens of thousands of files in a single
directory.
 
A

Arved Sandstrom

Using:

int max = 10;
int count = 0;

for (File thisFile : aDir.listFiles())
{
doSomething(thisFile);

if ( ++count >= max )
break;
}

gives me the first ten files in aDir. But if aDir contains 30K files,
then the listFiles() will run for a long time as it builds an array for
the 30K files.

Is there a way to have Java only get the first "max" files?
One way of doing it, which you can find by Googling but should occur to
you if you read the File Javadocs carefully, is below.

I've run this in an IDE with the working directory set to where I
touched a few hundred files.

The files returned will not be in any order; OTOH you didn't indicate
what you meant by "first".

AHS

-------------------------
package org.ahs.files;

import java.io.File;
import java.io.FileFilter;
import java.util.Arrays;

public class ShortFileList {

final int maxFiles = 30;

public static void main(String[] args) {
if (args.length != 1) {
System.err.println("Usage: ShortFileList <limit>");
}
// let NFE throw
Integer limit = Integer.parseInt(args[0]);

File testDir = new File(".");
File[] files = testDir.listFiles(new MyFileFilter(limit));
System.out.println(files.length);
System.out.println(Arrays.asList(files));
}

static class MyFileFilter implements FileFilter {

int maxFiles;

public MyFileFilter(int maxFiles) {
this.maxFiles = maxFiles;
}

@Override
public boolean accept(File pathname) {
return maxFiles-- > 0;
}
}
}
 
A

Arved Sandstrom

One way of doing it, which you can find by Googling but should occur to
you if you read the File Javadocs carefully, is below.

I've run this in an IDE with the working directory set to where I
touched a few hundred files.

The files returned will not be in any order; OTOH you didn't indicate
what you meant by "first".

AHS

-------------------------
package org.ahs.files;

import java.io.File;
import java.io.FileFilter;
import java.util.Arrays;

public class ShortFileList {

final int maxFiles = 30;

IGNORE this variable, earlier experimental version.
 
A

Arne Vajhøj

One way of doing it, which you can find by Googling but should occur to
you if you read the File Javadocs carefully, is below.
Integer limit = Integer.parseInt(args[0]);
File testDir = new File(".");
File[] files = testDir.listFiles(new MyFileFilter(limit));
static class MyFileFilter implements FileFilter {

int maxFiles;

public MyFileFilter(int maxFiles) {
this.maxFiles = maxFiles;
}

@Override
public boolean accept(File pathname) {
return maxFiles-- > 0;
}
}
}

If the problems is as described by OP then that must be the
correct solution.

"will run for a long time as it builds an array for the 30K files"

does not happen with this solution.

But I am a bit skeptical about whether a String[] with 30K elements
is really the bottleneck.

If the real bottleneck is the OS calls to get next file, then
a filter like this will not help.

Arne
 
R

Robert Klemme

But I am a bit skeptical about whether a String[] with 30K elements
is really the bottleneck.

If the real bottleneck is the OS calls to get next file, then
a filter like this will not help.

Why?

Kind regards

robert
 
A

Arne Vajhøj

But I am a bit skeptical about whether a String[] with 30K elements
is really the bottleneck.

If the real bottleneck is the OS calls to get next file, then
a filter like this will not help.

Why?

A String[] with 30K elements should be blazing fast
compared to anything that hits a disk.

And it would read all filenames and call the filter for
each of them.

Arne
 
E

Eric Sosman

But I am a bit skeptical about whether a String[] with 30K elements
is really the bottleneck.

If the real bottleneck is the OS calls to get next file, then
a filter like this will not help.

Why?

Because the listFiles() method will fetch the information
for all 30K files from the O/S, will construct 30K File objects
to represent them, and will submit all 30K File objects to the
FileFilter, one by one. The FileFilter will (very quickly)
reject 29.99K of the 30K Files, but ...
 
J

Jim Janney

Wojtek said:
Using:

int max = 10;
int count = 0;

for (File thisFile : aDir.listFiles())
{
doSomething(thisFile);

if ( ++count >= max )
break;
}

gives me the first ten files in aDir. But if aDir contains 30K files,
then the listFiles() will run for a long time as it builds an array
for the 30K files.

Is there a way to have Java only get the first "max" files?

As Roedy says, in pure Java there's no way to avoid reading the entire
directory, whether it builds the entire array or not. And if you want
them in any particular order it's necessary to read them all and sort
them anyway.
 
W

Wojtek

John B. Matthews wrote :
Although it may be beyond your control, you should also critically
assess a design having tens of thousands of files in a single
directory.

Well of course.

The directory holds files which are uploaded by external events. If
there are a lot of events between application runs, then the number of
files can indeed reach large numbers.

Since this is happening on a server, and you cam potentially have many
hundreds of people accessing at the same time (each with there own
directory), I was hoping to be able to "stage" file processing.

The:

public boolean accept(File pathname) {
return maxFiles-- > 0;
}

in FileFilter is interesting, but the file system nevertheless still
runs through the entire directory. Maybe FileFilter needs:

public boolean abort(File pathname);

Hmm, maybe I need a timed background process to move files to "holding"
directories which will be limited to a small number of files.
 
J

Jim Janney

Wojtek said:
John B. Matthews wrote :

Well of course.

The directory holds files which are uploaded by external events. If
there are a lot of events between application runs, then the number of
files can indeed reach large numbers.

Since this is happening on a server, and you cam potentially have many
hundreds of people accessing at the same time (each with there own
directory), I was hoping to be able to "stage" file processing.

The:

public boolean accept(File pathname) {
return maxFiles-- > 0;
}

in FileFilter is interesting, but the file system nevertheless still
runs through the entire directory. Maybe FileFilter needs:

public boolean abort(File pathname);

Hmm, maybe I need a timed background process to move files to
"holding" directories which will be limited to a small number of
files.

You could run a command in a subprocess. On a Unix system

ls -U | head -n 10

should run quickly (-U tells it not to sort). Not sure how to do that
on Windows.
 
E

Eric Sosman

On 26.01.2013 19:26, Arne Vajhøj wrote:

But I am a bit skeptical about whether a String[] with 30K elements
is really the bottleneck.

If the real bottleneck is the OS calls to get next file, then
a filter like this will not help.

Why?

Because the listFiles() method will fetch the information
for all 30K files from the O/S, will construct 30K File objects
to represent them, and will submit all 30K File objects to the
FileFilter, one by one. The FileFilter will (very quickly)
reject 29.99K of the 30K Files, but ...

Will it?

Necessarily. As far as listFiles() knows, the FileFilter
might accept the very last File object given to it. Therefore,
listFiles() cannot fail to present that very last File -- and
every other File -- for inspection.
It is plausible that the implementation of listFiles() uses an OS API that
enumerates files one at a time. On Windows, getting the first file of the
enumeration is faster than asking for all the files at once.
Meh.

Indeed, I suppose one could throw an exception from the FileFilter accept()
method to interrupt enumeration, if that's how listFiles() is implemented.
That would avoid the need to enumerate more than the needed number of
actual files.

It would also avoid the burden of returning anything from
listFiles() -- like, say, the array of accepted files ...

A seriously hackish approach might be to do the processing
of the files within the FileFilter itself, treating it as a
"visit this File" callback instead of as a predicate. Then if
the FileFilter threw an exception after processing the first N
files -- well, they'd already have been processed, and you were
going to ignore the listFiles() return value anyhow, so ...
But, as I said, that's pretty seriously hackish.
Of course, this is all implementation-dependent and since it's not
explicitly documented, could change at any time anyway.

The performance implications of retrieving information on 30K
files from the O/S are undocumented, true. But the necessity of
retrieving that information is deducible from what *is* documented.
But unless you've
actually examined the implementation details for listFiles(), it's not a
foregone conclusion that the technique of using a FileFilter offers no way
to improve latency.

Maybe this is the disconnect: I understood the O.P.'s concern as
"It's doing three thousand times too much work," not as "It takes
three thousand times as long as it should just to get to the first
File instance." Either way, though, I think a FileFilter (used in a
non-hackish way) cannot reduce either the total work or the latency.
Observe that listFiles() cannot return anything at all until it has
built the entire array of accepted files; Java's arrays have no way
to say "I hold five elements now, but might grow."
All that said, I think John Matthews' comment about the question of what
30K files are doing in a single directory in the first place is perhaps one
of the more useful points in this topic. One doesn't always have control
over that, of course...but if one does, it's certainly worth rethinking
that aspect of the design. There are reasons other than code latency to
avoid so many files in a single directory.

Yeah. The O.P. said something about external processes dumping
files into the directory, possibly dumping many between (widely-
spaced?) executions of his program. That seems odd to me, though,
because if there's a backlog of thirty thousand it seems odd to want
to reduce it by only ten ...

If he's stuck with this overall design, though, I think the
walkFileTree() method of java.nio.file.Files would be a cleaner way
to proceed. His FileVisitor could return FileVisitResult.TERMINATE
after it had seen ten files, and that would be that. No hacks.
 
A

Arved Sandstrom

On 1/26/2013 4:15 PM, Robert Klemme wrote:
On 26.01.2013 19:26, Arne Vajhøj wrote:

But I am a bit skeptical about whether a String[] with 30K elements
is really the bottleneck.

If the real bottleneck is the OS calls to get next file, then
a filter like this will not help.

Why?

Because the listFiles() method will fetch the information
for all 30K files from the O/S, will construct 30K File objects
to represent them, and will submit all 30K File objects to the
FileFilter, one by one. The FileFilter will (very quickly)
reject 29.99K of the 30K Files, but ...

Will it?

Necessarily. As far as listFiles() knows, the FileFilter
might accept the very last File object given to it. Therefore,
listFiles() cannot fail to present that very last File -- and
every other File -- for inspection.
[ SNIP ]

I'd have to agree. A simple test shows this to be the case, but your
reasoning precludes having to run such a test in the first place.

My code "gets' the first N files from listFiles(), for some definition
of "first", but it certainly doesn't only get N files from the OS.

Based on Wojtek's later post, I'd be examining the entire problem in
more detail before arriving at a decent solution. I don't think most of
the problem pertaining to offering reasonable batches of files to a Java
program for processing is something that I'd address in Java anyway.

AHS
 
A

Arne Vajhøj

John B. Matthews wrote :

Well of course.

The directory holds files which are uploaded by external events. If
there are a lot of events between application runs, then the number of
files can indeed reach large numbers.

Since this is happening on a server, and you cam potentially have many
hundreds of people accessing at the same time (each with there own
directory), I was hoping to be able to "stage" file processing.

No matter how and why these files end up there then you should
consider spreading them out in multiple directories.

Arne
 
A

Arne Vajhøj

On 26.01.2013 19:26, Arne Vajhøj wrote:

But I am a bit skeptical about whether a String[] with 30K elements
is really the bottleneck.

If the real bottleneck is the OS calls to get next file, then
a filter like this will not help.

Why?

Because the listFiles() method will fetch the information
for all 30K files from the O/S, will construct 30K File objects
to represent them, and will submit all 30K File objects to the
FileFilter, one by one. The FileFilter will (very quickly)
reject 29.99K of the 30K Files, but ...

Will it?

It is plausible that the implementation of listFiles() uses an OS API that
enumerates files one at a time. On Windows, getting the first file of the
enumeration is faster than asking for all the files at once.

Indeed, I suppose one could throw an exception from the FileFilter accept()
method to interrupt enumeration, if that's how listFiles() is implemented.
That would avoid the need to enumerate more than the needed number of
actual files.

Of course, this is all implementation-dependent and since it's not
explicitly documented, could change at any time anyway. But unless you've
actually examined the implementation details for listFiles(), it's not a
foregone conclusion that the technique of using a FileFilter offers no way
to improve latency.

It is a foregone conclusion that the posted code that Eric commented
on would read all files, because it did not throw an exception.

Code with a different logic could behave differently.

Arne
 
A

Arne Vajhøj

[...]
Because the listFiles() method will fetch the information
for all 30K files from the O/S, will construct 30K File objects
to represent them, and will submit all 30K File objects to the
FileFilter, one by one. The FileFilter will (very quickly)
reject 29.99K of the 30K Files, but ...

Will it?

Necessarily. As far as listFiles() knows, the FileFilter
might accept the very last File object given to it. Therefore,
listFiles() cannot fail to present that very last File -- and
every other File -- for inspection.

Except in the way I already noted, you mean.

Except if the code was different from the code he was
commenting on.
[...]
Indeed, I suppose one could throw an exception from the FileFilter accept()
method to interrupt enumeration, if that's how listFiles() is implemented.
That would avoid the need to enumerate more than the needed number of
actual files.

It would also avoid the burden of returning anything from
listFiles() -- like, say, the array of accepted files ...

As you've already agreed, it is possible for the FileFilter implementation
to store the results itself, obviating any need for the listFiles() method
to return successfully.

If it works (which is not assured...it depends on how listFiles() is
implemented in the first place), then yes, maybe it's a bit of a kludge.
But it's an easier, more portable kludge than writing some JNI-based
component and would in fact get the job done.

Sometimes, when the library you're using doesn't provide exactly the
features you need, you wind up with a kludge. Oh well...shit happens.

If JNI is used then at least it is straight forward logic.

Utilizing Java library classes in a way that they were not intended
to be used based on as assumption about the underlying implementation
is not straight forward logic.
I'm not saying it's a great solution. But it's a far cry from a conclusion
that it simply cannot be done with the Java API as it exists now.

There is no way in Java API to ensure that it will be done.

We just find it likely that then implementation will work
that way.

Arne
 
A

Arne Vajhøj

On Sat, 26 Jan 2013 17:06:07 -0500, Eric Sosman wrote:

On 1/26/2013 4:15 PM, Robert Klemme wrote:
On 26.01.2013 19:26, Arne Vajhøj wrote:

But I am a bit skeptical about whether a String[] with 30K elements
is really the bottleneck.

If the real bottleneck is the OS calls to get next file, then
a filter like this will not help.

Why?

Because the listFiles() method will fetch the information
for all 30K files from the O/S, will construct 30K File objects
to represent them, and will submit all 30K File objects to the
FileFilter, one by one. The FileFilter will (very quickly)
reject 29.99K of the 30K Files, but ...

Will it?

Necessarily. As far as listFiles() knows, the FileFilter
might accept the very last File object given to it. Therefore,
listFiles() cannot fail to present that very last File -- and
every other File -- for inspection.
[ SNIP ]

I'd have to agree. A simple test shows this to be the case, but your
reasoning precludes having to run such a test in the first place.

My code "gets' the first N files from listFiles(), for some definition
of "first", but it certainly doesn't only get N files from the OS.

Based on Wojtek's later post, I'd be examining the entire problem in
more detail before arriving at a decent solution. I don't think most of
the problem pertaining to offering reasonable batches of files to a Java
program for processing is something that I'd address in Java anyway.

If OP happens to be on Java 7, then I will suggest using:

java.nio.file.Files.newDirectoryStream(dir)

It is a straight forward way of getting the first N files.

And it is is as likely as the exception hack to not to read
all filenames from the OS.

Arne
 
K

Knute Johnson

Using:

int max = 10;
int count = 0;

for (File thisFile : aDir.listFiles())
{
doSomething(thisFile);

if ( ++count >= max )
break;
}

gives me the first ten files in aDir. But if aDir contains 30K files,
then the listFiles() will run for a long time as it builds an array for
the 30K files.

Is there a way to have Java only get the first "max" files?

import java.io.*;
import java.nio.*;
import java.nio.file.*;

public class FileSystemsTest {
public static void main(String[] args) throws IOException {
long start = System.currentTimeMillis();
Path dir = FileSystems.getDefault().getPath(".");
int i=10;
DirectoryStream<Path> stream = Files.newDirectoryStream(dir);
for (Path path : stream) {
System.out.println(path.getFileName());
if (--i <= 0)
break;
}
long stop = System.currentTimeMillis();
System.out.println(stop - start);
}
}

300003 files in the directory, almost 1.7GB of files, Windows XP, Java 7
and it takes 16 ms to run. Somebody else should try this out on their
computer to see if it works as fast.

..
..
..
01/26/2013 05:46 PM 58,890 9998.txt
01/26/2013 05:46 PM 58,890 9999.txt
01/26/2013 06:31 PM 1,316 FileSystemsTest.class
01/26/2013 06:29 PM 636 FileSystemsTest.java
01/26/2013 05:44 PM 650 MakeFiles.java
30003 File(s) 1,766,702,602 bytes
2 Dir(s) 49,387,085,824 bytes free

C:\Documents and Settings\Knute Johnson\bigdirectory>java FileSystemsTest
0.txt
1.txt
10.txt
100.txt
1000.txt
10000.txt
10001.txt
10002.txt
10003.txt
10004.txt
16

C:\Documents and Settings\Knute Johnson\bigdirectory>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top