Stuck on a caching issue.

  • Thread starter Inertia_sublimation
  • Start date
I

Inertia_sublimation

Hello everyone. Im developing a program the requires heavy usage of
recursive folder size computation, and need to speed it up immensely.

I thought the use of a cache to store the result of recursively
getting the size of a folder would be ideal, but am having a problem
with it: Its not working.

The size is recomputed every time the sizeRecursive() method is called
on a CachedDirectory.

Also, please forgive the amount of code I'm posting, I think all of it
is pertinant though.

Note the comments in DirectoryCache.CachedDirectory.sizeRecursive()'s
for loop, In order to work on the Directorys contained in the
Directory[] as CachedDirectorys, I have to wrap them in a new instance
of CachedDirectory. I don't know how to get around the lastComputed
field of the new CachedDirectory being set to -1 (forcing the new
CachedDirectory to recompute its value instead of looking in the cache
for it).
Please help!

PS - Is there any where I can put a line break in the declaration of a
generic Collection? The lines where I declare them at the top of the
class get kinda long, and they look ugly.

-------------------Class DirectoryCache-------------------
import java.util.*;
import java.io.*;

/**
* Provides one place where CachedDirectories can be
* instantiated, that all use the same cache.
* To instantiate a CachedDirectory from this cache
* use this code:
* DirectoryCache cache = new DirectoryCache();
* CachedDirectory dir = cache.new CachedDirectory(...);
*/

public class DirectoryCache {
// The storage facility:
private static HashMap<Directory, Long> dirSizes = new
HashMap<Directory, Long>();

/**
* This class subclasses Directory to add caching of the result of
* calling sizeRecursive().
*/
public class CachedDirectory extends Directory {
/** The last time the recusrive size was computed. */
private long lastComputed = -1;

/** This is provided as an alternative to
* typing DirectoryCache.this */
private final DirectoryCache cache = DirectoryCache.this;

public CachedDirectory (final String pathName)
throws NotADirectoryException {
super(pathName);
}

public CachedDirectory(final Directory f) throws IOException {
super(f);
}

public long sizeRecursive() throws IOException {
/* If this CachedDirectory is NOT in the cache,
* or if it was modified: */
if (!cache.dirSizes.containsKey(this)
|| lastModified() > lastComputed) {
// Debug output:
if (lastModified() > lastComputed)
System.out.println(this.toString()
+ " was modified.");
if (!cache.dirSizes.containsKey(this))
System.out.println(this.toString()
+ " is not cached.");

// Recompute the recursive size of this directory:

lastComputed = lastModified();

// Get the size of all files in this directory:
long newSize = size();

// Get all directories in this CachedDirectory:
Directory[] dirs = listDirectories();

for (Directory dir : dirs) {
/* TODO: Check if there's a better way to do this
* that doesn't involve creating a new instance.
* since this "fools" the equals() method...
* or does it?
*
* After some thought, becasuse its a new instance
* of CachedDirectory, its lastComputed variable
* is -1, meaning it will always recompute its
* size "manually". This is a bad thing, and
* I dont know how to fix it.
*/
dir = new CachedDirectory(dir);
newSize += dir.sizeRecursive();
}

// Cache and return new size:
cache.dirSizes.put(this, newSize);
return newSize;
} else { // This is in the cache and hasn't been modified.
System.out.println(this.toString() + " is cached.");
return cache.dirSizes.get(this);
}
}
}
}
-------------------Class Directory-------------------
import java.io.*;
import java.util.*;

/**
* This class represents a directory, othewise known as a folder.
* It provides a method of differentiating between Files and
Directories
* contained within the folder. Breaking from the java.io.File
"standard",
* this class provides a method of getting files and folders contained
within
* a folder seperately.
*/
public class Directory {
/** Internal representation of this directory. */
private final File representation;

/** The last time that the files and dirs arrays were refreshed.
*/
private long lastRefreshed = -1;

/** Contains the files stored in this directory. */
private File[] files;

/** Contains the directories stored in this directory. */
private Directory[] dirs;

public Directory(String pathname) throws NotADirectoryException {
this(new File(pathname));
}

public Directory(File f) throws NotADirectoryException {
if (f.exists() && f.isFile())
throw new NotADirectoryException(f);
representation = f;
}

/**
* Constructs a new Directory object pointing to the same
* data on disk.
* @todo Make constructors call each other (via this()).
*/
public Directory(Directory dir) {
/*
* TODO: Try to make all constructors call each other.
* right now, I cant call this(dir.representation)
* since I need to catch the NotADirectoryException.
*/
representation = dir.representation;
}

/**
* If recursive is false, gets the sum of all the sizes
* of the files contained in this Directory.
*/
public long size() throws IOException {
//System.out.println("Computing size for " + toString());
long size = 0;
File[] files = listFiles();
//System.out.println("File[] files == " +
Arrays.toString(files));
for (File file : files) {
size += file.length();
}
return size;
}

/**
* This method gets the sum of all the sizes of
* the files contained within this Directory,
* including those contained within the directories
* within this directory.
*/
public long sizeRecursive() throws IOException {
long size = size();

Directory[] directories = listDirectories();
//System.out.println("Directory[] directories == " +
Arrays.toString(directories));
for (Directory dir : directories) {
//System.out.println("Directory currently scanning == " +
dir);
size += dir.sizeRecursive();
}
return size;
}

/**
* Repopulates the arrays containing the files and directories
* contained within this directory.
*/
private void refreshContentArrays() throws IOException {
lastRefreshed = representation.lastModified();
File[] contents = representation.listFiles();

// contents shouldnt be null unless this directory doesn't
exist.
if (contents == null) {
this.dirs = null;
this.files = null;
return;
}

// A directory, at most, has contents.length files/directories
in it.
List<File> newFiles = new
ArrayList<File>(contents.length);
List<Directory> newDirs = new
ArrayList<Directory>(contents.length);
for (File item : contents) {
// TODO: Why does this need to be here?
if (item.exists() && !isSymbolicLink(item)) {
if (item.isFile()) {
newFiles.add(item);
} else { // Item is directory
try {
newDirs.add(new Directory(item));
} catch (NotADirectoryException nade) {
nade.printStackTrace();
}
// NotADirectoryException shouldn't happen.
}
}
}
// Store new data in the fields:
this.dirs = new Directory[newDirs.size()];
this.files = new File[newFiles.size()];
newDirs.toArray(dirs);
newFiles.toArray(files);
}

/**
* Lists all the directories contained within this directory.
* Note: This doesn't recurse into the directories.
* The special .. (parent) and . (this) directories are excluded.
* Returns null if this directory does not exist.
* @todo Move content array validity check to setLastModified().
*/
public Directory[] listDirectories() throws IOException {
// Make sure contents arrays are up-to-date.
// TODO: Check if I can just override setLastModified() to do
this.
long lastModified = representation.lastModified();
if (lastModified > lastRefreshed) {
refreshContentArrays();
}

Directory[] ret = null;
if (dirs != null) {
// Cant pass a refrence to the field, since
// its not immutable (can I use an UnmodifiableSet?)
ret = new Directory[dirs.length];
System.arraycopy(dirs, 0, ret, 0, dirs.length);
}
return ret;
}

/**
* Returns all the files contained within this directory.
* Contrary to java.io.File's listFiles() method, this returns
* only the *files* in this directory, not the directories.
*/
public File[] listFiles() throws IOException {
// Make sure contents arrays are up-to-date.
// TODO: Check if I can just override setLastModified() to do
this.
long lastModified = representation.lastModified();
if (lastModified > lastRefreshed) {
refreshContentArrays();
}

File[] ret = null;
if (files != null) {
// Cant pass a refrence to the field, since
// its not immutable (can I use an UnmodifiableList?)
ret = new File[files.length];
System.arraycopy(files, 0, ret, 0, files.length);
}
return ret;
}

private static boolean isSymbolicLink(File f) throws IOException {
return !f.getAbsolutePath().equals(f.getCanonicalPath());
}

/**
* @see java.io.File#lastModified()
*/
public long lastModified() {
return representation.lastModified();
}

/**
* @see java.io.File#exists()
*/
public boolean exists() {
return representation.exists();
}

/**
* @see java.io.File#mkdir()
*/
public boolean mkdir() {
return representation.mkdir();
}

/**
* @see java.io.File#delete()
*/
public boolean delete() {
return representation.delete();
}

/**
* @see java.io.File#toString()
*/
public String toString() {
return representation.toString();
}

/**
* @see java.io.File#hashCode()
*/
public int hashCode() {
return representation.hashCode();
}

/**
* @see java.io.File#equals(Object)
*/
public boolean equals(Object o) {
return representation.equals(o);
}

public int compareTo(Directory pathname) {
return representation.compareTo(pathname.representation);
}

/**
* @see java.io.File#getCanonicalPath()
*/
public String getCanonicalPath() throws IOException {
return representation.getCanonicalPath();
}

/**
* @see java.io.File#getPath()
*/
public String getPath() {
return representation.getPath();
}
}
-------------------Class NotADirectoryException-------------------
import java.io.File;

/**
* This exception is thrown when a Directory is constructed
* via a File that already exists as a file on the storage media.
* @todo See if this should instead subclass RuntimeException.
*/
public class NotADirectoryException extends Exception {
private final File file;
public NotADirectoryException(File file) {
this.file = file;
}

public NotADirectoryException(String filePath) {
this(new File(filePath));
}

public String getMessage() {
return file.toString() + " is not a directory.";
}
}
 
A

Alex Hunsley

Inertia_sublimation said:
Also, please forgive the amount of code I'm posting, I think all of it
is pertinant though.

Just one thing for now: post compilable code. What you posted isn't
compilable, as it has been line-broken and I can see comments spill over
onto the next line etc.
Also, when posting a pretty large bit of code, consider placing it on a
web server or ftp server if you have one!
 
P

Paul Lutus

Inertia_sublimation said:
Hello everyone. Im developing a program the requires heavy usage of
recursive folder size computation, and need to speed it up immensely.

I thought the use of a cache to store the result of recursively
getting the size of a folder would be ideal, but am having a problem
with it: Its not working.

First, your code is way too large for the task at hand. The problem can be
solved in 80 lines. I know, I posted an example for you earlier, but to no
apparent effect.

Second, you cannot cache your result from scan to scan, because there may be
changes in the filesystem between scans. So throw out the idea of a cache.
Each time you scan the directory tree, you have to scan each and every
item, period.

Please consider doing this with the simplest possible embodiment, by
abandoning any illusions that you can avoid a complete rescan each time a
scan is required.

Again, as a reminder:

// DirSum.java

import java.io.*;
import java.util.*;

public class DirSum {

DirSum(String path) {
scanSimple(new File(path));
}

private boolean isSymLink(File f)
{
try {
return !f.getCanonicalPath().equals(f.getAbsolutePath());
}
catch (java.io.IOException e) {
return false;
}
}

public DirData scanSimple(File path)
{
DirData data = new DirData();
File[] list = path.listFiles();
if(list != null) {
Arrays.sort(list);
for(int i = 0; i < list.length;i++) {
File f = list;
// don't follow symlinks
if(f.isDirectory()) {
if(isSymLink(f)) {
data.symlinks++;
}
else {
data.add(scanSimple(f));
}
}
else {
data.files++;
data.length += f.length();
}
}
}
System.out.println(path + ": " + data);
return data;
}


public static void main(String args[])
{
if(args.length == 0) {
System.out.println("usage: directory to be scanned.");
}
else {
new DirSum(args[0]);
}
}
}

class DirData {
public int symlinks = 0;
public long files = 0;
public long length = 0;
public long directories = 0;

public void add(DirData d)
{
symlinks += d.symlinks;
files += d.files;
length += d.length;
directories += d.directories;
}
public String toString()
{
return files + " files, " + directories + " directories, "
+ symlinks + " symbolic links, " + length + " bytes.";
}
}
 
I

Inertia_sublimation

Alex Hunsley said:
Just one thing for now: post compilable code. What you posted isn't
compilable, as it has been line-broken and I can see comments spill over
onto the next line etc.
Also, when posting a pretty large bit of code, consider placing it on a
web server or ftp server if you have one!

Thanks for the tip. Im still new to usenet, Ill keep that in mind and
post links to a downloadable copy of the source as soon as I can.
 
I

Inertia_sublimation

Paul Lutus said:
First, your code is way too large for the task at hand. The problem can be
solved in 80 lines. I know, I posted an example for you earlier, but to no
apparent effect.

Second, you cannot cache your result from scan to scan, because there may be
changes in the filesystem between scans. So throw out the idea of a cache.
Each time you scan the directory tree, you have to scan each and every
item, period.

For my purposes, I assume the filesystem doesn't change. What Im
trying to make is a utility that allows me to know a good estimate of
the sizes of each directory. I dont care about a few transient bytes
that change every so often; the user is responsible for not making
considerable changes to the filesystem (ex. deleting a huge amount of
data) while the program is running. Even if he/she does, I can provide
an option to recache those directories that changed.

Can your solution be adapted to allow for caching of directory data?
If so, it would be ideal, and I would gladly throw out my original
code :)

Thanks for your replies!
 
P

Paul Lutus

Inertia_sublimation said:
For my purposes, I assume the filesystem doesn't change.

Okay, in that case, you only have to scan the filesystem once, and
subsequently you can read the stored data. So put your tree data in a tree
data structure, one structurally synchronized with the tree search code,
and after the scan, you only have to read the tree, not the filesystem.

This is not difficult at all. But if your purpose is to have a program to
discover the size of the filesystem once, you might as well read it and
exit, without storing all the file-by-file data. If on the other hand you
need to read the filesystem repeatedly and accurately, then you do not want
to store the prior read data, so there is no point storing the tree state.

In neither of these scenarios is there a need to store the entire tree.
What Im
trying to make is a utility that allows me to know a good estimate of
the sizes of each directory.

I provided that, directory by directory, in my short program.
I dont care about a few transient bytes
that change every so often; the user is responsible for not making
considerable changes to the filesystem (ex. deleting a huge amount of
data) while the program is running. Even if he/she does, I can provide
an option to recache those directories that changed.

Can your solution be adapted to allow for caching of directory data?

Please ask yourself how you plan to use the cached data. There are any
number of ways to store the data (for example you could store the data that
is now printed by my program after each directory-level scan), but I
heaven't heard how you plan to use the data.
If so, it would be ideal, and I would gladly throw out my original
code :)

Get out the shovels. :)

In all seriousness, tell me how you plan to use the data. What do you need
to know, when, at what detail level?
 
A

Alex Hunsley

Inertia_sublimation said:
Thanks for the tip. Im still new to usenet, Ill keep that in mind and
post links to a downloadable copy of the source as soon as I can.

Thanks for listening!
Generally, putting small-ish bits of code in usenet posts is fine. I
would say your post was sort of mid-way between the "put code in the
post" and "upload somewhere and give a link" camps.

alex
 
I

Inertia_sublimation

Paul Lutus said:
Okay, in that case, you only have to scan the filesystem once, and
subsequently you can read the stored data. So put your tree data in a tree
data structure, one structurally synchronized with the tree search code,
and after the scan, you only have to read the tree, not the filesystem.

This is not difficult at all. But if your purpose is to have a program to
discover the size of the filesystem once, you might as well read it and
exit, without storing all the file-by-file data. If on the other hand you
need to read the filesystem repeatedly and accurately, then you do not want
to store the prior read data, so there is no point storing the tree state.

In neither of these scenarios is there a need to store the entire tree.


I provided that, directory by directory, in my short program.


Please ask yourself how you plan to use the cached data. There are any
number of ways to store the data (for example you could store the data that
is now printed by my program after each directory-level scan), but I
heaven't heard how you plan to use the data.


Get out the shovels. :)

In all seriousness, tell me how you plan to use the data. What do you need
to know, when, at what detail level?

I think the simplest way to describe what Im trying to do is to run
through what would happen if a user started my theoretical program.

The program starts out asking the user for a root directory to browse
in. After selecting one (C:) the program gets all directories 1 level
within C: (for example: C:\Windows, C:\Documents and Settings,
C:\RECYCLER, C:\Program Files) I call these directories Level 1. The
size of the contents in each of these Level 1 directories is then
calculated recursively. The user is presented with a pie chart
displaying the sizes of each Level 1 directory as a fraction of the
total space of C:\.

The user then clicks on one of the slices of the pie chart, say, the
one representing C:\Program Files. The program gets all directories 1
level within C:\Program Files (C:\Program Files\Foo, C:\Program
Files\Bar, C:\Program Files\Mozilla, etc), call these level 2. The
program then calculates the size of the contents of each of these
Level 2 directories recursively.

This is where caching will help the situation. Because the program
needed to visit every directory within every directory in C:\, it
already visited all the directories in C:\Program Files.

The user can continue browsing deeper into the directories in
C:\Program Files, or he/she can go back up to any level at any time.

Ah, and btw, I was able to get my code uploaded, here are links to it:
<http://ameritech.net/users/linuxgateway/Directory.java>
<http://ameritech.net/users/linuxgateway/DirectoryCache.java>
<http://ameritech.net/users/linuxgateway/NotADirectoryException.java>
 
P

Paul Lutus

Inertia_sublimation wrote:

/ ...
This is where caching will help the situation. Because the program
needed to visit every directory within every directory in C:\, it
already visited all the directories in C:\Program Files.

The user can continue browsing deeper into the directories in
C:\Program Files, or he/she can go back up to any level at any time.

My posted program already does all this, at every level. All you have to do
is store the results that my program prints.

Please think about this. How do you know the actual total for the root
directory (in your case, C:\)? You do this by visiting all the nested
directories and totaling their contents. This means you have already
visited all those directories, and you already have a total number for them
-- all you have to do is save the result.

If the user clicks on a specified subdirectory, all you have to do is recall
the stored total for that directory and its subdirectories. What you do not
have to do is rescan anything.

The required addition to my program would be a Vector (or ArrayList), to
which you would add entries for each visited directory, recursively. You
could create a utility class that contained a total number plus a vector
containing all the visited directories for that level. Each visited
directory would get an instance of this utility class added to the vector
for the higher level.

This would create a tree data structure, similar to the original filesystem
tree, that would contain all the directories and the totals for each. Each
node in the tree would be an instance of your utility class. Each utility
class instance would contain a Vector or ArrayList, plus a number that
represents the total size of all the files beneath that level.

This is not at all difficult, but you need to think it through.
 
I

Inertia_sublimation

Paul Lutus said:
The required addition to my program would be a Vector (or ArrayList), to
which you would add entries for each visited directory, recursively. You
could create a utility class that contained a total number plus a vector
containing all the visited directories for that level. Each visited
directory would get an instance of this utility class added to the vector
for the higher level.
By "total number" do you mean size?
 
P

Paul Lutus

By "total number" do you mean size?

I'll say it differently. I mean a number that represents the sum of the
lengths of all the files you encountered during your scan. The total number
for the root directory should be the grand total of the lengths of all the
files on the entire scanned directory tree -- e.g. the sum of all the
individual sums from the subdirectories.

Just as my original program does it, but with a browsable tree structure at
the end of the process. Like this:

********************************************************************

// DirSum.java

import java.io.*;
import java.util.*;

public class DirSum {

private boolean isSymLink(File f) {
try {
return !f.getCanonicalPath().equals(f.getAbsolutePath());
}
catch (java.io.IOException e) {
return false;
}
}

public DirData scanSimple(File path,boolean showResult) {
DirData data = new DirData();
File[] list = path.listFiles();
if(list != null) {
Arrays.sort(list);
for(int i = 0; i < list.length;i++) {
File f = list;
// don't follow symlinks
if(f.isDirectory()) {
if(isSymLink(f)) {
data.addSymlink();
}
else {
data.addDirectory(scanSimple(f,showResult));
}
}
else {
data.addFile(f);
}
}
}
if(showResult) {
System.out.println(path + ": " + data);
}
return data;
}

public static void main(String args[]) {
if(args.length == 0) {
System.out.println("usage: directory to be scanned.");
}
else {
DirSum s = new DirSum();
DirData d = s.scanSimple(new File(args[0]),false);
System.out.println("Total for " + args[0] + ": " + d);
}
}
}

class DirData {
private long symlinks = 0;
private long files = 0;
private long length = 0;
private long directories = 1;
private ArrayList subDirs = new ArrayList();

public final ArrayList getSubs() {
return subDirs;
}

public void addFile(File f) {
files++;
length += f.length();
}

public void addSymlink() {
symlinks++;
}

public long getFiles() {
return files;
}

public long getLength() {
return length;
}

public long getDirs() {
return directories;
}

public long getSymlinks() {
return symlinks;
}

public void addDirectory(DirData d) {
subDirs.add(d);
directories += d.getDirs();
files += d.getFiles();
length += d.getLength();
}

public String toString() {
return files + " files, " + directories + " dirs, " + symlinks
+ " symlinks, " + length + " bytes.";
}
}

********************************************************************

In this line:

DirData d = s.scanSimple(new File(args[0]),false);

What is returned is an instance of the DirData class that contains within it
a nested ArrayList of all the visited directories, arranged as a tree
structure. You can traverse the tree to get the information you need.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top