Best way to store a large number of files?

H

heather.fraser

Hello everybody,

I am creating an Image library application with Java which will
store several million files on the file system.
Meta data desribing the images will be stored in a database but
I think it's probably faster if the actual image files are stored
on the file system with a reference stored in the database.

As I understand it, storing all of the files in one single
directory would become slow in look-ups. And so I am thinking
of giving each image a 10-digit number and placing the image
in a directory structure such as this ~

/1/2/3/4/5/6/7/8/9/x.png

For example, if an image is 2749282749.jpg then the image 9.png
will be placed in the subdirectory /2/7/4/9/2/8/2/7/4

Is it really that simple? Are there any caveats that I should
be aware of?

thank you very much,

Heather
 
M

Mark Thornton

Hello everybody,

I am creating an Image library application with Java which will
store several million files on the file system.
Meta data desribing the images will be stored in a database but
I think it's probably faster if the actual image files are stored
on the file system with a reference stored in the database.

As I understand it, storing all of the files in one single
directory would become slow in look-ups.

It depends on the file system in use. This problem does occur for FAT
but not for NTFS for example.
And so I am thinking
of giving each image a 10-digit number and placing the image
in a directory structure such as this ~

A better approach might be to MessageDigest to compute a hash of the
file and use that to derive the file path and name. This would result in
identical files being located in the same place. You should also
experiment with the number of 'digits' to use at each level; one is
probably too few, two or three is likely to be more efficient. Otherwise
the approach is reasonable and is used by a number of applications.

Mark Thornton
 
K

Kenneth P. Turvey

On Sat, 08 Oct 2005 06:50:21 -0700, heather.fraser wrote:

[Snip]
/1/2/3/4/5/6/7/8/9/x.png

For example, if an image is 2749282749.jpg then the image 9.png
will be placed in the subdirectory /2/7/4/9/2/8/2/7/4

Is it really that simple? Are there any caveats that I should
be aware of?

This is pretty much exactly how many news servers store articles in the
filesystem. You don't need that many levels of directories though. Your
design will work find under Unix, I can't say for other platforms, but
I would expect it to be fine.
 
R

Roedy Green

Is it really that simple? Are there any caveats that I should
be aware of?

Create your directory structure first. You can't create a file without
the directory structure in place.

It is primarily Windows 98 and its FAT file system that has troubles
with long linear searches of directories. Your scheme has 10
directories per level and 10 leaf files per directory.

You might try your code with 100, 256 or 1000 per node to find the
optimal efficiency, perhaps even making the arity a platform
configurable option.

You are probably best to put the entire index name in the leaf file
name to avoid confusion, especially if files are copied about.
 
J

Jon Martin Solaas

Mark said:
It depends on the file system in use. This problem does occur for FAT
but not for NTFS for example.

Are you joking? NTFS is better at handling large number of files, but
sure it becomes a problem when the number is large enough.
 
M

Mark Thornton

Jon said:
Are you joking? NTFS is better at handling large number of files, but
sure it becomes a problem when the number is large enough.

Given that NTFS uses a tree structure for directories, it won't have any
more problem than using the hierarchy of directories proposed by the OP.
It certainly is happy with many thousands of entries in a directory. For
Linux fans I think ReiserFS has similar properties.

Mark Thornton
 
D

Drazen Gemic

A better approach might be to MessageDigest to compute a hash of the
file and use that to derive the file path and name. This would result in

Similar approach is used by Squid, a cacheing proxy. It creates hash
code out of URLs. Be sure that it can store and access files quickly and
deals with milions of files without any effort.

DG
 
A

Andrey Kuznetsov

As I understand it, storing all of the files in one single
It depends on the file system in use. This problem does occur for FAT but
not for NTFS for example.

but with java you will get HUGE problems in this case - think about
File#list().
 
A

Andrey Kuznetsov

Create your directory structure first. You can't create a file without
the directory structure in place.

you can create missing directories with mkdirs()
 
M

Mark Thornton

Andrey said:
but with java you will get HUGE problems in this case - think about
File#list().

The OP's task may not need to use the 'list' method. We also hope that
JSR-203 will eventually provide a way around this problem.

Mark Thornton
 
M

Mike Schilling

Roedy Green said:
Create your directory structure first. You can't create a file without
the directory structure in place.

It is primarily Windows 98 and its FAT file system that has troubles
with long linear searches of directories. Your scheme has 10
directories per level and 10 leaf files per directory.

You might try your code with 100, 256 or 1000 per node to find the
optimal efficiency, perhaps even making the arity a platform
configurable option.

You are probably best to put the entire index name in the leaf file
name to avoid confusion, especially if files are copied about.

This is an excellent point. Likewise, if the directory is damaged, and the
file is found via a disk repair utlity.
 
M

Mark Thornton

Roedy said:
What will JSR-203 provide?

Better means to access file system information including bulk access to
file properties. Unfortunately it has now been delayed until Dolphin (JDK7).
 
H

heather.fraser

Thank you to everybody who has replied and offered advice. It is so
much
easier to begin development when I feel confident that I am going in
the
right direction otherwise I often waste time looking over my shoulder
and wondering whether there isn't a better way.

I shall try the 1000 files per directory approach and see how it goes.


Mike said:
This is an excellent point. Likewise, if the directory is damaged, and the
file is found via a disk repair utlity.


This more than anything puts my mind at rest. I was very worried about
being migrating the images some time in the future and mixing up
several
images with the same name (for example, 9.png)

Oh what a simple solution to this. How did it never occur to me before.

Thank you all again,

Heather
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top