Best way to store a large number of files?

Discussion in 'Java' started by heather.fraser@gmail.com, Oct 8, 2005.

  1. Guest

    Hello everybody,

    I am creating an Image library application with Java which will
    store several million files on the file system.
    Meta data desribing the images will be stored in a database but
    I think it's probably faster if the actual image files are stored
    on the file system with a reference stored in the database.

    As I understand it, storing all of the files in one single
    directory would become slow in look-ups. And so I am thinking
    of giving each image a 10-digit number and placing the image
    in a directory structure such as this ~

    /1/2/3/4/5/6/7/8/9/x.png

    For example, if an image is 2749282749.jpg then the image 9.png
    will be placed in the subdirectory /2/7/4/9/2/8/2/7/4

    Is it really that simple? Are there any caveats that I should
    be aware of?

    thank you very much,

    Heather
    , Oct 8, 2005
    #1
    1. Advertising

  2. wrote:
    > Hello everybody,
    >
    > I am creating an Image library application with Java which will
    > store several million files on the file system.
    > Meta data desribing the images will be stored in a database but
    > I think it's probably faster if the actual image files are stored
    > on the file system with a reference stored in the database.
    >
    > As I understand it, storing all of the files in one single
    > directory would become slow in look-ups.


    It depends on the file system in use. This problem does occur for FAT
    but not for NTFS for example.

    >And so I am thinking
    > of giving each image a 10-digit number and placing the image
    > in a directory structure such as this ~


    A better approach might be to MessageDigest to compute a hash of the
    file and use that to derive the file path and name. This would result in
    identical files being located in the same place. You should also
    experiment with the number of 'digits' to use at each level; one is
    probably too few, two or three is likely to be more efficient. Otherwise
    the approach is reasonable and is used by a number of applications.

    Mark Thornton
    Mark Thornton, Oct 8, 2005
    #2
    1. Advertising

  3. On Sat, 08 Oct 2005 06:50:21 -0700, heather.fraser wrote:

    [Snip]
    > /1/2/3/4/5/6/7/8/9/x.png
    >
    > For example, if an image is 2749282749.jpg then the image 9.png
    > will be placed in the subdirectory /2/7/4/9/2/8/2/7/4
    >
    > Is it really that simple? Are there any caveats that I should
    > be aware of?


    This is pretty much exactly how many news servers store articles in the
    filesystem. You don't need that many levels of directories though. Your
    design will work find under Unix, I can't say for other platforms, but
    I would expect it to be fine.

    --
    Kenneth P. Turvey <>
    http://kt.squeakydolphin.com (not much there yet)
    Jabber IM:
    Phone: (314) 255-2199
    Kenneth P. Turvey, Oct 8, 2005
    #3
  4. Roedy Green Guest

    On 8 Oct 2005 06:50:21 -0700, wrote or quoted
    :

    >Is it really that simple? Are there any caveats that I should
    >be aware of?


    Create your directory structure first. You can't create a file without
    the directory structure in place.

    It is primarily Windows 98 and its FAT file system that has troubles
    with long linear searches of directories. Your scheme has 10
    directories per level and 10 leaf files per directory.

    You might try your code with 100, 256 or 1000 per node to find the
    optimal efficiency, perhaps even making the arity a platform
    configurable option.

    You are probably best to put the entire index name in the leaf file
    name to avoid confusion, especially if files are copied about.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Again taking new Java programming contracts.
    Roedy Green, Oct 9, 2005
    #4
  5. Mark Thornton wrote:
    > wrote:
    >
    >> Hello everybody,
    >>
    >> I am creating an Image library application with Java which will
    >> store several million files on the file system.
    >> Meta data desribing the images will be stored in a database but
    >> I think it's probably faster if the actual image files are stored
    >> on the file system with a reference stored in the database.
    >>
    >> As I understand it, storing all of the files in one single
    >> directory would become slow in look-ups.

    >
    >
    > It depends on the file system in use. This problem does occur for FAT
    > but not for NTFS for example.


    Are you joking? NTFS is better at handling large number of files, but
    sure it becomes a problem when the number is large enough.

    --
    jon martin solaas
    Jon Martin Solaas, Oct 9, 2005
    #5
  6. Jon Martin Solaas wrote:
    > Mark Thornton wrote:
    >
    >> wrote:
    >>
    >>> Hello everybody,
    >>>
    >>> I am creating an Image library application with Java which will
    >>> store several million files on the file system.
    >>> Meta data desribing the images will be stored in a database but
    >>> I think it's probably faster if the actual image files are stored
    >>> on the file system with a reference stored in the database.
    >>>
    >>> As I understand it, storing all of the files in one single
    >>> directory would become slow in look-ups.

    >>
    >>
    >>
    >> It depends on the file system in use. This problem does occur for FAT
    >> but not for NTFS for example.

    >
    >
    > Are you joking? NTFS is better at handling large number of files, but
    > sure it becomes a problem when the number is large enough.
    >


    Given that NTFS uses a tree structure for directories, it won't have any
    more problem than using the hierarchy of directories proposed by the OP.
    It certainly is happy with many thousands of entries in a directory. For
    Linux fans I think ReiserFS has similar properties.

    Mark Thornton
    Mark Thornton, Oct 9, 2005
    #6
  7. Drazen Gemic Guest

    > A better approach might be to MessageDigest to compute a hash of the
    > file and use that to derive the file path and name. This would result in


    Similar approach is used by Squid, a cacheing proxy. It creates hash
    code out of URLs. Be sure that it can store and access files quickly and
    deals with milions of files without any effort.

    DG
    Drazen Gemic, Oct 9, 2005
    #7
  8. >> As I understand it, storing all of the files in one single
    >> directory would become slow in look-ups.

    >
    > It depends on the file system in use. This problem does occur for FAT but
    > not for NTFS for example.


    but with java you will get HUGE problems in this case - think about
    File#list().

    --
    Andrey Kuznetsov
    http://uio.imagero.com Unified I/O for Java
    http://reader.imagero.com Java image reader
    http://jgui.imagero.com Java GUI components and utilities
    Andrey Kuznetsov, Oct 9, 2005
    #8
  9. Andrey Kuznetsov, Oct 9, 2005
    #9
  10. Andrey Kuznetsov wrote:
    >>>As I understand it, storing all of the files in one single
    >>>directory would become slow in look-ups.

    >>
    >>It depends on the file system in use. This problem does occur for FAT but
    >>not for NTFS for example.

    >
    >
    > but with java you will get HUGE problems in this case - think about
    > File#list().
    >


    The OP's task may not need to use the 'list' method. We also hope that
    JSR-203 will eventually provide a way around this problem.

    Mark Thornton
    Mark Thornton, Oct 9, 2005
    #10
  11. "Roedy Green" <> wrote in
    message news:...
    > On 8 Oct 2005 06:50:21 -0700, wrote or quoted
    > :
    >
    >>Is it really that simple? Are there any caveats that I should
    >>be aware of?

    >
    > Create your directory structure first. You can't create a file without
    > the directory structure in place.
    >
    > It is primarily Windows 98 and its FAT file system that has troubles
    > with long linear searches of directories. Your scheme has 10
    > directories per level and 10 leaf files per directory.
    >
    > You might try your code with 100, 256 or 1000 per node to find the
    > optimal efficiency, perhaps even making the arity a platform
    > configurable option.
    >
    > You are probably best to put the entire index name in the leaf file
    > name to avoid confusion, especially if files are copied about.


    This is an excellent point. Likewise, if the directory is damaged, and the
    file is found via a disk repair utlity.
    Mike Schilling, Oct 9, 2005
    #11
  12. Roedy Green Guest

    On Sun, 09 Oct 2005 19:58:19 GMT, Mark Thornton
    <> wrote or quoted :

    >JSR-203 will eventually provide a way around this problem.


    What will JSR-203 provide?
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Again taking new Java programming contracts.
    Roedy Green, Oct 10, 2005
    #12
  13. Roedy Green wrote:
    > On Sun, 09 Oct 2005 19:58:19 GMT, Mark Thornton
    > <> wrote or quoted :
    >
    >
    >>JSR-203 will eventually provide a way around this problem.

    >
    >
    > What will JSR-203 provide?


    Better means to access file system information including bulk access to
    file properties. Unfortunately it has now been delayed until Dolphin (JDK7).
    Mark Thornton, Oct 10, 2005
    #13
  14. Guest

    Thank you to everybody who has replied and offered advice. It is so
    much
    easier to begin development when I feel confident that I am going in
    the
    right direction otherwise I often waste time looking over my shoulder
    and wondering whether there isn't a better way.

    I shall try the 1000 files per directory approach and see how it goes.


    Mike Schilling wrote:
    > "Roedy Green" wrote:
    > > You are probably best to put the entire index name in the leaf file
    > > name to avoid confusion, especially if files are copied about.

    >
    > This is an excellent point. Likewise, if the directory is damaged, and the
    > file is found via a disk repair utlity.



    This more than anything puts my mind at rest. I was very worried about
    being migrating the images some time in the future and mixing up
    several
    images with the same name (for example, 9.png)

    Oh what a simple solution to this. How did it never occur to me before.

    Thank you all again,

    Heather
    , Oct 10, 2005
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Max
    Replies:
    2
    Views:
    429
  2. Tarun Mistry

    Best way to store a time?

    Tarun Mistry, Feb 22, 2006, in forum: ASP .Net
    Replies:
    1
    Views:
    309
    Karl Seguin [MVP]
    Feb 22, 2006
  3. mathieu
    Replies:
    11
    Views:
    461
    Victor Bazarov
    Dec 12, 2007
  4. EJP
    Replies:
    3
    Views:
    542
  5. Ryan Chan
    Replies:
    2
    Views:
    153
Loading...

Share This Page