HardDrive backed ArrayList

O

opalpa

Is there an ArrayList that uses file system instead of RAM?

That is, when I add an item, which is Serializable, I would like the
ArrayList to serialize item to file system and keep a reference to
where it put item. When I get the item I want it to be read from disk,
de-serialized, and passed to me.

The reason is I have over a gigabyte of data in ArrayList and it will
be over two gigabytes within a couple of months. The problem with
using this much space is that java process size is not allowed to grow
that large.

Opalinski
(e-mail address removed)
http://www.geocities.com/opalpaweb/
 
O

opalpa

Hmm, on further thought, the data structure appears simpe to write; I
need only three operations. What do folks think of:


package opalpa.util;
import java.util.*;
import java.io.*;
final public class DiskList {
private ArrayList<File> data = new ArrayList();
synchronized public void add(Serializable so) throws IOException {
File f = File.createTempFile("disklist.","dlo");
f.deleteOnExit();
opalpa.world.output.object(f,so);
data.add(f);
}
public Iterator<Serializable> iterator() {
return new ExtractingIterator( data.iterator() );
}
synchronized public void shuffle() {
Collections.shuffle(data);
}


private class ExtractingIterator implements Iterator {
private Iterator<File> it = null;
public ExtractingIterator(Iterator<File> it) {
this.it = it;
}
public Serializable next() {
try {
return opalpa.world.sense.object(it.next());
} catch (Exception e) {
throw new RuntimeException("cannot read written data", e);
}
}
public boolean hasNext() {
return it.hasNext();
}
public void remove() {
it.remove();
}
}

}


opalpa.world.output.object and opalpa.world.sense.object write and read
serializable objects. I am going to verify that this code works
shortly. Cheers

Opalinski
(e-mail address removed)
http://www.geocities.com/opalpaweb/
 
D

Dimitri Maziuk

(e-mail address removed) opalinski from opalpaweb sez:
Lord have mercy. Creating hundreds of thousands of files is a bad
idea. At least a slow one. Back to the drawing board...

You could go for textbook solution with RandomAccessFile
-- although even if your filesystem can handle files over
2GB, seek times are likely to kill you at some point.

100,000's small files don't have the seek time problem.
If you store them in a directory tree, with 3 levels of
subdirectories you can keep it reasonable.

I'd stick the whole thing into a postgres table instead,
myself.

Dima
 
O

opalpa

HerbmitCrab has more functionality than I am looking for. I do not
need to copy, move, resize records, nor have data persist from
execution to execution. The article brings up important points for
consideration. I am probably going to split across several physical
files and then use random access within those files.


Opalinski
(e-mail address removed)
http://www.geocities.com/opalpaweb/
 
O

opalpa

100,000's small files don't have the seek time problem.
If you store them in a directory tree, with 3 levels of
subdirectories you can keep it reasonable.

Interesting. How many subdirectories on each level?
I'd stick the whole thing into a postgres table instead,
myself.

I'll see how simple it is to setup and get a [primary key, byte array]
table up. I hope that that the data ceiling per table is big, like 4GB
or larger. I'll look into this solution. Thanks.

Opalinski
(e-mail address removed)
http://www.geocities.com/opalpaweb/
 
R

Roedy Green

Interesting. How many subdirectories on each level?

The problem with thousands of small files in the overhead of opening
them. If the data are stored is a large flat file or a database, you
don't have the open overhead.

Opens used to work with a linear search in Windows 98. Now they use a
hashmap of some sort so the penalty of a directories with a large
number of files is much lower. However, opening a file is still a
non-trivial operation.
 
D

Dimitri Maziuk

Roedy Green sez:
The problem with thousands of small files in the overhead of opening
them. If the data are stored is a large flat file or a database, you
don't have the open overhead.

Yes, there is that.
Opens used to work with a linear search in Windows 98. Now they use a
hashmap of some sort so the penalty of a directories with a large
number of files is much lower.

Right. On unix opendir()/readdir() is still linear AFAIK -- hence the
suggestion to use subdirectories. (With enough levels to keep the number
of directory entries down to about 1000 and path to file is constructed
like a hash key.)

Dima
 
M

Martin Gregorie

Roedy said:
The problem with thousands of small files in the overhead of opening
them. If the data are stored is a large flat file or a database, you
don't have the open overhead.

Opens used to work with a linear search in Windows 98. Now they use a
hashmap of some sort so the penalty of a directories with a large
number of files is much lower. However, opening a file is still a
non-trivial operation.
Using hundreds of thousands of tiny files can also waste a non-trivial
amount of disk space.

In the Windows/DOS world if you have a 120 GB vfat partition the
smallest space a file can occupy is 32K because disk space is allocated
in 64 block clusters. A block is, of course, 512 bytes. If you fill such
a partition with files which each contain between 1 and 32K bytes you'll
run out of space at around 3.9 million files. If the file is only a few
hundred bytes you'll take a read performance hit too because the OS
never reads part of a cluster.

That's less of a problem with a *nix filing system because a 4K block
size is typical, but it hasn't gone away and you still need to allow a
bit of space for the per-file indexing overheads. At least the
performance overhead isn't as great.
 
D

Dimitri Maziuk

(e-mail address removed) opalinski from opalpaweb sez:
Interesting. How many subdirectories on each level?

A common rule of thumb is between 1 and 10 thousand entries in
each directory. Guesstimate how many files you will eventually
have and figure out the number of levels from there. (Path names
should be constructed in the code like hash keys so you don't
need to list contents of each directory.)

The actual details are non-portable black magic: they are
filesystem-specific, they depend on implementation and intended
usage, etc.

Anyway, as Roedy pointed out, the trade-off with lots of small
files (no matter how you store them) is the time it takes to
open and close each.

Dima
 
O

Oliver Wong

Interesting. How many subdirectories on each level?

Depending on whether "3" counts the files themselves as a level:
18 * 18 * 18 * 18 = 104'976
47 * 47 * 47 = 103'823

- Oliver
 
A

Andrew McDonagh

Is there an ArrayList that uses file system instead of RAM?

That is, when I add an item, which is Serializable, I would like the
ArrayList to serialize item to file system and keep a reference to
where it put item. When I get the item I want it to be read from disk,
de-serialized, and passed to me.

The reason is I have over a gigabyte of data in ArrayList and it will
be over two gigabytes within a couple of months. The problem with
using this much space is that java process size is not allowed to grow
that large.

Opalinski
(e-mail address removed)
http://www.geocities.com/opalpaweb/

With that size of data, I'd either:

1) use a single Structured file (binary, XML etc) which stores the data.

2) use a DBMS (mySql, etc) to store the data.

I'd then create an ArrayList implementation backed by reading/writing
from/to either of those 2 choices.
 
R

Roedy Green

A common rule of thumb is between 1 and 10 thousand entries in
each directory.

Another thing to consider. Are humans going to be browsing these
directories? If so you want to keep each directory under 1000 files
and preferably under 100 for easy JFileChooser browsing.
 
D

Dimitri Maziuk

Roedy Green sez:
Another thing to consider. Are humans going to be browsing these
directories? If so you want to keep each directory under 1000 files
and preferably under 100 for easy JFileChooser browsing.

Which is precisely why I said "intended usage". 10^3 seems OK here for
command-line tools on relatively modern hardware. Adjust as needed for
GUI browsing, slow computers, etc.

There is argument that these files/names should be unreadable by humans
because humans will FUBAR the data and crash your whole system if you let
them. If you subscribe to that POV, storing all files in one directory is
the way to go: if JFileChooser OOM's on it, it's a Good Thing(tm).
Of course, in that scenario you'll probably want to store an md5 hash for
each file someplace where humans can't get to it and keep it in sync with
all your writes and handle the case where it doesn't match and so on...

Dima
 
T

Thomas Weidenfeller

Is there an ArrayList that uses file system instead of RAM?

That is, when I add an item, which is Serializable, I would like the
ArrayList to serialize item to file system and keep a reference to
where it put item. When I get the item I want it to be read from disk,
de-serialized, and passed to me.

IMHO you are off to a a false start here.

a) You do not want an ArrayList, because you want a file instead of the
array part, don't you? So why start with an ArrayList? The List
interface might be a better start, or ...

b) ... [Array]List is maybe also not what you want. The API deals with
references, and only stores references. But you want to store the whole
object, and a get will not return a reference to the original object,
but a new one from the deserialization.

Anyhow, you probably want a kind of light OO database. If you really
want to implement it from scratch you probably have to look into the
B-Tree stuff (and variants) for some efficient file usage.
The reason is I have over a gigabyte of data in ArrayList and it will
be over two gigabytes within a couple of months. The problem with
using this much space is that java process size is not allowed to grow
that large.

Really sounds like you are in need of some kind of database.

/Thomas
 
T

Tom Fredriksen

Interesting. How many subdirectories on each level?

You could possibly try ReiserFS, on a partition, or perhaps BerkleyDB, a
local file "db". ReiserFS is designed to be hold large number of small
files and be extremely fast. That way perhaps the only thing you need to
hold in memory is the index system. Or, you could just use a Black-Red
tree or a HashTree to index the files for you. Or something to that effect.

If there is any pattern in the data access you could perhaps have a
memory cache with it.

Btw, if you need the speed, the stay away from any general network
related solutions or any backend local server solution, only use memory
and file. They will eat away your performance gain. The first by net
latency with a factor of 10-100, the second by concurrency latency and
load with a factor of 3-6.

/tom
 
O

opalpa

Good day all,

Leisurely, before stepping into a weekend and brief vacation, I'll note
thread's ideas and actions they engendered:

* xcollections
fitting project seems to have died; looked at webiste, nothing to
download, I wrote no code

* ToXgene (general category: make XML files)
making XML files appears to add overhead; I did not download
anything, wrote not code

* HermitCrab
not yet reality, has the right idea; got me thinking of solutions
that are part of executable that efficiently use diskspace as storage,
the best I could think of was: A) to myself use random access files, or
B) to ask Roedy "how much for a subset of HermitCrab operations?"

* PostgreSQL
nice DB -- good documentation, good tools, and SQL; I downloaded,
installed, read documentation, and wrote code. Local DB was fifty
times slower than working in memory. The program already took a long
time to run and fifty times slower was not acceptable to clientele.

* Berkley DB JE
used it today; with large cache enabled (750MB), it was fast! Very
good documentation. It appears to be a HermitCrab implementation plus
some.

For now Berkley DB JE appears to fit the particular situation I am in
best of all alternatives noted above. I will explore (aka "play
around") more starting Thursday.

Thank you all for your suggestions. Many of the ideas were not on my
radar. I enjoy leisurely contemplating alternatives, evaluating each,
and choosing what appears to be rational for a given situation. You
have all helped in the most difficult part of analytics -- the
introduction of the possible.

Good evening,

Opalinski
(e-mail address removed)
http://www.geocities.com/opalpaweb/
 
R

Roedy Green

* HermitCrab
not yet reality, has the right idea; got me thinking of solutions
that are part of executable that efficiently use diskspace as storage,
the best I could think of was: A) to myself use random access files, or
B) to ask Roedy "how much for a subset of HermitCrab operations?"

The reason you might want a HermitCrab rather than a full database is
simplicity. You don't have to install a database with your simple
little app. It would be just another few classes. If int keys are
sufficient, it can be very quick. You save the full overhead of a
database when all you really wanted was variable length records.

You could do a read-only version with a very small amount of code
overhead.

Of course you don't get transactions, recovery, mulitple indexes or
any of the other goodies that come with a full database.

There are some SQL databases not that can be embedded with the app.
Some are ram only, but I think there are a few that persist. The
other catch with SQL is most of them require a royalty to distribute
or the JDBC drivers do.

see http://mindprod.com/jgloss/sqlvendors.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top