Stefan said:
This should be a common need. Yet I am not aware of anything
like it in Java SE. What is the most common (pure Java)
solution to it?
I would like to have an implementation of java.util.Map,
which is constructed with an int »m« and a java.io.File »f«.
It will use no more than »m« bytes of memory, but »swap« out
(the least often used) entries to the file »f«, when they do
not fit into the given memory size anymore.
Not too long ago I wound up needing to implement a program that would
store and retrieve large numbers of images, in an environment that
precluded the use of a DBMS (which would've been overkill anyway).
I basically created an on-disk HashMap quite simply: images were
retrieved into a BufferedImage from a stream, to which a suitable
DigestInputStream decorator had been attached before wrapping in
ImageInputStream. Out comes the SHA-1 hash of the image, as well as the
image itself.
Then the SHA-1 is used to construct a path and filename, mkdirs is used
on the path, and one ImageIO.write later, it's stored.
Retrieval was equally simple, since the SHA-1s were then used as the
unique identifiers for the images throughout the program and could
easily be turned into a path and filename for use this time with
ImageIO.read. Delete was equally simple. The only tricky part of CRUD
for this prog was the "U" part, since altering an image would alter its
hash, and thus its ID. (Sometimes ID had to be determined from a copy of
the image, to avoid duplications that could waste many gigabytes.)
Fortunately, alterations were much rarer than storage and retrieval, and
it wasn't hard to have a higher level of abstraction that had its own
identifiers, and that mapped to the SHA-1 that might have changed.
FWIW, I've also implemented a disk-based deque supporting O(1) additions
and removals at both ends (using sequentially numbered files as a
backing "DiskArrayList", each file holding up to N entries). The
constant factor overhead is large, since a linear search of up to N
entries may be needed to work with the last entry in the queue, which is
the last entry in the last file, and since I/O is much slower than RAM
anyway. This disk-deque was needed for some application that had to
store potentially millions of items and had to run where there wasn't
much RAM, but was plenty of disk. Speed of access to the deque was of
little concern because the tasks done would have been I/O bound anyway,
and saving tens of megabytes of RAM usage at run-time was worth it. In
fact, the application needed many of these deques, potentially running
into gigs of total memory use, so the only two alternatives to having
the deques themselves live on disk was having them be serialized to disk
and loaded on demand (which would make things much slower -- reading a
30 meg file to do a few things and then writing a 30 meg file is a lot
slower than reading a few-KB file to do a few things and then writing a
few-KB file!) or having them all in RAM and letting the system page like
crazy (equally slow, and would make the UI unresponsive at times).
Acceptable UI responsiveness under these circumstances dictated the need
for a disk-based data structure. (Think "kiosk".)
The disk-deque had some notion of transactions and commits, too, mainly
to maintain data integrity; if the power went out in mid-operation the
deque wouldn't be corrupted, though some changes to it might not have
been saved. The disk-hashmap-of-images had a weak version that saved the
image to a temp file and then renamed it when it was done, so the image
appearing in the hashmap under its hash would happen atomically as far
as any external process was concerned. A power failure in mid-save would
result in the image not having been added, rather than a truncated
version being in the map in the hash bucket for the un-truncated
version. (Eeeuw.) So in both cases operations would have to be retried
if there was an inconvenient crash or power failure but not corruption
of data; a weak protection of data integrity and some weak atomicity.
When you mission-critically need full ACID, though, you need a proper
database engine, even with all the added complexity and overhead that
implies.