Searching a disk-backed Map

Stefan Ram · Aug 18, 2009

This should be a common need. Yet I am not aware of anything
like it in Java SE. What is the most common (pure Java)
solution to it?

I would like to have an implementation of java.util.Map,
which is constructed with an int »m« and a java.io.File »f«.

It will use no more than »m« bytes of memory, but »swap« out
(the least often used) entries to the file »f«, when they do
not fit into the given memory size anymore.

Albert · Aug 18, 2009

Stefan Ram a écrit :

This should be a common need. Yet I am not aware of anything
like it in Java SE. What is the most common (pure Java)
solution to it?

I would like to have an implementation of java.util.Map,
which is constructed with an int »m« and a java.io.File »f«.

It will use no more than »m« bytes of memory, but »swap« out
(the least often used) entries to the file »f«, when they do
not fit into the given memory size anymore.

just to make you renounce, search for a feature request about a
release() or free() method for the "MappedByteBuffer" class in the sun
bugs database.

markspace · Aug 18, 2009

Stefan said:
This should be a common need. Yet I am not aware of anything
like it in Java SE. What is the most common (pure Java)
solution to it?

I would like to have an implementation of java.util.Map,
which is constructed with an int »m« and a java.io.File »f«.

It will use no more than »m« bytes of memory, but »swap« out
(the least often used) entries to the file »f«, when they do
not fit into the given memory size anymore.

I was going to the say the same thing as Patricia: the map is called
"Derby". Search for JavaDB.

<http://java.sun.com/developer/technicalArticles/J2SE/Desktop/javadb/>

There's no point in actually implementing such a specialized thing as a
literal Map when there's a better, more general solution already available.

Roedy Green · Aug 18, 2009

This should be a common need. Yet I am not aware of anything
like it in Java SE. What is the most common (pure Java)
solution to it?

I would like to have an implementation of java.util.Map,
which is constructed with an int »m« and a java.io.File »f«.

It will use no more than »m« bytes of memory, but »swap« out
(the least often used) entries to the file »f«, when they do
not fit into the given memory size anymore.

I rolled my own something similar. It is not that much code to handle
the random quotations you see on my website.

What you could do in write a class that uses a HashMap internally just
to hold the keys and objects that are offsets in a sequential file.

When you build the Map, you write the objects out with writeUTF or
writeObject, and record the size/offset of the stream before the
write.

Then to lookup in the Map, you look up the key, get the offset, seek
and do a read. You don't even need to know the length.

This is pretty fast, especially when the drive/OS does read caching.
If you wanted to make it even faster, you could put the objects in a
NIO memory mapped file, but that limits your file size. You also
might put the file on fast flash drive.

I would write such a beast to your specs for $50 US.

Roedy Green · Aug 18, 2009

Have you considered putting the data in a database instead, and using
java.sql to access it? The data structures and algorithms that Java uses
for in-memory maps are not very suitable for disk-based maps. Database
managers use structures and algorithms designed for the job.

Other advantages:

1. They don't fail entirely when RAM gets tighter. With an in RAM key
lookup, you must have the ram for all the keys.

2. You can improve their performance just by throwing more ram at
them. They can use it for both keys and objects.

3. If your indexing needs become more complex, you can just add. You
don't have to start over from scratch.

The disadvantages:

1. Usually you must separately install, start/stop the SQL engine.
This might conflict with other apps.

2. You have the overhead of all of SQL when you may need only a tiny
fraction of it. That RAM would have been better used with more
specific code.

Stefan Ram · Aug 18, 2009

Roedy Green said:
Then to lookup in the Map, you look up the key, get the offset, seek
and do a read. You don't even need to know the length.

I have read your ideas on how to implement this beast, and
they are nearly identical to what I had in mind before I
read your suggestion but after I wrote my OP (I made it up
in the time between). So, maybe I might go this way.

Stefan Ram · Aug 18, 2009

Roedy Green said:
I would write such a beast to your specs for $50 US.

¯¯¯¯¯
This is funny, because just yesterday I read an article
about brain research that found that all people (even blind
people) have different areas in the brain to represent
animals versus non-animal things, independent of how they
look like.

Then, yesterday, I posted my thoughts in »de.comp.lang.misc«
about how the brain sometimes uses the animal area and
sometimes the things area to represent abstract concepts in
programming. For example, a program looks like an animal to
the brain, while data looks like a thing.

Some people then made fun of my post, but now I read your
choice of the word »beast«, which supports my idea.

Stefan Ram · Aug 18, 2009

Roedy Green said:
1. Usually you must separately install, start/stop the SQL engine.
This might conflict with other apps.

I have this crazy idea to write applications that are based
on Java SE only and not require any additional library.
I know that this idea might not be very pragmatic or reasonable.

/If/ Derby would finally be included in Java SE, I would
love to use it.

Roedy Green · Aug 18, 2009

1. They don't fail entirely when RAM gets tighter. With an in RAM key
lookup, you must have the ram for all the keys.

2. You can improve their performance just by throwing more ram at
them. They can use it for both keys and objects.

3. If your indexing needs become more complex, you can just add. You
don't have to start over from scratch.

I expound on this theme at
http://mindprod.com/jgloss/sql.html#ALTERNATIVES

Roedy Green · Aug 18, 2009

Some people then made fun of my post, but now I read your
choice of the word »beast«, which supports my idea.

I have a more elaborate version for frequently changing data called a
Hermit Crab file. It was inspired by the metaphor of a hermit crab
seeking out ever larger shells as it grows.

see http://mindprod.com/jgloss/hermitcrab.html

As RAM has become less expensive, the need for a light weight indexed
record lookup has waned. You can just use a big hammer -- SQL.

Lew · Aug 18, 2009

Roedy said:
2. You have the overhead of all of SQL when you may need only a tiny
fraction of it. That RAM would have been better used with more
specific code.

Derby has a small footprint -- about 2 megabytes
for the base engine and embedded JDBC driver.

That's not a whole lot of "overhead" in the Java world. That RAM
would have to be really, really, really better used with custom code
to justify the effort, bug-hunting, maintenance and headache of not
using a database engine.

Lew · Aug 18, 2009

Stefan said:
I have this crazy idea to write applications that are based
on Java SE only and not require any additional library.
I know that this idea might not be very pragmatic or reasonable.

/If/ Derby would finally be included in Java SE, I would
love to use it.

Well, Derby does come standard with the Java SE JDK. Does that count
as "included"?

Donkey Hottie · Aug 18, 2009

Roedy Green said:
I expound on this theme at
http://mindprod.com/jgloss/sql.html#ALTERNATIVES

A heads up!

In that great document you write: "It is missing features you would expect
such as the ability to traverse forward and back in result sets. "

It is nowadays possible to create "random access" result sets.

PreparedStatement stmt = conn.prepareStatement(
"SELECT * FROM orders WHERE customer_id = ? ORDER BY order_time",
ResultSet.TYPE_SCROLL_INSENSITIVE,
ResultSet.CONCUR_READ_ONLY, ResultSet.CLOSE_CURSORS_AT_COMMIT);

The TYPE_SCROLL_INSENSITIVE makes it. You can then

resultSet.absolute(offset);

I have used this with MySQL (while my app is supposed to work on any SQL db
*)), and it seems to work and is fast anough. I use it to get a query, and
to display it on 'screen' some lines at a time I just seek whereever I need.
For each screenful I make the query again and again, and seek to the correct
screen. This is in a Web application where I do not store any state between
the screenfuls.

*) The ResultSet has a getType() method, which tells if it supports this. If
it has 'downgraded' the type, a slower approach may be needed.

if (resultSet.getType() == ResultSet.TYPE_FORWARD_ONLY)
{
// The driver could not give us a scrollable ResultSet
// but downgraded it to a TYPE_FORWARD_ONLY
for (int i = 0; i < beginIndex; i++)
{
resultSet.next();
}
}
else
{
// Move to the position one before the begin (for next() in
// the loop will advance.
resultSet.absolute(beginIndex);
}

Tom Anderson · Aug 18, 2009

Have you considered putting the data in a database instead, and using
java.sql to access it? The data structures and algorithms that Java uses
for in-memory maps are not very suitable for disk-based maps. Database
managers use structures and algorithms designed for the job.

'The job' in question being relational data access. Stefan doesn't want
that, he wants to do stores and lookups by key, and nothing else (well,
that and removals, and iteration - but i would imagine the priority is
fast storage and lookup). Yes, this is a subset of what you can do with a
relational data store, but it's quite possible that an implementation
which does keyed storage and nothing else will do it faster and more
efficiently.

The prototype of this is the ancient unix DBM:

http://en.wikipedia.org/wiki/Dbm

And its zillions of successors (which include java wrappers and pure java
implementations - although i don't know if any have a Map interface).
These are (generally held to be) quite a bit faster than an RDBMS.

It's worth stressing, for anyone who hasn't fully grokked this, that
RDBMSs are *not* fast. Not in the slightest. This is why none of the huge
web players - Google, Yahoo, Facebook, etc - use them in their
high-throughput apps. The reason RDBMSs are popular nonetheless is because
they're fast enough for most purposes, they're reliable and trustworthy,
and they're highly, exceptionally, *insanely* flexible.

tom

Tom Anderson · Aug 18, 2009

I have a more elaborate version for frequently changing data called a
Hermit Crab file. It was inspired by the metaphor of a hermit crab
seeking out ever larger shells as it grows.

see http://mindprod.com/jgloss/hermitcrab.html

As RAM has become less expensive, the need for a light weight indexed
record lookup has waned. You can just use a big hammer

Or maybe a really fat armadillo?

tom

Tom Anderson · Aug 18, 2009

'The job' in question being relational data access. Stefan doesn't want that,
he wants to do stores and lookups by key, and nothing else (well, that and
removals, and iteration - but i would imagine the priority is fast storage
and lookup). Yes, this is a subset of what you can do with a relational data
store, but it's quite possible that an implementation which does keyed
storage and nothing else will do it faster and more efficiently.

And if you don't believe me - how about Oracle?

http://www.oracle.com/technology/products/berkeley-db/je/index.html

Relational databases are the most sophisticated tool available to the
developer for data storage and analysis. Most persisted object data is
never analyzed using ad-hoc SQL queries; it is usually simply retrieved
and reconstituted as Java objects. The overhead of using a sophisticated
analytical storage engine is wasted on this basic task of object
retrieval. The full analytical power of the relational model is not
required to efficiently persist Java objects. In many cases, it is
unnecessary overhead. In contrast, Berkeley DB Java Edition does not have
the overhead of an ad-hoc query language like SQL, and so does not incur
this penalty.

The result is faster storage, lower CPU and memory requirements, and a
more efficient development process.

That software is freeware; if i was going to implement a disk-backed map,
it's where i'd start.

tom

Mike Schilling · Aug 19, 2009

Stefan said:
This should be a common need. Yet I am not aware of anything
like it in Java SE. What is the most common (pure Java)
solution to it?

I would like to have an implementation of java.util.Map,
which is constructed with an int »m« and a java.io.File »f«.

It will use no more than »m« bytes of memory, but »swap« out
(the least often used) entries to the file »f«, when they do
not fit into the given memory size anymore.

If you look at
http://mdr.netbeans.org/source/browse/mdr/src/org/netbeans/mdr/persistence/btreeimpl/ ,
you'll find a Java implementation of a persistent map a co-worker and
I wrote many years ago. I don't recall all of its features (and I'm
too lazy to set up a CVS client and pull down the source to look at
it), but basically

* You can create, delete, update, and retrieve keyed records
* The file is single-user but transactional -- none of the changes are
saved until you issue a commit.
* You can adjust the size of the cache. I think all changed records
are help in memory, but unchanged ones can come and go
* The index is a btree

If you ever used a Btree repository in the old Forte application
development system, it'll probably be clear that this code is a Java
port of the same underlying technology.

At the time I handed it off, it had a full set of unit tests, but one
of the most annoying things about working on NetBeans was that I could
never convince anyone else that unit tests were an essential part of
the source code, as opposed to something to hand off to QA in case
they felt like running them sometime. As a result, while the source
code eventually got checked into this repository, the tests have
apparently been lost.

Gerald Murdock · Aug 19, 2009

Stefan said:
This should be a common need. Yet I am not aware of anything
like it in Java SE. What is the most common (pure Java)
solution to it?

I would like to have an implementation of java.util.Map,
which is constructed with an int »m« and a java.io.File »f«.

It will use no more than »m« bytes of memory, but »swap« out
(the least often used) entries to the file »f«, when they do
not fit into the given memory size anymore.

Not too long ago I wound up needing to implement a program that would
store and retrieve large numbers of images, in an environment that
precluded the use of a DBMS (which would've been overkill anyway).

I basically created an on-disk HashMap quite simply: images were
retrieved into a BufferedImage from a stream, to which a suitable
DigestInputStream decorator had been attached before wrapping in
ImageInputStream. Out comes the SHA-1 hash of the image, as well as the
image itself.

Then the SHA-1 is used to construct a path and filename, mkdirs is used
on the path, and one ImageIO.write later, it's stored.

Retrieval was equally simple, since the SHA-1s were then used as the
unique identifiers for the images throughout the program and could
easily be turned into a path and filename for use this time with
ImageIO.read. Delete was equally simple. The only tricky part of CRUD
for this prog was the "U" part, since altering an image would alter its
hash, and thus its ID. (Sometimes ID had to be determined from a copy of
the image, to avoid duplications that could waste many gigabytes.)
Fortunately, alterations were much rarer than storage and retrieval, and
it wasn't hard to have a higher level of abstraction that had its own
identifiers, and that mapped to the SHA-1 that might have changed.

FWIW, I've also implemented a disk-based deque supporting O(1) additions
and removals at both ends (using sequentially numbered files as a
backing "DiskArrayList", each file holding up to N entries). The
constant factor overhead is large, since a linear search of up to N
entries may be needed to work with the last entry in the queue, which is
the last entry in the last file, and since I/O is much slower than RAM
anyway. This disk-deque was needed for some application that had to
store potentially millions of items and had to run where there wasn't
much RAM, but was plenty of disk. Speed of access to the deque was of
little concern because the tasks done would have been I/O bound anyway,
and saving tens of megabytes of RAM usage at run-time was worth it. In
fact, the application needed many of these deques, potentially running
into gigs of total memory use, so the only two alternatives to having
the deques themselves live on disk was having them be serialized to disk
and loaded on demand (which would make things much slower -- reading a
30 meg file to do a few things and then writing a 30 meg file is a lot
slower than reading a few-KB file to do a few things and then writing a
few-KB file!) or having them all in RAM and letting the system page like
crazy (equally slow, and would make the UI unresponsive at times).
Acceptable UI responsiveness under these circumstances dictated the need
for a disk-based data structure. (Think "kiosk".)

The disk-deque had some notion of transactions and commits, too, mainly
to maintain data integrity; if the power went out in mid-operation the
deque wouldn't be corrupted, though some changes to it might not have
been saved. The disk-hashmap-of-images had a weak version that saved the
image to a temp file and then renamed it when it was done, so the image
appearing in the hashmap under its hash would happen atomically as far
as any external process was concerned. A power failure in mid-save would
result in the image not having been added, rather than a truncated
version being in the map in the hash bucket for the un-truncated
version. (Eeeuw.) So in both cases operations would have to be retried
if there was an inconvenient crash or power failure but not corruption
of data; a weak protection of data integrity and some weak atomicity.

When you mission-critically need full ACID, though, you need a proper
database engine, even with all the added complexity and overhead that
implies.

Roedy Green · Aug 19, 2009

In that great document you write: "It is missing features you would expect
such as the ability to traverse forward and back in result sets. "

It is nowadays possible to create "random access" result sets.

"Knowledge keeps no better than fish."
~ Alfred North Whitehead

Thanks. Now fixed.

Arne Vajhøj · Aug 22, 2009

Patricia said:
Have you considered putting the data in a database instead, and using
java.sql to access it? The data structures and algorithms that Java uses
for in-memory maps are not very suitable for disk-based maps. Database
managers use structures and algorithms designed for the job.

Map is an interface and does not really specify algorithms or
data structures. 15 seconds looking at Map indicates that most
of the stuff could easily be implemented on top of a database.
Only entrySet, keySet and values would require non-trivial code.

But that is probably an unimportant question. The important question
is: would having a database backed map provide any benefits over
just using the database directly. I can not see any. A Map is
more general but having memory access and disk access being
transparent is a double edged sword.

Arne

In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
a fast malloc/free implementation & benchmarks	0	Mar 20, 2011
PermGen, Class Unloading and Garbage Collection once again	1	Sep 1, 2009
Callbacks and checked exceptions	15	Oct 27, 2008
comp.lang.java.gui FAQ	0	Sep 13, 2006
7.0 wishlist?	321	Oct 29, 2008
How a linker works (continued)	29	Mar 26, 2008
PEP 384: Defining a Stable ABI	0	May 17, 2009

Searching a disk-backed Map

Stefan Ram

Albert

markspace

Roedy Green

Roedy Green

Stefan Ram

Stefan Ram

Stefan Ram

Roedy Green

Roedy Green

Lew

Lew

Donkey Hottie

Tom Anderson

Tom Anderson

Tom Anderson

Mike Schilling

Gerald Murdock

Roedy Green

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads