Searching a disk-backed Map

Discussion in 'Java' started by Stefan Ram, Aug 18, 2009.

  1. Stefan Ram

    Stefan Ram Guest

    This should be a common need. Yet I am not aware of anything
    like it in Java SE. What is the most common (pure Java)
    solution to it?

    I would like to have an implementation of java.util.Map,
    which is constructed with an int »m« and a java.io.File »f«.

    It will use no more than »m« bytes of memory, but »swap« out
    (the least often used) entries to the file »f«, when they do
    not fit into the given memory size anymore.
    Stefan Ram, Aug 18, 2009
    #1
    1. Advertising

  2. Stefan Ram

    Albert Guest

    Stefan Ram a écrit :
    > This should be a common need. Yet I am not aware of anything
    > like it in Java SE. What is the most common (pure Java)
    > solution to it?
    >
    > I would like to have an implementation of java.util.Map,
    > which is constructed with an int »m« and a java.io.File »f«.
    >
    > It will use no more than »m« bytes of memory, but »swap« out
    > (the least often used) entries to the file »f«, when they do
    > not fit into the given memory size anymore.
    >


    just to make you renounce, search for a feature request about a
    release() or free() method for the "MappedByteBuffer" class in the sun
    bugs database.
    Albert, Aug 18, 2009
    #2
    1. Advertising

  3. Stefan Ram

    markspace Guest

    Stefan Ram wrote:
    > This should be a common need. Yet I am not aware of anything
    > like it in Java SE. What is the most common (pure Java)
    > solution to it?
    >
    > I would like to have an implementation of java.util.Map,
    > which is constructed with an int »m« and a java.io.File »f«.
    >
    > It will use no more than »m« bytes of memory, but »swap« out
    > (the least often used) entries to the file »f«, when they do
    > not fit into the given memory size anymore.
    >



    I was going to the say the same thing as Patricia: the map is called
    "Derby". Search for JavaDB.

    <http://java.sun.com/developer/technicalArticles/J2SE/Desktop/javadb/>

    There's no point in actually implementing such a specialized thing as a
    literal Map when there's a better, more general solution already available.
    markspace, Aug 18, 2009
    #3
  4. Stefan Ram

    Roedy Green Guest

    On 18 Aug 2009 12:09:58 GMT, -berlin.de (Stefan Ram)
    wrote, quoted or indirectly quoted someone who said :

    > This should be a common need. Yet I am not aware of anything
    > like it in Java SE. What is the most common (pure Java)
    > solution to it?
    >
    > I would like to have an implementation of java.util.Map,
    > which is constructed with an int »m« and a java.io.File »f«.
    >
    > It will use no more than »m« bytes of memory, but »swap« out
    > (the least often used) entries to the file »f«, when they do
    > not fit into the given memory size anymore.


    I rolled my own something similar. It is not that much code to handle
    the random quotations you see on my website.

    What you could do in write a class that uses a HashMap internally just
    to hold the keys and objects that are offsets in a sequential file.

    When you build the Map, you write the objects out with writeUTF or
    writeObject, and record the size/offset of the stream before the
    write.

    Then to lookup in the Map, you look up the key, get the offset, seek
    and do a read. You don't even need to know the length.

    This is pretty fast, especially when the drive/OS does read caching.
    If you wanted to make it even faster, you could put the objects in a
    NIO memory mapped file, but that limits your file size. You also
    might put the file on fast flash drive.

    I would write such a beast to your specs for $50 US.


    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    http://thecovemovie.com : The Cove: a documentary about Japan's secret atrocities against dolphins.
    Roedy Green, Aug 18, 2009
    #4
  5. Stefan Ram

    Roedy Green Guest

    On Tue, 18 Aug 2009 05:42:09 -0700, Patricia Shanahan <>
    wrote, quoted or indirectly quoted someone who said :

    >
    >Have you considered putting the data in a database instead, and using
    >java.sql to access it? The data structures and algorithms that Java uses
    >for in-memory maps are not very suitable for disk-based maps. Database
    >managers use structures and algorithms designed for the job.


    Other advantages:

    1. They don't fail entirely when RAM gets tighter. With an in RAM key
    lookup, you must have the ram for all the keys.

    2. You can improve their performance just by throwing more ram at
    them. They can use it for both keys and objects.

    3. If your indexing needs become more complex, you can just add. You
    don't have to start over from scratch.

    The disadvantages:

    1. Usually you must separately install, start/stop the SQL engine.
    This might conflict with other apps.

    2. You have the overhead of all of SQL when you may need only a tiny
    fraction of it. That RAM would have been better used with more
    specific code.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    http://thecovemovie.com : The Cove: a documentary about Japan's secret atrocities against dolphins.
    Roedy Green, Aug 18, 2009
    #5
  6. Stefan Ram

    Stefan Ram Guest

    Roedy Green <> writes:
    >Then to lookup in the Map, you look up the key, get the offset, seek
    >and do a read. You don't even need to know the length.


    I have read your ideas on how to implement this beast, and
    they are nearly identical to what I had in mind before I
    read your suggestion but after I wrote my OP (I made it up
    in the time between). So, maybe I might go this way.
    Stefan Ram, Aug 18, 2009
    #6
  7. Stefan Ram

    Stefan Ram Guest

    Roedy Green <> writes:
    >I would write such a beast to your specs for $50 US.

    ¯¯¯¯¯
    This is funny, because just yesterday I read an article
    about brain research that found that all people (even blind
    people) have different areas in the brain to represent
    animals versus non-animal things, independent of how they
    look like.

    Then, yesterday, I posted my thoughts in »de.comp.lang.misc«
    about how the brain sometimes uses the animal area and
    sometimes the things area to represent abstract concepts in
    programming. For example, a program looks like an animal to
    the brain, while data looks like a thing.

    Some people then made fun of my post, but now I read your
    choice of the word »beast«, which supports my idea.
    Stefan Ram, Aug 18, 2009
    #7
  8. Stefan Ram

    Stefan Ram Guest

    Roedy Green <> writes:
    >1. Usually you must separately install, start/stop the SQL engine.
    >This might conflict with other apps.


    I have this crazy idea to write applications that are based
    on Java SE only and not require any additional library.
    I know that this idea might not be very pragmatic or reasonable.

    /If/ Derby would finally be included in Java SE, I would
    love to use it.
    Stefan Ram, Aug 18, 2009
    #8
  9. Stefan Ram

    Roedy Green Guest

    On Tue, 18 Aug 2009 13:02:36 -0700, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >1. They don't fail entirely when RAM gets tighter. With an in RAM key
    >lookup, you must have the ram for all the keys.
    >
    >2. You can improve their performance just by throwing more ram at
    >them. They can use it for both keys and objects.
    >
    >3. If your indexing needs become more complex, you can just add. You
    >don't have to start over from scratch.


    I expound on this theme at
    http://mindprod.com/jgloss/sql.html#ALTERNATIVES
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    http://thecovemovie.com : The Cove: a documentary about Japan's secret atrocities against dolphins.
    Roedy Green, Aug 18, 2009
    #9
  10. Stefan Ram

    Roedy Green Guest

    On 18 Aug 2009 20:21:08 GMT, -berlin.de (Stefan Ram)
    wrote, quoted or indirectly quoted someone who said :

    > Some people then made fun of my post, but now I read your
    > choice of the word »beast«, which supports my idea.


    I have a more elaborate version for frequently changing data called a
    Hermit Crab file. It was inspired by the metaphor of a hermit crab
    seeking out ever larger shells as it grows.

    see http://mindprod.com/jgloss/hermitcrab.html

    As RAM has become less expensive, the need for a light weight indexed
    record lookup has waned. You can just use a big hammer -- SQL.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    http://thecovemovie.com : The Cove: a documentary about Japan's secret atrocities against dolphins.
    Roedy Green, Aug 18, 2009
    #10
  11. Stefan Ram

    Lew Guest

    Roedy Green wrote:
    > 2. You have the overhead of all of SQL when you may need only a tiny
    > fraction of it.  That RAM would have been better used with more
    > specific code.


    <http://db.apache.org/derby/>
    > Derby has a small footprint -- about 2 megabytes
    > for the base engine and embedded JDBC driver.


    That's not a whole lot of "overhead" in the Java world. That RAM
    would have to be really, really, really better used with custom code
    to justify the effort, bug-hunting, maintenance and headache of not
    using a database engine.

    --
    Lew
    Lew, Aug 18, 2009
    #11
  12. Stefan Ram

    Lew Guest

    Stefan Ram wrote:
    >   I have this crazy idea to write applications that are based
    >   on Java SE only and not require any additional library.
    >   I know that this idea might not be very pragmatic or reasonable.
    >
    >   /If/ Derby would finally be included in Java SE, I would
    >   love to use it.


    Well, Derby does come standard with the Java SE JDK. Does that count
    as "included"?

    --
    Lew
    Lew, Aug 18, 2009
    #12
  13. "Roedy Green" <> wrote in
    message news:
    >
    > I expound on this theme at
    > http://mindprod.com/jgloss/sql.html#ALTERNATIVES


    A heads up!

    In that great document you write: "It is missing features you would expect
    such as the ability to traverse forward and back in result sets. "

    It is nowadays possible to create "random access" result sets.

    PreparedStatement stmt = conn.prepareStatement(
    "SELECT * FROM orders WHERE customer_id = ? ORDER BY order_time",
    ResultSet.TYPE_SCROLL_INSENSITIVE,
    ResultSet.CONCUR_READ_ONLY, ResultSet.CLOSE_CURSORS_AT_COMMIT);

    The TYPE_SCROLL_INSENSITIVE makes it. You can then

    resultSet.absolute(offset);

    I have used this with MySQL (while my app is supposed to work on any SQL db
    *)), and it seems to work and is fast anough. I use it to get a query, and
    to display it on 'screen' some lines at a time I just seek whereever I need.
    For each screenful I make the query again and again, and seek to the correct
    screen. This is in a Web application where I do not store any state between
    the screenfuls.


    *) The ResultSet has a getType() method, which tells if it supports this. If
    it has 'downgraded' the type, a slower approach may be needed.

    if (resultSet.getType() == ResultSet.TYPE_FORWARD_ONLY)
    {
    // The driver could not give us a scrollable ResultSet
    // but downgraded it to a TYPE_FORWARD_ONLY
    for (int i = 0; i < beginIndex; i++)
    {
    resultSet.next();
    }
    }
    else
    {
    // Move to the position one before the begin (for next() in
    // the loop will advance.
    resultSet.absolute(beginIndex);
    }
    Donkey Hottie, Aug 18, 2009
    #13
  14. Stefan Ram

    Tom Anderson Guest

    On Tue, 18 Aug 2009, Patricia Shanahan wrote:

    > Stefan Ram wrote:
    >> This should be a common need. Yet I am not aware of anything
    >> like it in Java SE. What is the most common (pure Java)
    >> solution to it?
    >>
    >> I would like to have an implementation of java.util.Map,
    >> which is constructed with an int ?m? and a java.io.File ?f?.
    >>
    >> It will use no more than ?m? bytes of memory, but ?swap? out
    >> (the least often used) entries to the file ?f?, when they do
    >> not fit into the given memory size anymore.

    >
    > Have you considered putting the data in a database instead, and using
    > java.sql to access it? The data structures and algorithms that Java uses
    > for in-memory maps are not very suitable for disk-based maps. Database
    > managers use structures and algorithms designed for the job.


    'The job' in question being relational data access. Stefan doesn't want
    that, he wants to do stores and lookups by key, and nothing else (well,
    that and removals, and iteration - but i would imagine the priority is
    fast storage and lookup). Yes, this is a subset of what you can do with a
    relational data store, but it's quite possible that an implementation
    which does keyed storage and nothing else will do it faster and more
    efficiently.

    The prototype of this is the ancient unix DBM:

    http://en.wikipedia.org/wiki/Dbm

    And its zillions of successors (which include java wrappers and pure java
    implementations - although i don't know if any have a Map interface).
    These are (generally held to be) quite a bit faster than an RDBMS.

    It's worth stressing, for anyone who hasn't fully grokked this, that
    RDBMSs are *not* fast. Not in the slightest. This is why none of the huge
    web players - Google, Yahoo, Facebook, etc - use them in their
    high-throughput apps. The reason RDBMSs are popular nonetheless is because
    they're fast enough for most purposes, they're reliable and trustworthy,
    and they're highly, exceptionally, *insanely* flexible.

    tom

    --
    The square-jawed homunculi of Tommy Hilfiger ads make every day an
    existential holocaust. -- Scary Go Round
    Tom Anderson, Aug 18, 2009
    #14
  15. Stefan Ram

    Tom Anderson Guest

    On Tue, 18 Aug 2009, Roedy Green wrote:

    > On 18 Aug 2009 20:21:08 GMT, -berlin.de (Stefan Ram)
    > wrote, quoted or indirectly quoted someone who said :
    >
    >> Some people then made fun of my post, but now I read your
    >> choice of the word ?beast?, which supports my idea.

    >
    > I have a more elaborate version for frequently changing data called a
    > Hermit Crab file. It was inspired by the metaphor of a hermit crab
    > seeking out ever larger shells as it grows.
    >
    > see http://mindprod.com/jgloss/hermitcrab.html
    >
    > As RAM has become less expensive, the need for a light weight indexed
    > record lookup has waned. You can just use a big hammer


    Or maybe a really fat armadillo?

    tom

    --
    The square-jawed homunculi of Tommy Hilfiger ads make every day an
    existential holocaust. -- Scary Go Round
    Tom Anderson, Aug 18, 2009
    #15
  16. Stefan Ram

    Tom Anderson Guest

    On Tue, 18 Aug 2009, Tom Anderson wrote:

    > On Tue, 18 Aug 2009, Patricia Shanahan wrote:
    >
    >> Stefan Ram wrote:
    >>> This should be a common need. Yet I am not aware of anything
    >>> like it in Java SE. What is the most common (pure Java)
    >>> solution to it?
    >>>
    >>> I would like to have an implementation of java.util.Map,
    >>> which is constructed with an int ?m? and a java.io.File ?f?.
    >>>
    >>> It will use no more than ?m? bytes of memory, but ?swap? out
    >>> (the least often used) entries to the file ?f?, when they do
    >>> not fit into the given memory size anymore.

    >>
    >> Have you considered putting the data in a database instead, and using
    >> java.sql to access it? The data structures and algorithms that Java uses
    >> for in-memory maps are not very suitable for disk-based maps. Database
    >> managers use structures and algorithms designed for the job.

    >
    > 'The job' in question being relational data access. Stefan doesn't want that,
    > he wants to do stores and lookups by key, and nothing else (well, that and
    > removals, and iteration - but i would imagine the priority is fast storage
    > and lookup). Yes, this is a subset of what you can do with a relational data
    > store, but it's quite possible that an implementation which does keyed
    > storage and nothing else will do it faster and more efficiently.


    And if you don't believe me - how about Oracle?

    http://www.oracle.com/technology/products/berkeley-db/je/index.html

    Relational databases are the most sophisticated tool available to the
    developer for data storage and analysis. Most persisted object data is
    never analyzed using ad-hoc SQL queries; it is usually simply retrieved
    and reconstituted as Java objects. The overhead of using a sophisticated
    analytical storage engine is wasted on this basic task of object
    retrieval. The full analytical power of the relational model is not
    required to efficiently persist Java objects. In many cases, it is
    unnecessary overhead. In contrast, Berkeley DB Java Edition does not have
    the overhead of an ad-hoc query language like SQL, and so does not incur
    this penalty.

    The result is faster storage, lower CPU and memory requirements, and a
    more efficient development process.

    That software is freeware; if i was going to implement a disk-backed map,
    it's where i'd start.

    tom

    --
    The square-jawed homunculi of Tommy Hilfiger ads make every day an
    existential holocaust. -- Scary Go Round
    Tom Anderson, Aug 18, 2009
    #16
  17. Stefan Ram wrote:
    > This should be a common need. Yet I am not aware of anything
    > like it in Java SE. What is the most common (pure Java)
    > solution to it?
    >
    > I would like to have an implementation of java.util.Map,
    > which is constructed with an int »m« and a java.io.File »f«.
    >
    > It will use no more than »m« bytes of memory, but »swap« out
    > (the least often used) entries to the file »f«, when they do
    > not fit into the given memory size anymore.


    If you look at
    http://mdr.netbeans.org/source/browse/mdr/src/org/netbeans/mdr/persistence/btreeimpl/ ,
    you'll find a Java implementation of a persistent map a co-worker and
    I wrote many years ago. I don't recall all of its features (and I'm
    too lazy to set up a CVS client and pull down the source to look at
    it), but basically

    * You can create, delete, update, and retrieve keyed records
    * The file is single-user but transactional -- none of the changes are
    saved until you issue a commit.
    * You can adjust the size of the cache. I think all changed records
    are help in memory, but unchanged ones can come and go
    * The index is a btree

    If you ever used a Btree repository in the old Forte application
    development system, it'll probably be clear that this code is a Java
    port of the same underlying technology.

    At the time I handed it off, it had a full set of unit tests, but one
    of the most annoying things about working on NetBeans was that I could
    never convince anyone else that unit tests were an essential part of
    the source code, as opposed to something to hand off to QA in case
    they felt like running them sometime. As a result, while the source
    code eventually got checked into this repository, the tests have
    apparently been lost.
    Mike Schilling, Aug 19, 2009
    #17
  18. Stefan Ram wrote:
    > This should be a common need. Yet I am not aware of anything
    > like it in Java SE. What is the most common (pure Java)
    > solution to it?
    >
    > I would like to have an implementation of java.util.Map,
    > which is constructed with an int »m« and a java.io.File »f«.
    >
    > It will use no more than »m« bytes of memory, but »swap« out
    > (the least often used) entries to the file »f«, when they do
    > not fit into the given memory size anymore.


    Not too long ago I wound up needing to implement a program that would
    store and retrieve large numbers of images, in an environment that
    precluded the use of a DBMS (which would've been overkill anyway).

    I basically created an on-disk HashMap quite simply: images were
    retrieved into a BufferedImage from a stream, to which a suitable
    DigestInputStream decorator had been attached before wrapping in
    ImageInputStream. Out comes the SHA-1 hash of the image, as well as the
    image itself.

    Then the SHA-1 is used to construct a path and filename, mkdirs is used
    on the path, and one ImageIO.write later, it's stored.

    Retrieval was equally simple, since the SHA-1s were then used as the
    unique identifiers for the images throughout the program and could
    easily be turned into a path and filename for use this time with
    ImageIO.read. Delete was equally simple. The only tricky part of CRUD
    for this prog was the "U" part, since altering an image would alter its
    hash, and thus its ID. (Sometimes ID had to be determined from a copy of
    the image, to avoid duplications that could waste many gigabytes.)
    Fortunately, alterations were much rarer than storage and retrieval, and
    it wasn't hard to have a higher level of abstraction that had its own
    identifiers, and that mapped to the SHA-1 that might have changed.

    FWIW, I've also implemented a disk-based deque supporting O(1) additions
    and removals at both ends (using sequentially numbered files as a
    backing "DiskArrayList", each file holding up to N entries). The
    constant factor overhead is large, since a linear search of up to N
    entries may be needed to work with the last entry in the queue, which is
    the last entry in the last file, and since I/O is much slower than RAM
    anyway. This disk-deque was needed for some application that had to
    store potentially millions of items and had to run where there wasn't
    much RAM, but was plenty of disk. Speed of access to the deque was of
    little concern because the tasks done would have been I/O bound anyway,
    and saving tens of megabytes of RAM usage at run-time was worth it. In
    fact, the application needed many of these deques, potentially running
    into gigs of total memory use, so the only two alternatives to having
    the deques themselves live on disk was having them be serialized to disk
    and loaded on demand (which would make things much slower -- reading a
    30 meg file to do a few things and then writing a 30 meg file is a lot
    slower than reading a few-KB file to do a few things and then writing a
    few-KB file!) or having them all in RAM and letting the system page like
    crazy (equally slow, and would make the UI unresponsive at times).
    Acceptable UI responsiveness under these circumstances dictated the need
    for a disk-based data structure. (Think "kiosk".)

    The disk-deque had some notion of transactions and commits, too, mainly
    to maintain data integrity; if the power went out in mid-operation the
    deque wouldn't be corrupted, though some changes to it might not have
    been saved. The disk-hashmap-of-images had a weak version that saved the
    image to a temp file and then renamed it when it was done, so the image
    appearing in the hashmap under its hash would happen atomically as far
    as any external process was concerned. A power failure in mid-save would
    result in the image not having been added, rather than a truncated
    version being in the map in the hash bucket for the un-truncated
    version. (Eeeuw.) So in both cases operations would have to be retried
    if there was an inconvenient crash or power failure but not corruption
    of data; a weak protection of data integrity and some weak atomicity.

    When you mission-critically need full ACID, though, you need a proper
    database engine, even with all the added complexity and overhead that
    implies.
    Gerald Murdock, Aug 19, 2009
    #18
  19. Stefan Ram

    Roedy Green Guest

    On Wed, 19 Aug 2009 00:51:01 +0300, "Donkey Hottie"
    <> wrote, quoted or indirectly quoted someone who
    said :

    >In that great document you write: "It is missing features you would expect
    >such as the ability to traverse forward and back in result sets. "
    >
    >It is nowadays possible to create "random access" result sets.


    "Knowledge keeps no better than fish."
    ~ Alfred North Whitehead

    Thanks. Now fixed.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    http://thecovemovie.com : The Cove: a documentary about Japan's secret atrocities against dolphins.
    Roedy Green, Aug 19, 2009
    #19
  20. Stefan Ram

    Arne Vajhøj Guest

    Patricia Shanahan wrote:
    > Stefan Ram wrote:
    >> This should be a common need. Yet I am not aware of anything
    >> like it in Java SE. What is the most common (pure Java)
    >> solution to it?
    >>
    >> I would like to have an implementation of java.util.Map,
    >> which is constructed with an int »m« and a java.io.File »f«.
    >>
    >> It will use no more than »m« bytes of memory, but »swap« out
    >> (the least often used) entries to the file »f«, when they do
    >> not fit into the given memory size anymore.
    >>

    >
    > Have you considered putting the data in a database instead, and using
    > java.sql to access it? The data structures and algorithms that Java uses
    > for in-memory maps are not very suitable for disk-based maps. Database
    > managers use structures and algorithms designed for the job.


    Map is an interface and does not really specify algorithms or
    data structures. 15 seconds looking at Map indicates that most
    of the stuff could easily be implemented on top of a database.
    Only entrySet, keySet and values would require non-trivial code.

    But that is probably an unimportant question. The important question
    is: would having a database backed map provide any benefits over
    just using the database directly. I can not see any. A Map is
    more general but having memory access and disk access being
    transparent is a double edged sword.

    Arne
    Arne Vajhøj, Aug 22, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dave Kawczynski
    Replies:
    1
    Views:
    3,786
    Dave Kawczynski
    Apr 19, 2004
  2. GG
    Replies:
    4
    Views:
    546
  3. Wendy Smoak

    cached backed or self-emptying map

    Wendy Smoak, Apr 29, 2005, in forum: Java
    Replies:
    2
    Views:
    443
    Wendy Smoak
    Apr 29, 2005
  4. Replies:
    5
    Views:
    1,286
    Scott Ellsworth
    Aug 11, 2005
  5. Replies:
    0
    Views:
    728
Loading...

Share This Page