Caching a large file - 2nd post

Chris · Jul 22, 2004

I posted this a couple of days ago, but got no responses and it didn't show
up in Google groups, so it might have gotten lost along the way. If not,
sorry for the redundancy...

I need extremely fast random access to an extremely large file. The file
could be 100 gigabytes in size; the machine will have at most 2 or 4 gb of
memory. Certain parts of the file will be accessed more frequently than
others. The question is, what is the most efficient way to handle caching?

I could organize the file in pages, and write some kind of LRU cache to hold
the most active pages in JVM memory. I'm thinking, though, that it might be
simpler and more efficient to let the operating system handle it through
normal disk caching.

The advantages of this scheme will be 1) I won't have to write code, 2) the
cache memory will be owned by the operating system and not the JVM, which
means the user won't have to worry about memory configuration, 3) the
operating system can release disk cache memory when it's needed for some
other process, and 4) the operating system's disk caching algorithms are
probably much smarter than anything I could write.

The disadvantages are 1) accessing parts of the file still has to go through
the Java IO code, which will be slow compared to direct memory access, and
2) I don't know if this really works.

Thoughts?

Also -- does anyone know how Linux allocates memory to the disk cache? Does
it use all available memory? Or is there some way to configure it?

Will Hartung · Jul 22, 2004

Chris said:
I posted this a couple of days ago, but got no responses and it didn't show
up in Google groups, so it might have gotten lost along the way. If not,
sorry for the redundancy...

I need extremely fast random access to an extremely large file. The file
could be 100 gigabytes in size; the machine will have at most 2 or 4 gb of
memory. Certain parts of the file will be accessed more frequently than
others. The question is, what is the most efficient way to handle caching?

I could organize the file in pages, and write some kind of LRU cache to hold
the most active pages in JVM memory. I'm thinking, though, that it might be
simpler and more efficient to let the operating system handle it through
normal disk caching.

If the files are that large, and you have a sufficiently sized sample, I'm
just go with FileChannels and ByteBuffers, and let the OS deal with the
details.

HOWEVER, depending on what you're doing with that data it may well behoove
you to cache the RESULTS of what you derive from the data.

You don't mention the application, but you basically have two costs. One is
physical I/O, and the other is actually processing the data.

Take this extreme example. Say that you are simply calculating checksums for
blocks of read-only data. It is obvious that you should simply cache the
resulting checksums rather than rereading and recalcuating the checksum
everytime. Doesn't matter if the pages are cached, you still have to pay the
price of the calculation.

That kind of thing can readily cached on an LRU, if that is practical.

Finally, before doing anything, set up some tests and do some benchmarks. Do
the simplest things, and then try the more advanced tricks in some testing
environments that mimic what you actually do in the app and go from there.
Don't pre-optimize until you know where the problems are.

Regards,

Will Hartung
([email protected])

Roedy Green · Jul 22, 2004

The disadvantages are 1) accessing parts of the file still has to go through
the Java IO code, which will be slow compared to direct memory access, and
2) I don't know if this really works.

Thoughts?

Cascading buffering systems is a bit of a waste. You want to put all
your space at one level.

Think about what would happen if you wrote 10 disk "optimisers" and
chained them, giving them each 1/10 of available ram to play with to
cache frequently used data. What happens? Each cache tends to fill
up with the same records. You are best to use just one, then you don't
have any duplicates in your cache.

Since the OS is already caching disk records, it does not pay to
duplicate its efforts. Better to give the extra RAM back to it.
Further, it has a more global view of what to keep in cache.

This is not completely true, since a small aux cache can still be
helpful, since typically it will be quite a bit faster than an
OS-level cache.

If it really matters, try it several ways and see in practice which
ACTUALLY works best. It is so easy to convince yourself why some
method has to be faster than another, but there are so many factors,
you can easily be wrong.

marcus · Jul 22, 2004

I love to read what roedy writes and translate it into how I would say
things.

Write the application in the simplest, most basic manner possible, then
benchmark it on numerous systems and find out where the bottlenecks
really are. Why waste your time solving a potential problem? Spend
your effort solving real problems and optimize on later revs.

BTW, the harddrive (RAID, SAN) has a cache as well, which is blazingly
fast. I'm thinking an end user who can afford 100 gigs of storage for
your application understands things like this.

Lastly, is a single file really the best place to store 100g? if it is
structured data, wouldn't a database serve your needs better? Even a
trivial one you write and manage yourself? A set of files, or trees of
files? Frankly, in my ignorance I cannot imagine 100g in a single file
serving any reasonable purpose. My thought is always to store data in
the manner in which I anticipate it will be used. (Roedy help me out here!)

Post-lastly, I beleive the optimizations required to solve your problem
(for a single file) probably vary widely depending on whether your
end-user is using a unix-like file structure or any of the various
windows file structures, or something more exotic. Some (file systems)
are built for throughput, others for searching, others for transparancy
over multiple types of underlying hardware. Some customers tweak their
boxes for fast IO, some for massive file transfers. All of this will
affect performance, IMHO. If you don't know your end user, I would
leave this one alone.

-- clh

ak · Jul 22, 2004

I need extremely fast random access to an extremely large file. The file

could be 100 gigabytes in size; the machine will have at most 2 or 4 gb of
memory. Certain parts of the file will be accessed more frequently than
others. The question is, what is the most efficient way to handle caching?

hmm, how mach time take one seek() on 100 GB file? ;-)

http://www.google.de/search?q=java+bulk+i/o&ie=UTF-8&hl=de&meta=

ak · Jul 22, 2004

Bulk IO people says that they achieve speedups for array reading/writing of
over 60x.
Last version of Unified I/O gives speedup for array reading/writing between
80x and 130x.
However I didn't yet tested it with really HUGE files.

Markus Schaber · Jul 22, 2004

Hi, Chris,

Also -- does anyone know how Linux allocates memory to the disk cache?
Does it use all available memory? Or is there some way to configure
it?

Linux uses (nearly) all available memory, but it also utilizes some more
sophisticated tricks, e. G. it writes out pages that were not accessed
for some time so that it can quickly free them when memory is needed.

In /proc, there are some entries where you can tune some parameters, e.
G. the minimum of pages that Linux is trying to keep free (so it doesn't
have to wait for the purging when allocating), or the swappiness (the
likelyhood for linux to swap out some long-unused data to make more
space for Disk buffers).

Googling or reading the lkml archive should give you details on this.

HTH,
Markus

Roedy Green · Jul 22, 2004

My thought is always to store data in
the manner in which I anticipate it will be used. (Roedy help me out here!)

Generally people asking questions that imply a big budget have to be
tight-lipped about what they are doing. Usually the best you can do
is answer their questions literally. Of course the more they can
divulge about what they are doing the more imaginative and
broad-ranging the responses can be.

Roedy Green · Jul 22, 2004

hmm, how mach time take one seek() on 100 GB file? ;-)

same as on 100 1 gb files. It all depends on how well the data are
localised.

Chris · Jul 22, 2004

Roedy Green said:
here!)

Generally people asking questions that imply a big budget have to be
tight-lipped about what they are doing. Usually the best you can do
is answer their questions literally. Of course the more they can
divulge about what they are doing the more imaginative and
broad-ranging the responses can be.

You're right; I can't really say what the app is. But think search engine.
Large amounts of data that needs to be searched quickly. Disk seek time and
IO dominate.

It is a database-like app, but a traditional SQL database is much too slow
(yes, we've benchmarked it). The file format is database-like, but mostly
read-only.

Maybe these extra details will trigger a few ideas...

Chris · Jul 22, 2004

My thought is always to store data in

You're right; I can't really say what the app is. But think search engine.
Large amounts of data that needs to be searched quickly. Disk seek time and
IO dominate.

It is a database-like app, but a traditional SQL database is much too slow
(yes, we've benchmarked it). The file format is database-like, but mostly
read-only.

Maybe these extra details will trigger a few ideas...

And one other thing: having 100 separate 1 Gb files would *definitely* be
more manageable. I just don't think that 100 operations could possibly be as
fast as one large operation. But we will do the benchmarks.

Roedy Green · Jul 23, 2004

Maybe these extra details will trigger a few ideas...

Here's a little free brainstorming for you...

Silicon Graphics years ago developed special disk controllers to do
some processing as the data was flying by the heads. Perhaps the idea
may be revived for use by the giant search engines.

In the olden days, we used to always think in terms of cylinders and
tracks, and rotational latency, and would plot the trajectories of the
disk heads to do each transaction. You would then place files at
absolute positions and interleave files to minimize head motion. Disk
I/O works much better on cylinder and track boundaries.

You would place different files on different drives so that heads
could be seeking while i/o was happening on other drives. Back then
only one drive at a time could be transferring.

Since the physical drives are SO slow relatively, and the need for
speed so acute, perhaps it is time to go back to these old ways.

You might look into my idea of Marthas -- continuous defragging and
writing ordering data LRU by track. I envisioned them in hardware, but
you could do them in software. See
http://mindprod.com/jgloss/defragger.html

The basic idea is that a background task keeps moving dull data to the
not-so prime real estate, freeing it up for fresh writes. You might
even use a two armed disk, with one arm almost stationary reading and
writing active data, and the other shuffling data in and out of the
prime real estate. The background arm elevators back and forth like a
bus, picking up and dropping off tracks, transporting them to
locations more fitting their LRU status. RAM tracks the physical
location of all the logical tracks. The problem is crashing. You lose
track of everything if you are not careful.

I have been pushing this idea of marthas since around 1985. I am
surprised it still has not shown up in high end disk controllers. I
sent it to all the disk makers years ago.

I'm not sure what happens in a modern disk if you do a full track
read. Is it smart enough to start transferring right away, or it is
stupid and waits until sector 0 comes around? Perhaps someone buying
trainloads of disks could prod the disk controller makers to add this
cleverness if it is not there already. You might convince them to add
hardware marthaing as well.

I assume that search engines work by having massive amounts of RAM,
and each server on the farm looks after its corner of the universe.
The big job is to farm the request to the servers that have the info
in RAM and then collect it back together.

Google tends to keep feeding back the same sites for all queries. This
makes their caching job easier.

You might design a search engine around 64-bit java and use a
humongous address space with all data memory-mapped. Then you let the
OS deal with the caching, and the GC deal with keeping active objects
densely packed in the virtual address space.

Since the data is mostly read-only, you could afford a fair bit of
compression. see http://mindprod.com/projects/supercompressor.html

See http://mindprod.com/projects/htmlcompactor.html. Google could
spider faster if pages were more compact. "Bribe" websites to keep
their pages compact and properly dated.

enough for now...

Roedy Green · Jul 23, 2004

You might look into my idea of Marthas -- continuous defragging and
writing ordering data LRU by track. I envisioned them in hardware, but
you could do them in software.

see http://mindprod.com/jgloss/martha.html

here is an essay I wrote back in the OS/2 days on the idea.

Martha, a new class of disk accelerator software

Martha Stewart is an American TV homemaker whose idea of a
good time is to sort and organize her attic.

So I name this new class of disk accelerator software
after her:

A Martha is a program that runs at low priority in the
background. It examines the LAST ACCESS date of HPFS
files. It moves files that have not been used in a long
time to the outer two edges of the disk drive. This frees
up space near the centre. As a side effect , it moves
files recently accessed toward the centre. The net effect
is reducing the head travel to access the current working
set. of files, which might have equivalent effect of a 3
times faster hard disk.

It may be difficult to implement an effective Martha
because I don't think HPFS makes any provision to move a
file in active use by another process. However, even a
Martha that only worked while everyone else slept could
still be very effective.

Hardware Implementation of a Martha

Not to worry. This idea is prior art in the public
domain. I have posted this idea electronically many times
over the last decade. The BIX electronic conference
maintains a record of all postings. The idea needs a
small processor in the SCSI disk drive and a CMOS memory.
Its operation is totally transparent to the operating
system.

For simplicity of explanation, I will presume there are an
equal number of sectors in each track. You "timestamp"
each track into an array indexed by logical track number
stored in CMOS each time you access a track. This is not
an actual time of day, just an incremental counter.

Whenever the disk is idle, you physically swap tracks so
that tracks that have not been accessed in a long time are
near the outside edges, and ones accessed recently are
near the middle. You maintain an index by logical track
number translating to physical track number in CMOS. You
also store the logical track number on each track on disk
along with the timestamp so that the CMOS table can be
reconstructed in case it is damaged. You can make the
thing crash proof by always making the copy before you
update the CMOS table.

The net effect is to migrate the rarely accessed junk
dynamically to the outer edges. This means that in
practice the disk head will most of the time only have to
traverse a very small part of the entire travel. No
matter how the disk is logically organized, it moves
toward an efficient layout.

You can even swap tracks on the fly by having a low
priority background process clear out old junk from the
central tracks and schedule any full track write directly
to the freed central tracks, then post a "free" to its old
position in the CMOS table.

I'm sure you can handle the implementation details,
complications and refinements. The key is, the disk could
be three times faster without any improvement to
mechanicals or changes to the operating system drivers.
It would work better than a "Martha" utility optimising a
single partition since the Adaptec method would work
globally across multiple partitions on the same physical
disk even if they ran under differing operating systems.

Most users are ZIP file packrats and have a ton of
material rarely accessed. Even active applications have
huge amounts of material rarely accessed -- such as help
files or initialization code. The effects may be much
more spectacular in practice than you would hope for.

In any case, they would make you look extremely good on
any benchmark!

kjc · Jul 23, 2004

Just use one of the open source java caching solutions like
JCS or OSCache or SwarmCache.

Liz · Jul 23, 2004

there is a commercial defrag package that starts
with a p (can't remember the name right now) that puts less
used data at the begin/end of the disk.

Liz · Jul 23, 2004

ak said:
hmm, how mach time take one seek() on 100 GB file? ;-)

http://www.google.de/search?q=java+bulk+i/o&ie=UTF-8&hl=de&meta=

A seek is a seek is a seek, 1/2 rotation time.

Liz · Jul 23, 2004

Chris said:
I posted this a couple of days ago, but got no responses and it didn't show
up in Google groups, so it might have gotten lost along the way. If not,
sorry for the redundancy...

I need extremely fast random access to an extremely large file. The file
could be 100 gigabytes in size; the machine will have at most 2 or 4 gb of
memory. Certain parts of the file will be accessed more frequently than
others. The question is, what is the most efficient way to handle caching?

Bell Labs / Lucent Tech. has a product that is perfect for you. It is called
DataBlitz. Don't know the cost, but not free.

Roedy Green · Jul 23, 2004

there is a commercial defrag package that starts
with a p (can't remember the name right now) that puts less
used data at the begin/end of the disk.

see http://mindprod.com/jgloss/defragger.html

You are probably thinking of perfect disk.

What I am suggesting is a continuously running low priority defragger
that is not actually a defragger so much as a cacher that keeps data
in bands by time of last use. It would happily scatter files all over
if parts of them were highly used but other parts were not. It is
maintaining last use data on a per track basis.

If you can keep sufficient free prime real estate, all writes can go
to it. You don't move the arms to the nether regions of the disk for
write, you effectively move the data to prime real estate on every
write. Think of a mostly-write database. Its arms would almost never
move to write if the background could keep sufficient free space in
the prime real estate.

Roedy Green · Jul 23, 2004

A seek is a seek is a seek, 1/2 rotation time.

A seek is the time to move the head to the correct cylinder, typically
1/3 full span seek, plus 1/2 rotation rotational latency.

If you do full track reads and the disks have what IBM used to call
Rotational Positional Sensing, you can start reading almost right
away, simply by reading the sectors in a slightly different order.
I don't know if this exists in today's desktop machines.

Michael Borgwardt · Jul 23, 2004

Sounds a like a perfect opportunity to make use of the memory-mapped files
in java.nio

Directory Caching, suggestions and comments?	0	May 15, 2014
How do I scp extremely large files	15	Feb 5, 2012
Fix and improve a UDF File System Driver	0	Aug 20, 2023
file caching	0	May 24, 2005
Please help me limit caching	5	Oct 8, 2012
Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
Custom Minecraft launcher client error; I think regarding java	0	Sep 7, 2022
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023

Caching a large file - 2nd post

Chris

Will Hartung

Roedy Green

marcus

ak

ak

Markus Schaber

Roedy Green

Roedy Green

Chris

Chris

Roedy Green

Roedy Green

kjc

Liz

Liz

Liz

Roedy Green

Roedy Green

Michael Borgwardt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads