Extreamly large Hashtable

A

anthony

Has anyone created an extreamly large Hashtable? I need to create a
simple look up table key/value of some information.

I'm assuming that if it is in memory it will be faster then looking
this value up in a database.

The problem is I have about 4,000,000 rows of data. Assuming I have
enough memory to hold it, will a hashtable break being that large? Will
it be fast? Each row will only be about 10k in size.

Is there a better way?

Thanks,
AJS
 
B

Betty

Has anyone created an extreamly large Hashtable? I need to create a
simple look up table key/value of some information.

I'm assuming that if it is in memory it will be faster then looking
this value up in a database.

The problem is I have about 4,000,000 rows of data. Assuming I have
enough memory to hold it, will a hashtable break being that large? Will
it be fast? Each row will only be about 10k in size.

Is there a better way?

Thanks,
AJS
Surely you jest: 4000000 * 10000 = 40000000000;
 
C

Chris Uppal

The problem is I have about 4,000,000 rows of data. Assuming I have
enough memory to hold it, will a hashtable break being that large?

There's no reason why a hashtable with that many entries should not function
perfectly well. The critical question is whether the hash function is good
enough to give a reasonable spread among so many entries. I'd guess that the
chances are that it is.

But...

Each row will only be about 10k in size.

So you are talking about 40 GBytes of memory in total ? Unless you have very
specialised requirements -- and have the cash to buy specialised machinery to
implement those requirements -- that sounds a /little/ on the large side to
me...

I'm almost certain that you'd be better off using a proper database. They are
optimised for handling large amounts of data without using unreasonable (large,
yes, but not unreasonable) amounts of memory.

-- chris
 
E

Eric Sosman

Has anyone created an extreamly large Hashtable? I need to create a
simple look up table key/value of some information.

I'm assuming that if it is in memory it will be faster then looking
this value up in a database.

The problem is I have about 4,000,000 rows of data. Assuming I have
enough memory to hold it, will a hashtable break being that large? Will
it be fast? Each row will only be about 10k in size.

Four megarows times ten kilobytes per row equals forty
gigabytes. Are you sure you have that much RAM available?
If you don't, you'll just trade database I/O for swap I/O.

Where does this 40GB come from? If you can read it at
100 MB/s (from a well-tuned RAID stripe, say), it'll take
about seven minutes to pull in all the data and load the
table. Divide this seven minutes by the expected savings
per query to get the number of queries you must process in
order to recoup the startup time.

Note that the 10 KB row size is irrelevant to the table's
performance (unless it means that the keys' equals() and
hashCode() methods are slow). The table contains only the
references to the objects (that's another 7 GB, at a guess),
not the objects' data fields.

The above are just a few miscellaneous thoughts -- maybe
if you described your problem in more detail someone would be
able to comment more particularly.
 
A

anthony

Would a in-memory database product be any better, or would it be just
about the same.

Basically you can think of it like a phone book system, keyed on a
name. So I have a name and an address. I would use the system to
verify that the name is a known name. I need it to run as fast as
humanly possible. Would a simple array be faster?(assuming I could
replace the key to be numeric).

AJS
 
B

Betty

Would a in-memory database product be any better, or would it be just
about the same.

Basically you can think of it like a phone book system, keyed on a
name. So I have a name and an address. I would use the system to
verify that the name is a known name. I need it to run as fast as
humanly possible. Would a simple array be faster?(assuming I could
replace the key to be numeric).

AJS

An in memory database would need at least as much memory unless
you can find one that pages to disk. Of course you would want to
create an index so you don't have to sequentially search the key
attribute ;-)
 
E

Eric Sosman

Would a in-memory database product be any better, or would it be just
about the same.

Basically you can think of it like a phone book system, keyed on a
name. So I have a name and an address. I would use the system to
verify that the name is a known name. I need it to run as fast as
humanly possible. Would a simple array be faster?(assuming I could
replace the key to be numeric).

"As fast as humanly possible ..." Give me "all the
money in the world" and I'll build it for you.

In other words, you've got to get more specific about
your requirements. How fast do you need the response to
be? To put it another way, if one millisecond is too slow
how much money are you prepared to spend to reduce it to
a microsecond? To a nanosecond? To an attosecond (which
will require much more than "all the money in the world")?
 
A

anthony

I hear you, but I mean within the confines of the language what would
be the fastest way of accessing in-memory data of such a large list of
values.

AJS
 
B

Boudewijn Dijkstra

Eric Sosman said:
Four megarows times ten kilobytes per row equals forty
gigabytes. Are you sure you have that much RAM available?
If you don't, you'll just trade database I/O for swap I/O.

I believe the OP was talking about keeping the hashtable in memory, not the
entire database.
Where does this 40GB come from? If you can read it at
100 MB/s (from a well-tuned RAID stripe, say), it'll take
about seven minutes to pull in all the data and load the
table. Divide this seven minutes by the expected savings
per query to get the number of queries you must process in
order to recoup the startup time.

Right. Mind that, to binary search such a record in such a database with a
load of 75%, it takes about 39 times a disk access and a compare of two 10 kB
blocks. Subtract from this, about 4/3 disk accesses and a hashcode
computation, to get the expected savings.
Note that the 10 KB row size is irrelevant to the table's
performance (unless it means that the keys' equals() and
hashCode() methods are slow). The table contains only the
references to the objects (that's another 7 GB, at a guess),
not the objects' data fields.

Four million times 8 bytes (4-byte hashcode, 4-byte index) at a load of 75% is
still less than 40 MB. Where is your guess based on?
 
E

Eric Sosman

I hear you, but I mean within the confines of the language what would
be the fastest way of accessing in-memory data of such a large list of
values.

STILL no specifics ... Well, I'll do what I can.

"Within the confines of the language" -- well, you've
already said you want to use a HashTable, so that's that.
HashMap might be better -- or not; depends on what you're
doing, and I still don't know enough.

40 GB of "raw data" plus a few more GB of object
references and other such "metadata" -- this implies a
JVM with 64-bit addressing.

You'll also need about 64 GB of RAM to hold all that
data, metadata, your classes, the JVM, and the O/S. The
machine isn't going to do much of anything besides serving
up the data for you.

You'll probably need a 64-bit O/S to manage that much
memory -- it's possible for an O/S to manage more memory
than it can address, but that style of thing seems to have
fallen out of favor. Solaris, AIX, some Linux distros,
maybe that brand-new Windows (if you can find a 64-bit
JVM for it). Dunno about Mac; dunno about *BSD; not sure
about zLinux. OS/2 fans need not apply.

Alternatively, you could break up the data into, say,
sixteen chunks of a quarter-million rows each (2.5 GB)
and spread the load across sixteen machines with 4 GB of
RAM running 32-bit JVMs and the O/S of your choice. A
seventeenth machine could field the query and route it,
or you could submit each query to all sixteen servers and
"batch" the results. This could be pretty fast if you
tie the machines together in something like an Infiniband
fabric.

Still not fast enough? Okay: Deploy a hundred such
machines each handling 40,000 rows, and architect the
routing for high parallelism. Or use a hundred machines
each handling 400,000 rows (so each row appears on ten
different machines) so you can route each query to the
least-busy machine that can satisfy it.

Still not enough? No problem: Break open the piggy
bank, and Sun or IBM or SGI or Cray or somebody will build
you a great big computing grid populated with the biggest,
baddest iron they make. It'll help if you live near a
hydroelectric or nuclear power plant; this solution may
require a little more electricity than ordinary office
wiring can supply.

Do you see yet, andrew, why "What's the fastest" is the
wrong question? You've removed all other constraints, so
you get answers that may be in some sense "right" but are
in no sense "useful" -- I think I probably exceeded your
likely budget several paragraphs ago. The right question
is something like "How do I keep the average (or worst case,
or 90th percentile) response below X milliseconds, and can
it be done with fewer than Y systems on a budget of Z?"
That's something people can deal sensibly with -- but these
"The sky's the limit" questions get nowhere, and rapidly.
 
D

David McDivitt

Subject: Re: Extreamly large Hashtable
Date: Fri, 06 May 2005 15:32:47 -0400





"As fast as humanly possible ..." Give me "all the
money in the world" and I'll build it for you.

In other words, you've got to get more specific about
your requirements. How fast do you need the response to
be? To put it another way, if one millisecond is too slow
how much money are you prepared to spend to reduce it to
a microsecond? To a nanosecond? To an attosecond (which
will require much more than "all the money in the world")?

Use a database and keep the connection hot, possibly instantiating the
driver and connection object within the lookup class. This is not a good
practice but it would allow for minimum overhead. If a database error
occurred, destroy the objects, recreate, and try again. This would be the
fastest possible solution. Beyond that you may need to optimize the database
for fastest access, but it is doubtful that would be an issue. You would
need a close method to destroy everything to be called at application
shutdown.
 
J

John C. Bollinger

I hear you, but I mean within the confines of the language what would
be the fastest way of accessing in-memory data of such a large list of
values.

If you have keys with good hash codes then a HashMap is probably among
the best choices for lookup performance on in-memory data. The problem
is that you have too much data to keep it all in memory at once on any
conceivable 32-bit computer, so the question is probably moot. If you
have available a 64-bit computer with >~ 50GB of RAM, a 64-bit OS, and a
64-bit JVM then the situation might be different, but in that case you
should also have adequate expertise in-house that you wouldn't need to
pose your question here. (And I note that the time to load the whole
behemoth into memory from disk would be very substantial, too -- I hope
an application startup time measured in tens of minutes is acceptable.)

What you are postulating is silly. Management of such large amounts of
data is what traditional databases are _for_.
 
D

David McDivitt

Subject: Re: Extreamly large Hashtable
Date: 6 May 2005 13:36:01 -0700

I hear you, but I mean within the confines of the language what would
be the fastest way of accessing in-memory data of such a large list of
values.

That would be a binary search of already sorted keys. Probably no more than
seven or eight tests would be required to find any key. An alternative would
be to use a flat file having presorted fixed length records. Instead of
accessing memory you would read whatever offset into the file, which would
be essentially the same thing, and it would only take seven or eight reads
to get the desired record. To update you simply place a new flat file on the
server, created by another process. If I really needed speed and infinite
size I would do it this way.
 
M

Murat Tasan

That would be a binary search of already sorted keys. Probably no more than
seven or eight tests would be required to find any key.

in the original post, it was said that this will be used to search for
existence in the database, indicating that probably a good number of
searches will fail. this means that the average case for search will
trickle down near the leaves of this binary search.

log(4e6) / log(2) = 20. (approximately)

thus generally at least 20 comparisons will be needed. now, to decide
which is a better solution, one needs to examine the hashtable properties.
how long does it take to compute the hash function? then, how are
collisions resolved? assuming chaining, you need to find the average
length of the lists.

if the function takes x operations (Where each operation is about the
cost of a comparison), and the average length of each chain will be y,
then the average number of comparisons for a successful search will be x +
(y / 2).

although, in this case since search will probably involve many
non-existent entries in the database, the average hashtable cost will be
closer to x operations.

if x << 20, then use a hashtable.

store the actual records on disk and keep references only in the table.

if each disk reference is 4B, and each key is 4B, then the size of the
table is approximately 4e6 * (4B + 4B) = 32e6B, or 32MB. actually 32MB
is certainly a lower bound, and in experience the overhead of the
structures is often larger than the size of the stored data. but assuming
512MB or even 1GB of main memory (a reasonable assumption these days),
this is certainly a feasible project.

now, if you are getting alot of positive hits (i.e. searches that match),
you will probably be bottlenecked by disk IO. but that's a whole 'nother
story...

murat
 
W

Wibble

If your key is actually names, consider using SOUNDEX. It
makes for a very tight key compression and its tolerant of
spelling errors.
 
C

Chris Uppal

David said:
That would be a binary search of already sorted keys. Probably no more
than seven or eight tests would be required to find any key.

If a sufficiently good hash function is available, and space is not a problem,
then a big hash table can /always/ beat a big binary lookup. Hashtables can be
made O(1) (with more-or-less any desired constant factor, trading space for
speed), whereas binary lookup is O(log n).

-- chris
 
C

Chris Uppal

John said:
(And I note that the time to load the whole
behemoth into memory from disk would be very substantial, too -- I hope
an application startup time measured in tens of minutes is acceptable.)

You could store file-offsets in the main hashtable, and only load the row data
from the backing file on-demand. If you also arrange some sort of LRU to evict
row data at need then you could even manage to run the thing on a 32-bit box.

Needless to say, for most purposes, that would just be a slower, and more
difficult, way of getting only half the benefits of using a real database.
However, there /are/ circumstances where roll-your-own can provide substantial
benefits, but they are pretty rare (although, oddly enough, I've had to do it
twice).

-- chris
 
W

William Brogden

Would a in-memory database product be any better, or would it be just
about the same.

Basically you can think of it like a phone book system, keyed on a
name. So I have a name and an address. I would use the system to
verify that the name is a known name. I need it to run as fast as
humanly possible. Would a simple array be faster?(assuming I could
replace the key to be numeric).

AJS

If you just need to verify that the name is known, look into the
extremely compact way that spell checkers store words. Essentially
you would have a spell checker full of names.

Bill
 
E

Eric Sosman

David said:
That would be a binary search of already sorted keys. Probably no more than
seven or eight tests would be required to find any key.

lg(4000000) ~= 22, hence an average of ~21 probes for
a successful search, ~22 for worst-case success or for an
unsuccessful search.
An alternative would
be to use a flat file having presorted fixed length records. Instead of
accessing memory you would read whatever offset into the file, which would
be essentially the same thing, and it would only take seven or eight reads
to get the desired record.

To choose one of 4000000 items in "seven or eight" probes,
each probe needs to produce 6.7-8.8 outcomes. An eight-way
tree would do the job in ~6.3 probes average, ~7.3 worst case,
which more or less fits your description -- except that this
is the first time I've seen an eight-way tree described as a
"flat file!"
To update you simply place a new flat file on the
server, created by another process. If I really needed speed and infinite
size I would do it this way.

Seven or eight I/O's per probe -- or twenty-one if you use
binary search -- is not "speed" in my book. It'll surely take
at least as long (probably longer) than the database the O.P.
already rejected as too slow. (Well, not exactly: He rejected
it as "not the fastest;" we still don't know how much speed he
actually needs.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,608
Members
45,241
Latest member
Lisa1997

Latest Threads

Top