performance and memory usage.

A

arnold

Hi

I created a test program to try the speed and size management of the JDK
HashMap. With regards to this I was wondering if anybody has any
comments on whether this is an appropriate way of testing this
performance, i.e. how realistic is the results with respect to caching
influence of data, heap management etc. The code is simple, but would it
produce a correct result. (the program finishes in 11 seconds on my
machine). the code is attached.

Secondly, I was wondering about the memory (heap) usage in java for this
program. It needs at least 256 MB of heap to run to finish. cmd:

java -Xms32m -Xmx256m -cp target/classes/ App

Using -Xmx128m causes OutOfMemoryError.

When calculating on the size of the datastructure i find approx that

2M Dto's = 16MB
2M Integer's for map key = 8MB
2M Integer's for ArrayList = 8MB

Thats a total of 32MB plus some megabytes for the objects,
datastructures, JVM and so on. So probably about 48-64MB in total.
Thats a completely different number from 128 MB or 256 MB.

Does anybody have any comments on why the program requires that much memory?

arnie

*****************

/* Create 2 million Dto objects with random data and insert into HashMap
also add key to an iterable list */
public void test1() {

int size = 2000000;
Map<Integer, Dto> hmap = new HashMap<Integer, Dto> (size);
Collection<Integer> alist = new ArrayList<Integer> (size);
Random r = new Random();

Dto data;
for (int c=size; c>0; c--) {

data = new Dto();
data.ip= r.nextInt(2000000000);

hmap.put(data.ip, data);
alist.add(data.ip);
}

int c = 1;
for (Integer d : alist) {
hmap.get(d);
}
}


/* Simple data placeholder */
public class Dto
{
public int ip;
public int serialnum;

}
 
C

Chris Smith

arnold said:
java -Xms32m -Xmx256m -cp target/classes/ App

Using -Xmx128m causes OutOfMemoryError.

When calculating on the size of the datastructure i find approx that

2M Dto's = 16MB
2M Integer's for map key = 8MB
2M Integer's for ArrayList = 8MB

First of all, your Dto class requires 8 bytes just for fields alone.
Add to that at LEAST 8 bytes for an object header. That's 32MB at a
minimum. The ArrayList needs four bytes per entry just for a pointer to
the Integer object... that's another 8 MB. The Integer object itself
has 4 bytes of data, and again at least 8 bytes of object header, so
that's 24 MB. The HashMap data storage is complicated to estimate, but
even completely ignoring the bucket pointers and structure, you still
have 4 bytes for a pointer to the Dto and 4 bytes for a pointer to the
integer... plus, since the field of Dto is of type int, there's a
separate boxing conversion for the HashMap and a new set of Integer
objects, so that's another 8 MB + 8 MB + 24 MB = 40 MB. Total so far:
104 MB, and that's assuming minimal object header size and a magical
hashmap that requires no overhead for internal data structures.

Of course, you need some spare memory for any dynamic memory allocation
algorithm. For a free list (C++-structured), that would be about 8
bytes per object, or at least an extra 48 MB. For a more realistic
garbage collected system, this varies a lot more... but a good rule of
thumb is to multiply the memory requirement by two if you want decent
performance, meaning at least an extra 104 MB.

So, this easily accounts for running out of memory with 128 MB. In
fact, it's quite likely that you're memory-starving the garbage
collector with 256 MB, and would do better time-wise to provide a little
more.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
A

arnold

Chris said:
So, this easily accounts for running out of memory with 128 MB. In
fact, it's quite likely that you're memory-starving the garbage
collector with 256 MB, and would do better time-wise to provide a little
more.

Thats actually quite terrible memory utilisation. In that case I have
some questions:

1- Is there a more efficient way of storing this in memory, so you could
store up to, say, 100 million objects?
2- Is there a method of creating simple data placeholders which are not
objects, similar to structs in c/c++.
3- Are there any methods of making the data structures themselves more
memory efficient.

In C I could write this so the program would not require more than
approx 48MB. (where approx 32MB is the real data and 16 MB is the arrays
with the pointers to the structures containing the data).

To me its seems that you should not really use java for in-memory
storage, except for smaller data sets.

arnie
 
R

Roedy Green

Does anybody have any comments on why the program requires that much memory?

One problem is your HashMap size is wrong. You don't specify the
number of elements you want to hold, but the capacity. You must allow
some slop at least enough to account for the load factor.. It ran out
of room and had to double the size temporarily holding both old and
new.

See http://mindprod.com/jgloss/hashmap.html
 
C

Chris Smith

arnold said:
Thats actually quite terrible memory utilisation.

It's not all that bad; you're just storing a lot of data. In any case,
it's a realtively small constant multiple of what you'd get from any
other language.
In that case I have some questions:

1- Is there a more efficient way of storing this in memory, so you could
store up to, say, 100 million objects?
3- Are there any methods of making the data structures themselves more
memory efficient.

There are a few things you could do, if you're really desperate. For
example, you can save a lot of space by re-implementing HashMap and
ArrayList to use primitive int data instead of pointers to objects of
class Integer.
2- Is there a method of creating simple data placeholders which are not
objects, similar to structs in c/c++.
No.

In C I could write this so the program would not require more than
approx 48MB. (where approx 32MB is the real data and 16 MB is the arrays
with the pointers to the structures containing the data).

No, you couldn't. You're still ignoring the overhead for memory
allocation data structures. That overhead exists in C as well as in
Java.

The only difference is that in Java, objects carry an additional 8 bytes
or so of object header, and that the HashMap and ArrayList classes work
with Object and not int (solvable, as I mentioned, if you are willing to
re-implement those classes).
To me its seems that you should not really use java for in-memory
storage, except for smaller data sets.

You should, ideally, use a database for storing large data sets.
Ideally, you wouldn't implement your own database anyway, in any
language.

It's definitely possible to implement a database in Java if you so
desire, but you wouldn't store everything in object-oriented data
structures; instead, you'd use memory mapped files and java.nio.Buffer
to access the data. In-memory data structures are designed to be
convenient and capable for manipulating working data, not to be an
efficient means of data storage.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
T

tom fredriksen

Chris said:
There are a few things you could do, if you're really desperate. For
example, you can save a lot of space by re-implementing HashMap and
ArrayList to use primitive int data instead of pointers to objects of
class Integer.

I suppose there are some libraries which has alredy done these
implementations.

It's not all that bad; you're just storing a lot of data. In any case,
it's a realtively small constant multiple of what you'd get from any
other language.



No, you couldn't. You're still ignoring the overhead for memory
allocation data structures. That overhead exists in C as well as in
Java.

how so? Yes, its a lot of data, but the point is that the data is only
about 32MB of pure data, the rest of the approx 150MB are internal data
for the structures and objects etc (I tested the limits). That yields a
utilisation percentage of 15%, which is quite terrible.

In C what you need is a hash array of 4 bytes per entry, and a second
array for iteratiing the keys of 4 bytes per entry. then there is the
hash data which is 8 bytes per entry. Thats 4 + 4 + 8 * 2million which
is 32MB, and that includes all intrinsic data the strucures need. If I
am missing something please let me know.
(the hash structure might need some more space (i dont know the actual
implementation of the function), but its most likely not more than an
additional 4-8 bytes per entry, and that would take us up to 48MB.
You should, ideally, use a database for storing large data sets.
Ideally, you wouldn't implement your own database anyway, in any
language.

Databases and network operations are slow and costly (in money for
customer solutions), plus you then have an additional 2-3 programming
tiers, in addition the customer gets unnecessary system maintenance.

I am a bit sick of the notion in todays programming world that
everything can only be solved by another heavyweight system which is
excruciately complex to use and costs an arm and a leg in money and
maintenance. Thats just a waste of money and time. Most problems can be
solved by much simpler solutions (which are just as adaptable, scalable
and possibly contains inherent maintenance), than the
tomcat/struts/IBM/BEA/ORACLE/SAP type solutions. Considered BerkleyDB?
JavaSpaces? Hashmap with ReiserFS? The problem is that most people dont
realise this, so there are'nt many good or well known solutions for such
systems.

arnie
 
A

arnold

Chris said:
There are a few things you could do, if you're really desperate. For
example, you can save a lot of space by re-implementing HashMap and
ArrayList to use primitive int data instead of pointers to objects of
class Integer.

I suppose there are some libraries which has alredy done these
implementations.

It's not all that bad; you're just storing a lot of data. In any case,
it's a realtively small constant multiple of what you'd get from any
other language.
approx 48MB. (where approx 32MB is the real data and 16 MB is the arrays
with the pointers to the structures containing the data).
No, you couldn't. You're still ignoring the overhead for memory
allocation data structures. That overhead exists in C as well as in Java.


how so? Yes, its a lot of data, but the point is that the data is only
about 32MB of pure data, the rest of the approx 150MB are internal data
for the structures and objects etc (I tested the limits). That yields a
utilisation percentage of 15%, which is quite terrible.

In C what you need is a hash array of 4 bytes per entry, and a second
array for iteratiing the keys of 4 bytes per entry. then there is the
hash data which is 8 bytes per entry. Thats 4 + 4 + 8 * 2million which
is 32MB, and that includes all intrinsic data the strucures need. If I
am missing something please let me know.
(the hash structure might need some more space (i dont know the actual
implementation of the function), but its most likely not more than an
additional 4-8 bytes per entry, and that would take us up to 48MB.
storage, except for smaller data sets.
You should, ideally, use a database for storing large data sets.
Ideally, you wouldn't implement your own database anyway, in any language.


Databases and network operations are slow and costly (in money for
customer solutions), plus you then have an additional 2-3 programming
tiers, in addition the customer gets unnecessary system maintenance.

I am a bit sick of the notion in todays programming world that
everything can only be solved by another heavyweight system which is
excruciately complex to use and costs an arm and a leg in money and
maintenance. Thats just a waste of money and time. Most problems can be
solved by much simpler solutions (which are just as adaptable, scalable
and possibly contains inherent maintenance), than the
tomcat/struts/IBM/BEA/ORACLE/SAP type solutions. Considered BerkleyDB?
JavaSpaces? Hashmap with ReiserFS? The problem is that most people dont
realise this, so there are'nt many good or well known solutions for such
systems.

arnie
 
A

arnold

Roedy said:
One problem is your HashMap size is wrong. You don't specify the
number of elements you want to hold, but the capacity. You must allow
some slop at least enough to account for the load factor.. It ran out
of room and had to double the size temporarily holding both old and
new.

See http://mindprod.com/jgloss/hashmap.html


Theoretically it might be true, but it made no practical difference. I
tried increasing the size of both the Map and the Collection by a
multiple of 2 and 4 and decreased the heap size, it made no difference.
Both this change and the original code needed approx 180MB of heap to work.

arnie
 
C

Chris Smith

tom fredriksen said:
I suppose there are some libraries which has alredy done these
implementations.

Well, maybe. For integers, yes there probably are. You can save much
more space by inlining the custom data structure into the hash table
entries as well, and obviously that would not pre-exist.

Assuming you want to just inline ints... for ArrayList, the complexity
of an extra dependency on some third-party code is probably not worth
saving the couple dozen lines of code to do this on your own. The
HashMap will be more complex, and may be worth the new dependency.

Dynamic memory management is not magical in C. It uses data structures
to work, just like anything else. Those data structures are typically
stored in the bytes directly before the memory address that's returned
to you from a call to malloc. You are assuming that when you write
"malloc(8)", that requires 8 bytes of memory. In reality, it requires
about twice that amount, depending on the malloc implementation.
Ignoring this makes your comparison unreliable.

Incidentally, I'm still ignoring malloc overhead from memory
fragmentation. This overhead won't show up in a simple test program,
but will become larger over time in a long-running application,
generally asymptotically approaching a constant proportion of the total
memory requirement. Modern malloc implementations are pretty good, so
if you wanted to include this, you'd generally assume this memory is
about 20% or so of the actual memory in use. This is worth mentioning,
because this fragmentation would not be present in Java, in which
defragmentation is generally a side-effect of garbage collection.
(the hash structure might need some more space (i dont know the actual
implementation of the function), but its most likely not more than an
additional 4-8 bytes per entry, and that would take us up to 48MB.

Okay, let's get more specific.

I'll assume you use a straight-forward implementation of a general-
purpose hash table written in C. A hash table will require a set of
buckets, and a linked list of data for each bucket. On a 32-bit
machine, that's about 8 MB divided by the load factor, for the bucket
heads. The linked list for the data will contain a pointer to the data
itself, a pointer to the key, and a pointer to the next piece... plus
possibly a cached hash value, but we'll ignore that by assuming you've
chosen an implementation that opts for mem0ory compactness instead of
performance. That is an extra twelve bytes per entry of visible
overhead, plus about eight bytes per entry of memory management
overhead. You'll need to dynamically allocate space for the keys, so
that's an extra 4 bytes of visible memory, plus 8 bytes of allocation
overhead. Assuming a load factor of 2, that brings us to 4 MB
+ (n * 20) + (n * 12), which is a total of 68 MB of overhead for the
hash table.

You can eliminate about 32 MB of that 68 MB by writing your own hash
table, in which case you could inline the data and keys into your hash
entry structure and save the pointers and dynamically allocated keys.
Similar savings can be had by using C++ templates, and would be expected
in TR1's STL hash structure for example.

So we have an additional overhead of either 68 MB or 36 MB, depending on
your implementation choices. The total for C is either 100 MB or 68 MB,
not counting some negligible other sources of memory requirement.

So let's take the answer to be 68 MB. In Java, if you similarly write
your own data structures to inline the data, you'd use an additional 16
MB on top of that, for a total of 84 MB. Leaving space for the garbage
collector to work, you might expect 128 MB to be sufficient (remember
that we already accounted for about eight bytes per object that aren't
needed in a Java garbage collector since there's generally no explicit
free list, so we don't need to double it).

Or let's take the answer to be 100 MB if you don't want to write your
own hash table. In Java, you'll get an additional 48 MB, for a total of
148 MB. Make that about 200 MB for reasonable garbage collector
performance (same note about not needing the overhead to keep a free
list).

Or, if you want to use an ArrayList of Integer in Java instead of
managing your own int[], then the memory requirement goes up to 188 MB,
or maybe 256 MB for reasonable gc performance. There has been no C
equivalent proposed for such a design, but it would probably require
about 124 MB of memory (similar methods).

So, overall, it appears that Java requires about twice the memory of C
for all possible design alternatives. C++ shows a considerable
advantage, though, in that you can use STL data structures vector and
TR1's unordered_map, and get the 68 MB result without having to re-
implement anything.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
R

Roedy Green

There are a few things you could do, if you're really desperate. For
example, you can save a lot of space by re-implementing HashMap and
ArrayList to use primitive int data instead of pointers to objects of
class Integer.

One goofy thing about HashMap vs the way it used to be done is there
are three objects for each entry, a key, an object and a piece of glue
to tied them together with overflow chains.

In the olden days you, might have left a slot in the object itself for
the chains, and used a key embedded in the object. You then have only
one object per item.

The problem with that sort of old-fashioned implementation is you
must think ahead and leave a slot in your objects for HashSet to use,
and you can't put the object in more than one HashSet.
 
M

megagurka

arnold skrev:
Thats actually quite terrible memory utilisation. In that case I have
some questions:

1- Is there a more efficient way of storing this in memory, so you could
store up to, say, 100 million objects?
2- Is there a method of creating simple data placeholders which are not
objects, similar to structs in c/c++.
3- Are there any methods of making the data structures themselves more
memory efficient.

Try these primitive type collection libraries:

http://pcj.sourceforge.net/
http://trove4j.sourceforge.net/

/JN
 
A

arnold

Chris said:
Okay, let's get more specific.

Please see attached code sample.

First of, I've implemented this using the C language idioms, which one
should do instead of trying to copy the java or OO idioms.

Secondly, you should be aware of one thing about this implementation.
That is that the hash table implementation uses character keys, not
integer keys as my java implementation. In order to accomodate this I
needed to create a string represenation of the key, So this adds a
modest size to the program. Also I dont know how much memory the table
allocates for a 2mill entry table, show that should add some extra
compared to HashMap, All in all its still smaller than your optimal C++
impl. and about 3-4 times smaller than my java impl. (With 4 Gigs of
memory this program would be able to store 125 million entries, while
the java version would store 48 million entries)

FYI, this program took 3.18 seconds to complete the full implementation,
where as my java version needed 11 seconds.

Here are the memory stats for this program. The VIRT column is the most
representative, allthough it includes code and libraries sizes. You can
use the DATA column if you like, though.

Test 1: No datastructures allocated,

PID VIRT RES SHR SWAP CODE DATA COMMAND
24125 1432 280 1400 1152 8 1424 hash

Test 2: Allocating an iteration array of 2M entries

PID VIRT RES SHR SWAP CODE DATA COMMAND
24134 9248 316 1400 8932 8 9240 hash

Test 3: Allocating the hashtable with no data

PID VIRT RES SHR SWAP CODE DATA COMMAND
24573 32688 30m 1400 1100 8 31m hash

Test 4: Full implementation

PID VIRT RES SHR SWAP CODE DATA COMMAND
24753 63972 61m 1400 1096 8 62m hash




************ code **************


#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <search.h>


/* Placeholder for data */
typedef struct dto {
int ip;
int serialnum;
} dto_t;


int main(int argc, char **argv)
{
int *key = NULL; /* The key array for iterating the hash */
int size = 2000000; /* The number of entries*/

if (!hcreate((size_t) size)) {
printf("Error creating hashtable\n");
exit(0);
}

int r = 0;
dto_t *data = NULL;
ENTRY e, *ef;
char r_str[20];

/* Create data and insert into hash and key array */
key = malloc(sizeof(int) * size);
for(int c=0; c<size; c++ ) {

/* A random number between 1 and 2billions */
r = 1+(int) (2000000000.0*rand()/(RAND_MAX+1.0));

key[c] = r;
sprintf(r_str, "%i", r);
e.key = r_str;

data = malloc(sizeof(data));
data->ip = r;
e.data = (dto_t *)data;

hsearch(e, ENTER);
}

/* Iterate the key array and retrieve hash data */
for(int c=0; c<size; c++ ) {

sprintf(r_str, "%i", key[c]);
e.key = r_str;
ef = hsearch(e, FIND);
if(ef)
data = (dto_t *) ef->data;
}

return(0);
}
 
M

megagurka

arnold skrev:
Please see attached code sample.

First of, I've implemented this using the C language idioms, which one
should do instead of trying to copy the java or OO idioms.

AFAICT there is nothing that stops you from implementing the same data
structure in Java as you do in C.

/JN
 
A

arnold

megagurka said:
arnold skrev:



AFAICT there is nothing that stops you from implementing the same data
structure in Java as you do in C.

Thats true, but in a real scenario it would be relevant for the hashmap
only. I dont know how much more efficient it would be though. (the
arraylist is of no interrest, since it only used for testing. In any
case it can be implemented as a pure int array.)

/arnie
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top