Any suggestions for handling data of huge dimension in Java?

S

Simon

Dear All,

Good day. Regarding the subject, I am doing a research simulation by
using java in eclipse galileo. My laptop is dell studio xps 1645 with
i-7 processor and 4gb ram. When running the java source codes, I have
set the Run Configurations > Arguments > VM Arguments > -Xmx1024M -
XX:MaxPermSize=128M and also assign the object to null when it is no
longer needed. However, i keep facing java heap problem. Most of the
time, i am using HashMap<String,Double> and StringBuilder to hold the
data. Dimension of my data is around 5000 columns (or features) x 50
classes x 1000 files, and I need to extract that data into one file
for classification purpose. Therefore, is there any suggestions or
articles for me to cope with such problem?

Your concern is highly appreciated.

Thanks.

regards,
Ng.
 
E

Eric Sosman

Dear All,

Good day. Regarding the subject, I am doing a research simulation by
using java in eclipse galileo. My laptop is dell studio xps 1645 with
i-7 processor and 4gb ram. When running the java source codes, I have
set the Run Configurations> Arguments> VM Arguments> -Xmx1024M -
XX:MaxPermSize=128M and also assign the object to null when it is no
longer needed. However, i keep facing java heap problem. Most of the
time, i am using HashMap<String,Double> and StringBuilder to hold the
data. Dimension of my data is around 5000 columns (or features) x 50
classes x 1000 files, and I need to extract that data into one file
for classification purpose. Therefore, is there any suggestions or
articles for me to cope with such problem?

Find a way to "classify" incrementally, so you don't need to hold
all 250,000,000 key/value pairs in memory at the same time. Sorry I
can't be more specific, but I'm unable to guess what your data looks
like or what "classify" means.
 
N

Nigel Wade

Dear All,

Good day. Regarding the subject, I am doing a research simulation by
using java in eclipse galileo. My laptop is dell studio xps 1645 with
i-7 processor and 4gb ram. When running the java source codes, I have
set the Run Configurations > Arguments > VM Arguments > -Xmx1024M -
XX:MaxPermSize=128M and also assign the object to null when it is no
longer needed. However, i keep facing java heap problem. Most of the
time, i am using HashMap<String,Double> and StringBuilder to hold the
data. Dimension of my data is around 5000 columns (or features) x 50
classes x 1000 files, and I need to extract that data into one file
for classification purpose. Therefore, is there any suggestions or
articles for me to cope with such problem?

What is the problem? You haven't actually stated you have a problem,
only a task to perform. What problem are you actually seeing?

If you are running out of memory (OOM errors) then determine whether
it's the heap or PermGen (do you have any evidence that PermGen needs
increasing?). to determine whether you are keeping objects beyond their
sell-by date use a profiler. There's one in NetBeans which will do the
basics, but I don't know Eclipse that well. It's all too easy to not
release every reference to an object, thus not allowing it to be GC'd,
even when you think you have.

If you are not hanging on to objects unnecessarily you can try to
increase the heap size to the max. allowed by your system memory limits.
If you still cannot fit your data into the system then you need better
hardware/OS (a 64bit OS would allow you to use more per-process VM), or
a different algorithm (one which doesn't try to hold all your data in
memory at the same time).
 
M

Martin Gregorie

If you do really have a lot of data then using a DBMS might be a better
bet. You could at least pre-process the bulk data there to boil it down
to something that fits more comfortably into Java.
Sounds like the way to go: it sidesteps memory issues and avoids the
necessity for merging or summarising the data.
 
M

markspace

3. buy RAM. It is cheaper than you would imagine


Lease RAM, it's even cheaper that way.

AEC2 (Amazon Elastic Compute Cloud) will let you lease a fairly hefty 64
bit machine for pennies per hour. If you have really large computing
problems that a small desktop won't handle, I think some form of grid
computing is the way to go.

http://aws.amazon.com/ec2/

8 CPUs + 15 G bytes + 64 bit OS for $0.68 per hour. Can't beat that.
Even larger instances are available if needed.


EC2 will also let you run big data data tools like Map Reduce and
Hadoop. I haven't tried these, just read about them, but they look darn
interesting for certain data problems.

Any chance you (Simon/Ng) can tell us more about what your problem set
is and what the data looks like?
 
L

Lew

Simon said:
Good day. Regarding the subject, I am doing a research simulation by
using java [sic] in eclipse [sic] galileo [sic]. My laptop is dell [sic] studio [sic] xps [sic] 1645 with
i-7 processor and 4gb ram. When running the java [sic] source codes, I have
set the Run Configurations> Arguments> VM Arguments> -Xmx1024M -
XX:MaxPermSize=128M and also assign the object to null when it is no

A false tactic. You cannot assign an object to 'null', first of all. You can
only assign 'null' to a reference. That's actually very important, because if
you have a second reference to the same object, setting the first one to
'null' is just self-delusion.

Generally speaking, setting references to 'null' is undesirable and
potentially harmful, according to the Java gurus such as Brian Goetz. Except
for the very few places where it's needed, the best it does is give you a
false sense of security while preventing you from properly scoping your
references and your object lifetimes.
longer needed. However, i [sic] keep facing java [sic] heap problem. Most of the

That's either because you aren't releasing all references to objects when you
need to (likely), or you simply have too much data for a gig of heap.
time, i [sic] am using HashMap<String,Double> and StringBuilder to hold the

http://sscce.org/

Vague descriptions mean the problem is always in the part you don't show.
data. Dimension of my data is around 5000 columns (or features) x 50
classes x 1000 files, and I need to extract that data into one file
for classification purpose. Therefore, is there any suggestions or
articles for me to cope with such problem?

If you're extracting into a file from some place presumably not originally in
Java heap RAM, why are you loading it all into memory at once?

Please construct and provide an SSCCE.

http://sscce.org/

Otherwise you're like a patient asking his doctor, "I have something wrong.
Can you prescribe me something?"

We need solid data from you.

http://sscce.org/
 
A

Arne Vajhøj

Good day. Regarding the subject, I am doing a research simulation by
using java in eclipse galileo. My laptop is dell studio xps 1645 with
i-7 processor and 4gb ram. When running the java source codes, I have
set the Run Configurations> Arguments> VM Arguments> -Xmx1024M -
XX:MaxPermSize=128M and also assign the object to null when it is no
longer needed. However, i keep facing java heap problem. Most of the
time, i am using HashMap<String,Double> and StringBuilder to hold the
data. Dimension of my data is around 5000 columns (or features) x 50
classes x 1000 files, and I need to extract that data into one file
for classification purpose. Therefore, is there any suggestions or
articles for me to cope with such problem?

Listen to Roedy and Lew.

It is a lot cheaper to run this on more powerfull SW and HW
that to develop a programmatic solution.

Arne
 
A

Arne Vajhøj

For hosting a database? For your *pagefile*?! Are you crazy? Those things
have a limited number of rewrites for each bit. Hundreds, to be sure, or
even thousands, but you'll wear them out pretty quickly putting anything
on it that changes very frequently. I wouldn't be too shocked if putting
your swap partition on an SSD wore out the SSD in a matter of *weeks*.

That is how the situation were 15 years ago.

But support for many thousands of writes combined with
new techniques to spread load has changed that dramatically.

SSD disks will last for years even with heavy load.

For people that like facts then http://en.wikipedia.org/wiki/Ssd
has link 55-58 with all the details.

For people that are not interested in the details, then the
simple fact that SSD are widely used for many purposes from
database transaction log disks to laptop disks
should convince them that the technology is reliable
for more than weeks.

Arne
 
M

Martin Gregorie

On Thu, 24 Mar 2011 10:06:46 -0500, Leif Roar Moldskred wrote:

(another anti-Wesson diatribe)
The stuff you snipped was correct. Flash memory tends to last longer than
most people think. One of the keys to long life for long life with fairly
volatile memory on flash is to use a piece of flash thats several times
the size of the amount of data you're storing. As an example, if you
append 40 bytes to a log file once every second a 2GB flash memory will
survive about 10 years assuming a typical life of 100,000 write/erase
cycles per cell.

If you're considering a seldom written SD card that's used to load data
or programs into devices that need updating or synchronising, the flash
memory may well outlast the packaging since the contacts on an SD card
are only rated for 20,000 insertions.

But don't just take my word for this: take the trouble to find out
exactly how flash works, read the specifications of good quality SD cards
and do the maths yourself.
 
A

Arne Vajhøj

You're talking about high-end, expensive units, not consumer-grade flash
drives. Roedy was suggesting buying cheap flash drives, not high-grade
industrial-strength server components.

Leif linked to a site that calculated that continuous, maximum-speed
writing onto a drive would not wear it out for a half-century -- but it
assumed a 64GB drive from the high end of the write-endurance range. A
cheap consumer grade flash drive sized to fit your pagefile would be an
8GB USB stick rated, if you were very lucky, for 100,000 writes per bit.
Pagefile activity would use those writes up pretty damn quickly. At
maximum write speed it would last 1/160th as long as the expensive 64GB 2-
million-write drive -- so, less than four months. For realistic pagefile
usage it might last a couple of years. If it's a more typical 10,000-
write USB stick it won't last more than six to twelve weeks under the
same usage, hence my original estimate of "weeks".

The topic was SSD disk not USB sticks.

And it is not highend expensive SSD disks.

You can buy them in cheap consumer PC's.

In fact the OLPC was based on such technology.

Arne
 
A

Arne Vajhøj

Yes, this is well known, but you're omitting to mention the fact that
using a piece of flash that's several times as big makes that piece of
flash several times as expensive, and Roedy was advocating the use of
flash as a *cheap* option.

They cost 1-2 USD per GB in PC quality.

The original poster needed a bit more than 32 bit OS and JVM
could provide.

If he need 20 GB, then that is 20-40 USD.

I consider that cheap.

Arne
 
E

Eric Sosman

On Thu, 24 Mar 2011 10:06:46 -0500, Leif Roar Moldskred wrote:

(another anti-Wesson diatribe)

*sigh*

This is getting old.

Disputing someone's assertion does not constitute an attack
upon that person, only upon the assertion.

Treating a correction as a personal attack -- *that's* old.
 
E

Eric Sosman

A correction *is* a personal attack, since it implies that some person
needed correcting, and thus that that person has a negative trait.

Ignorance is value-neutral; we are all born with it. Willfully
refusing the offered cure is not value-neutral.
In fact, it's considerably nastier than simply calling that person a
stupidhead or something. Out and out namecalling is easily seen for what
it is; someone reading a post from Leif saying "Ken Wesson is a
stupidhead" would probably take it with a very large grain of salt.

He disputed your assertion, and cited an authority who also
disagrees with you. He didn't even call you wrong; he was polite and
took issue with assertion itself, not its utterer. This you called
an "anti-Wesson diatribe."
Said disproof of his claims is in other recent posts to this thread, by
the way. The capsule summary: there's cheap consumer-grade flash and
there's expensive industrial-strength flash. To get significant lifetime
out of a flash based swap partition would require the latter, and in a
much larger capacity than the nominal size of the swap partition to boot,
which pretty much trashes Roedy's claim that you can do that on the cheap.

This latest assertion has also been disputed in this thread, but
I haven't seen any authority cited. So I'll offer a citation:

http://www.newegg.com/Store/SubCategory.aspx?SubCategory=636&name=SSD

(Don't overlook the free shipping.)
Leif's trick was a simple, but unfortunately effective one -- quietly
ignore the fact that Roedy was talking about cheap flash and point to the
characteristics of the high-end stuff as "disproving" me. It's a
variation on what logicians call "equivocation".

And you, sir, are a variation on what I call a "plonkee."
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top