Lots of Data

R

RJ

I've got the following problem. I'm reading tab-separated text files
into an array with objects which represent one line. Each object
contains an other array where the single tab-separated texts are
saved. When I try read more than 40000 rows with more than 80 columns.
I get an OutOfMemoryError.
What can I do to have quick access to that data and save memory so
that my application will also work with lots of data.
 
M

Marco Schmidt

Steve Horsley:
One thing that I find works well here is to share all common Strings that
you read from a file. So that if you find many instances of the word "yes"
in the file, you simply store a reference to the existing string "yes". I
do this by checking every string I retrieve against a HashMap (storing the
String as both key and value). If the string is already in the HashMap, I
use that one instead.

Isn't that what intern() is for?

Regards,
Marco
 
T

Thomas Weidenfeller

What can I do to have quick access to that data and save memory so
that my application will also work with lots of data.

E.g. use a database instead. Or compress, or clerverly encode your data
if you really need it in memory.

/Thomas
 
D

dhek bhun kho

(e-mail address removed) (Thomas Weidenfeller), Tue, 08 Jul 2003
13:20:19 +0000:
E.g. use a database instead. Or compress, or cleverly encode your data
if you really need it in memory.

/Thomas

If the data is stored with an explicit ordering, then you can get away with
implementing some form of cache and accessing the file as a
java.io.RandomAccessFile.

It's a little bit more thinking to get the coding done, but reduces the memory
usage, and random access is fast. 40.000 lines of text isn't that much.

You can do a binary search if the data is sorted, or something like
Knuth-Morris-Pratt to search through the files for a specific key (if the
keys aren't too small).

I did try the KMP once, and it was quite easy to code, search times were
relatively small for a single key (<10ms on a cheap ata-100 IDE drive in a
file with about 50.000 records with no fixed length records).

Greets
Bhun.
 
T

Thomas Weidenfeller

dhek bhun kho said:
If the data is stored with an explicit ordering, then you can get away with
implementing some form of cache and accessing the file as a
java.io.RandomAccessFile.

It's a little bit more thinking to get the coding done, but reduces the memory
usage, and random access is fast. 40.000 lines of text isn't that much.

I know. I just didn't want to tell the original poster that his idea of
first reading in everything into memory is a !@#$%^&* idea. Instead I
wanted to provide an off-the-shelf solution. Of course, writing your
own little special-purpose "database" and search code is also an
option.

I see this "files? what the heck are files? Give me memory!" attitude
more and more often, also in this group. E.g. recently someone posted
his source code to copy the contents of one directory to another
directory. For each file, the code was along the line of:

* get length of file (which is a "long")

* cast length to "int" (which can still be a rather large number)

* allocate byte array of that (truncated) length

* read the complete file (or what remains after the truncation)
into the array

* write the array to the destination file

There isn't much more that one can do wrong here, isn't it? I guess it is
a generation thing. I am not that old :) but I had to work with memory
sizes that today we all use up for a simple String before the first cup
of tea in the morning.

/Thomas
 
D

dhek bhun kho

(e-mail address removed) (Thomas Weidenfeller), Wed, 09 Jul 2003
07:06:15 +0000:
I know. I just didn't want to tell the original poster that his idea of
first reading in everything into memory is a !@#$%^&* idea. Instead I
wanted to provide an off-the-shelf solution. Of course, writing your
own little special-purpose "database" and search code is also an
option.

HSQLDB (at sourceforge) provides JDBC access to a non-networked database
server if you want a quick fix.

[--the following is quite off-topic--]

:D Just a quick question: Does memory mapping put the whole file into
memory or does is just keep a part of the file memory resident? (I haven't
tried this yet.). Otherwise memory mapping the file would be a good
solution. The nio.*-interfaces are a bit awkward at times to me.

A bonus to the off-the-shelf solutions is, that they work well to get
things done initially on one and. On the other hand they tend to drive the
hardware and environment requirements to stupendous amounts of RAM and
software needed just to get it done. It's a bit strange to use a real
database server, because the programmer does not know how to manipulate
large files.

And in some cases the speed gain is actually fake. For example, I code on a
512 MB ram machine, but I really can't require users to have the same
environment. And once the OS itself starts to swap things in and out of
memory, you are actually experiencing a much worse slow down in EVERYTHING
you do with the computer.

Like a few weeks ago there was a posting about how to a code a servlet
doing a MIME/multipart upload. It's not that hard once you understand what
is actually coming over the connection ( just dump the output coming in
and "you save a nickle!" ). I have the code lying around somewhere...

I just dislike the fact that there is this general attitude on usenet,
that once you post something that's not 100% working, you're 'reputation'
is totally bummed out if somebody googles you. It wasn't always like this.
It's so irritating, that there is almost this requirement to post using an
alias, and never post anything that might be wrong. I really can't
remember signing a SLA contract with anybody.
I see this "files? what the heck are files? Give me memory!" attitude
more and more often, also in this group. E.g. recently someone posted
his source code to copy the contents of one directory to another
directory. For each file, the code was along the line of:

* get length of file (which is a "long")

* cast length to "int" (which can still be a rather large number)

* allocate byte array of that (truncated) length

* read the complete file (or what remains after the truncation)
into the array

* write the array to the destination file

There isn't much more that one can do wrong here, isn't it? I guess it is
a generation thing. I am not that old :) but I had to work with memory
sizes that today we all use up for a simple String before the first cup
of tea in the morning.

It has nothing to do with generation differences. Like I am 24 turning
25 next week; it's more like a dependency syndrome. Some major vendor
is making the majority of us API addicts. In the end the real knowledge of
being able to solve a problem is lost.

Like API's are fun and a real blessing in most cases, but once you cross
over the line that you can't solve a problem without using very high-level
API's: what good are you as a programmer? (don't flame me about this, I
don't consider myself the ueberprogrammer at all; some of those folks are
hanging around in the top 1000 rankings of www.topcoder.com)

I wouldn't consider myself any better than a Office XP user that talks to
his machine. (Just imaging everyone at an office department talking to
their machine; "We saved one nickle! We saved one nickle!" I would apply
for a permanent position at a sanitorium immediately).

Off-topic:

My wishlist for the future of the development of the jdk is:

1) faster startup time of the JRE
2) better memory management of the JRE
3) being able to tweak the initial environment of the JRE
(anybody noticed that the JRE 1.4 now fires up several extra threads for
the filesystem preferences thing?)
4) reduced memory usage of Swing
5) overall speed improvement of Swing
6) improved HTML support in Swing
7) modularized on-demand installation of the JRE

It's definitely not generics or overloaded return types (gosh). I don't see it be
beneficial to the JRE to keep adding new language constructs. While it is
more correct; the programming the you need to do can be done using the
current means.

The API is nearing google in the amount of classes, while less time is
spent on things an average end-user really wants: speed and reduced
memory usage, better (default) multimedia support and integration with the
operating system. A side-effect of adding things to the API is that you
can't remove it afterwards, since people expect the stuff to be present.

Take for example generics: the programmer's really want it, because it
makes the language more 'advanced'. But it's actually just a feature to
easy our laziness. A good editor/IDE with template support can do exactly
the same. Does the end-user notice the fact that you are using generics?
No. He's still looking at the sometimes brain-dead and slow to start user
interface.

Greets
Bhun.
 
T

Thomas Weidenfeller

dhek bhun kho said:
:D Just a quick question: Does memory mapping put the whole file into
memory or does is just keep a part of the file memory resident? (I haven't
tried this yet.).

It depends on the JVM implementation, which in turn depends on the OS
supporting it, and also on hardware for an efficient implemetation. For
cross-platform reasons the Java API spec does not guarantee that you
get real memory-mapped I/O.

If it is done as it should be done, then only the parts of the file
that you really access are transfered into memory (in some page-size
chunks). Everything else is just some meta data in the memory
management unit and the OS. But MappedByteBuffer also has the load()
method, which worries me very much, because it is supposed to load the
complete data into main memory at once.

On Unix, memory-mapped I/O is "just" an application of the normal
virtual memory management. It is the same mechanism that is used to
page in and out programs and program data, and it is highly optimized.
For memory-mapped I/O the file is associated with some virtual address
space - just some meta data is set up. When this virtual memory is
accessed at some time, it triggers a page-fault. That page-fault
triggers the loading of that page from the file into main memory, and
the access to the data is then satisfied from that page of main
memory.

What I would like to see for the nio FileChannel is a method to test if
real memory-mapped I/O is supported, and also some method/constant
which tells me the used page size (which is OS dependent).
Otherwise memory mapping the file would be a good
solution.

Well, if you have to sequencially search through your file, then sooner
or later all parts of the file will be paged into main memory (and some
might be paged out if main memory gets low). So you "only" gain some
speed because of the more efficient I/O operations. But, if you have a
more clever file structure (e.g. an index at the beginning of the
file), and record sizes which are multiples or integer parts of the
page size, you can gain considerable speed in file access, especially
for large files.
The nio.*-interfaces are a bit awkward at times to me.

Indeed, they are. E.g. I am still trying to find out if it is "better"
to map the complete file at once, or to create new MappedByteBuffers
for each region I want to access.
I just dislike the fact that there is this general attitude on usenet,
that once you post something that's not 100% working, you're 'reputation'
is totally bummed out if somebody googles you.

I don't have a reputation, so I don't have to care ;-) I am more
worried about the whining that starts when you tell people the sade
truth. E.g. that something they are trying is just stupid.
My wishlist for the future of the development of the jdk is:

1) faster startup time of the JRE
2) better memory management of the JRE

Please start with Swing :)
Take for example generics: the programmer's really want it,

Don't count me in on that one. I am burned by C++ (yes, I know, Java
generics are not templates, are much better, yada, yada, yada, but I
still don't want them).

/Thomas
 
D

dhek bhun kho

Hey Thomas (or anybody else)

(e-mail address removed) (Thomas Weidenfeller), Wed, 09 Jul 2003
07:06:15 +0000:
I know. I just didn't want to tell the original poster that his idea of
first reading in everything into memory is a !@#$%^&* idea. Instead I
wanted to provide an off-the-shelf solution. Of course, writing your
own little special-purpose "database" and search code is also an
option.

<a
href="http://forum.java.sun.com/thread.jsp?forum=4&thread=420695&start=0&range=30#1868195">
I was talking to somebody on JDC about multiple instances of serialised
data in a single file, and was wondering if anybody could provide some
input:
into the previously allocated space. And a small
buffer could be used to copy the rest of the file backward.

I see that deletion does not have to be much of an issue per se.

But the thing is: we are talking about serialised data, and thus variable
length 'records'.

Another problem is modification. There is no way to guarantee that a
modified object will fit back into it's own 'bucket'. That is when using
the approach of storing multiple serialised objects in a single file,
every modification of a serialised object will imply a deletion and addition.

The only things that are easy to do when having a single file with
multiple objects is to add. And even that requires you to do a hack by
adding deletion flag.

Searching this file for a specific instance will also be akward. For one
thing if you want to do fast searching, you would have to resort to
pattern matching, as the data is not stored in any particular ordering and
has variable length.

The more think of it, the nastier it gets. Deletion is but the start of a
problem. I am not saying it won't work. But all the extra work to get it
working just to have an esthetically nice solution (single file) makes me
think it is the result of not thinking clearly.

Maybe somebody has done it like this already, I really don't know. I
haven't seen an implementation like this yet. I thought about what dub
said. I'd like to know what anybody has to say about the following:

Code a class that will handle the save/delete/find methods.

<code>
public abstract class PersistenceManager {
public static PersistenceManager getInstance(Object object);
public void save(Serializable object) throws NotLocked;
public void delete(Serializable object) throws NotLocked;
public void lock(Serializable object) throws AlreadyLocked, ObjectNotFound;
public void unlock(Serializable object) throws NotLocked;
public Serializable find(Class serializableClass, Comparator query);
public Serializable find(Class serializableClass, String md5hash);
}
</code>

It's a singleton so it should be able to handle concurrent access to the
backing store. This way the PersistenceManager can do some form of
rudimentary caching (LRU, FIFO, whatever) and row locking in a single VM.

The basic idea is clear, I hope. I'm not totally sure about the lock and
unlock methods (deadlocks/starvation).

A simple scenario would be something like this:
<code>
File baseDirectory=..; // like the namespaces in SQL
...
public void save(Serializable object)
throws NotLocked
{
File classDirectory;

classDirectory = new File(baseDirectory, object.getClass().getName());
if (!classDirectory.exists()) classDirectory().mkdirs();
.. calculate key ..
.. check if this object has been locked, lock the key ..
.. serialise and save the object..
.. unlock ..
}
</code>

To get back on-topic. The major benefit of doing it like this, with a
single file per object instance is:

- deletion is a sequence like lock/delete file/unlock.
- saving is lock/create file/unlock.
- sequential search should not occur extra overhead (but is not synchronized)

One catch I can see now is that it requires the Serializable class to
either have some kind of unique id, meaning all objects should be unique:

This way everything is still accessible using random access methods, but
with the filesystem acting as 'one big file'. On the other hand, if I am
correct the OS itself usually does the caching of file access so you don't
need to go great lengths to do caching (hopefully).

And what I find most important: memory usage of this approach should be
realy low; since the it will only use the amount of a single object with
some serialisation overhead.

I am just not sure what do with the indexing of the serialised objects. I
am planning to use the filenames as the key to an entry. Should I subclass
the Serializable contract to enforce the presence of a keyvalue? I can't
rely on the hashCode() method. Or should I do something like a MD5/SHA
hash on the byte representation of the serialised object? That should
provide unique keyvalues shouldn't it? These would be consistent across
several executions of the application.

I am planning om implementing this, can anybody provide some useful input
if this meant to fail beforehand?

Greets
Bhun.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top