Compress int array, again

K

Kevin

(I searched through this group, but found little about this.)

In my program, there are a lot of arrays like: int[]. They are usually
100 - 1000 in length, with the values of each entry be a normal
integer. These int[] need to be stored in memory for fast acecss by
some other codes.

I heard (actually, our manager wants it) that they are ways to save
some memory by storing these int[] as "compressed". Any one could point
me some example codes of it?

I wish is it "loseless", and it has a range similar to the int (does
not require the exact range as int for our applications).

Thanks a lot and happy holiday!~
 
R

Roedy Green

I wish is it "loseless", and it has a range similar to the int (does
not require the exact range as int for our applications)

you could serialise the array, and GZIP it. The will look for
duplicates and replace when with a common token.

See http://mindprod.com/applets/fileio.html for how.

Tell it you want to serialise objects to a byte array with
compression. The catch is you can't use it in compressed form. You
have to fluff it up again first.
 
A

Alun Harford

Kevin said:
(I searched through this group, but found little about this.)

In my program, there are a lot of arrays like: int[]. They are usually
100 - 1000 in length, with the values of each entry be a normal
integer. These int[] need to be stored in memory for fast acecss by
some other codes.

I heard (actually, our manager wants it) that they are ways to save
some memory by storing these int[] as "compressed". Any one could point
me some example codes of it?

I wish is it "loseless", and it has a range similar to the int (does
not require the exact range as int for our applications).

Well I guess you could Huffman code each array (or maybe a better way) -
byte-by-byte - and keep an index that tells you where element x starts
(where x is a multiple of some integer n), but I doubt a small array of ints
is going to compress very well (unless you frequently use the same or
similar integers I guess).

Alun Harford
 
K

Knute Johnson

Kevin said:
(I searched through this group, but found little about this.)

In my program, there are a lot of arrays like: int[]. They are usually
100 - 1000 in length, with the values of each entry be a normal
integer. These int[] need to be stored in memory for fast acecss by
some other codes.

I heard (actually, our manager wants it) that they are ways to save
some memory by storing these int[] as "compressed". Any one could point
me some example codes of it?

I wish is it "loseless", and it has a range similar to the int (does
not require the exact range as int for our applications).

Thanks a lot and happy holiday!~

I would think that any compression scheme is going to take longer to
compress and uncompress than just reading the full size integers. I
don't think that Java will run on any 16 bit processors so an integer
access is just one hardware instruction. I think your manager is just
wasting his and your time.
 
K

Kevin

Thanks a lot for the replies.

Since I need to keep the int[] in memory, so I think we can not gzip it
out to disk.


I did some test using a naive way:

For each int[] array, I convert each int to 4 bytes, so it will have a
byte[] of 4 times the length.

I compress this byte[] using Java's Deflater class, and store the
compressed byte array in memory.

For my data, before compress, the int[]s take about 170M memory, after
compress, the byte[]s take about 52M memory. The compression time is a
killer here, which takes about 4-5 minutes. But since this can be
considered as a "preprocess" of my data and done offline, so this does
not seem to be a major issue for my application at hand.

I am not sure about the decompress time at run time though. But since
my code does not need to access all the int[]s at one time (it only
access 10-50 of them at a time), so maybe the run time will be
acceptable I guess.

Of course, we have to decompress each int[] array in order to access
even one entry in it. But since my code uses the int[] as a whole each
time, so this does not seem to be a problem here.

I just wonder whether my above way is the best method we can get?
 
R

Roedy Green

Since I need to keep the int[] in memory, so I think we can not gzip it
out to disk.

You can GZip to to ByteArrayOutputStream and then to a byte array. You
need not do any disk I/O.

But as I said, getting something out is a very expensive operation.
 
E

Erik Andreas Brandstadmoen

Kevin said:
In my program, there are a lot of arrays like: int[]. They are usually
100 - 1000 in length, with the values of each entry be a normal
integer. These int[] need to be stored in memory for fast acecss by
some other codes.

I heard (actually, our manager wants it) that they are ways to save
some memory by storing these int[] as "compressed". Any one could point
me some example codes of it?

I'm sorry, but mu curiosity just has to come forward:
What on earth are you doing with this, and have you investigated whether
some of Java's built-in Collections can do some of the work for you?

There might of course be very good reasons for doing what you do, but
I'm just curious. And in answer to your question, of course you can
store things in different ways (i.e. compressed) in memory, but I'm
wondering if it's a better idea to keep what you don't need on disk, and
read it when you need it, because reading from disk is not necessarily
slower than unzipping your in-memory data structures. Depending on
hardware, of course (or maybe not? I'm not very good at compression theory).

Independent of which solution you choose, of course you will always
trade reduced memory usage for reduced performance as well.

Regards, Erik Brandstadmoen
 
K

Knute Johnson

Kevin said:
Thanks a lot for the replies.

Since I need to keep the int[] in memory, so I think we can not gzip it
out to disk.


I did some test using a naive way:

For each int[] array, I convert each int to 4 bytes, so it will have a
byte[] of 4 times the length.

I compress this byte[] using Java's Deflater class, and store the
compressed byte array in memory.

For my data, before compress, the int[]s take about 170M memory, after
compress, the byte[]s take about 52M memory. The compression time is a
killer here, which takes about 4-5 minutes. But since this can be
considered as a "preprocess" of my data and done offline, so this does
not seem to be a major issue for my application at hand.

I am not sure about the decompress time at run time though. But since
my code does not need to access all the int[]s at one time (it only
access 10-50 of them at a time), so maybe the run time will be
acceptable I guess.

Of course, we have to decompress each int[] array in order to access
even one entry in it. But since my code uses the int[] as a whole each
time, so this does not seem to be a problem here.

I just wonder whether my above way is the best method we can get?

Does your data use the full range of an int? Could you store your data
in a short?

If your average array size is 550 ints and you have 170mb of
uncompressed arrays then that means you have approximately 77,272 arrays?

If it takes 4 to 5 minutes to compress your data, it will take that long
to uncompress. Even if the uncompress is spread out over time the net
result is the same.

The fastest access will be an int[] in memory. The second fastest will
probably be off of a hard disk with good buffering. Most hard drives
have multiple megabyte buffers these days. The other thing you might
look at is MappedByteBuffer.
 
K

Kevin

Thanks again.

Yes, the data has many many int[] arrays. Basically, each array records
counts of a data structure similar to a "data cube" in data mining.

As to the application, it is an interactive GUI based one. Each int[]
array basically corresponds to a data matrix to be manipulated and
displayed on screen. As for the application, it requires that when the
user clicks the mouse, the software displays the visualization
immediately. As to this reason, it is not possible to read in the
required int[] array from disk on the fly*.
Since one visualization screen usually only needs 1 to 100 int[]
arrays, which means, though we can have many many int[] arrays, at each
time, we only need to access 1-100 of them (kind of random, since we
never now which ones the user wants to use next). I think uncompress
1-100 (most time only 1-5) array on the fly should be OK in term of
speed. So I choose to store the compressed ones in memory, and do the
uncompress only to those necessary ones on the fly. (note: the
application only access these int[] arrays as read-only. No change will
be done on them)

*There is anyother way though I think: to segment these int[] arrays
into many files, and only read in one file on the fly should be fast
enough also (read one big file is slow, will take about 8 seconds as I
tested). But since the user may not want to see many files for one
project, so this is less attractive.

Also, though 170M data in memory is not that big as we have 1G memory
on the machines, but other parts of the application will use large
amount of memory also. Also, later, it is possible to have some more
data (more int[] arrays). So it is better to reduce their memory usage
somehow.

(To use 64 bit java is another way, but we do not have it for this
project)
 
K

Knute Johnson

Kevin said:
Thanks again.

Yes, the data has many many int[] arrays. Basically, each array records
counts of a data structure similar to a "data cube" in data mining.

As to the application, it is an interactive GUI based one. Each int[]
array basically corresponds to a data matrix to be manipulated and
displayed on screen. As for the application, it requires that when the
user clicks the mouse, the software displays the visualization
immediately. As to this reason, it is not possible to read in the
required int[] array from disk on the fly*.

I think you are seriously underestimating the performance of disk I/O.
I wrote a test program to read an array of ints between 100 and 1000
bytes long out of a file that is 170mb of ints. Even with my slow
computer it averages 10ms or less per array read. I don't think 10ms
would be too long to wait.

import java.io.*;
import java.util.*;

public class test10 {
public static void main(String[] args) throws Exception {
/* creates 170mb file of ints
RandomAccessFile file = new RandomAccessFile("ran.dat","rw");
for (int i=0; i<42500000; i++)
file.writeInt(i);
file.close();
*/
int iterations = 10000;
Random random = new Random(System.currentTimeMillis());
RandomAccessFile file = new RandomAccessFile("ran.dat","r");
long start = System.currentTimeMillis();
for (int i=0; i<iterations; i++) {
int arrayLength = random.nextInt(900) + 100;
int[] buf = new int[arrayLength];
int where = random.nextInt(42500000);
file.seek(where * 4);
try {
for (int j=0; j<arrayLength; j++)
buf[j] = file.readInt();
} catch (EOFException eofe) {
System.out.println(eofe);
}
}
file.close();
long end = System.currentTimeMillis();
System.out.println((end-start)/1.0/iterations);

/* this is slightly faster than above
int iterations = 10000;
Random random = new Random(System.currentTimeMillis());
long start = System.currentTimeMillis();
for (int i=0; i<iterations; i++) {
FileInputStream fis = new FileInputStream("ran.dat");
BufferedInputStream bis = new BufferedInputStream(fis);
DataInputStream dis = new DataInputStream(bis);
int arrayLength = random.nextInt(900) + 100;
int[] buf = new int[arrayLength];
int where = random.nextInt(42500000);
dis.skip(where * 4);
try {
for (int j=0; j<arrayLength; j++)
buf[j] = dis.readInt();
} catch (EOFException eofe) {
System.out.println(eofe);
}
dis.close();
}
long end = System.currentTimeMillis();
System.out.println((end-start)/1.0/iterations);
*/
}
}
 
R

Roedy Green

I think you are seriously underestimating the performance of disk I/O.
I wrote a test program to read an array of ints between 100 and 1000
bytes long out of a file that is 170mb of ints.

You want to arrange things so you can do the read in one physical i/o.
You might store your arrays one after the other in a random access
file with keeping in RAM where each array starts. If you need a new
array, just tack it on the end.

See http://mindprod.com/projects/hermitcrab.html
for how to do it if you require the ability to update and change the
length of the arrays on the fly.
 
K

Kevin

Thanks so much.
I think I really under the impression of java has slow IO from the old
days.
Now, your post gives me a nice hint on this.
Thank you, and now I know what I can do with my code.
 
A

Alun Harford

Kevin said:
Thanks again.

Yes, the data has many many int[] arrays. Basically, each array records
counts of a data structure similar to a "data cube" in data mining.

As to the application, it is an interactive GUI based one. Each int[]
array basically corresponds to a data matrix to be manipulated and
displayed on screen. As for the application, it requires that when the
user clicks the mouse, the software displays the visualization
immediately. As to this reason, it is not possible to read in the
required int[] array from disk on the fly*.
Since one visualization screen usually only needs 1 to 100 int[]
arrays, which means, though we can have many many int[] arrays, at each
time, we only need to access 1-100 of them (kind of random, since we
never now which ones the user wants to use next). I think uncompress
1-100 (most time only 1-5) array on the fly should be OK in term of
speed. So I choose to store the compressed ones in memory, and do the
uncompress only to those necessary ones on the fly. (note: the
application only access these int[] arrays as read-only. No change will
be done on them)

*There is anyother way though I think: to segment these int[] arrays
into many files, and only read in one file on the fly should be fast
enough also (read one big file is slow, will take about 8 seconds as I
tested). But since the user may not want to see many files for one
project, so this is less attractive.

Also, though 170M data in memory is not that big as we have 1G memory
on the machines, but other parts of the application will use large
amount of memory also. Also, later, it is possible to have some more
data (more int[] arrays). So it is better to reduce their memory usage
somehow.

Why not read the data in from disk (you should be able to know where in the
file the int[] that you want is located, so you don't need to read the whole
(large) file) and leave soft references to the arrays that you've already
read in?

Alun Harford
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,266
Messages
2,571,089
Members
48,773
Latest member
Kaybee

Latest Threads

Top