Creating and Accessing Very Big Arrays in C

H

Herbert Rosenau

Perhaps if you explained what you want to do, we can suggest ways of
doing it without requiring ridiculous amounts of memory.

I think this is very wise. Thanks for everyone's comments so far.

I'm effectively creating massive 7 dimensional hypermatricies in C.
These matrices are of type Char and only contain boolean values (1 or
0). They are incredibly sparse (I know that there are sparse matrix
techniques, but I haven't found any good implementations in C yet -
can you recommend any?).

I'll write in pseudo code, but I hope you get the gist.

Lets say I want to create a large 7D hypermatrix:

n = 100 ;
m = new boolean_matrix[n][n][n][n][n][n][n] ;

The problem is, this takes up way too much memory.

Hint: as you use only 1 bit of a variable you may simply use unsigned
int and therin 16, 32 or 64 bit. Or to make something easier to prot
use an array of char and use each of the 8 bits guaranteed by the
standard to reduce the number of vatiables needed.
To save memory, alternatively, I could create a 1D array that stores
n^7 elements:

Why does you think a nxnxnxnxnxnxnxn array costs more memory than an
array of size
n*n*n*n*n*n*n?
m = new Boolean_matrix[n*n*n*n*n*n*n] ;

When you needs C++ go to the C++ group. new is no C operator.
But n^7 (where n = 100) is bigger than the larges INT (2^32).

In C, it seems that I can create an array with a size contained within
DOUBLE number of elements, but I can't access the array once created,
hence my original problem:

Look at the definition of malloc on your documentation of your
compiler. I doin't think that malloc is defined as
void*malloc(long|double n); but as void*malloc(size_t n);
-------------------------------------------------------

I've managed to create arrays using DOUBLE data types, but when I try
to access the array, the compiler complains that the number is not an
INT, i.e.


// this succeeds
char* anArray = (char*) calloc( 999999999999999999 , sizeof(char) ) ;

No. You casts the result of malloc and that may enter you the lands of
undefined behavior.

Are here only idiots who are unable to readf what gets multiple times
in a week and multiple thousend times/year written here?
// this does not compile because the number 1234567891234584 is not an
INT
anArray[1234567891234584] = 1 ;

----------------------------------------------------------

There is certainly mileage in using other data storage classes to
represent my sparse matrices, i.e. hashtable, trees, etc - but I
haven't found any that are intuitive and/or efficient.

If anyone has any suggestions , I'd be sincerely grateful

Daniel
Ask the documentation of your system how many bytes a single
application can occupay. malloc will never give you even that amount
but something less.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
 
J

jmcgill

Ancient_Hacker said:
Nowdays memory is selling for about $60 a gigabyte

Memory that fits the machines that can address that kind of space
doesn't cost $60...

Maybe something in the grid world.

Maybe this problem can be addressed like a distributed.net project.
 
D

Dann Corbit

Ancient_Hacker said:
Good point. I think we were talking about the limit on the amount of
memory allocatable for this guy's array.

He wants a 100x100x100x100x100x100x100 array of boolean and is a bit
fuzzy on what this entails.

100,000,000,000,000 bits = 12,500,000,000,000 bytes

It's not impossible. He will have to use virtual memory on a 64 bit system.

Here is one terabyte for $575 (unwrap lines to examine):
http://www.compuplus.com/i-LaCie-Big-Disk-Extreme-10-
Terabyte-with-Triple-Interface-USB-20-FireWire-400-FireWire-
800-Hard-Drive-300797U-1005838~.html?sid=05gf807gg4xi7rw

So the disk would cost him about $8000.

If the array is sparse, he might not need anything so extreme.
 
K

Keith Thompson

Herbert Rosenau said:
No. You casts the result of malloc and that may enter you the lands of
undefined behavior.

No, the cast itself doesn't cause undefined behavior. If the code is
valid without the cast, it's valid with the cast. The cast *can* mask
certain errors, such as failing to #include <stdlib.h> or using a C++
compiler, and it's therefore a bad idea, but we need to be clear about
*why* it's a bad idea.

The 999999999999999999 argument is a problem, as I said in another
followup. sizeof(char) is 1 by definition. calloc() sets the
allocated memory to all-bits-zero, which is likely to be a waste of
time.
 
K

Keith Thompson

jmcgill said:
Memory that fits the machines that can address that kind of space
doesn't cost $60...

Maybe something in the grid world.

Maybe this problem can be addressed like a distributed.net project.

Maybe, but the OP has made it clear that the array is sparse. Trying
to figure out how to allocate the whole thing is a waste of time; he
needs to figure out a compact representation for his sparse array.

(If the full array were a reasonable size, representing the whole
thing, holes and all, might be a good approach; the resulting code
could be easier to write and quicker to execute than something that
has to traverse a more complex data structure. The meaning of
"reasonable size" varies over time, as memories get larger, faster,
and cheaper. But we're not yet to the point where an array of 10**14
elements is reasonable.
 
A

Ancient_Hacker

Dann said:
100,000,000,000,000 bits = 12,500,000,000,000 bytes

It's not impossible. He will have to use virtual memory on a 64 bit system.

Virtual memory is a particularly poor performer when used with sparse
arrays.
Poor as in easily a million times slower than real RAM. If we assume a
page size of 2MB (x86 large), 1GB of real RAM, and 1% sparse matrix,
every bit randomly accessed is likely to cause a page fault ( there's
only once chance in 550 it's in RAM). Seeking over 550 GB and reading
2MB to just access one bit is a huge performance hit.


He's either gotta use sparse matrix techniques, remove a dimension or
two, or shrink the extent of each dimension, as noted in my prevous
post.
 
D

Dann Corbit

Ancient_Hacker said:
Virtual memory is a particularly poor performer when used with sparse
arrays.
Poor as in easily a million times slower than real RAM. If we assume a
page size of 2MB (x86 large), 1GB of real RAM, and 1% sparse matrix,
every bit randomly accessed is likely to cause a page fault ( there's
only once chance in 550 it's in RAM). Seeking over 550 GB and reading
2MB to just access one bit is a huge performance hit.


He's either gotta use sparse matrix techniques, remove a dimension or
two, or shrink the extent of each dimension, as noted in my prevous
post.

Right. Hence my comment about sparse arrays below. I had not read the
original post and only responded to the content that you see.
 
O

Old Wolf

Keith said:
So the previous question can be changed to: why would one ever use a
suffixed *integer* constant?

One example is an argument to a variadic function, but that's probably
fairly rare. In most contexts, numeric values will be implicitly
converted to the required type.

When it's used in an expression that needs wider
precision, or needs to be unsigned, eg:

long x = 1000L * 1000;

unsigned int x = a * 256u + b;
 
J

jmcgill

Keith said:
Maybe, but the OP has made it clear that the array is sparse. Trying
to figure out how to allocate the whole thing is a waste of time; he
needs to figure out a compact representation for his sparse array.

Now I'm following the thread in hopes of learning something of the
techniques for dealing with sparse matrices. I even located the box
that might contain a couple of linear algebra books just in case.
 
K

Keith Thompson

jmcgill said:
Now I'm following the thread in hopes of learning something of the
techniques for dealing with sparse matrices. I even located the box
that might contain a couple of linear algebra books just in case.

To review: you were originally trying to create a 7-dimensional array,
with the length of each array being 100. That's 1e14 elements, which
is way beyond what's feasible -- but only a few of those elements will
actually be used.

The most obvious solution that springs to mind is a hash table, a data
structure that maps a key to a value. Construct the key from the 7
indices. Our own CBFalconer provides a freeware hash table package;
see <http://cbfalconer.home.att.net/download/index.htm>. (I haven't
used it myself, but Chuck's reputation leads me to assume it's good
code.)

A simpler solution is a simple linear array of structures, where each
structure contains the 7 indices and the data element. But this
requires you to search the array (either linearly or using a binary
search), and is likely to be less efficient than a hash table. Or you
can use an AVL tree, or whatever data structure strikes your fancy.

If there's some consistent pattern to the set of elements that you're
using, you might be able to take advantage of that. For example, if
the 2nd, 3rd, 4th, 5th, and 6th indices are always zero, you can use a
2-dimensional array; that's an absurd example, but you get the idea.
 
J

jmcgill

Keith said:
To review: you were originally trying to create a 7-dimensional array,
with the length of each array being 100. That's 1e14 elements, which
is way beyond what's feasible -- but only a few of those elements will
actually be used.

The most obvious solution that springs to mind is a hash table, a data
structure that maps a key to a value.

It's the only solution that I came up with, and it sprang to mind
immediately. Need to learn to trust my instincts I guess.
 
R

Richard Bos

Al Balmer said:
That one I get.


This one I don't. Why would you need a cast there?

This seems OK (at least the compiler doesn't complain):

int compare(const void *arg1, const void *arg2)
{
const int *warg1 = arg1;
const int *warg2 = arg2;

return *warg1 - *warg2;
}

You don't need one if you write it like this. You do need one (well,
two, actually) if you write it like this:

int compare(const void *arg1, const void *arg2)
{
return *(const int *)arg1 - *(const int *)arg2;
}

Some people prefer one way; some the other. Both are valid, a good C
programmer is familiar with both, and the latter needs the casts.

Richard
 
K

Keith Thompson

You don't need one if you write it like this. You do need one (well,
two, actually) if you write it like this:

int compare(const void *arg1, const void *arg2)
{
return *(const int *)arg1 - *(const int *)arg2;
}

Some people prefer one way; some the other. Both are valid, a good C
programmer is familiar with both, and the latter needs the casts.

Note that this comparison function fails if the arguments are, for
example, INT_MIN and INT_MAX. Better:

int compare(const void *arg1, const void *arg2)
{
if (*(int *)arg1) < *(int *)arg2) {
return -1;
}
else if (*(int *)arg1) < *(int *)arg2) {
return 1;
}
else {
return 0;
}
}
 
D

djhulme

Are here only idiots who are unable to readf what gets multiple times
in a week and multiple thousend times/year written here?

Derogatory remarks aren't useful. I can assure you that I'm not an
idiot, and I'd appreciate it if you'd talk to me (and anyone else
for that matter) as if you're talking to someone who is relatively
competent, in my case a Computer Scientist.
What is an "INT"? Remember that C is case-sensitive. There is, as
you know, a built-in type called "int".
It's "int", not "INT".

I know. For clarity I was trying to distinguish between discussing what
could be considered typos (e.g, int) and datatypes (e.g, INT). Sorry
for the confusion.
m = new Boolean_matrix[n*n*n*n*n*n*n] ;

When you needs C++ go to the C++ group. new is no C operator.

That's why I said that I was writing in Pseudo code - I don't
want to get bogged down in the specific code - my question was very
general.
The most obvious solution that springs to mind is a hash table, a data
structure that maps a key to a value. Construct the key from the 7
indices. Our own CBFalconer provides a freeware hash table package;
see <http://cbfalconer.home.att.net/download/index.htm>. (I haven't
used it myself, but Chuck's reputation leads me to assume it's good
code.)

Once again, thanks for everyone's comments.

I've implemented a hashtable, but it is very slow. In fact, my
'n' isn't always 100, in some cases it may be 1000 or more. I
definitely think that sparse matrix representations are the way
forward, but haven't found any good (multidimensional) ones as yet.
Why do you think that would save memory? You're still storing the
same number of elements.

As for memory differences between storing multidimensional arrays
versus a one dimensional array with the same number of elements, I
don't know the specifics, but it has something to do with how much
memory a pointer takes up.

For instance (forgive me if I don't get the calculations exact, this
is just for illustration), if I wanted to store a 100x100x100 3D array
of 'chars', a 1D array would allocate 1,000,000 bytes. If I created
a 3D array, I would need to have 100 pointers, with each pointer
pointing to 100 pointers, which each point to 100bytes:

assuming a pointer is 32bits = 4bytes

100x(4x100)x(4x100))) = 16,000,000bytes
16 times more memory than a 1D matrix with the same number of elements.

If anyone knows of any good (simple and optimised) sparse matrix
libraries in C, or anyone has any better ideas to solve my problem,
I'd be very grateful to hear them.

Thanks again,

Daniel

PS I have 3gig of memory and probably the maximum 7D hypermatrix of
chars I will need to create is when my 'n' is about 2000. My
matrices are probably 75% sparse (75% zeros too 25% ones) .
 
P

Philip Potter

I know. For clarity I was trying to distinguish between discussing what
could be considered typos (e.g, int) and datatypes (e.g, INT). Sorry
for the confusion.

No worries. Just remember that in the C world it's *less* confusing to use
lowercase builtin types, since that's the only form the compiler is
guaranteed to accept, and hence the form everyone is used to.
As for memory differences between storing multidimensional arrays
versus a one dimensional array with the same number of elements, I
don't know the specifics, but it has something to do with how much
memory a pointer takes up.

For instance (forgive me if I don't get the calculations exact, this
is just for illustration), if I wanted to store a 100x100x100 3D array
of 'chars', a 1D array would allocate 1,000,000 bytes. If I created
a 3D array, I would need to have 100 pointers, with each pointer
pointing to 100 pointers, which each point to 100bytes:

assuming a pointer is 32bits = 4bytes

100x(4x100)x(4x100))) = 16,000,000bytes
16 times more memory than a 1D matrix with the same number of elements.

You're a bit confused here.

char c1d[1000000];
This allocates 10^6 bytes (well, chars).

char c3d[100][100][100];
This *also* allocates 10^6 bytes.
There are no pointers here. This is an array-of-arrays-of-arrays-of-char.
Please look at the comp.lang.c FAQ, question 6.2.

If you were doing this using pointers:

char *c1d = malloc(1000000);

This allocates sizeof(char *) + 1000000 * sizeof(char).

int i,j;
char ***c3d = malloc(100 * sizeof(*c3d));
for(i=0; i<100; i++)
c3d = malloc(100 * sizeof(*c3d));
for(i=0; i<100; i++)
for(j=0; j<100; j++)
c3d[j] = malloc(100 * sizeof(*c3d[j]);

This allocates:
1 * sizeof(char ***) for c3d itself
100 * sizeof(char **) for c3d[] (first malloc)
100 * 100 * sizeof(char *) for c3d[][] (second set of mallocs)
100 * 100 * 100 * sizeof(char) for c3d[][][] (third set of mallocs)

assuming sizeof(any pointer) == 4 * sizeof(char), the total is:
4 + 400 + 40000 + 1000000
= 1040404 bytes. Not much bigger than that 1d array, really.
PS I have 3gig of memory and probably the maximum 7D hypermatrix of
chars I will need to create is when my 'n' is about 2000. My
matrices are probably 75% sparse (75% zeros too 25% ones) .

That's not particularly sparse. Storing all the bits would be more efficient
than storing the positions of all of the set bits, which is what a sparse
representation would do.

I think you could get much better space saving from packing 8 bits into 1
byte. However, you're still a bit stymied when you have to store 2000^7
bits. Is there any information in the data you could take advantage of? Do
the bits tend to come in groups, and could you take advantage of this?

Philip

PS You may remember me as "the guy with the long hair" from lectures. :)
 
R

Richard Heathfield

Keith Thompson said:

Note that this comparison function fails

[...because it uses subtraction...]
if the arguments are, for
example,

[...pointers to objects with the values...]
INT_MIN and INT_MAX. Better:

int compare(const void *arg1, const void *arg2)
{
if (*(int *)arg1) < *(int *)arg2) {
return -1;
}
else if (*(int *)arg1) < *(int *)arg2) {
return 1;
}
else {
return 0;
}
}

You're quite right - but I would prefer this:

int compare(const void *vi1, const void *vi2)
{
const int *i1 = vi1;
const int *i2 = vi2;
return (*i1 > *i2) - (*i1 < *i2);
}

partly because it's shorter, partly because it's simpler, partly because
it's easier to read (admittedly only if you are accustomed to the idiom it
uses!), and partly because it doesn't cast away constness.
 
K

Keith Thompson

Richard Heathfield said:
Keith Thompson said:

Note that this comparison function fails

[...because it uses subtraction...]
if the arguments are, for
example,

[...pointers to objects with the values...]
INT_MIN and INT_MAX. Better:

int compare(const void *arg1, const void *arg2)
{
if (*(int *)arg1) < *(int *)arg2) {
return -1;
}
else if (*(int *)arg1) < *(int *)arg2) {
return 1;
}
else {
return 0;
}
}

You're quite right - but I would prefer this:

int compare(const void *vi1, const void *vi2)
{
const int *i1 = vi1;
const int *i2 = vi2;
return (*i1 > *i2) - (*i1 < *i2);
}

partly because it's shorter, partly because it's simpler, partly because
it's easier to read (admittedly only if you are accustomed to the idiom it
uses!), and partly because it doesn't cast away constness.

Agreed, and the use of the result of the "<" and ">" operators is a
nice touch -- clever, but not quite *too* clever. I suppose I didn't
see the need to leave the "const" in the pointer casts, since the
result was just being immediately dereferenced.

I might prefer this:

int compare(const void *vi1, const void *vi2)
{
const int i1 = *(int*)vi1;
const int i2 = *(int*)vi2;
return (i1 > i2) - (i1 < i2);
}

It does have casts, which yours doesn't (a point in your favor), but
it uses them to extract the int values, which are what we're really
interested in.

(This is an alternative; it's not a clearly superior one.)
 
E

ena8t8si

tgur said:
Could you explain, why casting void pointers is bad?

Casting in general is bad. Writing a cast where it
isn't necessary, because conversion would take care
of it, is bad. The reasons are casting can introduce
errors or disguise other errors. There are some cases
where casting is really necessary, but most new C
programmers cast more than they need to (and unfortunately
sometimes not when they should). Casting a void pointer
is almost never necessary, and in most cases where
a function won't compile without a cast the function
can be rewritten so an assignment takes care of the
conversion without needing to write the cast.
 
A

Al Balmer

You don't need one if you write it like this. You do need one (well,
two, actually) if you write it like this:

int compare(const void *arg1, const void *arg2)
{
return *(const int *)arg1 - *(const int *)arg2;
}

Some people prefer one way; some the other. Both are valid, a good C
programmer is familiar with both, and the latter needs the casts.
I'll tell the secret <g>. The reason I often do it my way is that I do
a good deal of upgrading legacy code to ISO-compatible. When I need to
introduce void * parameters, the method I outlined lets me change the
function definition line, add a line for each parameter, and not touch
the rest of the function.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top