raj said:
int m[1000][1000];
bool m[1000][1000]; // assuming I use C++
char m[1000][1000];
unsigned char m[1000][1000];
Like many people, you make the mistake of assuming that bools occupy a
single bit (also note that C99 includes a bool type much like the C++
one.) There is a difficult-to-predict performance tradeoff here that
will vary from machine to machine. Here's a comparison of the advantages
and disadvantages:
int m[1000][1000];
This stores a single value in each word, consuming at least one million
words. Random access requires only the address calculation plus a single
read/write operation on most machines, with no need for bit operations.
That makes this alternative good for random access on machines with weak
bit-level capabilities. If the array is going to be traversed
sequentially on a machine with a data cache though, it will produce many
more cache misses than other approaches.
char m[1000][1000];
This stores a single value in each byte, so that you only need about
million bytes. If you traverse the array in order on a machine with a
data cache, you'll get a lot less cache misses with this version, and on
byte-addressable machines with instructions to fetch bytes, no bit
operations are needed on top of address calculation. On word-addressable
machines with no special instructions for loading bytes, however, this
may turn out to compile into something as slow as the bit array below.
It also still takes 8 times as much space as the bit array, and has
about 8 times as many cache misses when traversed in order on machines
with data caches.
#define INT_BITS (sizeof(int)*8)
#define CEIL_DIV(x,y) ((x+y-1)/y)
int m[1000][CEIL_DIV(1000,INT_BITS)];
#define M_GET(r,c) (m[(r)][(c)/INT_BITS] & (1 << ((c) % INT_BITS)))
#define M_SET(r,c) (m[(r)][(c)/INT_BITS] |= (1 << ((c) % INT_BITS)))
#define M_CLEAR(r,c) (m[(r)][(c)/INT_BITS] &= ~(1 << ((c) % INT_BITS)))
This creates an array of bits and some macros for accessing it. Although
the division and modulus operations will probably compile to bit
operations, each random access to a non-constant index pair will require
on most machines at least two ANDs and two shifts that the word version
doesn't. Accesses to fixed or partially fixed indexes are faster, since
it can precompute the index and bit mask (shown here for an INT_BITS of 32):
M_GET(40,65)
(m[(40)][(65)/INT_BITS] & (1 << ((65) % INT_BITS)))
m[40][65/32] & (1 << (65 % 32))
m[40][2] & (1 << 1)
m[40][2] & 2
Sequential access actually can yield very good performance, since you
can do that with a loop like this:
unsigned int current_word, i, j, mask;
for(i=0; i<1000; i++) {
for(j=0; j<CEIL_DIV(1000, INT_BITS); j++) {
current_word = m
[j];
for(mask = 1; mask != 0; mask <<= 1) {
int current_bit = current_word & mask; /*get current bit*/
current_word |= mask; /* set the current bit */
current_word &= ~mask; /* clear the current bit */
}
m[j] = current_word;
}
}
This amounts to two bit operations and 2/INT_BITS word accesses per
element on average, which is not only often faster than two word
accesses but cache misses on machines with caches are reduced by about
INT_BIT times.
Even if you only do random access, considering that you reduce your
space requirements by INT_BIT times, a few extra cycles per access is
justified in many applications.