V
vectorizor
Hi all,
I am writing an image processing algorithm that goes across an image
first row-wise and then column-wise. For illustration purposes,
imagine that for every pixels, the output is computed as the sum of
the input and its previous neighbourg. In the horizontal pass (i.e.
row-wise), the previous neighbourg is defined as the pixel on the
left. This is followed by a vertical pass, whereby the previous
neighbourg is defined as the pixel on top.
I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.
I was suggested to try a block processing approach, whereby the image
is stored by blocks rather than by rows. A block is typically 32x32,
so in memory, the first line of the block is followed by the second
line of the block, followed by the third, .... until we move on to the
next block, and the pattern is repeated. This has the advantage of
ensuring locality, i.e. both horizontal and vertical neighboring
pixels will be close, hence there should be less cache misses. Both
horizontal and vertical passes should be approximately as fast.
This is all well and good, but things are not that perfect in
practice. Accessing pixels is now significantly slower, even for
horizontal pass. For an image structured in blocks, an horizontal pass
is 0.030 sec. whereas with the original memory structure (row after
row, lets call it linear), the horizontal pass was only 0.009, 3 times
faster! This cannot be due to cache misses, as locality is pretty good
in both cases. The code to access a pixel is more costly in the case
of block processing. Compare the following, which is copied from my
implementation:
// linear access to pixel
// img: pointer to start of array in memory
#define PIXEL(img, width, row, col) ((img) + ((row) * width) +
(col))
// block access to pixel
#define PXL(img, width, col, row) \
((img) + (((row) & (~BLOCK_Y_MASK)) * (width)) + (((col) >>
BLOCK_X_SHIFT) * BLOCK_SIZE) \
+ (((row) & BLOCK_Y_MASK) << BLOCK_X_SHIFT) + ((col) &
BLOCK_X_MASK))
with enum
{
BLOCK_X_SHIFT = 5,
BLOCK_Y_SHIFT = 5,
BLOCK_X_MASK = (1 << BLOCK_X_SHIFT) - 1,
BLOCK_Y_MASK = (1 << BLOCK_Y_SHIFT) - 1,
BLOCK_WIDTH = (1 << BLOCK_X_SHIFT),
BLOCK_HEIGHT = (1 << BLOCK_Y_SHIFT),
BLOCK_SIZE = BLOCK_WIDTH * BLOCK_HEIGHT
};
Accessing a pixel is thus obviously much slower, and I was not able to
speed things up despite all my efforst. Does someone have any ideas or
suggestions that may help, or know of any good references discussing
the problem? Thanks in advance
Angus
I am writing an image processing algorithm that goes across an image
first row-wise and then column-wise. For illustration purposes,
imagine that for every pixels, the output is computed as the sum of
the input and its previous neighbourg. In the horizontal pass (i.e.
row-wise), the previous neighbourg is defined as the pixel on the
left. This is followed by a vertical pass, whereby the previous
neighbourg is defined as the pixel on top.
I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.
I was suggested to try a block processing approach, whereby the image
is stored by blocks rather than by rows. A block is typically 32x32,
so in memory, the first line of the block is followed by the second
line of the block, followed by the third, .... until we move on to the
next block, and the pattern is repeated. This has the advantage of
ensuring locality, i.e. both horizontal and vertical neighboring
pixels will be close, hence there should be less cache misses. Both
horizontal and vertical passes should be approximately as fast.
This is all well and good, but things are not that perfect in
practice. Accessing pixels is now significantly slower, even for
horizontal pass. For an image structured in blocks, an horizontal pass
is 0.030 sec. whereas with the original memory structure (row after
row, lets call it linear), the horizontal pass was only 0.009, 3 times
faster! This cannot be due to cache misses, as locality is pretty good
in both cases. The code to access a pixel is more costly in the case
of block processing. Compare the following, which is copied from my
implementation:
// linear access to pixel
// img: pointer to start of array in memory
#define PIXEL(img, width, row, col) ((img) + ((row) * width) +
(col))
// block access to pixel
#define PXL(img, width, col, row) \
((img) + (((row) & (~BLOCK_Y_MASK)) * (width)) + (((col) >>
BLOCK_X_SHIFT) * BLOCK_SIZE) \
+ (((row) & BLOCK_Y_MASK) << BLOCK_X_SHIFT) + ((col) &
BLOCK_X_MASK))
with enum
{
BLOCK_X_SHIFT = 5,
BLOCK_Y_SHIFT = 5,
BLOCK_X_MASK = (1 << BLOCK_X_SHIFT) - 1,
BLOCK_Y_MASK = (1 << BLOCK_Y_SHIFT) - 1,
BLOCK_WIDTH = (1 << BLOCK_X_SHIFT),
BLOCK_HEIGHT = (1 << BLOCK_Y_SHIFT),
BLOCK_SIZE = BLOCK_WIDTH * BLOCK_HEIGHT
};
Accessing a pixel is thus obviously much slower, and I was not able to
speed things up despite all my efforst. Does someone have any ideas or
suggestions that may help, or know of any good references discussing
the problem? Thanks in advance
Angus