Block processing

V

vectorizor

Hi all,

I am writing an image processing algorithm that goes across an image
first row-wise and then column-wise. For illustration purposes,
imagine that for every pixels, the output is computed as the sum of
the input and its previous neighbourg. In the horizontal pass (i.e.
row-wise), the previous neighbourg is defined as the pixel on the
left. This is followed by a vertical pass, whereby the previous
neighbourg is defined as the pixel on top.

I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.

I was suggested to try a block processing approach, whereby the image
is stored by blocks rather than by rows. A block is typically 32x32,
so in memory, the first line of the block is followed by the second
line of the block, followed by the third, .... until we move on to the
next block, and the pattern is repeated. This has the advantage of
ensuring locality, i.e. both horizontal and vertical neighboring
pixels will be close, hence there should be less cache misses. Both
horizontal and vertical passes should be approximately as fast.

This is all well and good, but things are not that perfect in
practice. Accessing pixels is now significantly slower, even for
horizontal pass. For an image structured in blocks, an horizontal pass
is 0.030 sec. whereas with the original memory structure (row after
row, lets call it linear), the horizontal pass was only 0.009, 3 times
faster! This cannot be due to cache misses, as locality is pretty good
in both cases. The code to access a pixel is more costly in the case
of block processing. Compare the following, which is copied from my
implementation:

// linear access to pixel
// img: pointer to start of array in memory
#define PIXEL(img, width, row, col) ((img) + ((row) * width) +
(col))

// block access to pixel
#define PXL(img, width, col, row) \
((img) + (((row) & (~BLOCK_Y_MASK)) * (width)) + (((col) >>
BLOCK_X_SHIFT) * BLOCK_SIZE) \
+ (((row) & BLOCK_Y_MASK) << BLOCK_X_SHIFT) + ((col) &
BLOCK_X_MASK))

with enum
{
BLOCK_X_SHIFT = 5,
BLOCK_Y_SHIFT = 5,
BLOCK_X_MASK = (1 << BLOCK_X_SHIFT) - 1,
BLOCK_Y_MASK = (1 << BLOCK_Y_SHIFT) - 1,
BLOCK_WIDTH = (1 << BLOCK_X_SHIFT),
BLOCK_HEIGHT = (1 << BLOCK_Y_SHIFT),
BLOCK_SIZE = BLOCK_WIDTH * BLOCK_HEIGHT
};

Accessing a pixel is thus obviously much slower, and I was not able to
speed things up despite all my efforst. Does someone have any ideas or
suggestions that may help, or know of any good references discussing
the problem? Thanks in advance

Angus
 
V

Victor Bazarov

I am writing an image processing algorithm that goes across an image
first row-wise and then column-wise. [..]

I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.

I was suggested to try a block processing approach, [..]

This is all well and good, but things are not that perfect in
practice. Accessing pixels is now significantly slower, even for
horizontal pass. For an image structured in blocks, an horizontal pass
is 0.030 sec. whereas with the original memory structure (row after
row, lets call it linear), the horizontal pass was only 0.009, 3 times
faster! This cannot be due to cache misses, as locality is pretty good
in both cases. The code to access a pixel is more costly in the case
of block processing. Compare the following, which is copied from my
implementation:

// linear access to pixel
// img: pointer to start of array in memory
#define PIXEL(img, width, row, col) ((img) + ((row) * width) +
(col))

// block access to pixel
#define PXL(img, width, col, row) \
((img) + (((row) & (~BLOCK_Y_MASK)) * (width)) + (((col) >>
BLOCK_X_SHIFT) * BLOCK_SIZE) \
+ (((row) & BLOCK_Y_MASK) << BLOCK_X_SHIFT) + ((col) &
BLOCK_X_MASK))

with enum
{
BLOCK_X_SHIFT = 5,
BLOCK_Y_SHIFT = 5,
BLOCK_X_MASK = (1 << BLOCK_X_SHIFT) - 1,
BLOCK_Y_MASK = (1 << BLOCK_Y_SHIFT) - 1,
BLOCK_WIDTH = (1 << BLOCK_X_SHIFT),
BLOCK_HEIGHT = (1 << BLOCK_Y_SHIFT),
BLOCK_SIZE = BLOCK_WIDTH * BLOCK_HEIGHT
};

Accessing a pixel is thus obviously much slower, and I was not able to
speed things up despite all my efforst. Does someone have any ideas or
suggestions that may help, or know of any good references discussing
the problem? Thanks in advance

What strikes me as wrong here is that you do so much more arithmetic
in PXL than in PIXEL. It should be pretty much the same.

And why are you using macros? You should be writing block as an object
(just like an image), and process it separately. Each block should have
a boundary of the pixels of its surrounding blocks (for contiguousness)
and processing should be done within one block, then move to the next
block.

I think you should consult with comp.graphics.algorithms.

V
 
J

Jacek Dziedzic

Hi all,

I am writing an image processing algorithm that goes across an image
first row-wise and then column-wise. For illustration purposes,
imagine that for every pixels, the output is computed as the sum of
the input and its previous neighbourg. In the horizontal pass (i.e.
row-wise), the previous neighbourg is defined as the pixel on the
left. This is followed by a vertical pass, whereby the previous
neighbourg is defined as the pixel on top.

I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.

I was suggested to try a block processing approach, whereby the image
is stored by blocks rather than by rows. A block is typically 32x32,
so in memory, the first line of the block is followed by the second
line of the block, followed by the third, .... until we move on to the
next block, and the pattern is repeated. This has the advantage of
ensuring locality, i.e. both horizontal and vertical neighboring
pixels will be close, hence there should be less cache misses. Both
horizontal and vertical passes should be approximately as fast.

This is all well and good, but things are not that perfect in
practice. Accessing pixels is now significantly slower, even for
horizontal pass. For an image structured in blocks, an horizontal pass
is 0.030 sec. whereas with the original memory structure (row after
row, lets call it linear), the horizontal pass was only 0.009, 3 times
faster! This cannot be due to cache misses, as locality is pretty good
in both cases. The code to access a pixel is more costly in the case
of block processing. Compare the following, which is copied from my
implementation:

// linear access to pixel
// img: pointer to start of array in memory
#define PIXEL(img, width, row, col) ((img) + ((row) * width) +
(col))

// block access to pixel
#define PXL(img, width, col, row) \
((img) + (((row) & (~BLOCK_Y_MASK)) * (width)) + (((col) >>
BLOCK_X_SHIFT) * BLOCK_SIZE) \
+ (((row) & BLOCK_Y_MASK) << BLOCK_X_SHIFT) + ((col) &
BLOCK_X_MASK))

with enum
{
BLOCK_X_SHIFT = 5,
BLOCK_Y_SHIFT = 5,
BLOCK_X_MASK = (1 << BLOCK_X_SHIFT) - 1,
BLOCK_Y_MASK = (1 << BLOCK_Y_SHIFT) - 1,
BLOCK_WIDTH = (1 << BLOCK_X_SHIFT),
BLOCK_HEIGHT = (1 << BLOCK_Y_SHIFT),
BLOCK_SIZE = BLOCK_WIDTH * BLOCK_HEIGHT
};

Accessing a pixel is thus obviously much slower, and I was not able to
speed things up despite all my efforst. Does someone have any ideas or
suggestions that may help, or know of any good references discussing
the problem? Thanks in advance

<OT>
Try with a different block size than 32. Sometimes the cache
suffers penalties when the data is aligned on certain boundaries
that are in tune with the cache's hashing method. Perhaps, by
chance, 32 is this magic value? You could try with blocks of
different size.

Other than that, perhaps it would be beneficial to store a
transpose of your original pixel map in a copy? That way you
would only have to go column-wise once (when creating this
transpose), and then work row-wise all the time later on.
</OT>

All this is, however, beyond the scope of this ng.

HTH,
- J.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top