B

V

vectorizor

Hi all,

I am writing an image processing algorithm that goes across an image
first row-wise and then column-wise. For illustration purposes,
imagine that for every pixels, the output is computed as the sum of
the input and its previous neighbourg. In the horizontal pass (i.e.
row-wise), the previous neighbourg is defined as the pixel on the
left. This is followed by a vertical pass, whereby the previous
neighbourg is defined as the pixel on top.

I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.

I was suggested to try a block processing approach, whereby the image
is stored by blocks rather than by rows. A block is typically 32x32,
so in memory, the first line of the block is followed by the second
line of the block, followed by the third, .... until we move on to the
next block, and the pattern is repeated. This has the advantage of
ensuring locality, i.e. both horizontal and vertical neighboring
pixels will be close, hence there should be less cache misses. Both
horizontal and vertical passes should be approximately as fast.

This is all well and good, but things are not that perfect in
practice. Accessing pixels is now significantly slower, even for
horizontal pass. For an image structured in blocks, an horizontal pass
is 0.030 sec. whereas with the original memory structure (row after
row, lets call it linear), the horizontal pass was only 0.009, 3 times
faster! This cannot be due to cache misses, as locality is pretty good
in both cases. The code to access a pixel is more costly in the case
of block processing. Compare the following, which is copied from my
implementation:

// linear access to pixel
// img: pointer to start of array in memory
#define PIXEL(img, width, row, col) ((img) + ((row) * width) +
(col))

// block access to pixel
#define PXL(img, width, col, row) \
((img) + (((row) & (~BLOCK_Y_MASK)) * (width)) + (((col) >>
BLOCK_X_SHIFT) * BLOCK_SIZE) \
+ (((row) & BLOCK_Y_MASK) << BLOCK_X_SHIFT) + ((col) &
BLOCK_X_MASK))

with enum
{
BLOCK_X_SHIFT = 5,
BLOCK_Y_SHIFT = 5,
BLOCK_X_MASK = (1 << BLOCK_X_SHIFT) - 1,
BLOCK_Y_MASK = (1 << BLOCK_Y_SHIFT) - 1,
BLOCK_WIDTH = (1 << BLOCK_X_SHIFT),
BLOCK_HEIGHT = (1 << BLOCK_Y_SHIFT),
BLOCK_SIZE = BLOCK_WIDTH * BLOCK_HEIGHT
};

Accessing a pixel is thus obviously much slower, and I was not able to
speed things up despite all my efforst. Does someone have any ideas or
suggestions that may help, or know of any good references discussing
the problem? Thanks in advance

Angus
 
E

Eric Sosman

Hi all,

I am writing an image processing algorithm that goes across an image
first row-wise and then column-wise. For illustration purposes,
imagine that for every pixels, the output is computed as the sum of
the input and its previous neighbourg. In the horizontal pass (i.e.
row-wise), the previous neighbourg is defined as the pixel on the
left. This is followed by a vertical pass, whereby the previous
neighbourg is defined as the pixel on top.

I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.
[...]

This isn't really a C question -- most questions about
speed are not C questions -- but I'll risk the rath of the
wrighteous and try a partial answer anyhow.

The biggest payoff is likely to come from combining the
two passes into one, perhaps by interleaving. For example,
you might make a horizontal pass on rows 0 and 1 and then
the first step of a vertical pass (just rows 0 and 1), then
horizontal on row 2 and a vertical step for rows 1 and 2,
and so on. This may increase the amount of work you are
able to do per cache fill.

You might also get a benefit from "padding" the image
so the rows are not separated by exact multiples of the cache
line size. A 640x480 image, for example, has a row length
that is likely to be a multiple of the cache line size, and
this could lead to cache thrashing. Embedding the 640x480
area in a 642x480 area (and ignoring the extra two columns)
might gain speed.

For further ideas, see if you can find a newsgroup with
"graphics" or "image" in its name.
 
M

Malcolm McLean

[image processing ]
I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.

I was suggested to try a block processing approach, whereby the image
is stored by blocks rather than by rows.
Firstly it surprising that scanning the image is the bottleneck in your
program. You ought to be able to scan a large area of arbitrary memory very
quickly indeed, even with sub-optimal cache usage.
You are largely wasting your time trying to speed things up by reorganising
images into blocks. Yes you can improve read times, but at the cost of extra
processing to manage the blocks, and then you add extra complexity to the
code. Finally you probably have to reformat anyway for the next stage of
processing. At the end you have a complex and bug-ridden program that runs
only slightly faster, if at all.

One thing that is not a waste of time is to try padding to sizes that are an
exact power of two, and change the multiply into a shift, or not an exact
power of two, to help the cache along. Then profile to see which is faster.

Another thing you can try is

int block[32][32]

getblock(img, block, x, y)
processblock(block);
writeblock(img, block, 32, 32);

storing the portion of the image you are working on to a local area,
processing it, and writing out the results. This will only work if
processblock() is quite intensive.
 
C

CBFalconer

Eric said:
.... snip ...

This isn't really a C question -- most questions about
speed are not C questions -- but I'll risk the rath of the
wrighteous and try a partial answer anyhow.

rath? I am roth. :)

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>
<http://www.aaxnet.com/editor/edit043.html>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews
 
V

vectorizor

The biggest payoff is likely to come from combining the
two passes into one, perhaps by interleaving. For example,
you might make a horizontal pass on rows 0 and 1 and then
the first step of a vertical pass (just rows 0 and 1), then
horizontal on row 2 and a vertical step for rows 1 and 2,
and so on. This may increase the amount of work you are
able to do per cache fill.

That would require an extra buffer to store the result of the vertical
passs, and I can't do that.
You might also get a benefit from "padding" the image
so the rows are not separated by exact multiples of the cache
line size. A 640x480 image, for example, has a row length
that is likely to be a multiple of the cache line size, and
this could lead to cache thrashing. Embedding the 640x480
area in a 642x480 area (and ignoring the extra two columns)
might gain speed.

I do not understand, I thought having row length a multiple of the
cache size was a good thing??
 
E

Eric Sosman

vectorizor said:
That would require an extra buffer to store the result of the vertical
passs, and I can't do that.

You need space for *one* image row. The idea is to do
the vertical "pass" as a series of row-by-row steps that
interleave between the horizontal passes.
I do not understand, I thought having row length a multiple of the
cache size was a good thing??

Might be, might not. My thought was that to calculate the
end value for a specific pixel you need its original value and
those of the three pixels just above it, just to its left, and
diagonally up and left. If the row length is a multiple of the
cache line size, the two vertically-adjacent pixels might compete
for the same cache location and thus cause thrashing. If the
length is one greater than a line size multiple, the up-and-left
pixel might compete with the central pixel. Hence my suggestion
of two padding slots.

But I may be talking through my hat, making suppositions
about "caches" and "line sizes" and "memory latencies" and
"speed" and other such matters that are not part of the C
language. I repeat my earlier suggestion (which I now think
should probably have been the entirety of my response):
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top