B

vectorizor · Apr 5, 2007

Hi all,

I am writing an image processing algorithm that goes across an image
first row-wise and then column-wise. For illustration purposes,
imagine that for every pixels, the output is computed as the sum of
the input and its previous neighbourg. In the horizontal pass (i.e.
row-wise), the previous neighbourg is defined as the pixel on the
left. This is followed by a vertical pass, whereby the previous
neighbourg is defined as the pixel on top.

I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.

I was suggested to try a block processing approach, whereby the image
is stored by blocks rather than by rows. A block is typically 32x32,
so in memory, the first line of the block is followed by the second
line of the block, followed by the third, .... until we move on to the
next block, and the pattern is repeated. This has the advantage of
ensuring locality, i.e. both horizontal and vertical neighboring
pixels will be close, hence there should be less cache misses. Both
horizontal and vertical passes should be approximately as fast.

This is all well and good, but things are not that perfect in
practice. Accessing pixels is now significantly slower, even for
horizontal pass. For an image structured in blocks, an horizontal pass
is 0.030 sec. whereas with the original memory structure (row after
row, lets call it linear), the horizontal pass was only 0.009, 3 times
faster! This cannot be due to cache misses, as locality is pretty good
in both cases. The code to access a pixel is more costly in the case
of block processing. Compare the following, which is copied from my
implementation:

// linear access to pixel
// img: pointer to start of array in memory
#define PIXEL(img, width, row, col) ((img) + ((row) * width) +
(col))

// block access to pixel
#define PXL(img, width, col, row) \
((img) + (((row) & (~BLOCK_Y_MASK)) * (width)) + (((col) >>
BLOCK_X_SHIFT) * BLOCK_SIZE) \
+ (((row) & BLOCK_Y_MASK) << BLOCK_X_SHIFT) + ((col) &
BLOCK_X_MASK))

with enum
{
BLOCK_X_SHIFT = 5,
BLOCK_Y_SHIFT = 5,
BLOCK_X_MASK = (1 << BLOCK_X_SHIFT) - 1,
BLOCK_Y_MASK = (1 << BLOCK_Y_SHIFT) - 1,
BLOCK_WIDTH = (1 << BLOCK_X_SHIFT),
BLOCK_HEIGHT = (1 << BLOCK_Y_SHIFT),
BLOCK_SIZE = BLOCK_WIDTH * BLOCK_HEIGHT
};

Accessing a pixel is thus obviously much slower, and I was not able to
speed things up despite all my efforst. Does someone have any ideas or
suggestions that may help, or know of any good references discussing
the problem? Thanks in advance

Angus

Eric Sosman · Apr 5, 2007

Hi all,

I am writing an image processing algorithm that goes across an image
first row-wise and then column-wise. For illustration purposes,
imagine that for every pixels, the output is computed as the sum of
the input and its previous neighbourg. In the horizontal pass (i.e.
row-wise), the previous neighbourg is defined as the pixel on the
left. This is followed by a vertical pass, whereby the previous
neighbourg is defined as the pixel on top.

I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.
[...]

This isn't really a C question -- most questions about
speed are not C questions -- but I'll risk the rath of the
wrighteous and try a partial answer anyhow.

The biggest payoff is likely to come from combining the
two passes into one, perhaps by interleaving. For example,
you might make a horizontal pass on rows 0 and 1 and then
the first step of a vertical pass (just rows 0 and 1), then
horizontal on row 2 and a vertical step for rows 1 and 2,
and so on. This may increase the amount of work you are
able to do per cache fill.

You might also get a benefit from "padding" the image
so the rows are not separated by exact multiples of the cache
line size. A 640x480 image, for example, has a row length
that is likely to be a multiple of the cache line size, and
this could lead to cache thrashing. Embedding the 640x480
area in a 642x480 area (and ignoring the extra two columns)
might gain speed.

For further ideas, see if you can find a newsgroup with
"graphics" or "image" in its name.

Ian Collins · Apr 5, 2007

Hi all,

Please don't multi-post on Usenet. Post to the group for language you
are using.

Malcolm McLean · Apr 6, 2007

[image processing ]

I quickly encountered a problem with the column-wise path, as
processing pixel in a vertical fashion was _very_ slow (0.31 vs. 0.009
for the horizontal). I since learned that it was due to cache misses.
Indeed, the image is stored row by row, so vertically-neighbouring
pixels are actually far away in memory, which entails all kind of
speed problems.

I was suggested to try a block processing approach, whereby the image
is stored by blocks rather than by rows.

Firstly it surprising that scanning the image is the bottleneck in your
program. You ought to be able to scan a large area of arbitrary memory very
quickly indeed, even with sub-optimal cache usage.
You are largely wasting your time trying to speed things up by reorganising
images into blocks. Yes you can improve read times, but at the cost of extra
processing to manage the blocks, and then you add extra complexity to the
code. Finally you probably have to reformat anyway for the next stage of
processing. At the end you have a complex and bug-ridden program that runs
only slightly faster, if at all.

One thing that is not a waste of time is to try padding to sizes that are an
exact power of two, and change the multiply into a shift, or not an exact
power of two, to help the cache along. Then profile to see which is faster.

Another thing you can try is

int block[32][32]

getblock(img, block, x, y)
processblock(block);
writeblock(img, block, 32, 32);

storing the portion of the image you are working on to a local area,
processing it, and writing out the results. This will only work if
processblock() is quite intensive.

CBFalconer · Apr 6, 2007

Eric said:
.... snip ...

This isn't really a C question -- most questions about
speed are not C questions -- but I'll risk the rath of the
wrighteous and try a partial answer anyhow.

rath? I am roth.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>
<http://www.aaxnet.com/editor/edit043.html>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews

vectorizor · Apr 10, 2007

The biggest payoff is likely to come from combining the
two passes into one, perhaps by interleaving. For example,
you might make a horizontal pass on rows 0 and 1 and then
the first step of a vertical pass (just rows 0 and 1), then
horizontal on row 2 and a vertical step for rows 1 and 2,
and so on. This may increase the amount of work you are
able to do per cache fill.

That would require an extra buffer to store the result of the vertical
passs, and I can't do that.

You might also get a benefit from "padding" the image
so the rows are not separated by exact multiples of the cache
line size. A 640x480 image, for example, has a row length
that is likely to be a multiple of the cache line size, and
this could lead to cache thrashing. Embedding the 640x480
area in a 642x480 area (and ignoring the extra two columns)
might gain speed.

I do not understand, I thought having row length a multiple of the
cache size was a good thing??

Eric Sosman · Apr 10, 2007

vectorizor said:
That would require an extra buffer to store the result of the vertical
passs, and I can't do that.

You need space for *one* image row. The idea is to do
the vertical "pass" as a series of row-by-row steps that
interleave between the horizontal passes.

I do not understand, I thought having row length a multiple of the
cache size was a good thing??

Might be, might not. My thought was that to calculate the
end value for a specific pixel you need its original value and
those of the three pixels just above it, just to its left, and
diagonally up and left. If the row length is a multiple of the
cache line size, the two vertically-adjacent pixels might compete
for the same cache location and thus cause thrashing. If the
length is one greater than a line size multiple, the up-and-left
pixel might compete with the central pixel. Hence my suggestion
of two padding slots.

But I may be talking through my hat, making suppositions
about "caches" and "line sizes" and "memory latencies" and
"speed" and other such matters that are not part of the C
language. I repeat my earlier suggestion (which I now think
should probably have been the entirety of my response):

Block processing	2	Apr 5, 2007
Filter sober in c++ don't pass test	0	Dec 2, 2023
<Button ...> display is fine, except for two things	1	Oct 23, 2023
Google appscript	0	Oct 5, 2024
Align separate li to right	2	Jun 19, 2024
Clickable Div Block	1	Oct 13, 2023
How to position the tooltip comment on these buttons?	9	Nov 4, 2023
Can't wrap text around image and one more	1	Jul 25, 2025

B

vectorizor

Eric Sosman

Ian Collins

Malcolm McLean

CBFalconer

vectorizor

Eric Sosman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads