Row-wise vs. column-wise image processing

Discussion in 'C Programming' started by Enrique Cruiz, Jan 25, 2007.

  1. Hello all,

    I am currently implementing a fairly simple algorithm. It scans a
    grayscale image, and computes a pixel's new value as a function of its
    original value. Two passes are made, first horizontally and second
    vertically. The problem I have is that the vertical pass is 3 to 4
    times slower than the horizontal, although the code is _exactly_ the
    same in both cases?!

    The code itself is very simple. There are 2 loops to scan through rows
    and colums (horizontal pass), or columns and rows (vertical pass)
    depending on the pass. The processing part is simple and is contained
    in a single line. 'pixel' is a pointer to the current pixel. The new
    value of the current pixel is first updated, and the pointer is then
    incremented to the next pixel, which is either in the next column or in
    the next row depending on the pass. I should add that the image is
    stored as a large 1-D vector, i.e. the values of each rows are
    contiguous in memory. Here is a simplified version of the code:

    ####################
    // HORIZONTAL
    // loop for every row
    ....
    // then for every column
    for(col=firstCol+1 ; col<=lastCol-1 ; ++col)
    {
    *pixel = (*pixel) * 2 / 3;
    pixel++;
    }

    // VERTICAL
    // loop for every column
    ....
    // then for every row
    for(row=firstRow+1 ; row<=lastRow-1 ; ++row)
    {
    *pixel = (*pixel) * 2 / 3;
    pixel+=imgWidth;
    }
    ##################

    For this small amount of code, timings are as follow:
    - horizontal = 0.035 sec.
    - vertical =   0.135 sec.

    Now if we simply remove in each case the line updating the pixel
    pointer (i.e. "pixel++;" and "pixel+=imgWidth;"), the timings then
    becomes equal at 0.081. This simple instruction is responsible for the
    massive loss of time. And I have no idea why.

    My only guess relates to memory management issues. Since the image is
    stored row-wise, the current and next values are physically next in
    memory during the horizontal pass. On the other hand, for the vertical
    pass, the next value is stored in the next row, and the distance
    between them becomes 'image_width'. My guess is that the next pixel
    value in such a case is not close enough to be stored in the processor
    cache or register. The processor has to fetch it from memory, hence the
    massive loss in speed. This is however just a guess.

    I would really appreciate if anybody could enlighten me on this topic.

    Thanks in advance,


    Enrique
    Enrique Cruiz, Jan 25, 2007
    #1
    1. Advertising

  2. Enrique Cruiz

    santosh Guest

    Enrique Cruiz wrote:
    > Hello all,
    >
    > I am currently implementing a fairly simple algorithm. It scans a
    > grayscale image, and computes a pixel's new value as a function of its
    > original value. Two passes are made, first horizontally and second
    > vertically. The problem I have is that the vertical pass is 3 to 4
    > times slower than the horizontal, although the code is _exactly_ the
    > same in both cases?!
    >
    > The code itself is very simple. There are 2 loops to scan through rows
    > and colums (horizontal pass), or columns and rows (vertical pass)
    > depending on the pass. The processing part is simple and is contained
    > in a single line. 'pixel' is a pointer to the current pixel. The new
    > value of the current pixel is first updated, and the pointer is then
    > incremented to the next pixel, which is either in the next column or in
    > the next row depending on the pass. I should add that the image is
    > stored as a large 1-D vector, i.e. the values of each rows are
    > contiguous in memory. Here is a simplified version of the code:
    >
    > ####################
    > // HORIZONTAL
    > // loop for every row
    > ...
    > // then for every column
    > for(col=firstCol+1 ; col<=lastCol-1 ; ++col)
    > {
    > *pixel = (*pixel) * 2 / 3;
    > pixel++;
    > }
    >
    > // VERTICAL
    > // loop for every column
    > ...
    > // then for every row
    > for(row=firstRow+1 ; row<=lastRow-1 ; ++row)
    > {
    > *pixel = (*pixel) * 2 / 3;
    > pixel+=imgWidth;
    > }
    > ##################
    >
    > For this small amount of code, timings are as follow:
    > - horizontal = 0.035 sec.
    > - vertical = 0.135 sec.
    >
    > Now if we simply remove in each case the line updating the pixel
    > pointer (i.e. "pixel++;" and "pixel+=imgWidth;"), the timings then
    > becomes equal at 0.081. This simple instruction is responsible for the
    > massive loss of time. And I have no idea why.
    >
    > My only guess relates to memory management issues. Since the image is
    > stored row-wise, the current and next values are physically next in
    > memory during the horizontal pass. On the other hand, for the vertical
    > pass, the next value is stored in the next row, and the distance
    > between them becomes 'image_width'. My guess is that the next pixel
    > value in such a case is not close enough to be stored in the processor
    > cache or register. The processor has to fetch it from memory, hence the
    > massive loss in speed. This is however just a guess.
    >
    > I would really appreciate if anybody could enlighten me on this topic.
    >
    > Thanks in advance,


    Without a complete, minimal, compilable example, our guess is as good
    as yours, which is probably a good guess. Use the time program to
    observe time spent by your code and time spent in system code. That'll
    give an indication if your guess is right.

    In anycase, the C standard says nothing about efficiency of the code
    generated. If the small delay is too much for your application, then
    you'll have to profile your code to find out the area where it is
    stalling and look at what to do about it.
    santosh, Jan 25, 2007
    #2
    1. Advertising

  3. Enrique Cruiz

    Guest

    On 25 Jan, 11:00, Enrique Cruiz <> wrote:

    > I am currently implementing a fairly simple algorithm. It scans a
    > grayscale image, and computes a pixel's new value as a function of its
    > original value. Two passes are made, first horizontally and second
    > vertically. The problem I have is that the vertical pass is 3 to 4
    > times slower than the horizontal, although the code is _exactly_ the
    > same in both cases?!

    [Snip]

    This is not really a question about the C language. A general forum
    such as comp.programming may be more appropriate than a C language
    forum.

    > My only guess relates to memory management issues.


    <Off-topic>
    You are almost certainly right, and there's probably not a lot you can
    do to change it with this processing style.
    In your horizontal pass you can process a cache's worth of data at a
    time before pulling the next cachefull. In the worst case, your
    vertical pass could have a cache miss for each pixel accessed.
    </Off-topic>
    , Jan 25, 2007
    #3
  4. Enrique Cruiz wrote:

    > Now if we simply remove in each case the line updating the pixel
    > pointer (i.e. "pixel++;" and "pixel+=imgWidth;"), the timings then
    > becomes equal at 0.081. This simple instruction is responsible for the
    > massive loss of time. And I have no idea why.


    I'm pretty sure it's because the processor for which you are compiling
    has an INC-command which increments a given number quite fast. Your
    "pixel++" is compiled into such an INC, whereas "pixel+=imgWidth" can
    only be translated into an ADD-command which takes much longer.

    I don't think you can do much C-specific here. Maybe, if imgWidth is a
    power of two, there is a chance of using some strange &s and |s, but I
    wonder...

    Regards
    Steffen
    Steffen Buehler, Jan 25, 2007
    #4
  5. Enrique Cruiz

    santosh Guest

    Steffen Buehler wrote:
    > Enrique Cruiz wrote:
    >
    > > Now if we simply remove in each case the line updating the pixel
    > > pointer (i.e. "pixel++;" and "pixel+=imgWidth;"), the timings then
    > > becomes equal at 0.081. This simple instruction is responsible for the
    > > massive loss of time. And I have no idea why.

    >
    > I'm pretty sure it's because the processor for which you are compiling
    > has an INC-command which increments a given number quite fast. Your
    > "pixel++" is compiled into such an INC, whereas "pixel+=imgWidth" can
    > only be translated into an ADD-command which takes much longer.

    <snip>

    <OT>
    With regard to modern x86s, as far as I can tell, the ADD instruction
    is faster than the INC instruction. The OP didn't specify the size of
    the image being processed, and the cache sizes of the CPU, so one can't
    tell if the slowdown is due to cache misses.
    </OT>
    santosh, Jan 25, 2007
    #5
  6. On 2007-01-25 11:55:10 +0000, "santosh" <> said:
    >
    > <OT>
    > With regard to modern x86s, as far as I can tell, the ADD instruction
    > is faster than the INC instruction. The OP didn't specify the size of
    > the image being processed, and the cache sizes of the CPU, so one can't
    > tell if the slowdown is due to cache misses.
    > </OT>


    10 MB pixels image, 3650x2730

    I have no idea how to find out about the sizes of the CPU cache?!

    Enrique
    Enrique Cruiz, Jan 25, 2007
    #6
  7. In article <> writes:
    > On 25 Jan, 11:00, Enrique Cruiz <> wrote:
    > > I am currently implementing a fairly simple algorithm. It scans a
    > > grayscale image, and computes a pixel's new value as a function of its
    > > original value. Two passes are made, first horizontally and second
    > > vertically. The problem I have is that the vertical pass is 3 to 4
    > > times slower than the horizontal, although the code is _exactly_ the
    > > same in both cases?!

    ....
    > You are almost certainly right, and there's probably not a lot you can
    > do to change it with this processing style.
    > In your horizontal pass you can process a cache's worth of data at a
    > time before pulling the next cachefull. In the worst case, your
    > vertical pass could have a cache miss for each pixel accessed.


    Indeed. Cache misses can give a huge loss of cycles.
    --
    dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
    home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
    Dik T. Winter, Jan 25, 2007
    #7
  8. Enrique Cruiz

    CBFalconer Guest

    Enrique Cruiz wrote:
    >
    > I am currently implementing a fairly simple algorithm. It scans a
    > grayscale image, and computes a pixel's new value as a function of its
    > original value. Two passes are made, first horizontally and second
    > vertically. The problem I have is that the vertical pass is 3 to 4
    > times slower than the horizontal, although the code is _exactly_ the
    > same in both cases?!
    >
    > The code itself is very simple. There are 2 loops to scan through rows
    > and colums (horizontal pass), or columns and rows (vertical pass)
    > depending on the pass. The processing part is simple and is contained
    > in a single line. 'pixel' is a pointer to the current pixel. The new
    > value of the current pixel is first updated, and the pointer is then
    > incremented to the next pixel, which is either in the next column or in
    > the next row depending on the pass. I should add that the image is
    > stored as a large 1-D vector, i.e. the values of each rows are
    > contiguous in memory. Here is a simplified version of the code:
    >
    > ####################
    > // HORIZONTAL
    > // loop for every row
    > ...
    > // then for every column
    > for(col=firstCol+1 ; col<=lastCol-1 ; ++col)
    > {
    > *pixel = (*pixel) * 2 / 3;
    > pixel++;
    > }
    >
    > // VERTICAL
    > // loop for every column
    > ...
    > // then for every row
    > for(row=firstRow+1 ; row<=lastRow-1 ; ++row)
    > {
    > *pixel = (*pixel) * 2 / 3;
    > pixel+=imgWidth;
    > }
    > ##################
    >
    > For this small amount of code, timings are as follow:
    > - horizontal = 0.035 sec.
    > - vertical = 0.135 sec.
    >
    > Now if we simply remove in each case the line updating the pixel
    > pointer (i.e. "pixel++;" and "pixel+=imgWidth;"), the timings then
    > becomes equal at 0.081. This simple instruction is responsible for the
    > massive loss of time. And I have no idea why.
    >
    > My only guess relates to memory management issues. Since the image is
    > stored row-wise, the current and next values are physically next in
    > memory during the horizontal pass. On the other hand, for the vertical
    > pass, the next value is stored in the next row, and the distance
    > between them becomes 'image_width'. My guess is that the next pixel
    > value in such a case is not close enough to be stored in the processor
    > cache or register. The processor has to fetch it from memory, hence the
    > massive loss in speed. This is however just a guess.


    Your guess is almost certainly correct. But, for the code you
    show, there is no reason to even have the second scan. The
    operation could simply be *pixel *= 4.0 / 9;

    Assuming the operations are really more complex, and depend on
    adjacent pixels, you could maintain cache coherency by keeping
    things local. No need for the two sweeps.

    /* init pixel pointer */
    for (row = firstRow + 1; row < lastRow; ++row) {
    /* adjust pixel pointer, maybe pixel += 2 */
    for (col = firstCol + 1; col < lastCol; ++col, ++pixel) {
    /* look through adjacencies */
    }
    }

    --
    <http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>

    "A man who is right every time is not likely to do very much."
    -- Francis Crick, co-discover of DNA
    "There is nothing more amazing than stupidity in action."
    -- Thomas Matthews
    CBFalconer, Jan 25, 2007
    #8
  9. Enrique Cruiz

    santosh Guest

    Enrique Cruiz wrote:
    > On 2007-01-25 11:55:10 +0000, "santosh" <> said:
    > >
    > > <OT>
    > > With regard to modern x86s, as far as I can tell, the ADD instruction
    > > is faster than the INC instruction. The OP didn't specify the size of
    > > the image being processed, and the cache sizes of the CPU, so one can't
    > > tell if the slowdown is due to cache misses.
    > > </OT>

    >
    > 10 MB pixels image, 3650x2730
    >
    > I have no idea how to find out about the sizes of the CPU cache?!


    Under Linux 'dmesg | grep cpu' should do the trick. Under Windows
    msinfo.exe should show the sizes.

    Anyway, a 10 Mb file will definitely cause hundreds of cache misses,
    when you read non-sequentially. I don't there's anything much you can
    do about it. Generally the delay is acceptable.
    santosh, Jan 25, 2007
    #9
  10. On Jan 25, 6:00 am, Enrique Cruiz <> wrote:
    > Hello all,
    >
    > I am currently implementing a fairly simple algorithm. It scans a
    > grayscale image, and computes a pixel's new value as a function of its
    > original value. Two passes are made, first horizontally and second
    > vertically. The problem I have is that the vertical pass is 3 to 4
    > times slower than the horizontal, although the code is _exactly_ the
    > same in both cases?!
    >
    > The code itself is very simple. There are 2 loops to scan through rows
    > and colums (horizontal pass), or columns and rows (vertical pass)
    > depending on the pass. The processing part is simple and is contained
    > in a single line. 'pixel' is a pointer to the current pixel. The new
    > value of the current pixel is first updated, and the pointer is then
    > incremented to the next pixel, which is either in the next column or in
    > the next row depending on the pass. I should add that the image is
    > stored as a large 1-D vector, i.e. the values of each rows are
    > contiguous in memory. Here is a simplified version of the code:
    >
    > ####################
    > // HORIZONTAL
    > // loop for every row
    > ...
    > // then for every column
    > for(col=firstCol+1 ; col<=lastCol-1 ; ++col)
    > {
    > *pixel = (*pixel) * 2 / 3;
    > pixel++;
    >
    > }// VERTICAL
    > // loop for every column
    > ...
    > // then for every row
    > for(row=firstRow+1 ; row<=lastRow-1 ; ++row)
    > {
    > *pixel = (*pixel) * 2 / 3;
    > pixel+=imgWidth;}##################
    >
    > For this small amount of code, timings are as follow:
    > - horizontal = 0.035 sec.
    > - vertical = 0.135 sec.
    >
    > Now if we simply remove in each case the line updating the pixel
    > pointer (i.e. "pixel++;" and "pixel+=imgWidth;"), the timings then
    > becomes equal at 0.081. This simple instruction is responsible for the
    > massive loss of time. And I have no idea why.
    >
    > My only guess relates to memory management issues. Since the image is
    > stored row-wise, the current and next values are physically next in
    > memory during the horizontal pass. On the other hand, for the vertical
    > pass, the next value is stored in the next row, and the distance
    > between them becomes 'image_width'. My guess is that the next pixel
    > value in such a case is not close enough to be stored in the processor
    > cache or register. The processor has to fetch it from memory, hence the
    > massive loss in speed. This is however just a guess.
    >
    > I would really appreciate if anybody could enlighten me on this topic.
    >
    > Thanks in advance,
    >


    This is quite standard, and your guess is quite correct. The problem
    is
    that when you process by rows the elements are close together
    and when you process by columns the elements are far apart.
    This has little to do with C (if you use Fortran you will probably
    find that processng by columns is faster than processing by rows).

    You have several options

    Live with things as they are (a drop in speed of three times
    is not that severe. Suppose your matrix
    was so large that a column operation caused a disc read
    for every element. Can you say "several orders of magnitude")

    Change to an algorithm that can be done with
    only row processing (not always possible).

    Do the row processing, transpose the matrix,
    do the column processing, transpose the matrix
    again. The two transposes may take less time
    than the processing time difference.

    Store the image in "tiles", so that close
    by pixels in any direction are usually close
    in memory. This is the most general solution
    and also the hardest. Note that the
    overhead of the tiling sheme will probably
    make row processing slower. Hopefully, you
    will more that make up for this in column processing.

    Process in a different order. Process a few rows
    and then do the needed column processing. Process
    a pixel at a time, doing the needed row and column operations.
    The change in memory access pattern may help.

    -William Hughes
    William Hughes, Jan 25, 2007
    #10
  11. On Jan 25, 11:00 am, Enrique Cruiz <>
    wrote:

    > I am currently implementing a fairly simple algorithm. It scans a
    > grayscale image, and computes a pixel's new value as a function of its
    > original value. Two passes are made, first horizontally and second
    > vertically. The problem I have is that the vertical pass is 3 to 4
    > times slower than the horizontal, although the code is _exactly_ the
    > same in both cases?!


    RAM is very slow, that is why your computer has cache memory. And
    accessing RAM to read a single byte is very very slow compared to
    transferring a whole bunch of data, that is why your computer will read
    a whole bunch of consecutive bytes from RAM into cache memory when it
    has to, typically 32, 64 or 128 byte at a time.

    When you process data line by line, those 128 bytes are all used, so
    this is very efficient. When you process data in columns, you access
    one pixel, but the computer reads 128 bytes from RAM. Then you access
    the next pixel, again the computer reads 128 bytes of RAM. I guess you
    can see how this is inefficient.

    There is also some strange effect that the speed depends on the
    distance from one line to another. Run your program with different
    values of imgWidth and measure the time, and you will likely see that
    your computer doesn't like certain values of imgWidth (it has to do
    with cache associativity - google or wikipedia should help). Figure out
    which values it doesn't like and avoid them. Nice powers of two are
    usually a very bad idea for numWidth.

    If you process the same data repeatedly, it is best to do that in
    chunks that fit into cache. Usually your computer has two kinds of
    cache, L1 cache (very small, maybe 32 KB or less, and very very fast),
    and L2 cache (maybe 1 MB or 2 MB if you're lucky, a bit slowish) and
    then there is RAM (tons of it, slow as hell). To find out how much
    cache your computer has, just measure how long the same operation
    takes, depending on the data size. You should have a considerable drop
    in speed at two points, so don't exceed that size.
    christian.bau, Jan 26, 2007
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Soumyadip Rakshit

    copy array - row-wise

    Soumyadip Rakshit, Feb 15, 2006, in forum: C Programming
    Replies:
    5
    Views:
    315
    Barry Schwarz
    Feb 17, 2006
  2. Soumyadip Rakshit

    row-wise copying of 2D arrays

    Soumyadip Rakshit, Feb 15, 2006, in forum: C++
    Replies:
    2
    Views:
    277
    default
    Feb 15, 2006
  3. Enrique Cruiz
    Replies:
    5
    Views:
    399
    Jim Langston
    Jan 25, 2007
  4. D
    Replies:
    0
    Views:
    197
  5. Chris Lowis

    Parsing a CSV file column-wise

    Chris Lowis, Sep 1, 2008, in forum: Ruby
    Replies:
    10
    Views:
    218
    James Gray
    Sep 8, 2008
Loading...

Share This Page