Multibyte add & subtract

V

valinor

Hi guys,

(rather lengthy...)

I'm trying to speed up the time spent on a postfilter for video.
YUV 4:2:0 data, each pixel is 1 byte (0-255)

The basic idea is to filter one pixel on each side of a 8-pixel border.
The filter used is a variant of (1,1,-4,1,1).

In the example below I do a vertical filtering of line n and the
diff for pixel c1 is calculated as
diff(c1) = a1+b1+(c1<<2)+d1+e1 (1)
c2 as
diff(c2) = a2+b2+(c2<<2)+d2+e2
etc.

Pixel 1.2.3.4.
--------------
n-2 a1a2a3a4
n-1 b1b2b3b4
n c1c2c3c4
----- pixel border----
n+1 d12d3d4d
n+2 e1e2e3e4

The current implementation reads the values of a1,b1,c1,d1,e1 one byte
at a time, do the calculation and write back the filtered value for c1.
I.e something close to the code below:
imdifftmp = *(ImageSrc_p-w2);
imdiff2 = *(ImageSrc_p-w2+1);
...
imdiff8 = *(ImageSrc_p-w2+7);
imdifftmp += *(ImageSrc_p-width);
imdiff2 += *(ImageSrc_p-width+1);
...
imdiff8 += *(ImageSrc_p-width+7);
imdifftmp -= (*(ImageSrc_p)) << 2;
imdiff2 -= (*(ImageSrc_p+1)) << 2;
...
imdiff8 -= (*(ImageSrc_p+7)) << 2;
imdifftmp += *(ImageSrc_p+width);
imdiff2 += *(ImageSrc_p+width+1);
...
imdiff8 += *(ImageSrc_p+width+7);
imdifftmp += *(ImageSrc_p+w2);
imdiff2 += *(ImageSrc_p+w2+1);
...
imdiff8 += *(ImageSrc_p+w2+7);

Not very efficient on a 32-bit machine! What I'm trying to achive is
to read a 32-bit word containing 4 pixel values, do the calculation
an a whole word and write back a word. After some googeling I found
the book "Hackers Delight" by Henry S. Warren, Jr. He presents such
a method implemented by the two macros below:

//Multibyte Add of 4 1-byte integers packed into a word
#define MBA(x, y, s)\
do{\
s = ((x)&0x7f7f7f7f)+((y)&0x7f7f7f7f); \
s = (((x)^(y))&0x80808080)^s; \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)

//Multibyte Subtract of 4 1-byte integers packed into a word
#define MBS(x, y, d)\
do{\
d = ((x)|0x80808080)-((y)&0x7f7f7f7f); \
d = ~((((x)^(y))|0x7f7f7f7f)^d); \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)

He also states that the operation below gives the carry into each
position
(where ¤ in this case denotes bitwise exclusive or (^):
(x¤y)¤x¤y

These macros works great for small values! The problem is how to handle
the carry so that the correct values after the calculations in (1)
can be extracted. My question (finally!) is:
How can I (if it is possible) handle the carry to recreate the correct
signed integer value after the calculations above?

Some sample code below:

void main(void){
long a1 = 0xc7c8c9ca;
long b1 = 0xc8c9cacb;
long c1 = 0xddc8c9ca;
long d1 = 0xcacbcccd;
long e1 = 0xcbcccdce;

MBA(a1,b1,s1); //a+b
MBA(s1,d1,s2); //+d
MBA(s2,e1,s1); //+e
MBS(s1,c1,s2); //-c (is it possible to do the -(c<<2) part smarter?
MBS(s2,c1,s1); //-c
MBS(s1,c1,s2); //-c
MBS(s2,c1,s1); //-c

//Extract MSB Byte (B0) and add carry stuff...

printf("\nvalue after macros %08lX, value after calc %08lX\n", s1,
0xc7+0xc8-(0xdd<<2)+0xca+0xcb);
}

Gives:
carry 9F939794
carry 1F073F3A
carry B7B9BF9C
carry F8101000
carry BF81879C
carry FF313730
carry 3B818384
value after macros B0080808, value after calc FFFFFFB0
-- --------
^ ^
|----------------------------|
|
Same value for different methods

Cheers
//Fredrik
 
T

Thad Smith

I'm trying to speed up the time spent on a postfilter for video.
YUV 4:2:0 data, each pixel is 1 byte (0-255)

In the example below I do a vertical filtering of line n and the
diff for pixel c1 is calculated as
diff(c1) = a1+b1+(c1<<2)+d1+e1 (1)
...
Not very efficient on a 32-bit machine! What I'm trying to achive is
to read a 32-bit word containing 4 pixel values, do the calculation
an a whole word and write back a word. After some googeling I found
the book "Hackers Delight" by Henry S. Warren, Jr. He presents such
a method implemented by the two macros below:

//Multibyte Add of 4 1-byte integers packed into a word
#define MBA(x, y, s)\
do{\
s = ((x)&0x7f7f7f7f)+((y)&0x7f7f7f7f); \
s = (((x)^(y))&0x80808080)^s; \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)

//Multibyte Subtract of 4 1-byte integers packed into a word
#define MBS(x, y, d)\
do{\
d = ((x)|0x80808080)-((y)&0x7f7f7f7f); \
d = ~((((x)^(y))|0x7f7f7f7f)^d); \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)

Each of the 8-bit fields is added mod 2^8.
These macros works great for small values! The problem is how to handle
the carry so that the correct values after the calculations in (1)
can be extracted. My question (finally!) is:
How can I (if it is possible) handle the carry to recreate the correct
signed integer value after the calculations above?

Some sample code below:

void main(void){
long a1 = 0xc7c8c9ca;
long b1 = 0xc8c9cacb;
long c1 = 0xddc8c9ca;
long d1 = 0xcacbcccd;
long e1 = 0xcbcccdce;

MBA(a1,b1,s1); //a+b
MBA(s1,d1,s2); //+d
MBA(s2,e1,s1); //+e
MBS(s1,c1,s2); //-c (is it possible to do the -(c<<2) part smarter?
MBS(s2,c1,s1); //-c
MBS(s1,c1,s2); //-c
MBS(s2,c1,s1); //-c
...
value after macros B0080808, value after calc FFFFFFB0
-- --------
^ ^
|----------------------------|
|
Same value for different methods

As you note, the 8 lsbs are correct. If you can guarantee that the
difference in pixel value over points a - e is less than 64, you can
simply use the msb as the sign bit. In your example the msb of B0 = 1,
so sign extend the bit.

If you make no assumptions about value range in the group, then the
range of computed value is -4*255 to 4*255. That requires 11 bits to
uniquely represent each value. You could represent each pixel as 11
bits, with the initial 3 msbs = 0. You could thus pack 2 pixels in a
32-bit word or 5 pixels in a 64-bit word. If you can guarantee a pixel
value difference of 128 or less in each 5 point group, you could get by
with 10 bits/pixel, packing 3 pixels per 32-bit word.

If you choose to use two 11 bit pixels in a 32-bit word, you might as
well pack 2 16-bit values per 32-bit word, which gives easier packing
and unpacking.
 
R

Rod Pemberton

Valinor,

I've made some corrections. Don't let those get to you. There are some
useful non-correction related comments below.
I'm trying to speed up the time spent on a postfilter for video.
YUV 4:2:0 data, each pixel is 1 byte (0-255)

The basic idea is to filter one pixel on each side of a 8-pixel border.
The filter used is a variant of (1,1,-4,1,1).

In the example below I do a vertical filtering of line n and the
diff for pixel c1 is calculated as
diff(c1) = a1+b1+(c1<<2)+d1+e1 (1)
c2 as
diff(c2) = a2+b2+(c2<<2)+d2+e2
etc.

Pixel 1.2.3.4.
--------------
n-2 a1a2a3a4
n-1 b1b2b3b4
n c1c2c3c4
----- pixel border----
n+1 d12d3d4d
n+2 e1e2e3e4
//Multibyte Add of 4 1-byte integers packed into a word
#define MBA(x, y, s)\
do{\
s = ((x)&0x7f7f7f7f)+((y)&0x7f7f7f7f); \
s = (((x)^(y))&0x80808080)^s; \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\

The C++ comments create a multi-line comment according to GCC. Rewrite like
so:

/* printf("\ncarry %08lX", ((x)+(y))^(x)^(y)); */ \
}while(0)

//Multibyte Subtract of 4 1-byte integers packed into a word
#define MBS(x, y, d)\
do{\
d = ((x)|0x80808080)-((y)&0x7f7f7f7f); \
d = ~((((x)^(y))|0x7f7f7f7f)^d); \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\

The C++ comments create a multi-line comment according to GCC. Rewrite like
so:

/* printf("\ncarry %08lX", ((x)+(y))^(x)^(y)); */ \
}while(0)

He also states that the operation below gives the carry into each
position
(where ¤ in this case denotes bitwise exclusive or (^):
(x¤y)¤x¤y

These macros works great for small values! The problem is how to handle
the carry so that the correct values after the calculations in (1)
can be extracted. My question (finally!) is:
How can I (if it is possible) handle the carry to recreate the correct
signed integer value after the calculations above?

The MBS macro _appears_ (you'll need to confirm) to be calculating two's
complement correctly. This means that the values _should_ be correctly
signed when you extract each byte and cast them from an unsigned variable to
a signed one. This is because most compilers use two's complement for
negative integers.
Some sample code below:

void main(void){

#include <stdio.h>
long a1 = 0xc7c8c9ca;
long b1 = 0xc8c9cacb;
long c1 = 0xddc8c9ca;
long d1 = 0xcacbcccd;
long e1 = 0xcbcccdce;

long s1,s2; /* missing */
MBA(a1,b1,s1); //a+b
MBA(s1,d1,s2); //+d
MBA(s2,e1,s1); //+e
MBS(s1,c1,s2); //-c (is it possible to do the -(c<<2) part smarter?
MBS(s2,c1,s1); //-c
MBS(s1,c1,s2); //-c
MBS(s2,c1,s1); //-c

//Extract MSB Byte (B0) and add carry stuff...

printf("\nvalue after macros %08lX, value after calc %08lX\n", s1,
0xc7+0xc8-(0xdd<<2)+0xca+0xcb);

return(EXIT_SUCCESS); /* corrected */

In (1) above, you _add_ (c1<<2), but here you _subtract_ (c1<<2). Did you
want MBS() or MBA()?
MBS(s1,c1,s2); //-c (is it possible to do the -(c<<2) part smarter?

Yes, replace the four lines that compute (c1<<2), with (if you wanted MBS,
otherwise change to MBA):

MBS(s1,((c1&0x3f3f3f3f)<<2),s2); //-c (is it possible to do the -(c<<2) part
smarter?
Gives:
carry 9F939794
carry 1F073F3A
carry B7B9BF9C
carry F8101000
carry BF81879C
carry FF313730
carry 3B818384
value after macros B0080808, value after calc FFFFFFB0

Sorry, I didn't check these.


Rod Pemberton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top