V
valinor
Hi guys,
(rather lengthy...)
I'm trying to speed up the time spent on a postfilter for video.
YUV 4:2:0 data, each pixel is 1 byte (0-255)
The basic idea is to filter one pixel on each side of a 8-pixel border.
The filter used is a variant of (1,1,-4,1,1).
In the example below I do a vertical filtering of line n and the
diff for pixel c1 is calculated as
diff(c1) = a1+b1+(c1<<2)+d1+e1 (1)
c2 as
diff(c2) = a2+b2+(c2<<2)+d2+e2
etc.
Pixel 1.2.3.4.
--------------
n-2 a1a2a3a4
n-1 b1b2b3b4
n c1c2c3c4
----- pixel border----
n+1 d12d3d4d
n+2 e1e2e3e4
The current implementation reads the values of a1,b1,c1,d1,e1 one byte
at a time, do the calculation and write back the filtered value for c1.
I.e something close to the code below:
imdifftmp = *(ImageSrc_p-w2);
imdiff2 = *(ImageSrc_p-w2+1);
...
imdiff8 = *(ImageSrc_p-w2+7);
imdifftmp += *(ImageSrc_p-width);
imdiff2 += *(ImageSrc_p-width+1);
...
imdiff8 += *(ImageSrc_p-width+7);
imdifftmp -= (*(ImageSrc_p)) << 2;
imdiff2 -= (*(ImageSrc_p+1)) << 2;
...
imdiff8 -= (*(ImageSrc_p+7)) << 2;
imdifftmp += *(ImageSrc_p+width);
imdiff2 += *(ImageSrc_p+width+1);
...
imdiff8 += *(ImageSrc_p+width+7);
imdifftmp += *(ImageSrc_p+w2);
imdiff2 += *(ImageSrc_p+w2+1);
...
imdiff8 += *(ImageSrc_p+w2+7);
Not very efficient on a 32-bit machine! What I'm trying to achive is
to read a 32-bit word containing 4 pixel values, do the calculation
an a whole word and write back a word. After some googeling I found
the book "Hackers Delight" by Henry S. Warren, Jr. He presents such
a method implemented by the two macros below:
//Multibyte Add of 4 1-byte integers packed into a word
#define MBA(x, y, s)\
do{\
s = ((x)&0x7f7f7f7f)+((y)&0x7f7f7f7f); \
s = (((x)^(y))&0x80808080)^s; \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)
//Multibyte Subtract of 4 1-byte integers packed into a word
#define MBS(x, y, d)\
do{\
d = ((x)|0x80808080)-((y)&0x7f7f7f7f); \
d = ~((((x)^(y))|0x7f7f7f7f)^d); \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)
He also states that the operation below gives the carry into each
position
(where ¤ in this case denotes bitwise exclusive or (^):
(x¤y)¤x¤y
These macros works great for small values! The problem is how to handle
the carry so that the correct values after the calculations in (1)
can be extracted. My question (finally!) is:
How can I (if it is possible) handle the carry to recreate the correct
signed integer value after the calculations above?
Some sample code below:
void main(void){
long a1 = 0xc7c8c9ca;
long b1 = 0xc8c9cacb;
long c1 = 0xddc8c9ca;
long d1 = 0xcacbcccd;
long e1 = 0xcbcccdce;
MBA(a1,b1,s1); //a+b
MBA(s1,d1,s2); //+d
MBA(s2,e1,s1); //+e
MBS(s1,c1,s2); //-c (is it possible to do the -(c<<2) part smarter?
MBS(s2,c1,s1); //-c
MBS(s1,c1,s2); //-c
MBS(s2,c1,s1); //-c
//Extract MSB Byte (B0) and add carry stuff...
printf("\nvalue after macros %08lX, value after calc %08lX\n", s1,
0xc7+0xc8-(0xdd<<2)+0xca+0xcb);
}
Gives:
carry 9F939794
carry 1F073F3A
carry B7B9BF9C
carry F8101000
carry BF81879C
carry FF313730
carry 3B818384
value after macros B0080808, value after calc FFFFFFB0
-- --------
^ ^
|----------------------------|
|
Same value for different methods
Cheers
//Fredrik
(rather lengthy...)
I'm trying to speed up the time spent on a postfilter for video.
YUV 4:2:0 data, each pixel is 1 byte (0-255)
The basic idea is to filter one pixel on each side of a 8-pixel border.
The filter used is a variant of (1,1,-4,1,1).
In the example below I do a vertical filtering of line n and the
diff for pixel c1 is calculated as
diff(c1) = a1+b1+(c1<<2)+d1+e1 (1)
c2 as
diff(c2) = a2+b2+(c2<<2)+d2+e2
etc.
Pixel 1.2.3.4.
--------------
n-2 a1a2a3a4
n-1 b1b2b3b4
n c1c2c3c4
----- pixel border----
n+1 d12d3d4d
n+2 e1e2e3e4
The current implementation reads the values of a1,b1,c1,d1,e1 one byte
at a time, do the calculation and write back the filtered value for c1.
I.e something close to the code below:
imdifftmp = *(ImageSrc_p-w2);
imdiff2 = *(ImageSrc_p-w2+1);
...
imdiff8 = *(ImageSrc_p-w2+7);
imdifftmp += *(ImageSrc_p-width);
imdiff2 += *(ImageSrc_p-width+1);
...
imdiff8 += *(ImageSrc_p-width+7);
imdifftmp -= (*(ImageSrc_p)) << 2;
imdiff2 -= (*(ImageSrc_p+1)) << 2;
...
imdiff8 -= (*(ImageSrc_p+7)) << 2;
imdifftmp += *(ImageSrc_p+width);
imdiff2 += *(ImageSrc_p+width+1);
...
imdiff8 += *(ImageSrc_p+width+7);
imdifftmp += *(ImageSrc_p+w2);
imdiff2 += *(ImageSrc_p+w2+1);
...
imdiff8 += *(ImageSrc_p+w2+7);
Not very efficient on a 32-bit machine! What I'm trying to achive is
to read a 32-bit word containing 4 pixel values, do the calculation
an a whole word and write back a word. After some googeling I found
the book "Hackers Delight" by Henry S. Warren, Jr. He presents such
a method implemented by the two macros below:
//Multibyte Add of 4 1-byte integers packed into a word
#define MBA(x, y, s)\
do{\
s = ((x)&0x7f7f7f7f)+((y)&0x7f7f7f7f); \
s = (((x)^(y))&0x80808080)^s; \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)
//Multibyte Subtract of 4 1-byte integers packed into a word
#define MBS(x, y, d)\
do{\
d = ((x)|0x80808080)-((y)&0x7f7f7f7f); \
d = ~((((x)^(y))|0x7f7f7f7f)^d); \
// printf("\ncarry %08lX", ((x)+(y))^(x)^(y));\
}while(0)
He also states that the operation below gives the carry into each
position
(where ¤ in this case denotes bitwise exclusive or (^):
(x¤y)¤x¤y
These macros works great for small values! The problem is how to handle
the carry so that the correct values after the calculations in (1)
can be extracted. My question (finally!) is:
How can I (if it is possible) handle the carry to recreate the correct
signed integer value after the calculations above?
Some sample code below:
void main(void){
long a1 = 0xc7c8c9ca;
long b1 = 0xc8c9cacb;
long c1 = 0xddc8c9ca;
long d1 = 0xcacbcccd;
long e1 = 0xcbcccdce;
MBA(a1,b1,s1); //a+b
MBA(s1,d1,s2); //+d
MBA(s2,e1,s1); //+e
MBS(s1,c1,s2); //-c (is it possible to do the -(c<<2) part smarter?
MBS(s2,c1,s1); //-c
MBS(s1,c1,s2); //-c
MBS(s2,c1,s1); //-c
//Extract MSB Byte (B0) and add carry stuff...
printf("\nvalue after macros %08lX, value after calc %08lX\n", s1,
0xc7+0xc8-(0xdd<<2)+0xca+0xcb);
}
Gives:
carry 9F939794
carry 1F073F3A
carry B7B9BF9C
carry F8101000
carry BF81879C
carry FF313730
carry 3B818384
value after macros B0080808, value after calc FFFFFFB0
-- --------
^ ^
|----------------------------|
|
Same value for different methods
Cheers
//Fredrik