A
Alessio Sangalli
Hi, I am facing some performance issues in an algorithm to "translate"
some YUV data in another format. I'll make everything very simple:
I have 4 blocks of Y data, one of U data and one of V data. Repeat the
previous structure a few hundreds times to get a whole image.
Every block consists in a 8x8 grid.
Let's consider Y data only:
I have to convert it in a way that I take chunks of 8bytes and put in
different memory locations to build up a "planar" or "raster"
representation of the data. The following code processes one block:
unsigned char* destination;
unsigned char* source;
int xsize, ysize, x, y, i;
[...]
for(y=0; y<ysize; y+=16)
for(x=0; x<xsize; x+=16)
for(i=0; i<8; i++)
{
memcpy(destination+(y+i)*xsize+x, yuvcurr, sizeof(unsigned char)*8);
yuvcurr+=8;
}
I noticed that the code above is *much* slower (ARM9, gcc 4.0.0, uClibc)
than the following trick:
unsigned int* destination;
unsigned int* source;
int xsize, ysize, x, y, i;
[...]
for(y=0; y<ysize; y+=16)
for(x=0; x<xsize; x+=16)
for(i=0; i<8; i++)
{
dest = (unsigned int*)(final+(y+i)*xsize+x);
*(dest++) = *(source++);
*(dest++) = *(source++);
}
Basically I know I have to copy 8 bytes or two words and I do that
instead of calling memcpy.
The first solution roughly gives me 70fps, while the second one 230.
Any comment on this? Am I missing something, the memcpy implementation
is mislead by something?
bye
Alessio
some YUV data in another format. I'll make everything very simple:
I have 4 blocks of Y data, one of U data and one of V data. Repeat the
previous structure a few hundreds times to get a whole image.
Every block consists in a 8x8 grid.
Let's consider Y data only:
I have to convert it in a way that I take chunks of 8bytes and put in
different memory locations to build up a "planar" or "raster"
representation of the data. The following code processes one block:
unsigned char* destination;
unsigned char* source;
int xsize, ysize, x, y, i;
[...]
for(y=0; y<ysize; y+=16)
for(x=0; x<xsize; x+=16)
for(i=0; i<8; i++)
{
memcpy(destination+(y+i)*xsize+x, yuvcurr, sizeof(unsigned char)*8);
yuvcurr+=8;
}
I noticed that the code above is *much* slower (ARM9, gcc 4.0.0, uClibc)
than the following trick:
unsigned int* destination;
unsigned int* source;
int xsize, ysize, x, y, i;
[...]
for(y=0; y<ysize; y+=16)
for(x=0; x<xsize; x+=16)
for(i=0; i<8; i++)
{
dest = (unsigned int*)(final+(y+i)*xsize+x);
*(dest++) = *(source++);
*(dest++) = *(source++);
}
Basically I know I have to copy 8 bytes or two words and I do that
instead of calling memcpy.
The first solution roughly gives me 70fps, while the second one 230.
Any comment on this? Am I missing something, the memcpy implementation
is mislead by something?
bye
Alessio