Moving bytes in memory

A

Alessio Sangalli

Hi, I am facing some performance issues in an algorithm to "translate"
some YUV data in another format. I'll make everything very simple:

I have 4 blocks of Y data, one of U data and one of V data. Repeat the
previous structure a few hundreds times to get a whole image.
Every block consists in a 8x8 grid.

Let's consider Y data only:
I have to convert it in a way that I take chunks of 8bytes and put in
different memory locations to build up a "planar" or "raster"
representation of the data. The following code processes one block:

unsigned char* destination;
unsigned char* source;
int xsize, ysize, x, y, i;

[...]

for(y=0; y<ysize; y+=16)
for(x=0; x<xsize; x+=16)
for(i=0; i<8; i++)
{
memcpy(destination+(y+i)*xsize+x, yuvcurr, sizeof(unsigned char)*8);
yuvcurr+=8;
}


I noticed that the code above is *much* slower (ARM9, gcc 4.0.0, uClibc)
than the following trick:



unsigned int* destination;
unsigned int* source;
int xsize, ysize, x, y, i;

[...]

for(y=0; y<ysize; y+=16)
for(x=0; x<xsize; x+=16)
for(i=0; i<8; i++)
{
dest = (unsigned int*)(final+(y+i)*xsize+x);
*(dest++) = *(source++);
*(dest++) = *(source++);
}


Basically I know I have to copy 8 bytes or two words and I do that
instead of calling memcpy.
The first solution roughly gives me 70fps, while the second one 230.

Any comment on this? Am I missing something, the memcpy implementation
is mislead by something?

bye
Alessio
 
A

Alessio Sangalli

Eric said:
It might be simpler if you showed the actual code instead
of an paraphrase teeming with undeclared variables ...

every IRC channel/newsgroup/forum has its own rules. I am now using a
pastebin service. What is commented is the "alternate" way to do it. The
"memcpy" version is at least 3 times slower.

http://pastebin.ca/1075068

bye
as
 
K

Kalle Olavi Niemitalo

Alessio Sangalli said:
dest = (unsigned int*)(final+(y+i)*xsize+x);
*(dest++) = *(source++);
*(dest++) = *(source++);

This assumes the data matches the alignment requirements of
unsigned int. In general, memcpy cannot assume that and must
either check the alignment every time or copy byte by byte,
slowing it down.

http://www.arm.com/support/faqdev/4972.html mentions that ARM's
compiler redirects memcpy calls to a different function
(__rt_memcpy_w or __aeabi_memcpy4) if the pointer types in
argument expressions imply word-alignment, or even inlines the
copy if the size is also an aligned small constant. I don't know
if GCC has such a feature.

GCC probably cannot merge your two *(dest++) = *(source++);
statements into one LDM/STM pair because doing so would change
the result if dest + 1 == source. Replacing unsigned int with
e.g. uint64_t, so that only one assignment is needed, might
enable the use of LDM/STM; in principle this could require even
stricter alignment, but I guess ARM has __alignof(uint64_t) ==
__alignof(unsigned int). Alternatively, __restrict might help.
 
W

Willem

Alessio Sangalli wrote:
) Basically I know I have to copy 8 bytes or two words and I do that
) instead of calling memcpy.
) The first solution roughly gives me 70fps, while the second one 230.
)
) Any comment on this? Am I missing something, the memcpy implementation
) is mislead by something?

The only way the memcpy would be able to beat copying by hand of 8 bytes,
is if it were inlined by the compiler and then optimized heavily, or if
the compiler 'knows' about memcpy and can optimize it accordingly.
(And even then, it could only beat your code by a small margin, using
128-bit instructions.)


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
P

Peter Nilsson

Alessio said:
every IRC channel/newsgroup/forum has its own rules. I am now using a
pastebin service. What is commented is the "alternate" way to do it. The
"memcpy" version is at least 3 times slower.

http://pastebin.ca/1075068

So unroll everything! [Completely untested...]

void foo( unsigned char *d,
unsigned char *s,
size_t xsize,
size_t ysize )
{
unsigned char *s2; /* secondary source (s + 64) */
unsigned char *y, *yend;
unsigned char *x, *xend;

#define copy8(d,s) \
do { \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
*d++ = *s++; \
} \
while (0)

for (y = d, yend = y + ysize * xsize; y < yend; y += 16 * xsize)
for (x = y, xend = x + xsize; x < xend; )
{
s2 = s + 64;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
s += 64;

s2 = s + 64;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16;
copy8(x,s); copy8(x,s2); x += xsize - 16 * xsize;
s += 64;
}
}
 
J

James Dow Allen

Hi, I am facing some performance issues ...
I have 4 blocks of Y data, one of U data and one of V data
... I take chunks of 8bytes and put in
different memory locations
...
      memcpy(...)
...
the code above is *much* slower ... than
...
      *(dest++) = *(source++);
      *(dest++) = *(source++);
...

If you're very concerned with performance, I'm surprised
you bothered to try a memcpy() solution.
Regulars in this ng will babble over and over and over
and over again, albeit correctly, that Premature Optimization
is an evil, but replacing a function call with a
(more readable!) two-line fix in the inner loop known to
dominate performance hardly constitutes Premature
Optimization!

Please! To listen to some c.l.c'ers you'd think
any optimized code should be replaced forthwith
with an unoptimized version! (One regular touts
his hashlib every week or two, even though it
contains beaucoup code with no discernable purpose
except to slow it down!)

What I used to do, sometimes, in loops like yours,
was to examine the generated object code, see if it
could be improved, then figure how to modify the C
code to produce the optimal object code. (For
example, changing an index from signed to unsigned
might do good, IIRC.)

Do you need to convert YUV to/from RGB? If so, that
conversion is likely to be quite time-consuming.
There are some very interesting ways to speed
*that* up.

The World's Fastest Jpeg Compressor(tm) has DCT,
Huffman, Zigzag, color conversion, etc. that are all
quite fast, and ends up spending, believe it or not,
about 40% of (decompression) time in the very inner
loop similar to that you are describing!

Eric said:
It might be simpler if you showed the actual code instead
of an paraphrase teeming with undeclared variables ...

Eric's comments usually have great value and technical
accuracy. Not counting a spelling deviation on dest(ination)
I counted one (1) undeclared variable in the sample
code, hardly justifying the term "teeming", and the code
and question were self-explanatory anyway. Eric: are
you applying for premier membership in the c.l.c
pedant's club?

James Dow Allen
 
B

Bart

      memcpy(destination+(y+i)*xsize+x, yuvcurr, sizeof(unsigned char)*8);
I noticed that the code above is *much* slower (ARM9, gcc 4.0.0, uClibc)
than the following trick:
      dest = (unsigned int*)(final+(y+i)*xsize+x);
      *(dest++) = *(source++);
      *(dest++) = *(source++);
    }

Basically I know I have to copy 8 bytes or two words and I do that
instead of calling memcpy.
The first solution roughly gives me 70fps, while the second one 230.

Any comment on this? Am I missing something, the memcpy implementation
is mislead by something?

Copying 8 bytes can be done in typically 4 instructions on a 32-bit
machine.

How many instructions do you think it takes to push the parameters,
call memcpy, and for it to sort out all it's logic, do the work, and
then return? In fact I'm surprised it's only 3 times slower. Maybe
your compiler is inlining, although it's not doing a very good job
then.
 
A

Alessio Sangalli

James said:
If you're very concerned with performance, I'm surprised
you bothered to try a memcpy() solution.

Well it's textbook knowledge, everybody knows that memcpy is supposed to
be the fastest way to copy memory ;)

Do you need to convert YUV to/from RGB? If so, that

No, I simply need to unpack the output of the JPEG decoder (hardware)
and put it in a different (raster) format for the scaler (hardware).


Thank you and to everybody that helped me to open up my eyes on memcpy.

bye
as
 
F

Flash Gordon

Alessio Sangalli wrote, On 18/07/08 21:07:
Well it's textbook knowledge, everybody knows that memcpy is supposed to
be the fastest way to copy memory ;)

<snip>

You have to make sure you have included the correct headers *and*
enabled the optimiser or memcpy might not be inlined and optimised
(including telling it the oldest processor you want to support, if using
gcc on an x86 for example you might not need to support an 80386!).
However it is still not guaranteed to be faster, it depends on the
quality of the implementation.

I would try to make sure that the data will match the strictest
alignment the processor supports. This *might* help memcpy and *will*
allow you to make use of fancier tricks if you need to.

You have measured and found a problem, so trying to improve the
situation is not premature optimisation. However changing the code might
not be the *only* way to improve it.
 
C

Chris Torek

I noticed that [using memcpy] is *much* slower (ARM9, gcc 4.0.0, uClibc)
than [an inline expansion that takes advantage of some knowledge about
the data being copied.]

GCC-on-ARM is, shall we say, "not a very mature product".[%] But
if you are willing to do some compiler-source work, it is likely
that you can make your GCC emit a reasonably optimal code sequence
for this particular code. (You will want to add special handling
for memcpy of known sizes at known-or-implied alignments.) If you
then submit this back to the GCC maintainers, it may eventually
wind up in future distributions. Until then, you will have to
maintain this change yourself.

It is up to you to decide whether "maintaining changes to GCC until
they are picked up by the mainline" is worthwhile. The alternative
is that for this particular case, where you have already profiled
or otherwise timed your code and found a bottleneck that can be
worked around some other way, you write a little bit of "special"
code just for this ARM system, and maintain this special code in
your own system. Which one is more cost-effective for you? (The
answer depends on *you*, i.e., your own situation, so this is not
a rhetorical question.)

[% In the WR Linux group, we have a bit of kernel code that gcc
compiles incorrectly on ARM, but not on other machines. We had to
decide what to do about this until we could get the compiler bug
fixed. I was not involved in the decision, I just saw the issue
fly by in email.]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,608
Members
45,244
Latest member
cryptotaxsoftware12

Latest Threads

Top