asm code for ARM (very simple)

G

Gernot Frisch

Hi,

can someone, please optimize this routine for an ARM processor?

inline void _QCopy4(register unsigned long* a,
register unsigned long* b,
register unsigned int ndwords)
{
// copy 4 bytes at once
for(; ndwords>0; ndwords--) *a++=*b++;
}


Thank you.
 
M

Michael DOUBEZ

Gernot Frisch a écrit :
Hi,

can someone, please optimize this routine for an ARM processor?

inline void _QCopy4(register unsigned long* a,
register unsigned long* b,
register unsigned int ndwords)
{
// copy 4 bytes at once
for(; ndwords>0; ndwords--) *a++=*b++;
}

Depending on the architecture you can:
- use a DMA to copy the data: slow to setup but may be more efficient
if you have a lot of data
- use bursts: usually can copy 4 long in one burst (check your hardware)
- configure your cache policy

This as nothing to do with C++.

Otherwise:
- use memmove instead of memcpy when it is the intended semantic
- try to keep your data aligned on 4,8 or 16 bytes boundaries

And the most important: benchmark to locate your bottleneck.
 
P

peter koch

Hi,

can someone, please optimize this routine for an ARM processor?

inline void _QCopy4(register unsigned long* a,
                    register unsigned long* b,
                    register unsigned int ndwords)
{
 // copy 4 bytes at once
 for(; ndwords>0; ndwords--) *a++=*b++;

}

What benchmarks did you make, that made you decide that this function
is a bottleneck and that the code generated by the compiler is
inadequate? I would expect the compiler to be able to generate quite
good if not optimal code here.

/Peter
 
J

Jorgen Grahn

Why is b not a pointer to const? It's an open invitation to copy in
the wrong direction.
What benchmarks did you make, that made you decide that this function
is a bottleneck and that the code generated by the compiler is
inadequate? I would expect the compiler to be able to generate quite
good if not optimal code here.

Unless the data is unaliged, and (which I seem to recall is the case
with ARM) unaligned reads & writes work, but are really, really slow.

(By the way, I'd skip the 'register' keyword, unless it really affects
the generated code. And if it *does*, I'd consider looking for a new
compiler.)

/Jorgen
 
M

Michael DOUBEZ

Jorgen Grahn a écrit :
Why is b not a pointer to const? It's an open invitation to copy in
the wrong direction.


Unless the data is unaliged, and (which I seem to recall is the case
with ARM) unaligned reads & writes work, but are really, really slow.

AFAIK an MMU can be integrated as an option with most ARM today.

But there is no problem here since the parameters are long pointer; they
should be on the right boundaries. Unless the caller coerced them but
this is however UB.
 
J

Jorgen Grahn

Jorgen Grahn a écrit :

AFAIK an MMU can be integrated as an option with most ARM today.

I'm not sure an MMU has anything to do with it. I have seen two or
three different systems without an MMU which worked "correctly" with
unaligned accesses, but at a huge speed penalty.
But there is no problem here since the parameters are long pointer; they
should be on the right boundaries. Unless the caller coerced them but
this is however UB.

It's UB, but it's unfortunately very common out there, especially in
embedded systems. It might be part of this particular problem.

/Jorgen
 
M

Michael DOUBEZ

Jorgen Grahn a écrit :
I'm not sure an MMU has anything to do with it. I have seen two or
three different systems without an MMU which worked "correctly" with
unaligned accesses, but at a huge speed penalty.

Without an MMU, you get corrupted data unless the software you use can
add some magic.

I know it from experience: we had this problem on a network device where
data was written by an ethernet device in 4 bytes aligned memory. But
the ethernet header is 14 bytes long which means that all remaining data
(IP adresses, TCP informations, application data ...) was unaligned. I
won't ellaborate on the fact that development done before we received
the chip, relied on cleanly aligned data.

We got through it but an MMU seemed priceless :)
It's UB, but it's unfortunately very common out there, especially in
embedded systems. It might be part of this particular problem.

It might be if you write in C, but reinterpret_cast<> tend to stand out
in C++ and is caught at the first code review.
 
J

Jorgen Grahn

Jorgen Grahn a écrit :

It might be if you write in C, but reinterpret_cast<> tend to stand out
in C++ and is caught at the first code review.

Yes, and I like gcc's -Wc-style-cast flag.

But you are assuming real-world projects use reinterpret_cast<>,
perform code reviews, and care about type safety. I agree that they
*should* (I think it would pay off quickly), but many don't.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top