asm code for ARM (very simple)

Gernot Frisch · Sep 4, 2008

Hi,

can someone, please optimize this routine for an ARM processor?

inline void _QCopy4(register unsigned long* a,
register unsigned long* b,
register unsigned int ndwords)
{
// copy 4 bytes at once
for(; ndwords>0; ndwords--) *a++=*b++;
}

Thank you.

Michael DOUBEZ · Sep 4, 2008

Gernot Frisch a écrit :

Hi,

can someone, please optimize this routine for an ARM processor?

inline void _QCopy4(register unsigned long* a,
register unsigned long* b,
register unsigned int ndwords)
{
// copy 4 bytes at once
for(; ndwords>0; ndwords--) *a++=*b++;
}

Depending on the architecture you can:
- use a DMA to copy the data: slow to setup but may be more efficient
if you have a lot of data
- use bursts: usually can copy 4 long in one burst (check your hardware)
- configure your cache policy

This as nothing to do with C++.

Otherwise:
- use memmove instead of memcpy when it is the intended semantic
- try to keep your data aligned on 4,8 or 16 bytes boundaries

And the most important: benchmark to locate your bottleneck.

peter koch · Sep 4, 2008

Hi,

can someone, please optimize this routine for an ARM processor?

inline void _QCopy4(register unsigned long* a,
register unsigned long* b,
register unsigned int ndwords)
{
// copy 4 bytes at once
for(; ndwords>0; ndwords--) *a++=*b++;

}

What benchmarks did you make, that made you decide that this function
is a bottleneck and that the code generated by the compiler is
inadequate? I would expect the compiler to be able to generate quite
good if not optimal code here.

/Peter

Jorgen Grahn · Sep 8, 2008

Why is b not a pointer to const? It's an open invitation to copy in
the wrong direction.

What benchmarks did you make, that made you decide that this function
is a bottleneck and that the code generated by the compiler is
inadequate? I would expect the compiler to be able to generate quite
good if not optimal code here.

Unless the data is unaliged, and (which I seem to recall is the case
with ARM) unaligned reads & writes work, but are really, really slow.

(By the way, I'd skip the 'register' keyword, unless it really affects
the generated code. And if it *does*, I'd consider looking for a new
compiler.)

/Jorgen

Michael DOUBEZ · Sep 9, 2008

Jorgen Grahn a écrit :

Why is b not a pointer to const? It's an open invitation to copy in
the wrong direction.

Unless the data is unaliged, and (which I seem to recall is the case
with ARM) unaligned reads & writes work, but are really, really slow.

AFAIK an MMU can be integrated as an option with most ARM today.

But there is no problem here since the parameters are long pointer; they
should be on the right boundaries. Unless the caller coerced them but
this is however UB.

Jorgen Grahn · Sep 9, 2008

Jorgen Grahn a écrit :

AFAIK an MMU can be integrated as an option with most ARM today.

I'm not sure an MMU has anything to do with it. I have seen two or
three different systems without an MMU which worked "correctly" with
unaligned accesses, but at a huge speed penalty.

But there is no problem here since the parameters are long pointer; they
should be on the right boundaries. Unless the caller coerced them but
this is however UB.

It's UB, but it's unfortunately very common out there, especially in
embedded systems. It might be part of this particular problem.

/Jorgen

Michael DOUBEZ · Sep 9, 2008

Jorgen Grahn a écrit :

I'm not sure an MMU has anything to do with it. I have seen two or
three different systems without an MMU which worked "correctly" with
unaligned accesses, but at a huge speed penalty.

Without an MMU, you get corrupted data unless the software you use can
add some magic.

I know it from experience: we had this problem on a network device where
data was written by an ethernet device in 4 bytes aligned memory. But
the ethernet header is 14 bytes long which means that all remaining data
(IP adresses, TCP informations, application data ...) was unaligned. I
won't ellaborate on the fact that development done before we received
the chip, relied on cleanly aligned data.

We got through it but an MMU seemed priceless

It's UB, but it's unfortunately very common out there, especially in
embedded systems. It might be part of this particular problem.

It might be if you write in C, but reinterpret_cast<> tend to stand out
in C++ and is caught at the first code review.

Jorgen Grahn · Sep 16, 2008

Jorgen Grahn a écrit :

It might be if you write in C, but reinterpret_cast<> tend to stand out
in C++ and is caught at the first code review.

Yes, and I like gcc's -Wc-style-cast flag.

But you are assuming real-world projects use reinterpret_cast<>,
perform code reviews, and care about type safety. I agree that they
*should* (I think it would pay off quickly), but many don't.

/Jorgen

Qsort() messing with my entire Code	0	Apr 25, 2022
Adding adressing of IPv6 to program	1	Feb 16, 2023
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
C99 Seg fault on while(), why ?	0	Sep 13, 2022
Code was not Working Please Help	1	May 30, 2023
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023
8 buttons ,3 states and PJON Arduino	0	Jan 15, 2022
Please critique my code for fun learning project.	5	Jul 21, 2023

asm code for ARM (very simple)

Gernot Frisch

Michael DOUBEZ

peter koch

Jorgen Grahn

Michael DOUBEZ

Jorgen Grahn

Michael DOUBEZ

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads