asm code for ARM (very simple)

Discussion in 'C++' started by Gernot Frisch, Sep 4, 2008.

  1. Hi,

    can someone, please optimize this routine for an ARM processor?

    inline void _QCopy4(register unsigned long* a,
    register unsigned long* b,
    register unsigned int ndwords)
    {
    // copy 4 bytes at once
    for(; ndwords>0; ndwords--) *a++=*b++;
    }


    Thank you.


    --
    ------------------------------------
    Gernot Frisch
    http://www.glbasic.com
    Gernot Frisch, Sep 4, 2008
    #1
    1. Advertising

  2. Gernot Frisch a écrit :
    > Hi,
    >
    > can someone, please optimize this routine for an ARM processor?
    >
    > inline void _QCopy4(register unsigned long* a,
    > register unsigned long* b,
    > register unsigned int ndwords)
    > {
    > // copy 4 bytes at once
    > for(; ndwords>0; ndwords--) *a++=*b++;
    > }


    Depending on the architecture you can:
    - use a DMA to copy the data: slow to setup but may be more efficient
    if you have a lot of data
    - use bursts: usually can copy 4 long in one burst (check your hardware)
    - configure your cache policy

    This as nothing to do with C++.

    Otherwise:
    - use memmove instead of memcpy when it is the intended semantic
    - try to keep your data aligned on 4,8 or 16 bytes boundaries

    And the most important: benchmark to locate your bottleneck.

    --
    Michael
    Michael DOUBEZ, Sep 4, 2008
    #2
    1. Advertising

  3. Gernot Frisch

    peter koch Guest

    On 4 Sep., 09:52, "Gernot Frisch" <> wrote:
    > Hi,
    >
    > can someone, please optimize this routine for an ARM processor?
    >
    > inline void _QCopy4(register unsigned long* a,
    >                     register unsigned long* b,
    >                     register unsigned int ndwords)
    > {
    >  // copy 4 bytes at once
    >  for(; ndwords>0; ndwords--) *a++=*b++;
    >
    > }


    What benchmarks did you make, that made you decide that this function
    is a bottleneck and that the code generated by the compiler is
    inadequate? I would expect the compiler to be able to generate quite
    good if not optimal code here.

    /Peter
    peter koch, Sep 4, 2008
    #3
  4. Gernot Frisch

    Jorgen Grahn Guest

    On Thu, 4 Sep 2008 14:54:20 -0700 (PDT), peter koch <> wrote:
    > On 4 Sep., 09:52, "Gernot Frisch" <> wrote:
    >> Hi,
    >>
    >> can someone, please optimize this routine for an ARM processor?
    >>
    >> inline void _QCopy4(register unsigned long* a,
    >>                     register unsigned long* b,
    >>                     register unsigned int ndwords)


    Why is b not a pointer to const? It's an open invitation to copy in
    the wrong direction.

    >> {
    >>  // copy 4 bytes at once
    >>  for(; ndwords>0; ndwords--) *a++=*b++;
    >>
    >> }

    >
    > What benchmarks did you make, that made you decide that this function
    > is a bottleneck and that the code generated by the compiler is
    > inadequate? I would expect the compiler to be able to generate quite
    > good if not optimal code here.


    Unless the data is unaliged, and (which I seem to recall is the case
    with ARM) unaligned reads & writes work, but are really, really slow.

    (By the way, I'd skip the 'register' keyword, unless it really affects
    the generated code. And if it *does*, I'd consider looking for a new
    compiler.)

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
    \X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!
    Jorgen Grahn, Sep 8, 2008
    #4
  5. Jorgen Grahn a écrit :
    > On Thu, 4 Sep 2008 14:54:20 -0700 (PDT), peter koch <> wrote:
    >> On 4 Sep., 09:52, "Gernot Frisch" <> wrote:
    >>> Hi,
    >>>
    >>> can someone, please optimize this routine for an ARM processor?
    >>>
    >>> inline void _QCopy4(register unsigned long* a,
    >>> register unsigned long* b,
    >>> register unsigned int ndwords)

    >
    > Why is b not a pointer to const? It's an open invitation to copy in
    > the wrong direction.
    >
    >>> {
    >>> // copy 4 bytes at once
    >>> for(; ndwords>0; ndwords--) *a++=*b++;
    >>>
    >>> }

    >> What benchmarks did you make, that made you decide that this function
    >> is a bottleneck and that the code generated by the compiler is
    >> inadequate? I would expect the compiler to be able to generate quite
    >> good if not optimal code here.

    >
    > Unless the data is unaliged, and (which I seem to recall is the case
    > with ARM) unaligned reads & writes work, but are really, really slow.


    AFAIK an MMU can be integrated as an option with most ARM today.

    But there is no problem here since the parameters are long pointer; they
    should be on the right boundaries. Unless the caller coerced them but
    this is however UB.

    --
    Michael
    Michael DOUBEZ, Sep 9, 2008
    #5
  6. Gernot Frisch

    Jorgen Grahn Guest

    On Tue, 09 Sep 2008 09:22:14 +0200, Michael DOUBEZ <> wrote:
    > Jorgen Grahn a écrit :
    >> On Thu, 4 Sep 2008 14:54:20 -0700 (PDT), peter koch <> wrote:
    >>> On 4 Sep., 09:52, "Gernot Frisch" <> wrote:
    >>>> Hi,
    >>>>
    >>>> can someone, please optimize this routine for an ARM processor?
    >>>>
    >>>> inline void _QCopy4(register unsigned long* a,
    >>>> register unsigned long* b,
    >>>> register unsigned int ndwords)
    >>>> {
    >>>> // copy 4 bytes at once
    >>>> for(; ndwords>0; ndwords--) *a++=*b++;
    >>>>
    >>>> }
    >>> What benchmarks did you make, that made you decide that this function
    >>> is a bottleneck and that the code generated by the compiler is
    >>> inadequate? I would expect the compiler to be able to generate quite
    >>> good if not optimal code here.

    >>
    >> Unless the data is unaliged, and (which I seem to recall is the case
    >> with ARM) unaligned reads & writes work, but are really, really slow.

    >
    > AFAIK an MMU can be integrated as an option with most ARM today.


    I'm not sure an MMU has anything to do with it. I have seen two or
    three different systems without an MMU which worked "correctly" with
    unaligned accesses, but at a huge speed penalty.

    > But there is no problem here since the parameters are long pointer; they
    > should be on the right boundaries. Unless the caller coerced them but
    > this is however UB.


    It's UB, but it's unfortunately very common out there, especially in
    embedded systems. It might be part of this particular problem.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
    \X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!
    Jorgen Grahn, Sep 9, 2008
    #6
  7. Jorgen Grahn a écrit :
    > On Tue, 09 Sep 2008 09:22:14 +0200, Michael DOUBEZ <> wrote:
    >> Jorgen Grahn a écrit :
    >>> On Thu, 4 Sep 2008 14:54:20 -0700 (PDT), peter koch <> wrote:
    >>>> On 4 Sep., 09:52, "Gernot Frisch" <> wrote:
    >>>>> Hi,
    >>>>>
    >>>>> can someone, please optimize this routine for an ARM processor?
    >>>>>
    >>>>> inline void _QCopy4(register unsigned long* a,
    >>>>> register unsigned long* b,
    >>>>> register unsigned int ndwords)
    >>>>> {
    >>>>> // copy 4 bytes at once
    >>>>> for(; ndwords>0; ndwords--) *a++=*b++;
    >>>>>
    >>>>> }
    >>>> What benchmarks did you make, that made you decide that this function
    >>>> is a bottleneck and that the code generated by the compiler is
    >>>> inadequate? I would expect the compiler to be able to generate quite
    >>>> good if not optimal code here.
    >>> Unless the data is unaliged, and (which I seem to recall is the case
    >>> with ARM) unaligned reads & writes work, but are really, really slow.

    >> AFAIK an MMU can be integrated as an option with most ARM today.

    >
    > I'm not sure an MMU has anything to do with it. I have seen two or
    > three different systems without an MMU which worked "correctly" with
    > unaligned accesses, but at a huge speed penalty.


    Without an MMU, you get corrupted data unless the software you use can
    add some magic.

    I know it from experience: we had this problem on a network device where
    data was written by an ethernet device in 4 bytes aligned memory. But
    the ethernet header is 14 bytes long which means that all remaining data
    (IP adresses, TCP informations, application data ...) was unaligned. I
    won't ellaborate on the fact that development done before we received
    the chip, relied on cleanly aligned data.

    We got through it but an MMU seemed priceless :)

    >> But there is no problem here since the parameters are long pointer; they
    >> should be on the right boundaries. Unless the caller coerced them but
    >> this is however UB.

    >
    > It's UB, but it's unfortunately very common out there, especially in
    > embedded systems. It might be part of this particular problem.


    It might be if you write in C, but reinterpret_cast<> tend to stand out
    in C++ and is caught at the first code review.

    --
    Michael
    Michael DOUBEZ, Sep 9, 2008
    #7
  8. Gernot Frisch

    Jorgen Grahn Guest

    On Tue, 09 Sep 2008 14:18:05 +0200, Michael DOUBEZ <> wrote:
    > Jorgen Grahn a écrit :
    >> On Tue, 09 Sep 2008 09:22:14 +0200, Michael DOUBEZ <> wrote:
    >>> Jorgen Grahn a écrit :


    >>> But there is no problem here since the parameters are long pointer; they
    >>> should be on the right boundaries. Unless the caller coerced them but
    >>> this is however UB.

    >>
    >> It's UB, but it's unfortunately very common out there, especially in
    >> embedded systems. It might be part of this particular problem.

    >
    > It might be if you write in C, but reinterpret_cast<> tend to stand out
    > in C++ and is caught at the first code review.


    Yes, and I like gcc's -Wc-style-cast flag.

    But you are assuming real-world projects use reinterpret_cast<>,
    perform code reviews, and care about type safety. I agree that they
    *should* (I think it would pay off quickly), but many don't.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
    \X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!
    Jorgen Grahn, Sep 16, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve Jasper
    Replies:
    0
    Views:
    3,002
    Steve Jasper
    Nov 20, 2003
  2. Raymond Arthur St. Marie II of III

    very Very VERY dumb Question About The new Set( ) 's

    Raymond Arthur St. Marie II of III, Jul 23, 2003, in forum: Python
    Replies:
    4
    Views:
    470
    Raymond Hettinger
    Jul 27, 2003
  3. shanx__=|;-

    very very very long integer

    shanx__=|;-, Oct 16, 2004, in forum: C Programming
    Replies:
    19
    Views:
    1,615
    Merrill & Michele
    Oct 19, 2004
  4. Abhishek Jha

    very very very long integer

    Abhishek Jha, Oct 16, 2004, in forum: C Programming
    Replies:
    4
    Views:
    417
    jacob navia
    Oct 17, 2004
  5. olivier.melcher

    Help running a very very very simple code

    olivier.melcher, May 12, 2008, in forum: Java
    Replies:
    8
    Views:
    2,267
Loading...

Share This Page