AVX in Visual Studio

Discussion in 'C++' started by Thomas, Apr 19, 2013.

  1. Thomas

    Thomas Guest

    Hello

    I've just started to convert some c++ code which is optimised for sse4 to
    avx.

    After a few false starts with Visual Studio, I've finally started to
    generate avx code. Unfortunately the result is that my application is
    running slower rather than faster - it does at least produce the correct
    results.

    I suspect that the main reason for the decrease in speed is that the
    compiler is mixing up sse and avx code which I believe is a real performance
    killer?

    For example the following line:

    const embree::avxf eps(1.0e-20f);

    Generates the following assembly code

    0000000000884F5B vmovss xmm0,dword ptr [__real@1e3ce508 (98E970h)]
    0000000000884F63 vmovss dword ptr [rbp],xmm0
    0000000000884F68 lea rax,[rbp]
    0000000000884F6C vbroadcastss ymm0,dword ptr [rax]
    0000000000884F71 vmovaps ymmword ptr [rbp+20h],ymm0
    0000000000884F76 vmovaps ymm0,ymmword ptr [rbp+20h]
    0000000000884F7B vmovaps ymmword ptr [eps],ymm0

    As you can see, I'm using the intel embree intrinsic library.

    Any idea how to avoid this - do I need to hand code the lower level
    intrinsics?

    Thanks for any help.

    Thomas
    Thomas, Apr 19, 2013
    #1
    1. Advertising

  2. Thomas

    Melzzzzz Guest

    On Fri, 19 Apr 2013 18:55:16 +0100
    "Thomas" <> wrote:

    > Hello
    >
    > I've just started to convert some c++ code which is optimised for
    > sse4 to avx.
    >
    > After a few false starts with Visual Studio, I've finally started to
    > generate avx code. Unfortunately the result is that my application is
    > running slower rather than faster - it does at least produce the
    > correct results.
    >
    > I suspect that the main reason for the decrease in speed is that the
    > compiler is mixing up sse and avx code which I believe is a real
    > performance killer?
    >
    > For example the following line:
    >
    > const embree::avxf eps(1.0e-20f);
    >
    > Generates the following assembly code
    >
    > 0000000000884F5B vmovss xmm0,dword ptr [__real@1e3ce508
    > (98E970h)] 0000000000884F63 vmovss dword ptr [rbp],xmm0
    > 0000000000884F68 lea rax,[rbp]
    > 0000000000884F6C vbroadcastss ymm0,dword ptr [rax]
    > 0000000000884F71 vmovaps ymmword ptr [rbp+20h],ymm0
    > 0000000000884F76 vmovaps ymm0,ymmword ptr [rbp+20h]
    > 0000000000884F7B vmovaps ymmword ptr [eps],ymm0
    >
    > As you can see, I'm using the intel embree intrinsic library.


    This is not mixing sse with avx. All intructions are prefixed with v.

    >
    > Any idea how to avoid this - do I need to hand code the lower level
    > intrinsics?


    I think that problem lies somewhere else.
    Melzzzzz, Apr 19, 2013
    #2
    1. Advertising

  3. Thomas

    Thomas Guest

    "Melzzzzz" <> wrote in message
    news:kks2f9$nff$...
    > On Fri, 19 Apr 2013 18:55:16 +0100
    > "Thomas" <> wrote:



    >
    > This is not mixing sse with avx. All intructions are prefixed with v.
    >
    >>
    >> Any idea how to avoid this - do I need to hand code the lower level
    >> intrinsics?

    >
    > I think that problem lies somewhere else.
    >


    Thanks, that's really useful - I also had a look at an intel doc about
    mixing sse and avx which made the same point.

    In my case I have a function for intersecting a ray with a triangle. The
    function was written so that all the floats corresponding to the triangle
    could be converted to embree::ssef's so that 4 triangles at a time could be
    intersected. That gave almost a 4x speed-up. The next step was to convert
    the floats to embree::avxf's - but instead of something approaching an 8x
    speed up I got a slow-down. I still suspect that this is caused by mixing
    sse and avx since the profiler still points to the intersect routine as
    being 90+% of the run-time.

    So, what about the following two c++ to asembly conversions which were
    created by the Visual Studio c++ compiler with speed optimization enabled.
    They each seem to contain a mix of avx and non-avx instructions (??) - they
    also seem much more verbose than I would have expected, but maybe I'm
    missing something?

    I can see that mixing avx and non-avx mighr be unavoidable in the second
    example where an avx variable is being reduced to a scalar. But the first
    example looks like a classic case for avx, so why the non avx instructions?

    Again, Many Thanks
    Thomas


    embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);

    0000000001315939 mov rax,qword ptr [this]

    0000000001315941 vmovaps ymm0,ymmword ptr [Qz]

    0000000001315949 vmulps ymm0,ymm0,ymmword ptr [rax+100h]

    0000000001315951 vmovaps ymmword ptr [rbp+1000h],ymm0

    0000000001315959 vmovaps ymm0,ymmword ptr [rbp+1000h]

    0000000001315961 vmovaps ymmword ptr [rbp+1020h],ymm0

    0000000001315969 vmovaps ymm0,ymmword ptr [rbp+1020h]

    0000000001315971 vmovaps ymmword ptr [rbp+1040h],ymm0

    0000000001315979 mov rax,qword ptr [this]

    0000000001315981 vmovaps ymm0,ymmword ptr [Qy]

    0000000001315989 vmulps ymm0,ymm0,ymmword ptr [rax+0E0h]

    0000000001315991 vmovaps ymmword ptr [rbp+1060h],ymm0

    0000000001315999 vmovaps ymm0,ymmword ptr [rbp+1060h]

    00000000013159A1 vmovaps ymmword ptr [rbp+1080h],ymm0

    00000000013159A9 vmovaps ymm0,ymmword ptr [rbp+1080h]

    00000000013159B1 vmovaps ymmword ptr [rbp+10A0h],ymm0

    00000000013159B9 mov rax,qword ptr [this]

    00000000013159C1 vmovaps ymm0,ymmword ptr [Qx]

    00000000013159C9 vmulps ymm0,ymm0,ymmword ptr [rax+0C0h]

    00000000013159D1 vmovaps ymmword ptr [rbp+10C0h],ymm0

    00000000013159D9 vmovaps ymm0,ymmword ptr [rbp+10C0h]

    00000000013159E1 vmovaps ymmword ptr [rbp+10E0h],ymm0

    00000000013159E9 vmovaps ymm0,ymmword ptr [rbp+10E0h]

    00000000013159F1 vmovaps ymmword ptr [rbp+1100h],ymm0

    00000000013159F9 vmovaps ymm0,ymmword ptr [rbp+1100h]

    0000000001315A01 vaddps ymm0,ymm0,ymmword ptr [rbp+10A0h]

    0000000001315A09 vmovaps ymmword ptr [rbp+1120h],ymm0

    0000000001315A11 vmovaps ymm0,ymmword ptr [rbp+1120h]

    0000000001315A19 vmovaps ymmword ptr [rbp+1140h],ymm0

    0000000001315A21 vmovaps ymm0,ymmword ptr [rbp+1140h]

    0000000001315A29 vmovaps ymmword ptr [rbp+1160h],ymm0

    0000000001315A31 vmovaps ymm0,ymmword ptr [rbp+1160h]

    0000000001315A39 vaddps ymm0,ymm0,ymmword ptr [rbp+1040h]

    0000000001315A41 vmovaps ymmword ptr [rbp+1180h],ymm0

    0000000001315A49 vmovaps ymm0,ymmword ptr [rbp+1180h]

    0000000001315A51 vmovaps ymmword ptr [rbp+11A0h],ymm0

    0000000001315A59 vmovaps ymm0,ymmword ptr [rbp+11A0h]

    0000000001315A61 vmovaps ymmword ptr [rbp+11C0h],ymm0

    0000000001315A69 vmovaps ymm0,ymmword ptr [PuInv]

    0000000001315A71 vmulps ymm0,ymm0,ymmword ptr [rbp+11C0h]

    0000000001315A79 vmovaps ymmword ptr [rbp+11E0h],ymm0

    0000000001315A81 vmovaps ymm0,ymmword ptr [rbp+11E0h]

    0000000001315A89 vmovaps ymmword ptr [rbp+1200h],ymm0

    0000000001315A91 vmovaps ymm0,ymmword ptr [rbp+1200h]

    0000000001315A99 vmovaps ymmword ptr [t],ymm0







    if(embree::reduce_or(valid)==false)

    00000000013161C7 vmovaps ymm0,ymmword ptr [valid]

    00000000013161CF vtestps ymm0,ymmword ptr [valid]

    00000000013161D8 mov eax,1

    00000000013161DD mov ecx,0

    00000000013161E2 cmove ecx,eax

    00000000013161E5 test ecx,ecx

    00000000013161E7 jne 00000000013161F5

    00000000013161E9 mov dword ptr [rbp+1D80h],1

    00000000013161F3 jmp 00000000013161FF

    00000000013161F5 mov dword ptr [rbp+1D80h],0

    00000000013161FF movzx eax,byte ptr [rbp+1D80h]

    0000000001316206 test eax,eax

    0000000001316208 jne 0000000001316214

    return -1;

    000000000131620A mov eax,0FFFFFFFFh

    000000000131620F jmp 000000000131650A
    Thomas, Apr 19, 2013
    #3
  4. Thomas

    Thomas Guest

    "Andy Champ" <> wrote in message
    news:...
    > On 19/04/2013 20:52, Thomas wrote:
    >> I can see that mixing avx and non-avx mighr be unavoidable in the second
    >> example where an avx variable is being reduced to a scalar. But the first
    >> example looks like a classic case for avx, so why the non avx
    >> instructions?

    >
    > I'm not familiar with AVX. But the instructions I do recognise imply to me
    > that vx_, vy_ and vz_ are member variables of the current object, so it is
    > loading the address of this into rax for address calculation.
    >
    > Are you sure you have all the optimisation turned on? It seems odd that
    > it is doing it three times. And the rest of the instructions look as
    > though they are copying data in and out of memory a lot.
    >
    > Andy
    >


    Thanks

    Yes, vx_, vy_ and vz_ are member variables, and yes I have optimize speed
    turned on for this module in VS-10.

    The question is; does the "mov rax,qword ptr [this]" amount to a non
    avx-instruction which will significantly hit the performance of the
    subsequent avx instructions (the v-prefixed instructions)?

    Thanks
    Thomas
    Thomas, Apr 20, 2013
    #4
  5. Thomas

    Melzzzzz Guest

    On Sat, 20 Apr 2013 09:15:02 +0100
    "Thomas" <> wrote:

    >
    > The question is; does the "mov rax,qword ptr [this]" amount to a non
    > avx-instruction which will significantly hit the performance of the
    > subsequent avx instructions (the v-prefixed instructions)?
    >

    No. Only mixing sse instructions with ones prefixed with 'v'
    slows down performance. Since you are using compiler to
    compile I wouldn't suspect to mixing of sse with avx rather
    to unoptimal code.
    Melzzzzz, Apr 20, 2013
    #5
  6. Thomas

    Melzzzzz Guest

    On Fri, 19 Apr 2013 20:52:57 +0100
    "Thomas" <> wrote:

    >
    > "Melzzzzz" <> wrote in message
    > news:kks2f9$nff$...
    > > On Fri, 19 Apr 2013 18:55:16 +0100
    > > "Thomas" <> wrote:

    >
    >
    > >
    > > This is not mixing sse with avx. All intructions are prefixed with
    > > v.
    > >
    > >>
    > >> Any idea how to avoid this - do I need to hand code the lower level
    > >> intrinsics?

    > >
    > > I think that problem lies somewhere else.
    > >

    >
    > Thanks, that's really useful - I also had a look at an intel doc
    > about mixing sse and avx which made the same point.
    >
    > In my case I have a function for intersecting a ray with a triangle.
    > The function was written so that all the floats corresponding to the
    > triangle could be converted to embree::ssef's so that 4 triangles at
    > a time could be intersected. That gave almost a 4x speed-up. The next
    > step was to convert the floats to embree::avxf's - but instead of
    > something approaching an 8x speed up I got a slow-down. I still
    > suspect that this is caused by mixing sse and avx since the profiler
    > still points to the intersect routine as being 90+% of the run-time.


    Could you post sse version?

    >
    > So, what about the following two c++ to asembly conversions which
    > were created by the Visual Studio c++ compiler with speed
    > optimization enabled. They each seem to contain a mix of avx and
    > non-avx instructions (??) - they also seem much more verbose than I
    > would have expected, but maybe I'm missing something?


    No. They do not contain mixing of avx with *sse* instructions.

    >
    > I can see that mixing avx and non-avx mighr be unavoidable in the
    > second example where an avx variable is being reduced to a scalar.
    > But the first example looks like a classic case for avx, so why the
    > non avx instructions?


    There is no problem with that. Non avx instructions are normally mixed
    with avx, but *sse* causes slow down.

    >
    > Again, Many Thanks
    > Thomas
    >
    >
    > embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);
    >
    > 0000000001315939 mov rax,qword ptr [this]
    >
    > 0000000001315941 vmovaps ymm0,ymmword ptr [Qz]
    >
    > 0000000001315949 vmulps ymm0,ymm0,ymmword ptr [rax+100h]
    >
    > 0000000001315951 vmovaps ymmword ptr [rbp+1000h],ymm0
    >
    > 0000000001315959 vmovaps ymm0,ymmword ptr [rbp+1000h]


    What is this ;)
    Are you sure this is with optimisations on?

    >
    > 0000000001315961 vmovaps ymmword ptr [rbp+1020h],ymm0
    >
    > 0000000001315969 vmovaps ymm0,ymmword ptr [rbp+1020h]


    again

    >
    > 0000000001315971 vmovaps ymmword ptr [rbp+1040h],ymm0
    >
    > 0000000001315979 mov rax,qword ptr [this]


    and again

    >
    > 0000000001315981 vmovaps ymm0,ymmword ptr [Qy]
    >
    > 0000000001315989 vmulps ymm0,ymm0,ymmword ptr [rax+0E0h]
    >
    > 0000000001315991 vmovaps ymmword ptr [rbp+1060h],ymm0
    >
    > 0000000001315999 vmovaps ymm0,ymmword ptr [rbp+1060h]
    >
    > 00000000013159A1 vmovaps ymmword ptr [rbp+1080h],ymm0
    >
    > 00000000013159A9 vmovaps ymm0,ymmword ptr [rbp+1080h]
    >
    > 00000000013159B1 vmovaps ymmword ptr [rbp+10A0h],ymm0
    >
    > 00000000013159B9 mov rax,qword ptr [this]
    >
    > 00000000013159C1 vmovaps ymm0,ymmword ptr [Qx]
    >
    > 00000000013159C9 vmulps ymm0,ymm0,ymmword ptr [rax+0C0h]
    >
    > 00000000013159D1 vmovaps ymmword ptr [rbp+10C0h],ymm0
    >
    > 00000000013159D9 vmovaps ymm0,ymmword ptr [rbp+10C0h]
    >
    > 00000000013159E1 vmovaps ymmword ptr [rbp+10E0h],ymm0
    >
    > 00000000013159E9 vmovaps ymm0,ymmword ptr [rbp+10E0h]
    >
    > 00000000013159F1 vmovaps ymmword ptr [rbp+1100h],ymm0
    >
    > 00000000013159F9 vmovaps ymm0,ymmword ptr [rbp+1100h]
    >
    > 0000000001315A01 vaddps ymm0,ymm0,ymmword ptr [rbp+10A0h]
    >
    > 0000000001315A09 vmovaps ymmword ptr [rbp+1120h],ymm0
    >
    > 0000000001315A11 vmovaps ymm0,ymmword ptr [rbp+1120h]
    >
    > 0000000001315A19 vmovaps ymmword ptr [rbp+1140h],ymm0
    >
    > 0000000001315A21 vmovaps ymm0,ymmword ptr [rbp+1140h]
    >
    > 0000000001315A29 vmovaps ymmword ptr [rbp+1160h],ymm0
    >
    > 0000000001315A31 vmovaps ymm0,ymmword ptr [rbp+1160h]
    >
    > 0000000001315A39 vaddps ymm0,ymm0,ymmword ptr [rbp+1040h]
    >
    > 0000000001315A41 vmovaps ymmword ptr [rbp+1180h],ymm0
    >
    > 0000000001315A49 vmovaps ymm0,ymmword ptr [rbp+1180h]
    >
    > 0000000001315A51 vmovaps ymmword ptr [rbp+11A0h],ymm0
    >
    > 0000000001315A59 vmovaps ymm0,ymmword ptr [rbp+11A0h]
    >
    > 0000000001315A61 vmovaps ymmword ptr [rbp+11C0h],ymm0
    >
    > 0000000001315A69 vmovaps ymm0,ymmword ptr [PuInv]
    >
    > 0000000001315A71 vmulps ymm0,ymm0,ymmword ptr [rbp+11C0h]
    >
    > 0000000001315A79 vmovaps ymmword ptr [rbp+11E0h],ymm0
    >
    > 0000000001315A81 vmovaps ymm0,ymmword ptr [rbp+11E0h]
    >
    > 0000000001315A89 vmovaps ymmword ptr [rbp+1200h],ymm0
    >
    > 0000000001315A91 vmovaps ymm0,ymmword ptr [rbp+1200h]
    >
    > 0000000001315A99 vmovaps ymmword ptr [t],ymm0
    >


    This code is really, really unoptimized...

    >
    >
    >
    >
    >
    >
    > if(embree::reduce_or(valid)==false)
    >
    > 00000000013161C7 vmovaps ymm0,ymmword ptr [valid]
    >
    > 00000000013161CF vtestps ymm0,ymmword ptr [valid]
    >
    > 00000000013161D8 mov eax,1
    >
    > 00000000013161DD mov ecx,0
    >
    > 00000000013161E2 cmove ecx,eax
    >
    > 00000000013161E5 test ecx,ecx
    >
    > 00000000013161E7 jne 00000000013161F5


    What is this ;)

    >
    > 00000000013161E9 mov dword ptr [rbp+1D80h],1
    >
    > 00000000013161F3 jmp 00000000013161FF
    >
    > 00000000013161F5 mov dword ptr [rbp+1D80h],0
    >
    > 00000000013161FF movzx eax,byte ptr [rbp+1D80h]
    >
    > 0000000001316206 test eax,eax
    >
    > 0000000001316208 jne 0000000001316214
    >
    > return -1;
    >
    > 000000000131620A mov eax,0FFFFFFFFh
    >
    > 000000000131620F jmp 000000000131650A


    I guess that this routine does something nonsensical....

    Your compiler does not produce optimized code at all...
    Melzzzzz, Apr 20, 2013
    #6
  7. Thomas

    Melzzzzz Guest

    On Fri, 19 Apr 2013 20:52:57 +0100
    "Thomas" <> wrote:

    >
    >
    > embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);
    >

    This is how it should be written:
    mov rax,qword ptr [this]
    vmovaps ymm0,ymmword ptr [Qz]
    vmulps ymm1,ymm0,ymmword ptr [rax+100h]
    vmovaps ymm2,ymmword ptr [Qy]
    vmulps ymm3,ymm2,ymmword ptr [rax+0E0h]
    vmovaps ymm4,ymmword ptr [Qx]
    vmulps ymm5,ymm4,ymmword ptr [rax+0C0h]
    vaddps ymm6,ymm1,ymm3
    vaddps ymm6,ymm6,ymm5
    vmulps ymm6,ymm6,ymmword ptr [PuInv]
    vmovaps ymmword ptr [t],ymm6

    > if(embree::reduce_or(valid)==false)

    and this one:
    vmovaps ymm0,ymmword ptr [valid]
    vtestps ymm0,ymm0
    jne address
    Melzzzzz, Apr 20, 2013
    #7
  8. Thomas

    Thomas Guest

    "Melzzzzz" <> wrote in message
    news:kku5a7$69q$...
    > On Fri, 19 Apr 2013 20:52:57 +0100
    > "Thomas" <> wrote:
    >
    >>
    >>
    >> embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);
    >>

    > This is how it should be written:
    > mov rax,qword ptr [this]
    > vmovaps ymm0,ymmword ptr [Qz]
    > vmulps ymm1,ymm0,ymmword ptr [rax+100h]
    > vmovaps ymm2,ymmword ptr [Qy]
    > vmulps ymm3,ymm2,ymmword ptr [rax+0E0h]
    > vmovaps ymm4,ymmword ptr [Qx]
    > vmulps ymm5,ymm4,ymmword ptr [rax+0C0h]
    > vaddps ymm6,ymm1,ymm3
    > vaddps ymm6,ymm6,ymm5
    > vmulps ymm6,ymm6,ymmword ptr [PuInv]
    > vmovaps ymmword ptr [t],ymm6
    >
    >> if(embree::reduce_or(valid)==false)

    > and this one:
    > vmovaps ymm0,ymmword ptr [valid]
    > vtestps ymm0,ymm0
    > jne address
    >


    Thanks a lot for your help.

    I ran some stripped down tests and established that on my machine an avxf
    multiply consistently costs about 1.4x an ssef multiply - meaning the cost
    of an avxf flop is about 0.7x that of an ssef flop.

    Of course, you only see the benefit of avxf multiplies if you are
    multiplying more than 4 floats - if you are multiplying 4 or fewer then
    you'll see a slowdown.

    It turns out that the problem does indeed lie elsewhere.

    Thanks
    Thomas
    Thomas, Apr 23, 2013
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. wh
    Replies:
    2
    Views:
    476
    Cowboy \(Gregory A. Beamer\)
    Jan 16, 2004
  2. Thirumalai
    Replies:
    0
    Views:
    616
    Thirumalai
    May 22, 2006
  3. rockdale
    Replies:
    1
    Views:
    558
    Juan T. Llibre
    Aug 23, 2006
  4. xman
    Replies:
    0
    Views:
    446
  5. xman
    Replies:
    1
    Views:
    541
    Victor Bazarov
    Aug 17, 2005
Loading...

Share This Page