AVX in Visual Studio

Thomas · Apr 19, 2013

Hello

I've just started to convert some c++ code which is optimised for sse4 to
avx.

After a few false starts with Visual Studio, I've finally started to
generate avx code. Unfortunately the result is that my application is
running slower rather than faster - it does at least produce the correct
results.

I suspect that the main reason for the decrease in speed is that the
compiler is mixing up sse and avx code which I believe is a real performance
killer?

For example the following line:

const embree::avxf eps(1.0e-20f);

Generates the following assembly code

0000000000884F5B vmovss xmm0,dword ptr [__real@1e3ce508 (98E970h)]
0000000000884F63 vmovss dword ptr [rbp],xmm0
0000000000884F68 lea rax,[rbp]
0000000000884F6C vbroadcastss ymm0,dword ptr [rax]
0000000000884F71 vmovaps ymmword ptr [rbp+20h],ymm0
0000000000884F76 vmovaps ymm0,ymmword ptr [rbp+20h]
0000000000884F7B vmovaps ymmword ptr [eps],ymm0

As you can see, I'm using the intel embree intrinsic library.

Any idea how to avoid this - do I need to hand code the lower level
intrinsics?

Thanks for any help.

Thomas

Melzzzzz · Apr 19, 2013

Hello

I've just started to convert some c++ code which is optimised for
sse4 to avx.

After a few false starts with Visual Studio, I've finally started to
generate avx code. Unfortunately the result is that my application is
running slower rather than faster - it does at least produce the
correct results.

I suspect that the main reason for the decrease in speed is that the
compiler is mixing up sse and avx code which I believe is a real
performance killer?

For example the following line:

const embree::avxf eps(1.0e-20f);

Generates the following assembly code

0000000000884F5B vmovss xmm0,dword ptr [__real@1e3ce508
(98E970h)] 0000000000884F63 vmovss dword ptr [rbp],xmm0
0000000000884F68 lea rax,[rbp]
0000000000884F6C vbroadcastss ymm0,dword ptr [rax]
0000000000884F71 vmovaps ymmword ptr [rbp+20h],ymm0
0000000000884F76 vmovaps ymm0,ymmword ptr [rbp+20h]
0000000000884F7B vmovaps ymmword ptr [eps],ymm0

As you can see, I'm using the intel embree intrinsic library.

This is not mixing sse with avx. All intructions are prefixed with v.

Any idea how to avoid this - do I need to hand code the lower level
intrinsics?

I think that problem lies somewhere else.

Thomas · Apr 19, 2013

Melzzzzz said:
On Fri, 19 Apr 2013 18:55:16 +0100

This is not mixing sse with avx. All intructions are prefixed with v.

I think that problem lies somewhere else.

Thanks, that's really useful - I also had a look at an intel doc about
mixing sse and avx which made the same point.

In my case I have a function for intersecting a ray with a triangle. The
function was written so that all the floats corresponding to the triangle
could be converted to embree::ssef's so that 4 triangles at a time could be
intersected. That gave almost a 4x speed-up. The next step was to convert
the floats to embree::avxf's - but instead of something approaching an 8x
speed up I got a slow-down. I still suspect that this is caused by mixing
sse and avx since the profiler still points to the intersect routine as
being 90+% of the run-time.

So, what about the following two c++ to asembly conversions which were
created by the Visual Studio c++ compiler with speed optimization enabled.
They each seem to contain a mix of avx and non-avx instructions (??) - they
also seem much more verbose than I would have expected, but maybe I'm
missing something?

I can see that mixing avx and non-avx mighr be unavoidable in the second
example where an avx variable is being reduced to a scalar. But the first
example looks like a classic case for avx, so why the non avx instructions?

Again, Many Thanks
Thomas

embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);

0000000001315939 mov rax,qword ptr [this]

0000000001315941 vmovaps ymm0,ymmword ptr [Qz]

0000000001315949 vmulps ymm0,ymm0,ymmword ptr [rax+100h]

0000000001315951 vmovaps ymmword ptr [rbp+1000h],ymm0

0000000001315959 vmovaps ymm0,ymmword ptr [rbp+1000h]

0000000001315961 vmovaps ymmword ptr [rbp+1020h],ymm0

0000000001315969 vmovaps ymm0,ymmword ptr [rbp+1020h]

0000000001315971 vmovaps ymmword ptr [rbp+1040h],ymm0

0000000001315979 mov rax,qword ptr [this]

0000000001315981 vmovaps ymm0,ymmword ptr [Qy]

0000000001315989 vmulps ymm0,ymm0,ymmword ptr [rax+0E0h]

0000000001315991 vmovaps ymmword ptr [rbp+1060h],ymm0

0000000001315999 vmovaps ymm0,ymmword ptr [rbp+1060h]

00000000013159A1 vmovaps ymmword ptr [rbp+1080h],ymm0

00000000013159A9 vmovaps ymm0,ymmword ptr [rbp+1080h]

00000000013159B1 vmovaps ymmword ptr [rbp+10A0h],ymm0

00000000013159B9 mov rax,qword ptr [this]

00000000013159C1 vmovaps ymm0,ymmword ptr [Qx]

00000000013159C9 vmulps ymm0,ymm0,ymmword ptr [rax+0C0h]

00000000013159D1 vmovaps ymmword ptr [rbp+10C0h],ymm0

00000000013159D9 vmovaps ymm0,ymmword ptr [rbp+10C0h]

00000000013159E1 vmovaps ymmword ptr [rbp+10E0h],ymm0

00000000013159E9 vmovaps ymm0,ymmword ptr [rbp+10E0h]

00000000013159F1 vmovaps ymmword ptr [rbp+1100h],ymm0

00000000013159F9 vmovaps ymm0,ymmword ptr [rbp+1100h]

0000000001315A01 vaddps ymm0,ymm0,ymmword ptr [rbp+10A0h]

0000000001315A09 vmovaps ymmword ptr [rbp+1120h],ymm0

0000000001315A11 vmovaps ymm0,ymmword ptr [rbp+1120h]

0000000001315A19 vmovaps ymmword ptr [rbp+1140h],ymm0

0000000001315A21 vmovaps ymm0,ymmword ptr [rbp+1140h]

0000000001315A29 vmovaps ymmword ptr [rbp+1160h],ymm0

0000000001315A31 vmovaps ymm0,ymmword ptr [rbp+1160h]

0000000001315A39 vaddps ymm0,ymm0,ymmword ptr [rbp+1040h]

0000000001315A41 vmovaps ymmword ptr [rbp+1180h],ymm0

0000000001315A49 vmovaps ymm0,ymmword ptr [rbp+1180h]

0000000001315A51 vmovaps ymmword ptr [rbp+11A0h],ymm0

0000000001315A59 vmovaps ymm0,ymmword ptr [rbp+11A0h]

0000000001315A61 vmovaps ymmword ptr [rbp+11C0h],ymm0

0000000001315A69 vmovaps ymm0,ymmword ptr [PuInv]

0000000001315A71 vmulps ymm0,ymm0,ymmword ptr [rbp+11C0h]

0000000001315A79 vmovaps ymmword ptr [rbp+11E0h],ymm0

0000000001315A81 vmovaps ymm0,ymmword ptr [rbp+11E0h]

0000000001315A89 vmovaps ymmword ptr [rbp+1200h],ymm0

0000000001315A91 vmovaps ymm0,ymmword ptr [rbp+1200h]

0000000001315A99 vmovaps ymmword ptr [t],ymm0

if(embree::reduce_or(valid)==false)

00000000013161C7 vmovaps ymm0,ymmword ptr [valid]

00000000013161CF vtestps ymm0,ymmword ptr [valid]

00000000013161D8 mov eax,1

00000000013161DD mov ecx,0

00000000013161E2 cmove ecx,eax

00000000013161E5 test ecx,ecx

00000000013161E7 jne 00000000013161F5

00000000013161E9 mov dword ptr [rbp+1D80h],1

00000000013161F3 jmp 00000000013161FF

00000000013161F5 mov dword ptr [rbp+1D80h],0

00000000013161FF movzx eax,byte ptr [rbp+1D80h]

0000000001316206 test eax,eax

0000000001316208 jne 0000000001316214

return -1;

000000000131620A mov eax,0FFFFFFFFh

000000000131620F jmp 000000000131650A

Thomas · Apr 20, 2013

Andy Champ said:
I'm not familiar with AVX. But the instructions I do recognise imply to me
that vx_, vy_ and vz_ are member variables of the current object, so it is
loading the address of this into rax for address calculation.

Are you sure you have all the optimisation turned on? It seems odd that
it is doing it three times. And the rest of the instructions look as
though they are copying data in and out of memory a lot.

Andy

Thanks

Yes, vx_, vy_ and vz_ are member variables, and yes I have optimize speed
turned on for this module in VS-10.

The question is; does the "mov rax,qword ptr [this]" amount to a non
avx-instruction which will significantly hit the performance of the
subsequent avx instructions (the v-prefixed instructions)?

Thanks
Thomas

Melzzzzz · Apr 20, 2013

The question is; does the "mov rax,qword ptr [this]" amount to a non
avx-instruction which will significantly hit the performance of the
subsequent avx instructions (the v-prefixed instructions)?

No. Only mixing sse instructions with ones prefixed with 'v'
slows down performance. Since you are using compiler to
compile I wouldn't suspect to mixing of sse with avx rather
to unoptimal code.

Melzzzzz · Apr 20, 2013

Thanks, that's really useful - I also had a look at an intel doc
about mixing sse and avx which made the same point.

In my case I have a function for intersecting a ray with a triangle.
The function was written so that all the floats corresponding to the
triangle could be converted to embree::ssef's so that 4 triangles at
a time could be intersected. That gave almost a 4x speed-up. The next
step was to convert the floats to embree::avxf's - but instead of
something approaching an 8x speed up I got a slow-down. I still
suspect that this is caused by mixing sse and avx since the profiler
still points to the intersect routine as being 90+% of the run-time.

Could you post sse version?

So, what about the following two c++ to asembly conversions which
were created by the Visual Studio c++ compiler with speed
optimization enabled. They each seem to contain a mix of avx and
non-avx instructions (??) - they also seem much more verbose than I
would have expected, but maybe I'm missing something?

No. They do not contain mixing of avx with *sse* instructions.

I can see that mixing avx and non-avx mighr be unavoidable in the
second example where an avx variable is being reduced to a scalar.
But the first example looks like a classic case for avx, so why the
non avx instructions?

There is no problem with that. Non avx instructions are normally mixed
with avx, but *sse* causes slow down.

Again, Many Thanks
Thomas

embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);

0000000001315939 mov rax,qword ptr [this]

0000000001315941 vmovaps ymm0,ymmword ptr [Qz]

0000000001315949 vmulps ymm0,ymm0,ymmword ptr [rax+100h]

0000000001315951 vmovaps ymmword ptr [rbp+1000h],ymm0

0000000001315959 vmovaps ymm0,ymmword ptr [rbp+1000h]

What is this

Are you sure this is with optimisations on?

0000000001315961 vmovaps ymmword ptr [rbp+1020h],ymm0

0000000001315969 vmovaps ymm0,ymmword ptr [rbp+1020h]
again

0000000001315971 vmovaps ymmword ptr [rbp+1040h],ymm0

0000000001315979 mov rax,qword ptr [this]

and again

0000000001315981 vmovaps ymm0,ymmword ptr [Qy]

0000000001315989 vmulps ymm0,ymm0,ymmword ptr [rax+0E0h]

0000000001315991 vmovaps ymmword ptr [rbp+1060h],ymm0

0000000001315999 vmovaps ymm0,ymmword ptr [rbp+1060h]

00000000013159A1 vmovaps ymmword ptr [rbp+1080h],ymm0

00000000013159A9 vmovaps ymm0,ymmword ptr [rbp+1080h]

00000000013159B1 vmovaps ymmword ptr [rbp+10A0h],ymm0

00000000013159B9 mov rax,qword ptr [this]

00000000013159C1 vmovaps ymm0,ymmword ptr [Qx]

00000000013159C9 vmulps ymm0,ymm0,ymmword ptr [rax+0C0h]

00000000013159D1 vmovaps ymmword ptr [rbp+10C0h],ymm0

00000000013159D9 vmovaps ymm0,ymmword ptr [rbp+10C0h]

00000000013159E1 vmovaps ymmword ptr [rbp+10E0h],ymm0

00000000013159E9 vmovaps ymm0,ymmword ptr [rbp+10E0h]

00000000013159F1 vmovaps ymmword ptr [rbp+1100h],ymm0

00000000013159F9 vmovaps ymm0,ymmword ptr [rbp+1100h]

0000000001315A01 vaddps ymm0,ymm0,ymmword ptr [rbp+10A0h]

0000000001315A09 vmovaps ymmword ptr [rbp+1120h],ymm0

0000000001315A11 vmovaps ymm0,ymmword ptr [rbp+1120h]

0000000001315A19 vmovaps ymmword ptr [rbp+1140h],ymm0

0000000001315A21 vmovaps ymm0,ymmword ptr [rbp+1140h]

0000000001315A29 vmovaps ymmword ptr [rbp+1160h],ymm0

0000000001315A31 vmovaps ymm0,ymmword ptr [rbp+1160h]

0000000001315A39 vaddps ymm0,ymm0,ymmword ptr [rbp+1040h]

0000000001315A41 vmovaps ymmword ptr [rbp+1180h],ymm0

0000000001315A49 vmovaps ymm0,ymmword ptr [rbp+1180h]

0000000001315A51 vmovaps ymmword ptr [rbp+11A0h],ymm0

0000000001315A59 vmovaps ymm0,ymmword ptr [rbp+11A0h]

0000000001315A61 vmovaps ymmword ptr [rbp+11C0h],ymm0

0000000001315A69 vmovaps ymm0,ymmword ptr [PuInv]

0000000001315A71 vmulps ymm0,ymm0,ymmword ptr [rbp+11C0h]

0000000001315A79 vmovaps ymmword ptr [rbp+11E0h],ymm0

0000000001315A81 vmovaps ymm0,ymmword ptr [rbp+11E0h]

0000000001315A89 vmovaps ymmword ptr [rbp+1200h],ymm0

0000000001315A91 vmovaps ymm0,ymmword ptr [rbp+1200h]

0000000001315A99 vmovaps ymmword ptr [t],ymm0

This code is really, really unoptimized...

if(embree::reduce_or(valid)==false)

00000000013161C7 vmovaps ymm0,ymmword ptr [valid]

00000000013161CF vtestps ymm0,ymmword ptr [valid]

00000000013161D8 mov eax,1

00000000013161DD mov ecx,0

00000000013161E2 cmove ecx,eax

00000000013161E5 test ecx,ecx

00000000013161E7 jne 00000000013161F5

What is this

00000000013161E9 mov dword ptr [rbp+1D80h],1

00000000013161F3 jmp 00000000013161FF

00000000013161F5 mov dword ptr [rbp+1D80h],0

00000000013161FF movzx eax,byte ptr [rbp+1D80h]

0000000001316206 test eax,eax

0000000001316208 jne 0000000001316214

return -1;

000000000131620A mov eax,0FFFFFFFFh

000000000131620F jmp 000000000131650A

I guess that this routine does something nonsensical....

Your compiler does not produce optimized code at all...

Melzzzzz · Apr 20, 2013

embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);

This is how it should be written:
mov rax,qword ptr [this]
vmovaps ymm0,ymmword ptr [Qz]
vmulps ymm1,ymm0,ymmword ptr [rax+100h]
vmovaps ymm2,ymmword ptr [Qy]
vmulps ymm3,ymm2,ymmword ptr [rax+0E0h]
vmovaps ymm4,ymmword ptr [Qx]
vmulps ymm5,ymm4,ymmword ptr [rax+0C0h]
vaddps ymm6,ymm1,ymm3
vaddps ymm6,ymm6,ymm5
vmulps ymm6,ymm6,ymmword ptr [PuInv]
vmovaps ymmword ptr [t],ymm6

if(embree::reduce_or(valid)==false)

and this one:
vmovaps ymm0,ymmword ptr [valid]
vtestps ymm0,ymm0
jne address

Thomas · Apr 23, 2013

Melzzzzz said:
embree::avxf t = PuInv*(Qx*vx_+Qy*vy_+Qz*vz_);

Click to expand...

This is how it should be written:
mov rax,qword ptr [this]
vmovaps ymm0,ymmword ptr [Qz]
vmulps ymm1,ymm0,ymmword ptr [rax+100h]
vmovaps ymm2,ymmword ptr [Qy]
vmulps ymm3,ymm2,ymmword ptr [rax+0E0h]
vmovaps ymm4,ymmword ptr [Qx]
vmulps ymm5,ymm4,ymmword ptr [rax+0C0h]
vaddps ymm6,ymm1,ymm3
vaddps ymm6,ymm6,ymm5
vmulps ymm6,ymm6,ymmword ptr [PuInv]
vmovaps ymmword ptr [t],ymm6

if(embree::reduce_or(valid)==false)

Click to expand...

and this one:
vmovaps ymm0,ymmword ptr [valid]
vtestps ymm0,ymm0
jne address

Thanks a lot for your help.

I ran some stripped down tests and established that on my machine an avxf
multiply consistently costs about 1.4x an ssef multiply - meaning the cost
of an avxf flop is about 0.7x that of an ssef flop.

Of course, you only see the benefit of avxf multiplies if you are
multiplying more than 4 floats - if you are multiplying 4 or fewer then
you'll see a slowdown.

It turns out that the problem does indeed lie elsewhere.

Thanks
Thomas

Productivity Power Tools 2017/2019 extension stops copy operation in Visual Studio 2017.	1	Feb 13, 2022
How to make Intellisense Quick Info for JavaScript and CSS files to appear in Visual Studio 2017?	0	Mar 31, 2022
Command line arguments in C++ Release configuration in Visual Studio 2008	7	Jul 6, 2013
How to debug the pyd code in Visual studio 2013	0	Feb 10, 2014
Visual Studio Question	3	Feb 25, 2009
"bool const a( 5 )" in "Microsoft Visual Studio C++ 2010"	18	Sep 28, 2011
STL/CLR library in Visual Studio 2008	2	Jan 26, 2010
Microsoft User Research is looking for C++ devs in the Puget Soundarea using Visual Studio 2012 for	3	Aug 28, 2012

AVX in Visual Studio

Thomas

Melzzzzz

Thomas

Thomas

Melzzzzz

Melzzzzz

Melzzzzz

Thomas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads