Array optimizing problem in C++?

courpron · Mar 25, 2008

Personally I've never managed to code up a scenario (using GCC with
various optimisations) where __restrict__ appears to have made any
difference whatsoever.

Try the program I gave earlier on this topic.
Compile it with :
g++ -O3

Then uncomment the line :
//#define NO_ALIASING_OPTIMIZATION

and compile it again with g++ -O3.

There should be a difference.
Let me know if you don't find any.

Here is the program :

#include <iostream>
#include <ctime>

//#define NO_ALIASING_OPTIMIZATION

const int len = 50000;

__attribute__((noinline))
#ifndef NO_ALIASING_OPTIMIZATION
void smooth (int* dest, int * src )
#else
void smooth (int* __restrict dest, int * __restrict src )
#endif
{
for ( int i = 0 ; i < 17 ; ++i )
dest[ i ] = src[ i ] + src[ i + 1 ] + src[ i + 2 ];
}

void fill (int* src)
{
for (int i = 0 ; i < len ; ++ i )
src = i;
}

int main()
{

int src_array [len] = {0} ;
int dest_array [len] = {0};

fill(src_array);

smooth (dest_array, dest_array); // dummy call

clock_t start=clock();

for (int i = 0; i < 100000000; i++)
smooth (dest_array, src_array);

clock_t endt=clock();

std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

// doesn't work without the following cout on vc++
std::cout << dest_array [0] ;

return 0;
}

Alexandre Courpron.

Lionel B · Mar 25, 2008

Try the program I gave earlier on this topic. Compile it with :
g++ -O3

Then uncomment the line :
//#define NO_ALIASING_OPTIMIZATION

and compile it again with g++ -O3.

There should be a difference.
Let me know if you don't find any.

With g++ 4.1.2 (same results with 4.2.0)

With aliasing optimisation:

Time smooth(): 2120 ms

Without aliasing optimisation:

Time smooth(): 2120 ms

With ICC (Intel compiler) using -restrict -O3

With aliasing optimisation:

Time smooth(): 3410 ms

Without aliasing optimisation:

Time smooth(): 2990 ms

So there appears to be a small improvement with aliasing optimisation for
ICC (although there is a fair amount of variance in the results, so not
very significant) and no discernible difference for GCC. In fact I've
checked that there is absolutely no difference in the assembler generated
by g++. This is on linux x86_64 - I tried compiling 32-bit binaries
(using -m32 flag) and again, no difference.

Lionel B · Mar 25, 2008

With ICC (Intel compiler) using -restrict -O3

With aliasing optimisation:

Time smooth(): 3410 ms

Without aliasing optimisation:

Time smooth(): 2990 ms

Sorry, those results were swapped round

courpron · Mar 25, 2008

With g++ 4.1.2 (same results with 4.2.0)

With aliasing optimisation:

Time smooth(): 2120 ms

Without aliasing optimisation:

Time smooth(): 2120 ms

I used exactly the same compiler (g++ 4.1.2) and there was a
difference in the generated assembly listing (my architecture is 32
bits, intel pentium 4)

Could you post the generated assembly when NO_ALIASING_OPTIMIZATION is
defined ?

Alexandre Courpron.

Lionel B · Mar 25, 2008

I used exactly the same compiler (g++ 4.1.2) and there was a difference
in the generated assembly listing (my architecture is 32 bits, intel
pentium 4)

Could you post the generated assembly when NO_ALIASING_OPTIMIZATION is
defined ?

Sure (here's the 32-bit version, as you might be able to compare it better):

$ g++ -m32 -O3 -S scratch.cpp
$ cat scratch.s

.file "scratch.cpp"
.section .ctors,"aw",@progbits
.align 4
.long _GLOBAL__I__Z6smoothPiS_
.text
.align 2
.p2align 4,,15
..globl _Z6smoothPiS_
.type _Z6smoothPiS_, @function
_Z6smoothPiS_:
..LFB1435:
pushl %ebp
..LCFI0:
movl %esp, %ebp
..LCFI1:
movl 12(%ebp), %edx
pushl %esi
..LCFI2:
movl 8(%ebp), %esi
pushl %ebx
..LCFI3:
movl 8(%edx), %eax
leal 8(%edx), %ebx
addl 4(%edx), %eax
addl (%edx), %eax
leal 4(%edx), %ecx
movl %eax, (%esi)
movl 4(%ebx), %eax
addl 4(%ecx), %eax
addl 4(%edx), %eax
movl %eax, 4(%esi)
movl 8(%edx), %eax
addl 8(%ecx), %eax
addl 8(%ebx), %eax
movl %eax, 8(%esi)
movl 12(%edx), %eax
addl 12(%ecx), %eax
addl 12(%ebx), %eax
movl %eax, 12(%esi)
movl 16(%edx), %eax
addl 16(%ecx), %eax
addl 16(%ebx), %eax
movl %eax, 16(%esi)
movl 20(%edx), %eax
addl 20(%ecx), %eax
addl 20(%ebx), %eax
movl %eax, 20(%esi)
movl 24(%edx), %eax
addl 24(%ecx), %eax
addl 24(%ebx), %eax
movl %eax, 24(%esi)
movl 28(%edx), %eax
addl 28(%ecx), %eax
addl 28(%ebx), %eax
movl %eax, 28(%esi)
movl 32(%edx), %eax
addl 32(%ecx), %eax
addl 32(%ebx), %eax
movl %eax, 32(%esi)
movl 36(%edx), %eax
addl 36(%ecx), %eax
addl 36(%ebx), %eax
movl %eax, 36(%esi)
movl 40(%edx), %eax
addl 40(%ecx), %eax
addl 40(%ebx), %eax
movl %eax, 40(%esi)
movl 44(%edx), %eax
addl 44(%ecx), %eax
addl 44(%ebx), %eax
movl %eax, 44(%esi)
movl 48(%edx), %eax
addl 48(%ecx), %eax
addl 48(%ebx), %eax
movl %eax, 48(%esi)
movl 52(%edx), %eax
addl 52(%ecx), %eax
addl 52(%ebx), %eax
movl %eax, 52(%esi)
movl 56(%edx), %eax
addl 56(%ecx), %eax
addl 56(%ebx), %eax
movl %eax, 56(%esi)
movl 60(%edx), %eax
addl 60(%ecx), %eax
addl 60(%ebx), %eax
movl %eax, 60(%esi)
movl 64(%edx), %eax
addl 64(%ecx), %eax
addl 64(%ebx), %eax
movl %eax, 64(%esi)
popl %ebx
popl %esi
popl %ebp
ret
..LFE1435:
.size _Z6smoothPiS_, .-_Z6smoothPiS_
..globl __gxx_personality_v0
.align 2
.p2align 4,,15
..globl _Z4fillPi
.type _Z4fillPi, @function
_Z4fillPi:
..LFB1436:
pushl %ebp
..LCFI4:
xorl %eax, %eax
movl %esp, %ebp
..LCFI5:
movl 8(%ebp), %edx
.p2align 4,,7
..L4:
movl %eax, (%edx,%eax,4)
addl $1, %eax
cmpl $50000, %eax
jne .L4
popl %ebp
ret
..LFE1436:
.size _Z4fillPi, .-_Z4fillPi
.align 2
.p2align 4,,15
.type _Z41__static_initialization_and_destruction_0ii, @function
_Z41__static_initialization_and_destruction_0ii:
..LFB1591:
pushl %ebp
..LCFI6:
movl %esp, %ebp
..LCFI7:
subl $24, %esp
..LCFI8:
subl $1, %eax
je .L15
..L14:
leave
ret
.p2align 4,,7
..L15:
cmpl $65535, %edx
jne .L14
movl $_ZSt8__ioinit, (%esp)
call _ZNSt8ios_base4InitC1Ev
movl $__dso_handle, 8(%esp)
movl $0, 4(%esp)
movl $__tcf_0, (%esp)
call __cxa_atexit
leave
ret
..LFE1591:
.size _Z41__static_initialization_and_destruction_0ii, .-_Z41__static_initialization_and_destruction_0ii
.align 2
.p2align 4,,15
.type _GLOBAL__I__Z6smoothPiS_, @function
_GLOBAL__I__Z6smoothPiS_:
..LFB1593:
pushl %ebp
..LCFI9:
movl $65535, %edx
movl %esp, %ebp
..LCFI10:
movl $1, %eax
popl %ebp
jmp _Z41__static_initialization_and_destruction_0ii
..LFE1593:
.size _GLOBAL__I__Z6smoothPiS_, .-_GLOBAL__I__Z6smoothPiS_
.align 2
.p2align 4,,15
.type __tcf_0, @function
__tcf_0:
..LFB1592:
pushl %ebp
..LCFI11:
movl %esp, %ebp
..LCFI12:
movl $_ZSt8__ioinit, 8(%ebp)
popl %ebp
jmp _ZNSt8ios_base4InitD1Ev
..LFE1592:
.size __tcf_0, .-__tcf_0
.section .rodata.str1.1,"aMS",@progbits,1
..LC0:
.string "Time smooth(): "
..LC3:
.string " ms\n"
.section .rodata.cst4,"aM",@progbits,4
.align 4
..LC1:
.long 1232348160
.align 4
..LC2:
.long 1148846080
.text
.align 2
.p2align 4,,15
..globl main
.type main, @function
main:
..LFB1437:
leal 4(%esp), %ecx
..LCFI13:
andl $-16, %esp
pushl -4(%ecx)
..LCFI14:
pushl %ebp
..LCFI15:
movl %esp, %ebp
..LCFI16:
pushl %edi
..LCFI17:
pushl %esi
..LCFI18:
pushl %ebx
..LCFI19:
pushl %ecx
..LCFI20:
subl $400024, %esp
..LCFI21:
leal -200016(%ebp), %esi
movl $200000, 8(%esp)
leal -400016(%ebp), %edi
movl $0, 4(%esp)
movl %esi, (%esp)
call memset
movl $200000, 8(%esp)
movl $0, 4(%esp)
movl %edi, (%esp)
call memset
xorl %eax, %eax
.p2align 4,,7
..L21:
movl %eax, (%esi,%eax,4)
addl $1, %eax
cmpl $50000, %eax
jne .L21
movl %edi, 4(%esp)
xorl %ebx, %ebx
movl %edi, (%esp)
call _Z6smoothPiS_
call clock
movl %eax, -400020(%ebp)
.p2align 4,,7
..L23:
movl %esi, 4(%esp)
addl $1, %ebx
movl %edi, (%esp)
call _Z6smoothPiS_
cmpl $100000000, %ebx
jne .L23
call clock
movl $.LC0, 4(%esp)
movl $_ZSt4cout, (%esp)
movl %eax, %ebx
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
subl -400020(%ebp), %ebx
pushl %ebx
fildl (%esp)
addl $4, %esp
fdivs .LC1
movl %eax, (%esp)
fmuls .LC2
fstpl 4(%esp)
call _ZNSolsEd
movl $.LC3, 4(%esp)
movl %eax, (%esp)
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl -400016(%ebp), %eax
movl $_ZSt4cout, (%esp)
movl %eax, 4(%esp)
call _ZNSolsEi
addl $400024, %esp
xorl %eax, %eax
popl %ecx
popl %ebx
popl %esi
popl %edi
popl %ebp
leal -4(%ecx), %esp
ret
..LFE1437:
.size main, .-main
.local _ZSt8__ioinit
.comm _ZSt8__ioinit,1,1
.weakref _Z20__gthrw_pthread_oncePiPFvvE,pthread_once
.weakref _Z27__gthrw_pthread_getspecificj,pthread_getspecific
.weakref _Z27__gthrw_pthread_setspecificjPKv,pthread_setspecific
.weakref _Z22__gthrw_pthread_createPmPK14pthread_attr_tPFPvS3_ES3_,pthread_create
.weakref _Z22__gthrw_pthread_cancelm,pthread_cancel
.weakref _Z26__gthrw_pthread_mutex_lockP15pthread_mutex_t,pthread_mutex_lock
.weakref _Z29__gthrw_pthread_mutex_trylockP15pthread_mutex_t,pthread_mutex_trylock
.weakref _Z28__gthrw_pthread_mutex_unlockP15pthread_mutex_t,pthread_mutex_unlock
.weakref _Z26__gthrw_pthread_mutex_initP15pthread_mutex_tPK19pthread_mutexattr_t,pthread_mutex_init
.weakref _Z26__gthrw_pthread_key_createPjPFvPvE,pthread_key_create
.weakref _Z26__gthrw_pthread_key_deletej,pthread_key_delete
.weakref _Z30__gthrw_pthread_mutexattr_initP19pthread_mutexattr_t,pthread_mutexattr_init
.weakref _Z33__gthrw_pthread_mutexattr_settypeP19pthread_mutexattr_ti,pthread_mutexattr_settype
.weakref _Z33__gthrw_pthread_mutexattr_destroyP19pthread_mutexattr_t,pthread_mutexattr_destroy
.section .eh_frame,"a",@progbits
..Lframe1:
.long .LECIE1-.LSCIE1
..LSCIE1:
.long 0x0
.byte 0x1
.string "zP"
.uleb128 0x1
.sleb128 -4
.byte 0x8
.uleb128 0x5
.byte 0x0
.long __gxx_personality_v0
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x88
.uleb128 0x1
.align 4
..LECIE1:
..LSFDE5:
.long .LEFDE5-.LASFDE5
..LASFDE5:
.long .LASFDE5-.Lframe1
.long .LFB1591
.long .LFE1591-.LFB1591
.uleb128 0x0
.byte 0x4
.long .LCFI6-.LFB1591
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI7-.LCFI6
.byte 0xd
.uleb128 0x5
.align 4
..LEFDE5:
..LSFDE11:
.long .LEFDE11-.LASFDE11
..LASFDE11:
.long .LASFDE11-.Lframe1
.long .LFB1437
.long .LFE1437-.LFB1437
.uleb128 0x0
.byte 0x4
.long .LCFI13-.LFB1437
.byte 0xc
.uleb128 0x1
.uleb128 0x0
.byte 0x9
.uleb128 0x4
.uleb128 0x1
.byte 0x4
.long .LCFI14-.LCFI13
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x4
.long .LCFI15-.LCFI14
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI16-.LCFI15
.byte 0xd
.uleb128 0x5
.byte 0x4
.long .LCFI20-.LCFI16
.byte 0x84
.uleb128 0x6
.byte 0x83
.uleb128 0x5
.byte 0x86
.uleb128 0x4
.byte 0x87
.uleb128 0x3
.align 4
..LEFDE11:
.ident "GCC: (GNU) 4.1.2 20070626 (Red Hat 4.1.2-14)"
.section .note.GNU-stack,"",@progbits

courpron · Mar 25, 2008

[snip asm]

It doesn't generate the same thing on my machine. Your assembly
listing doesn't display any aliasing optimization. There is still 3
memory access for each iteration (2 reads and 1 write). I also see
that there are "weakref" in your assembly listing, which is not the
case on my machine. This is probably a libstd++ issue. So, if you
don't mind, could you try the following program (simpler smooth
function and no #include, we can always resort to the command "time ./
a.out" to measure performances of each version) :

//#define NO_ALIASING_OPTIMIZATION

const int len = 16;

__attribute__((noinline))
#ifndef NO_ALIASING_OPTIMIZATION
void smooth (int* dest, int * src )
#else
void smooth (int* __restrict dest, int * __restrict src )
#endif
{
for ( int i = 0 ; i < len ; ++i )
dest[ i ] = *src;
}

int main()
{

int dest_array [len];
int *src = new int();

for (int i = 0; i < 100000000; i++)
smooth (dest_array, src);

return 0;
}

Alexandre Courpron.

Lionel B · Mar 25, 2008

[snip asm]

Click to expand...

It doesn't generate the same thing on my machine. Your assembly listing
doesn't display any aliasing optimization. There is still 3 memory
access for each iteration (2 reads and 1 write). I also see that there
are "weakref" in your assembly listing, which is not the case on my
machine. This is probably a libstd++ issue. So, if you don't mind, could
you try the following program (simpler smooth function and no #include,
we can always resort to the command "time ./ a.out" to measure
performances of each version) :

That seems to have made a difference - to the assembly at least - I
still don't see any significant difference in run time, however.

Compiled with:

g++ -m32 -S -O3 scratch.cpp

**************************** with no-alias optimisation:

.file "scratch.cpp"
.text
.align 2
.p2align 4,,15
..globl _Z6smoothPiS_
.type _Z6smoothPiS_, @function
_Z6smoothPiS_:
..LFB2:
pushl %ebp
..LCFI0:
movl %esp, %ebp
..LCFI1:
movl 12(%ebp), %edx
movl 8(%ebp), %eax
movl (%edx), %edx
movl %edx, (%eax)
movl %edx, 4(%eax)
movl %edx, 8(%eax)
movl %edx, 12(%eax)
movl %edx, 16(%eax)
movl %edx, 20(%eax)
movl %edx, 24(%eax)
movl %edx, 28(%eax)
movl %edx, 32(%eax)
movl %edx, 36(%eax)
movl %edx, 40(%eax)
movl %edx, 44(%eax)
movl %edx, 48(%eax)
movl %edx, 52(%eax)
movl %edx, 56(%eax)
movl %edx, 60(%eax)
popl %ebp
ret
..LFE2:
.size _Z6smoothPiS_, .-_Z6smoothPiS_
..globl __gxx_personality_v0
.align 2
.p2align 4,,15
..globl main
.type main, @function
main:
..LFB3:
leal 4(%esp), %ecx
..LCFI2:
andl $-16, %esp
pushl -4(%ecx)
..LCFI3:
pushl %ebp
..LCFI4:
movl %esp, %ebp
..LCFI5:
pushl %edi
..LCFI6:
pushl %esi
..LCFI7:
pushl %ebx
..LCFI8:
xorl %ebx, %ebx
pushl %ecx
..LCFI9:
subl $72, %esp
..LCFI10:
movl $4, (%esp)
leal -80(%ebp), %edi
call _Znwj
movl %eax, %esi
movl $0, (%eax)
.p2align 4,,7
..L4:
movl %esi, 4(%esp)
addl $1, %ebx
movl %edi, (%esp)
call _Z6smoothPiS_
cmpl $100000000, %ebx
jne .L4
addl $72, %esp
xorl %eax, %eax
popl %ecx
popl %ebx
popl %esi
popl %edi
popl %ebp
leal -4(%ecx), %esp
ret
..LFE3:
.size main, .-main
.section .eh_frame,"a",@progbits
..Lframe1:
.long .LECIE1-.LSCIE1
..LSCIE1:
.long 0x0
.byte 0x1
.string "zP"
.uleb128 0x1
.sleb128 -4
.byte 0x8
.uleb128 0x5
.byte 0x0
.long __gxx_personality_v0
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x88
.uleb128 0x1
.align 4
..LECIE1:
..LSFDE3:
.long .LEFDE3-.LASFDE3
..LASFDE3:
.long .LASFDE3-.Lframe1
.long .LFB3
.long .LFE3-.LFB3
.uleb128 0x0
.byte 0x4
.long .LCFI2-.LFB3
.byte 0xc
.uleb128 0x1
.uleb128 0x0
.byte 0x9
.uleb128 0x4
.uleb128 0x1
.byte 0x4
.long .LCFI3-.LCFI2
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x4
.long .LCFI4-.LCFI3
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI5-.LCFI4
.byte 0xd
.uleb128 0x5
.byte 0x4
.long .LCFI8-.LCFI5
.byte 0x83
.uleb128 0x5
.byte 0x86
.uleb128 0x4
.byte 0x87
.uleb128 0x3
.byte 0x4
.long .LCFI9-.LCFI8
.byte 0x84
.uleb128 0x6
.align 4
..LEFDE3:
.ident "GCC: (GNU) 4.1.2 20070626 (Red Hat 4.1.2-14)"
.section .note.GNU-stack,"",@progbits

**************************** without no-alias optimisation:

.file "scratch.cpp"
.text
.align 2
.p2align 4,,15
..globl _Z6smoothPiS_
.type _Z6smoothPiS_, @function
_Z6smoothPiS_:
..LFB2:
pushl %ebp
..LCFI0:
movl %esp, %ebp
..LCFI1:
movl 12(%ebp), %eax
movl 8(%ebp), %edx
movl (%eax), %ecx
movl %ecx, (%edx)
movl (%eax), %ecx
movl %ecx, 4(%edx)
movl (%eax), %ecx
movl %ecx, 8(%edx)
movl (%eax), %ecx
movl %ecx, 12(%edx)
movl (%eax), %ecx
movl %ecx, 16(%edx)
movl (%eax), %ecx
movl %ecx, 20(%edx)
movl (%eax), %ecx
movl %ecx, 24(%edx)
movl (%eax), %ecx
movl %ecx, 28(%edx)
movl (%eax), %ecx
movl %ecx, 32(%edx)
movl (%eax), %ecx
movl %ecx, 36(%edx)
movl (%eax), %ecx
movl %ecx, 40(%edx)
movl (%eax), %ecx
movl %ecx, 44(%edx)
movl (%eax), %ecx
movl %ecx, 48(%edx)
movl (%eax), %ecx
movl %ecx, 52(%edx)
movl (%eax), %ecx
movl %ecx, 56(%edx)
movl (%eax), %eax
movl %eax, 60(%edx)
popl %ebp
ret
..LFE2:
.size _Z6smoothPiS_, .-_Z6smoothPiS_
..globl __gxx_personality_v0
.align 2
.p2align 4,,15
..globl main
.type main, @function
main:
..LFB3:
leal 4(%esp), %ecx
..LCFI2:
andl $-16, %esp
pushl -4(%ecx)
..LCFI3:
pushl %ebp
..LCFI4:
movl %esp, %ebp
..LCFI5:
pushl %edi
..LCFI6:
pushl %esi
..LCFI7:
pushl %ebx
..LCFI8:
xorl %ebx, %ebx
pushl %ecx
..LCFI9:
subl $72, %esp
..LCFI10:
movl $4, (%esp)
leal -80(%ebp), %edi
call _Znwj
movl %eax, %esi
movl $0, (%eax)
.p2align 4,,7
..L4:
movl %esi, 4(%esp)
addl $1, %ebx
movl %edi, (%esp)
call _Z6smoothPiS_
cmpl $100000000, %ebx
jne .L4
addl $72, %esp
xorl %eax, %eax
popl %ecx
popl %ebx
popl %esi
popl %edi
popl %ebp
leal -4(%ecx), %esp
ret
..LFE3:
.size main, .-main
.section .eh_frame,"a",@progbits
..Lframe1:
.long .LECIE1-.LSCIE1
..LSCIE1:
.long 0x0
.byte 0x1
.string "zP"
.uleb128 0x1
.sleb128 -4
.byte 0x8
.uleb128 0x5
.byte 0x0
.long __gxx_personality_v0
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x88
.uleb128 0x1
.align 4
..LECIE1:
..LSFDE3:
.long .LEFDE3-.LASFDE3
..LASFDE3:
.long .LASFDE3-.Lframe1
.long .LFB3
.long .LFE3-.LFB3
.uleb128 0x0
.byte 0x4
.long .LCFI2-.LFB3
.byte 0xc
.uleb128 0x1
.uleb128 0x0
.byte 0x9
.uleb128 0x4
.uleb128 0x1
.byte 0x4
.long .LCFI3-.LCFI2
.byte 0xc
.uleb128 0x4
.uleb128 0x4
.byte 0x4
.long .LCFI4-.LCFI3
.byte 0xe
.uleb128 0x8
.byte 0x85
.uleb128 0x2
.byte 0x4
.long .LCFI5-.LCFI4
.byte 0xd
.uleb128 0x5
.byte 0x4
.long .LCFI8-.LCFI5
.byte 0x83
.uleb128 0x5
.byte 0x86
.uleb128 0x4
.byte 0x87
.uleb128 0x3
.byte 0x4
.long .LCFI9-.LCFI8
.byte 0x84
.uleb128 0x6
.align 4
..LEFDE3:
.ident "GCC: (GNU) 4.1.2 20070626 (Red Hat 4.1.2-14)"
.section .note.GNU-stack,"",@progbits

courpron · Mar 25, 2008

[snip asm]

Click to expand...

Click to expand...

That seems to have made a difference - to the assembly at least - I
still don't see any significant difference in run time, however.

Yes, the performance increase on my machine was around 4%.
In this example, a single int is read to fill the dest_array.
Therefore, it stays easily in the cache and the performance
degradation is minimal.
Operations on large arrays can however leads to cache trashing and
performance degradations would be much more visible.

The assembly listing now shows no aliasing optimization.

From this code :

for ( int i = 0 ; i < len ; ++i )
dest[ i ] = *src;

generated asm with __restrict :

movl (%edx), %edx

movl %edx, (%eax)
movl %edx, 4(%eax)
movl %edx, 8(%eax)

...

With __restrict, the value of *src is stored in the register once (in
edx), and it is used for the rest of the iterations.

generated asm without __restrict :

movl (%eax), %ecx
movl %ecx, (%edx)

movl (%eax), %ecx
movl %ecx, 4(%edx)

movl (%eax), %ecx
movl %ecx, 8(%edx)

...

without restrict, the value is stored in the register (in ecx) for
each iteration because the compiler doesn't know if &dest and src
are pointing to the same object. The compiler must take into account
the situation where changing the value of dest changes the value of
*src. It deals with this by reloading each time the *src value.

Alexandre Courpron.

Lionel B · Mar 25, 2008

[snip asm]

Click to expand...

That seems to have made a difference - to the assembly at least - I
still don't see any significant difference in run time, however.

Click to expand...

Yes, the performance increase on my machine was around 4%.

I've just checked your original code under g++ 4.3.0 (installed with its
own libstdc++) and now I do indeed get the optimisation benefit:

without no-alias optimisation: 3270 ms
with no-alias optimisation: 1310 ms

I suspect you were right about it being an issue with libstdc++

Cheers,

Bo Persson · Mar 25, 2008

Lionel said:
[ ... ]

I don't really see much sense in comparing Java performance with
C++ performance. What might be more interesting is the effect of
the 'restrict' specifier supported by C99 compilers (and many
C89/90 and C++ compielers as an extension), which is intended to
assist compiler in performing exactly this kind of optimizations.

Click to expand...

Of, if you're really interested in C++, you could throw in a
comparison use a valarray, which was designed to give the same
sort of assurance. Unfortunately, I don't know of anybody who
seems to have gone to much (if any) trouble to optimize valarray
at all -- rather the contrary, it was an idea that even its own
creator admits was in the wrong place at the wrong time, so it's
ignored almost to death, so to speak.

Click to expand...

FWIW on my (gcc 4.1.1 supplied) implementation valarrays appear to
be simply pointers to new'ed memory with a (GCC-specific)
__restrict__ qualifier.

Personally I've never managed to code up a scenario (using GCC with
various optimisations) where __restrict__ appears to have made any
difference whatsoever. I seem to recall that this has come up now
and again - inconclusively - on the g++ lists.

You have to work with functions taking multiple pointers to arrays of
the same type. This is not idiomatic for C++ code.

Bo Persson

Ioannis Vranos · Mar 30, 2008

Razii said:
Probably due to array bound check in Java, if there is in indeed an
issue with C++ arrays, overall there is no difference.

Java? What Java has to do with this? Also I see you are crossposting
between clc++ and cljp, probably wanting to create a flame.

Ioannis Vranos · Mar 30, 2008

"Java" is actually two things: Java the language (syntax) and JVM (Java
Virtual Machine). You can't compare C++ with Java/JVM.

Java/JVM is to be compared with similar things, like C#/.NET, C++/.NET, etc.

CLI is an ISO standard for a Virtual Machine. .NET is a CLI-compliant
VM. Another CLI-compliant VM is Mono.

C++/CLI is an ECMA-standardised language binding (syntax) for CLI.
C#/CLI the same.

So when comparing "Java" with some other "language", you are essentially
comparing Java/JVM with another language binding to some virtual machine.

If SUN wanted, they could open their JVM to other languages like MS does
for .NET.

Now, more specifically you can compare Java/JVM with C#/.NET, C++/.NET,
C#/Mono etc.

JVM is written in C++, so there is no way Java/JVM to be faster than C++
by definition.

Arne Vajhøj · Mar 30, 2008

Ioannis said:
If SUN wanted, they could open their JVM to other languages like MS does
for .NET.

They have. Among the more common are Ada, Ruby and Python.

Now, more specifically you can compare Java/JVM with C#/.NET, C++/.NET,
C#/Mono etc.

It is rather common to compare Java/JVM with X/native as well.

JVM is written in C++, so there is no way Java/JVM to be faster than C++
by definition.

Not true.

The speed of execution depends on the code generated. It does not
depend on the language the compiler is written in. AOT or JIT does
not matter.

Arne

Razii · Mar 31, 2008

JVM is written in C++, so there is no way Java/JVM to be faster than C++
by definition.

What kind of logic is that? If someone writes a C++ compiler in say,
C#, would that mean that C++ by definition would be slower than C#?

Ioannis Vranos · Apr 1, 2008

Razii said:
What kind of logic is that? If someone writes a C++ compiler in say,
C#, would that mean that C++ by definition would be slower than C#?

OK, forget the last phrase.

"Java" is actually two things: Java the language (syntax) and JVM (Java
Virtual Machine). You can't compare C++ with Java/JVM.

Java/JVM is to be compared with similar things, like C#/.NET, C++/.NET, etc.

CLI is an ISO standard for a Virtual Machine. .NET is a CLI-compliant
VM. Another CLI-compliant VM is Mono.

C++/CLI is an ECMA-standardised language binding (syntax) for CLI.
C#/CLI the same.

So when comparing "Java" with some other "language", you are essentially
comparing Java/JVM with another language binding to some virtual machine.

If SUN wanted, they could open their JVM to other languages like MS does
for .NET.

Now, more specifically you can compare Java/JVM with C#/.NET, C++/.NET,
C#/Mono etc.

Comparing Java/JVM to native C++ code in terms of speed and even space,
is unfair to Java due to the overhead of the Garbage Collection
mechanism of the JVM.

Ioannis Vranos · Apr 1, 2008

Made more clear:

"Java" is actually two things: Java the language (syntax) and JVM (Java
Virtual Machine). You can't compare C++ with Java/JVM.

Java/JVM is to be compared with similar things, like C#/.NET, C++/.NET, etc.

CLI is an ISO standard for a Virtual Machine. .NET is a CLI-compliant
VM. Another CLI-compliant VM is Mono.

C++/CLI is an ECMA-standardised language binding (syntax) for CLI.
C#/CLI the same.

So when comparing "Java" with some other "language", you are essentially
comparing Java/JVM with another language binding to some virtual machine.

If SUN wanted, they could open their JVM to more languages like C++, as
MS does for .NET.

Now, more specifically you can compare Java/JVM with C#/.NET, C++/.NET,
C#/Mono etc.

Comparing Java/JVM to native C++ code in terms of speed and even space,
is unfair to Java due to the overhead of the Garbage Collection
mechanism of the JVM.

Lew · Apr 1, 2008

Ioannis said:
If SUN [sic] wanted, they could open their JVM to more languages like C++, as
MS does for .NET.

Which they have done, as have other JVM vendors. In fact, there's absolutely
nothing at all about any of these JVM implementations to prevent anyone from
targeting them with any language's compilers at any time, nor has there ever
been. What more is needed to "open" these JVMs?

Ioannis Vranos · Apr 1, 2008

Lew said:
Ioannis said:

If SUN [sic] wanted, they could open their JVM to more languages like
C++, as
MS does for .NET.

Click to expand...

Which they have done, as have other JVM vendors. In fact, there's
absolutely nothing at all about any of these JVM implementations to
prevent anyone from targeting them with any language's compilers at any
time, nor has there ever been. What more is needed to "open" these JVMs?

OK, but SUN could provide C++ support for JVM as MS does for .NET.

The summary part is that one can't compare Java/JVM with C++ alone, it
is like comparing apples and potatoes.

Java/JVM should be compared with another language/VM combination. A
valid comparison would be Java/JVM vs C++/.NET.

But essentially the comparison comes down to which virtual machine is
better.

Erik WikstrÃ¶m · Apr 1, 2008

Lew said:
Lew said:

Ioannis said:

If SUN [sic] wanted, they could open their JVM to more languages like
C++, as
MS does for .NET.

Click to expand...

Which they have done, as have other JVM vendors. In fact, there's
absolutely nothing at all about any of these JVM implementations to
prevent anyone from targeting them with any language's compilers at any
time, nor has there ever been. What more is needed to "open" these JVMs?

Click to expand...

OK, but SUN could provide C++ support for JVM as MS does for .NET.

Just because C++/CLI have C++ in the name does not make it C++.

RedGrittyBrick · Apr 1, 2008

Ioannis said:
Lew said:

Ioannis said:

If SUN [sic] wanted, they could open their JVM to more languages like
C++, as
MS does for .NET.

Click to expand...

Which they have done, as have other JVM vendors. In fact, there's
absolutely nothing at all about any of these JVM implementations to
prevent anyone from targeting them with any language's compilers at any
time, nor has there ever been. What more is needed to "open" these JVMs?

Click to expand...

OK, but SUN could provide C++ support for JVM as MS does for .NET.

OK, but MS could provide Java support for .NET as Sun does for JVM.

This mode of discourse is unproductive.

The summary part is that one can't compare Java/JVM with C++ alone, it
is like comparing apples and potatoes.

Java/JVM should be compared with another language/VM combination. A
valid comparison would be Java/JVM vs C++/.NET.

But essentially the comparison comes down to which virtual machine is
better.

A while back, someone (one of the regulars I think) pointed out that
Java is four things not just the two you identified.

- The language (as documented in the JLS).
- A huge library of standard classes.
- A compiler (source to byte-code).
- A run-time environment (including the JVM).

I suspect the quality of all of these, not just the JVM, have some
effect on the performance of an application.

Homework in C - Help Needed	1	Oct 16, 2024
Lexical Analysis on C++	1	Oct 31, 2023
In C, the longest palindromic subsequence multithread exists	0	Nov 23, 2022
Filter sober in c++ don't pass test	0	Dec 2, 2023
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
C# problem	1	Sep 11, 2024
Chatbot	0	Oct 8, 2024
Character operations in C++	2	Jan 28, 2024

Array optimizing problem in C++?

courpron

Lionel B

Lionel B

courpron

Lionel B

courpron

Lionel B

courpron

Lionel B

Bo Persson

Ioannis Vranos

Ioannis Vranos

Arne Vajhøj

Razii

Ioannis Vranos

Ioannis Vranos

Lew

Ioannis Vranos

Erik WikstrÃ¶m

RedGrittyBrick

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads