Multi-cpu and ruby Threading

R

Regis d'Aubarede

Hello,

We receive a new PC based on I Core 7 on Windows 7.
So i try to compare the use processors resources of each
Ruby interpretor (JRuby,IronRuby,Ruby 1.9.1 ).
I do the same (stupid) treatment by 1 to 8 threads, and measure
the global duration.

(test program is on attachment)

Here is the result.

c:\usr\ruby\local>jruby thread_bench.rb
1.8.7, java, 2010-05-12
1000 iterations by 1 threads , Duration = 2772 ms
500 iterations by 2 threads , Duration = 2076 ms
333 iterations by 3 threads , Duration = 1884 ms
250 iterations by 4 threads , Duration = 1848 ms
200 iterations by 5 threads , Duration = 1814 ms
166 iterations by 6 threads , Duration = 1755 ms
142 iterations by 7 threads , Duration = 1866 ms
125 iterations by 8 threads , Duration = 1538 ms

c:\usr\ruby\local>ir thread_bench.rb
1.8.6, i386-mswin32, 2009-03-31
1000 iterations by 1 threads , Duration = 2257 ms
500 iterations by 2 threads , Duration = 1305 ms
333 iterations by 3 threads , Duration = 1055 ms
250 iterations by 4 threads , Duration = 880 ms
200 iterations by 5 threads , Duration = 1026 ms
166 iterations by 6 threads , Duration = 940 ms
142 iterations by 7 threads , Duration = 989 ms
125 iterations by 8 threads , Duration = 1098 ms

c:\usr\ruby\local>ruby19 thread_bench.rb
1.9.1, i386-mswin32, 2010-01-10
1000 iterations by 1 threads , Duration = 7318 ms
500 iterations by 2 threads , Duration = 7393 ms
333 iterations by 3 threads , Duration = 7335 ms
250 iterations by 4 threads , Duration = 7367 ms
200 iterations by 5 threads , Duration = 7450 ms
166 iterations by 6 threads , Duration = 7343 ms
142 iterations by 7 threads , Duration = 7349 ms
125 iterations by 8 threads , Duration = 7454 ms

So it's seem that IronRuby has better use of cpus than JRuby ?

Attachments:
http://www.ruby-forum.com/attachment/4825/thread_bench.rb
 
R

Regis d'Aubarede

Roger said:
I've seen a bit of slowdown on jruby when using multiple threads, as
well.

Result seem different on Linux.Here same test, on same machine,
on ubunbtu 10.4/virtualbox with 8 processor affinity ;

regis@regis-desktop:~/Ruby/local$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8) (6b18-1.8-0ubuntu1)
OpenJDK Server VM (build 14.0-b16, mixed mode)

regis@regis-desktop:~/Ruby/local$ jruby -v
jruby 1.5.1 (ruby 1.8.7 patchlevel 249) (2010-06-06 f3a3480) (OpenJDK
Client VM 1.6.0_18) [i386-java]

regis@regis-desktop:~/Ruby/local$ jruby thread_bench.rb
1.8.7, java, 2010-06-06
1000 iterations by 1 threads , Duration = 3930 ms
500 iterations by 2 threads , Duration = 3723 ms
333 iterations by 3 threads , Duration = 3490 ms
250 iterations by 4 threads , Duration = 3470 ms
200 iterations by 5 threads , Duration = 3353 ms
166 iterations by 6 threads , Duration = 3378 ms
142 iterations by 7 threads , Duration = 3455 ms
125 iterations by 8 threads , Duration = 4032 ms

regis@regis-desktop:~/Ruby/local$ ir -v
IronRuby 0.9.0.0 on .NET 2.0.0.0

regis@regis-desktop:~/Ruby/local$ ir thread_bench.rb
1.8.6, i386-mswin32, 2008-05-28
1000 iterations by 1 threads , Duration = 11091 ms
500 iterations by 2 threads , Duration = 7676 ms
333 iterations by 3 threads , Duration = 12243 ms
250 iterations by 4 threads , Duration = 7728 ms
200 iterations by 5 threads , Duration = 7767 ms
166 iterations by 6 threads , Duration = 7749 ms
142 iterations by 7 threads , Duration = 8184 ms
125 iterations by 8 threads , Duration = 8069 ms
regis@regis-desktop:~/Ruby/local$
 
C

Charles Oliver Nutter

Hello,

We receive a new PC based on I Core 7 on Windows 7.
So i try to compare the use processors resources of each
Ruby interpretor (JRuby,IronRuby,Ruby 1.9.1 ).
I do the same (stupid) treatment by 1 to 8 threads, and measure
the global duration.

(test program is on attachment)

Here is the result.

c:\usr\ruby\local>jruby =C2=A0thread_bench.rb
1.8.7, java, 2010-05-12
1000 iterations by 1 threads =C2=A0, Duration =C2=A0=3D 2772 ms
500 iterations by 2 threads =C2=A0 , Duration =C2=A0=3D 2076 ms
333 iterations by 3 threads =C2=A0 , Duration =C2=A0=3D 1884 ms
250 iterations by 4 threads =C2=A0 , Duration =C2=A0=3D 1848 ms
200 iterations by 5 threads =C2=A0 , Duration =C2=A0=3D 1814 ms
166 iterations by 6 threads =C2=A0 , Duration =C2=A0=3D 1755 ms
142 iterations by 7 threads =C2=A0 , Duration =C2=A0=3D 1866 ms
125 iterations by 8 threads =C2=A0 , Duration =C2=A0=3D 1538 ms

Probably not running server VM, so pass --server. Overall times should
be better, but depending on the algorithm the remaining bottleneck for
JRuby may or may not be CPU-bound.

The initial iteration's time should probably be largely discounted,
and the whole thing should probably be run a couple times to see the
actual perf of a longer-running app.

I don't have IronRuby here, but here's numbers for me on Java 6,
server, OS X, Core 2 Duo 2.6GHz:

(2nd time through in the same script, only the 1 and 2 processor runs):

1000 iterations by 1 threads , Duration =3D 2633 ms
500 iterations by 2 threads , Duration =3D 1628 ms

If with --server on your system JRuby's still slower than IronRuby,
there may be a bug or bottleneck we can repair. I have been meaning to
make blocks faster in JRuby, but they still come with a higher cost
than some other impls.

- Charlie
 
C

Charles Oliver Nutter

If with --server on your system JRuby's still slower than IronRuby,
there may be a bug or bottleneck we can repair. I have been meaning to
make blocks faster in JRuby, but they still come with a higher cost
than some other impls.

Maybe also worth showing an experimental dynopt flag for JRuby that
seem to improve performance dramatically, but at a small cost of some
Ruby semantics (backtraces get a little funky, for example):

~/projects/jruby =E2=9E=94 jruby --server -J-Djruby.compile.dynopt=3Dtrue t=
hread_bench.rb
1.8.7, java, 2010-06-17
1000 iterations by 1 threads , Duration =3D 400 ms
500 iterations by 2 threads , Duration =3D 188 ms
333 iterations by 3 threads , Duration =3D 192 ms
250 iterations by 4 threads , Duration =3D 149 ms
200 iterations by 5 threads , Duration =3D 167 ms
166 iterations by 6 threads , Duration =3D 214 ms
142 iterations by 7 threads , Duration =3D 177 ms
125 iterations by 8 threads , Duration =3D 163 ms
1000 iterations by 1 threads , Duration =3D 265 ms
500 iterations by 2 threads , Duration =3D 160 ms
333 iterations by 3 threads , Duration =3D 186 ms
250 iterations by 4 threads , Duration =3D 148 ms
200 iterations by 5 threads , Duration =3D 159 ms
166 iterations by 6 threads , Duration =3D 151 ms
142 iterations by 7 threads , Duration =3D 150 ms
125 iterations by 8 threads , Duration =3D 171 ms
...

Hopefully I can land this in JRuby 1.6, but it's on master now.

- Charlie
 
R

Regis d'Aubarede

Charles said:
Maybe also worth showing an experimental dynopt flag for JRuby that seem to
improve performance ....


Sorry for my bad english !!

My test consist to verify that symetric multi-core (SMP) is well use by
the VM. In this aspect, pure performence is not important.
the decrease of duration calculation with the increase off used threads
is my concern.

(http://programmingzen.com/2010/06/28/the-great-ruby-shootout-windows-edition/
show that JRuby is superior to IronRuby...)

For discrimination if the issue is in JRuby side or in JVM side, i run
same
JRubyCode, but invoke a pure Java traitement :
(1..nb_threads).map { Thread.new() { Calc.calc(p1,n1) } }
with

class Calc {
public static long calc(int a, int b) {
long res=0;
for (int i=0;i<a;i++)
for (int j=0;j<b;j++)
for (int k=0;k<1000;k++)
res+=i+j+k;
return(res);
}
}

c:\usr\ruby\local>jruby thread_bench2.rb
1.8.7, java, 2010-05-12
1000 iterations by 1 threads , Duration = 15404 ms
500 iterations by 2 threads , Duration = 8147 ms
333 iterations by 3 threads , Duration = 5812 ms
250 iterations by 4 threads , Duration = 4690 ms
200 iterations by 5 threads , Duration = 4648 ms
166 iterations by 6 threads , Duration = 4749 ms
142 iterations by 7 threads , Duration = 4371 ms
125 iterations by 8 threads , Duration = 4222 ms

So JVM scale right :)
And my intel core i7 has realy 4 core...

Attachments:
http://www.ruby-forum.com/attachment/4829/thread_bench2.rb
 
C

Charles Oliver Nutter

For discrimination if the issue is in JRuby side or in JVM side, i run
same
JRubyCode, but invoke a pure Java traitement :
=C2=A0 =C2=A0(1..nb_threads).map { =C2=A0Thread.new() { Calc.calc(p1,n1) = } }
with

class Calc {
=C2=A0public static long calc(int a, int b) {
=C2=A0 =C2=A0long res=3D0;
=C2=A0 =C2=A0for (int i=3D0;i<a;i++)
=C2=A0 =C2=A0 =C2=A0for (int j=3D0;j<b;j++)
=C2=A0 =C2=A0 =C2=A0 for (int k=3D0;k<1000;k++)
=C2=A0 =C2=A0 =C2=A0 res+=3Di+j+k;
=C2=A0 =C2=A0return(res);
=C2=A0}
}

Yes, this result is not surprising to me. In the original case, the
benchmark suffers mostly from all the objects being created. For
example:

* All the numeric loops (in JRuby) create at least one new Fixnum
object for every iteration
* All the math operations create Fixnum or Float objects as well

Running an allocation profile of your benchmark (which actually runs
pretty slow because there's *so much* allocation happening) shows the
amount of data that's being chewed up...it's very likely that the
bottleneck is in allocating all those closures and all those Fixnums
for this particular case:

~/projects/jruby =E2=9E=94 jruby -J-Xrunhprof thread_bench.rb
1.8.7, java, 2010-06-17
1000 iterations by 1 threads , Duration =3D 399267 ms
^CDumping Java heap ... allocation sites ... done.

~/projects/jruby =E2=9E=94 egrep "%|objs" java.hprof.txt | head -n 11
rank self accum bytes objs bytes objs trace name
1 65.18% 65.18% 13545024 423282 1133938432 35435576 302318
org.jruby.RubyFixnum
2 22.61% 87.79% 4697920 146810 381348672 11917146 302867
org.jruby.RubyFloat
3 1.32% 89.12% 274992 5350 274992 5350 300000 char[]
4 0.62% 89.74% 128488 5341 128488 5341 300000 java.lang.String
5 0.18% 89.92% 38184 1 38184 1 306423 short[]
6 0.18% 90.10% 38184 1 38184 1 306428 short[]
7 0.14% 90.24% 28720 718 29400 735 300521
java.util.WeakHashMap$Entry
8 0.13% 90.37% 27792 70 27792 70 300000 byte[]
9 0.13% 90.50% 26832 1118 35040 1460 300704
java.util.concurrent.ConcurrentHashMap$HashEntry
10 0.12% 90.63% 25232 166 25232 166 300557 org.jruby.MetaCla=
ss

Note that this is only after the 1000-iteration run, and during
execution over 1GB of memory was allocated and released, mostly in
Fixnum objects with a smaller amount (380MB+) in Float objects.
Running with verbose GC:


~/projects/jruby =E2=9E=94 jruby -J-verbose:gc thread_bench.rb
1.8.7, java, 2010-06-17
[GC 13184K->1128K(63936K), 0.0108696 secs]
[GC 14312K->2124K(63936K), 0.0077762 secs]
[GC 15308K->1445K(63936K), 0.0010409 secs]
[GC 14629K->1246K(63936K), 0.0031958 secs]
...

And adding up all the size changes (number of GC runs * difference in
live object size) produces roughly the same estimate; for the period
the 1000-iteration part of the bench runs, it allocates a *lot* of
objects.

IronRuby may do better here if they're able to treat Fixnum objects as
value types, which the CLR handles more efficiently than the JVM's
"every object is on the heap". Ultimately this is largely an
allocation-rate benchmark, at least on JRuby, since our Fixnum objects
are "real" objects (or to put it in MRI's favor...our Fixnum objects
are forced to be "real" objects with heap lifecycles).

The dynopt work is part of efforts in JRuby to bring math performance
closer to Java, largely by eliminating te excessive object churn and
layers of noise for math operations.

- Charlie
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top