Using multicore CPUs in parallel tasks

M

Marc Hoeppner

Hi,

I've been reading around a bit but couldn't find a solution that worked,
so here goes:

I am running ruby 1.8 and want to make full use of a quad core CPU
(64bit, Ubuntu) in a task that lends itself to multithreading/multicore
use.

It's basically an array of objects that are each use in a fairly CPU
intensive job, so I figured I could have 4 of them run at the same time
, one on each CPU.

BUT...

The only reasonably understandably suggestion looked something like:

----
threads = 4
my_array = [something_here]

threads.times do
Process.fork(a_method(my_array.shift))
end

my_array.each do |object|
Process.wait(0)
Process.fork(a_method(object))
end
---

But this still only used one CPU (and looks a bit ugly..). Is that some
limitation of ruby (v 1.8 specifically) or am I doing something wrong?

Cheers,

Marc
 
R

Rajinder Yadav

Hi,

I've been reading around a bit but couldn't find a solution that worked,
so here goes:

I am running ruby 1.8 and want to make full use of a quad core CPU
(64bit, Ubuntu) in a task that lends itself to multithreading/multicore
use.

It's basically an array of objects that are each use in a fairly CPU
intensive job, so I figured I could have 4 of them run at the same time
, one on each CPU.

You might want to checkout Pure and Tiamat and talk to James Lawrence
(see links). He seems to have something you are asking for. I don't
know much about these 2 project, they came by my radar a few days ago
but I think it's cool what James is working on!

=3D=3D Links

* Pure: http://purefunctional.rubyforge.org

* Documentation: http://tiamat.rubyforge.org
* Download: http://rubyforge.org/frs/?group_id=3D9145
* Rubyforge home: http://rubyforge.org/projects/tiamat
* Repository: http://github.com/quix/tiamat

=3D=3D Author

BUT...

The only reasonably understandably suggestion looked something like:

----
threads =3D 4
my_array =3D [something_here]

threads.times do
=A0Process.fork(a_method(my_array.shift))
end

my_array.each do |object|
=A0Process.wait(0)
=A0Process.fork(a_method(object))
end
---

But this still only used one CPU (and looks a bit ugly..). Is that some
limitation of ruby (v 1.8 specifically) or am I doing something wrong?

Cheers,

Marc



--=20
Kind Regards,
Rajinder Yadav

http://DevMentor.org

Do Good! - Share Freely, Enrich and Empower people to Transform their lives=
 
G

Glen Holcomb

Hi,

I've been reading around a bit but couldn't find a solution that worked,
so here goes:

I am running ruby 1.8 and want to make full use of a quad core CPU
(64bit, Ubuntu) in a task that lends itself to multithreading/multicore
use.

It's basically an array of objects that are each use in a fairly CPU
intensive job, so I figured I could have 4 of them run at the same time
, one on each CPU.

BUT...

The only reasonably understandably suggestion looked something like:

----
threads =3D 4
my_array =3D [something_here]

threads.times do
Process.fork(a_method(my_array.shift))
end

my_array.each do |object|
Process.wait(0)
Process.fork(a_method(object))
end
---

But this still only used one CPU (and looks a bit ugly..). Is that some
limitation of ruby (v 1.8 specifically) or am I doing something wrong?

Cheers,

Marc
You are going to want Ruby 1.9 for this. In 1.8 threads are "green",
basically they only exists as threads inside the VM so you still only hit
one core and any blocking system I/O will block all of your threads.

--=20
"Hey brother Christian with your high and mighty errand, Your actions speak
so loud, I can=92t hear a word you=92re saying."

-Greg Graffin (Bad Religion)
 
P

Peter Booth

Marc,

How long lived is each of these tasks? Are we talking seconds or weeks?
Is there a user-facing aspect to this or is throughput the variable
that you're wanting to optimize?

When you say "fairly CPU intensive", doe sthis mean that when one of
these tasks runs you see (from sar/mpstat) that one of your CPUs is
pinned?

Peter
 
T

Tony Arcieri

[Note: parts of this message were removed to make it a legal post.]

You are going to want Ruby 1.9 for this. In 1.8 threads are "green",
basically they only exists as threads inside the VM so you still only hit
one core and any blocking system I/O will block all of your threads.

Ruby 1.9 isn't going to help you when using threads to distribute
computation across CPU cores. The Global VM Lock ensures that simultaneous
computation is still limited to one core.

JRuby, on the other hand, does not have this limitation. On MRI/1.9 I would
recommend using multiple processes.
 
G

Glen Holcomb

Ruby 1.9 isn't going to help you when using threads to distribute
computation across CPU cores. The Global VM Lock ensures that simultaneo= us
computation is still limited to one core.

JRuby, on the other hand, does not have this limitation. On MRI/1.9 I
would
recommend using multiple processes.

Ah, I did not know that.

--=20
"Hey brother Christian with your high and mighty errand, Your actions speak
so loud, I can=92t hear a word you=92re saying."

-Greg Graffin (Bad Religion)
 
R

Robert Klemme

Ruby 1.9 isn't going to help you when using threads to distribute
computation across CPU cores. The Global VM Lock ensures that simultaneous
computation is still limited to one core.

Are you saying that the global VM lock even extends to several
processes? Because Marc did not want to use threads for distribution
but rather processes.

Kind regards

robert
 
R

Robert Klemme

Hi,

I've been reading around a bit but couldn't find a solution that worked,
so here goes:

I am running ruby 1.8 and want to make full use of a quad core CPU
(64bit, Ubuntu) in a task that lends itself to multithreading/multicore
use.

It's basically an array of objects that are each use in a fairly CPU
intensive job, so I figured I could have 4 of them run at the same time
, one on each CPU.

BUT...

The only reasonably understandably suggestion looked something like:

----
threads = 4
my_array = [something_here]

threads.times do
Process.fork(a_method(my_array.shift))
end

my_array.each do |object|
Process.wait(0)
Process.fork(a_method(object))
end

I believe you are not using Process.fork properly. In fact, I am
surprised that you do not get an exception:

irb(main):001:0> Process.fork("foo")
ArgumentError: wrong number of arguments (1 for 0)
from (irb):1:in `fork'
from (irb):1
from :0

Basically what you do is you do a calculation (a_method(object)) and
_then_ you create a process. No surprise that only one CPU is busy.

Here's something else that you could do

processes = 4

my_array.each_slice my_array.size / processes do |tasks|
fork do
tasks.each do |task|
a_method(task)
end
end
end

Process.waitall

Drawback is that one of those processes might accidentally get all the
easy tasks and you do not utilize CPUs optimally. Here's another
solution that does not have that issue

processes = 4
count = 0

my_array.each do |task|
if count == processes
Process.wait
count -= 1
end

fork do
a_method(task)
end
count += 1
end

Process.waitall

You can see that it works with this example:

processes = 4
count = 0

10.times do |task|
if count == processes
Process.wait
count -= 1
end

fork do
printf "%-20s start %4d %4d\n", Time.now, $$, task
sleep rand(5) + 2
printf "%-20s end %4d %4d\n", Time.now, $$, task
end
count += 1
end

Process.waitall


Kind regards

robert
 
T

Tony Arcieri

[Note: parts of this message were removed to make it a legal post.]

Are you saying that the global VM lock even extends to several processes?
Because Marc did not want to use threads for distribution but rather
processes.

No, if you look over my post again it specifically mentions the GVL applies
to threads and suggests using processes.
 
R

Robert Klemme

2009/10/29 Tony Arcieri said:
No, if you look over my post again it specifically mentions the GVL appli= es
to threads and suggests using processes.

I figured as much. The thread discussion does not help Marc, because
he explicitly wanted to use processes for core utilization. Basically
Glen sent us in the wrong direction though. :)

Cheers

robert


--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
M

Marc Hoeppner

Robert said:
I believe you are not using Process.fork properly. In fact, I am
surprised that you do not get an exception:

irb(main):001:0> Process.fork("foo")
ArgumentError: wrong number of arguments (1 for 0)
from (irb):1:in `fork'
from (irb):1
from :0

Yes, quite possible - I didn't really look up the exact code, just wrote
it down from memory, sorry about that..
processes = 4
count = 0

my_array.each do |task|
if count == processes
Process.wait
count -= 1
end

fork do
a_method(task)
end
count += 1
end

Process.waitall

That works like a charm, thanks a lot!
 
J

James M. Lawrence

Robert said:
processes = 4
count = 0

my_array.each do |task|
if count == processes
Process.wait
count -= 1
end

fork do
a_method(task)
end
count += 1
end

Process.waitall

Another option,

Tiamat.open_local(4) {
pure do
fun_map :result => my_array do |elem|
a_method(elem)
end
end.compute.result
}

This lets you distribute across N physical machines without a change to
the code.
 
J

James M. Lawrence

Tony said:
Ruby 1.9 isn't going to help you when using threads to distribute
computation across CPU cores. The Global VM Lock ensures that
simultaneous computation is still limited to one core.

JRuby, on the other hand, does not have this limitation. On MRI/1.9
I would recommend using multiple processes.

I'm not so sure jruby does this effectively.

require 'tiamat/autoconfig'
require 'pure/dsl'
require 'benchmark'

mod = pure do
def total(left, right)
left + right
end

def left
(1..5_000_000).inject(0) { |acc, n| acc + n }
end

def right
(1..5_000_000).inject(0) { |acc, n| acc + n }
end
end

Benchmark.bmbm { |bm|
bm.report("1 thread, 1 interpreter") {
mod.compute(1).total
}
bm.report("2 threads, 1 interpreter") {
mod.compute(2).total
}
# this part removed for jruby bench
bm.report("2 threads, 2 interpreters") {
Tiamat.open_local(2) {
mod.compute.total
}
}
}

== ruby 1.9.2dev (2009-10-18 trunk 25393) [i386-darwin9.8.0]
Rehearsal -------------------------------------------------------------
1 thread, 1 interpreter 4.370000 0.020000 4.390000 ( 4.389990)
2 threads, 1 interpreter 4.360000 0.030000 4.390000 ( 4.385111)
2 threads, 2 interpreters 0.010000 0.010000 4.700000 ( 2.460661)
--------------------------------------------------- total: 13.480000sec

user system total real
1 thread, 1 interpreter 4.360000 0.020000 4.380000 ( 4.376050)
2 threads, 1 interpreter 4.360000 0.030000 4.390000 ( 4.380982)
2 threads, 2 interpreters 0.010000 0.010000 4.710000 ( 2.465925)


== jruby 1.4.0RC3 (ruby 1.8.7 patchlevel 174) (2009-10-30 1d7de2d) (Java
HotSpot(TM) Client VM 1.5.0_20) [i386-java]
Rehearsal ------------------------------------------------------------
1 thread, 1 interpreter 6.060000 0.000000 6.060000 ( 6.060000)
2 threads, 1 interpreter 7.629000 0.000000 7.629000 ( 7.629000)
-------------------------------------------------- total: 13.689000sec

user system total real
1 thread, 1 interpreter 6.080000 0.000000 6.080000 ( 6.080000)
2 threads, 1 interpreter 7.288000 0.000000 7.288000 ( 7.288000)
 
R

Rajinder Yadav

Another option,

Tiamat.open_local(4) {
=A0pure do
=A0 =A0fun_map :result =3D> my_array do |elem|
=A0 =A0 =A0a_method(elem)
=A0 =A0end
=A0end.compute.result
}

This lets you distribute across N physical machines without a change to
the code.

This is just elegant =3D) ... it's funny how I observer something then
more of what I observer comes in to the fold! Was hoping you would
reply to the thread ;)
--=20
Kind Regards,
Rajinder Yadav

http://DevMentor.org

Do Good! - Share Freely, Enrich and Empower people to Transform their lives=
 
G

Glen Holcomb

I figured as much. The thread discussion does not help Marc, because
he explicitly wanted to use processes for core utilization. Basically
Glen sent us in the wrong direction though. :)
I've always worked best as a diversion.

--=20
"Hey brother Christian with your high and mighty errand, Your actions speak
so loud, I can=92t hear a word you=92re saying."

-Greg Graffin (Bad Religion)
 
C

Charles Oliver Nutter

=3D=3D ruby 1.9.2dev (2009-10-18 trunk 25393) [i386-darwin9.8.0]
Rehearsal -------------------------------------------------------------
1 thread, 1 interpreter =C2=A0 =C2=A0 4.370000 =C2=A0 0.020000 =C2=A0 4.3= 90000 ( =C2=A04.389990)
2 threads, 1 interpreter =C2=A0 =C2=A04.360000 =C2=A0 0.030000 =C2=A0 4.3= 90000 ( =C2=A04.385111)
2 threads, 2 interpreters =C2=A0 0.010000 =C2=A0 0.010000 =C2=A0 4.700000= ( =C2=A02.460661)
--------------------------------------------------- total: 13.480000sec

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0user =C2=A0 =C2=A0 system =C2=A0 =C2=
=A0 =C2=A0total =C2=A0 =C2=A0 =C2=A0 =C2=A0real
1 thread, 1 interpreter =C2=A0 =C2=A0 4.360000 =C2=A0 0.020000 =C2=A0 4.3= 80000 ( =C2=A04.376050)
2 threads, 1 interpreter =C2=A0 =C2=A04.360000 =C2=A0 0.030000 =C2=A0 4.3= 90000 ( =C2=A04.380982)
2 threads, 2 interpreters =C2=A0 0.010000 =C2=A0 0.010000 =C2=A0 4.710000= ( =C2=A02.465925)


=3D=3D jruby 1.4.0RC3 (ruby 1.8.7 patchlevel 174) (2009-10-30 1d7de2d) (J= ava
HotSpot(TM) Client VM 1.5.0_20) [i386-java]
Rehearsal ------------------------------------------------------------
1 thread, 1 interpreter =C2=A0 =C2=A06.060000 =C2=A0 0.000000 =C2=A0 6.06= 0000 ( =C2=A06.060000)
2 threads, 1 interpreter =C2=A0 7.629000 =C2=A0 0.000000 =C2=A0 7.629000 = ( =C2=A07.629000)
-------------------------------------------------- total: 13.689000sec

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 user =C2=A0 =C2=A0 system =C2=A0 =C2=A0 =C2=
=A0total =C2=A0 =C2=A0 =C2=A0 =C2=A0real
1 thread, 1 interpreter =C2=A0 =C2=A06.080000 =C2=A0 0.000000 =C2=A0 6.08= 0000 ( =C2=A06.080000)
2 threads, 1 interpreter =C2=A0 7.288000 =C2=A0 0.000000 =C2=A0 7.288000 =
( =C2=A07.288000)

JRuby benchmarking:

* Use Java 6+

Java 6 is much faster than Java 5. Java 7 is faster still in many cases.

* Pass --server if -v output says "client" VM

The Hotspot JVM has two modes: "server" and "client". The "server" VM
does runtime-profiled optimizations and can be 2x or more faster than
the "client" VM.

Results on my system (core 2 duo 2.66GHz):

ruby 1.9.2dev (2009-07-23 trunk 24248) [i386-darwin9.7.1]
Rehearsal -------------------------------------------------------------
1 thread, 1 interpreter 3.370000 0.020000 3.390000 ( 3.516261)
2 threads, 1 interpreter 3.330000 0.020000 3.350000 ( 3.412460)
2 threads, 2 interpreters 0.010000 0.000000 3.590000 ( 2.133313)
--------------------------------------------------- total: 10.330000sec

user system total real
1 thread, 1 interpreter 3.350000 0.010000 3.360000 ( 3.415410)
2 threads, 1 interpreter 3.350000 0.020000 3.370000 ( 3.423560)
2 threads, 2 interpreters 0.000000 0.010000 3.630000 ( 2.302965)

jruby 1.5.0.dev (ruby 1.8.7 patchlevel 174) (2009-10-30 eaa9e7f) (Java
HotSpot(TM) 64-Bit Server VM 1.6.0_15) [x86_64-java]
Rehearsal ------------------------------------------------------------
1 thread, 1 interpreter 2.373000 0.000000 2.373000 ( 2.373000)
2 threads, 1 interpreter 1.733000 0.000000 1.733000 ( 1.733000)
--------------------------------------------------- total: 4.106000sec

user system total real
1 thread, 1 interpreter 2.145000 0.000000 2.145000 ( 2.145000)
2 threads, 1 interpreter 1.840000 0.000000 1.840000 ( 1.840000)

It would probably improve more with a longer run, but this is pretty good.

- Charlie
 
J

James M. Lawrence

Charles said:
JRuby benchmarking:

* Use Java 6+

Java 6 is much faster than Java 5. Java 7 is faster still in many cases.

* Pass --server if -v output says "client" VM

I didn't consider it because the behavior I showed looks wrong for
either Java 5 or Java 6 in either client or server mode. Indeed I
obtained the same results with Java 6 Server VM.

A computation split into two parallel threads takes more time than the
same computation with one thread. 'top' reports 185% CPU and 100% CPU
respectively.

I was not concerned with comparing MRI and jruby. MRI was a baseline
to demonstrate that Pure's parallelism was working in the first place.

I was unable to find your eaa9e7f commit so I grabbed the latest
master branch.

jruby 1.5.0.dev (ruby 1.8.7 patchlevel 174) (2009-11-02 55366a1) (Java
HotSpot(TM) 64-Bit Server VM 1.6.0_15) [x86_64-java]

Core 2 Duo 1.83GHz; all apps closed except Terminal; benchmarks made
without 'top' running.

Rehearsal ------------------------------------------------------------
1 thread, 1 interpreter 3.422000 0.000000 3.422000 ( 3.422000)
2 threads, 1 interpreter 4.008000 0.000000 4.008000 ( 4.008000)
--------------------------------------------------- total: 7.430000sec

user system total real
1 thread, 1 interpreter 2.942000 0.000000 2.942000 ( 2.942000)
2 threads, 1 interpreter 3.595000 0.000000 3.595000 ( 3.595000)

Results are the same with Pure removed:

require 'benchmark'

def left
(1..10_000_000).inject(0) { |acc, n| acc + n }
end

def right
(1..10_000_000).inject(0) { |acc, n| acc + n }
end

Benchmark.bmbm { |bm|
bm.report("1 thread") {
Thread.new {
[left, right]
}.value
}
bm.report("2 threads") {
[
Thread.new { left },
Thread.new { right },
].map { |t| t.value }
}
}

Rehearsal ---------------------------------------------
1 thread 6.726000 0.000000 6.726000 ( 6.726000)
2 threads 7.478000 0.000000 7.478000 ( 7.478000)
----------------------------------- total: 14.204000sec

user system total real
1 thread 6.636000 0.000000 6.636000 ( 6.636000)
2 threads 8.196000 0.000000 8.196000 ( 8.196000)
 
C

Charles Oliver Nutter

Rehearsal ------------------------------------------------------------
1 thread, 1 interpreter =C2=A0 =C2=A03.422000 =C2=A0 0.000000 =C2=A0 3.42= 2000 ( =C2=A03.422000)
2 threads, 1 interpreter =C2=A0 4.008000 =C2=A0 0.000000 =C2=A0 4.008000 = ( =C2=A04.008000)
--------------------------------------------------- total: 7.430000sec

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 user =C2=A0 =C2=A0 system =C2=A0 =C2=A0 =C2=
=A0total =C2=A0 =C2=A0 =C2=A0 =C2=A0real
1 thread, 1 interpreter =C2=A0 =C2=A02.942000 =C2=A0 0.000000 =C2=A0 2.94= 2000 ( =C2=A02.942000)
2 threads, 1 interpreter =C2=A0 3.595000 =C2=A0 0.000000 =C2=A0 3.595000 =
( =C2=A03.595000)

This does not match my results. Are you sure both cores are being used?
Rehearsal ---------------------------------------------
1 thread =C2=A0 =C2=A06.726000 =C2=A0 0.000000 =C2=A0 6.726000 ( =C2=A06.= 726000)
2 threads =C2=A0 7.478000 =C2=A0 0.000000 =C2=A0 7.478000 ( =C2=A07.47800= 0)
----------------------------------- total: 14.204000sec

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0user =C2=A0 =C2=A0=
system =C2=A0 =C2=A0 =C2=A0total =C2=A0 =C2=A0 =C2=A0 =C2=A0real
1 thread =C2=A0 =C2=A06.636000 =C2=A0 0.000000 =C2=A0 6.636000 ( =C2=A06.= 636000)
2 threads =C2=A0 8.196000 =C2=A0 0.000000 =C2=A0 8.196000 ( =C2=A08.19600=
0)

Also does not match my results:

Rehearsal ---------------------------------------------
1 thread 4.795000 0.000000 4.795000 ( 4.739000)
2 threads 3.072000 0.000000 3.072000 ( 3.072000)
------------------------------------ total: 7.867000sec

user system total real
1 thread 4.081000 0.000000 4.081000 ( 4.081000)
2 threads 2.966000 0.000000 2.966000 ( 2.966000)

I'd love to hear from others trying this benchmark, since the results
you've given don't match my results on any of the systems I'm testing.

- Charlie
 
J

James M. Lawrence

Charles Oliver Nutter:
This does not match my results. Are you sure both cores are being used?

I am certain. I tried to head off this question when I said: all
applications are closed save Terminal; top reports 0% CPU usage
beforehand; top reports java at 100% CPU during the 1-thread test;
185% CPU during the 2-thread test; top was not running during the
posted benchmarks.

I should also mention this is my mp3 player co-opted into a Mac dev
machine--a Mac Mini. Maybe Java balks at the specs. System Profiler:

Model Name: Mac mini
Model Identifier: Macmini2,1
Processor Name: Intel Core 2 Duo
Processor Speed: 1.83 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 2 MB
Memory: 1 GB
Bus Speed: 667 MHz

Darwin jl.local 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01
PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386

It would be nice to match jruby versions. Can you try master 55366a1
or push eaa9e7f to a remote branch?

[quoting the rest in full due to ruby-forum gateway breakage]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top