Threading in ruby

V

Vishnu I.

Hi
I was recently trying to solve a pet problem of mine where threading
seems to be the most natural way to solve a problem. I needed a thread
that would scan a fixed list of tuples each of which contained a single
filenames to sha1 them and add the sha1 to each tuple and a second
thread that would run through this list and contact this service on the
internet to get metadata on these files. (the service exposes a udp api
that allows a given ip to use only a fixed local port and has strict
rate limits, so I don't want to request metadata for files
concurrently). The thing is I have almost never worked worked with
threading since I've mostly built web apps.

So I decided to maintain a list of tuples and a last updated index that
would be visible to both threads.

So something like this
https://gist.github.com/764495#file_simple_multithreaded_example .
However, I understand that this may not work. Because there is no
guarantee of the order in which the two operations are performed and
these might be reordered. But this understanding comes from reading
about this in other languages. Is this also true in Ruby?

I understand that if I were doing this in .net or java I could just make
index volatile which is a directive to the compiler and runtime that
operations surrounding it should not be reordered. Am i correct in
understanding that there is no such concept in ruby? So I cannot do this
locklessly.

So I decided to lock on a mutex and I came up with this.
https://gist.github.com/764495#file_simple_multithreaded_example_with_loc=
ks
now in the scanner thread I have udpated the list inside the sychronized
block and then updated index, but is it safe to move the list updation
outside? (I assume no since regular statements and sych blocks can get
reordered). Also I have sychronized in the updater thread because I dont
know if statements inside sychronized blocks can be re-ordered. Is that
right?

p.s. After I considered this solution, I came to know about the Queue
class=C2=A0in thread, so I know that that is a more elegant solution. But=
I
would still like to know the solution.

-- =

Posted via http://www.ruby-forum.com/.=
 
R

Robert Klemme

=A0 =A0I was recently trying to solve a pet problem of mine where threadi= ng
seems to be the most natural way to solve a problem. I needed a thread
that would scan a fixed list of tuples each of which contained a single
filenames to sha1 them and add the sha1 to each tuple and a second
thread that would run through this list and contact this service on the
internet to get metadata on these files. (the service exposes a udp api
that allows a given ip to use only a fixed local port and has strict
rate limits, so I don't want to request metadata for files
concurrently). The thing is I have almost never worked worked with
threading since I've mostly built web apps.

This is the point where I would immediately switch to using a queue.
=A0So I decided to maintain a list of tuples and a last updated index tha= t
would be visible to both threads.

So something like this
https://gist.github.com/764495#file_simple_multithreaded_example .
However, I understand that this may not work. Because there is no
guarantee of the order in which the two operations are performed and
these might be reordered. But this understanding comes from reading
about this in other languages. Is this also true in Ruby?

Yes. Without synchronization of some sort all bets are off and there
are no guarantees about the order in which code in concurrent threads
is executed.
I understand that if I were doing this in .net or java I could just make
index volatile which is a directive to the compiler and runtime that
operations surrounding it should not be reordered. Am i correct in
understanding that there is no such concept in ruby? So I cannot do this
locklessly.
Exactly.

So I decided to lock on a mutex and I came up with this.
https://gist.github.com/764495#file_simple_multithreaded_example_with_loc= ks
now in the scanner thread I have udpated the list inside the sychronized
block and then updated index,

As far as I can see you forgot to increment local_index after #update_metad=
ata.
but is it safe to move the list updation
outside? (I assume no since regular statements and sych blocks can get
reordered).

There is no list update. You only update Hash instances contained in
the Array. With your logic it should be safe to only synchronize on
the index access because you modify only those entries in the Array
with index > current index.
Also I have sychronized in the updater thread because I dont
know if statements inside sychronized blocks can be re-ordered. Is that
right?

Not sure what you mean here. You properly synchronize access to index
which is exactly what you need to do to make the code safe.
p.s. After I considered this solution, I came to know about the Queue
class=A0in thread, so I know that that is a more elegant solution. But I
would still like to know the solution.

https://gist.github.com/764540

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
V

Vishnu I.

Robert Klemme wrote in post #972187:
This is the point where I would immediately switch to using a queue.

oh absolutely.Once I learnt that Queue's exist, I got the thing working
with queues. I just want to understand how to reason with threads that
share data.
Yes. Without synchronization of some sort all bets are off and there
are no guarantees about the order in which code in concurrent threads
is executed.


As far as I can see you forgot to increment local_index after
#update_metadata.

ah yes, that was an accident when I was typing it out right now.
There is no list update. You only update Hash instances contained in
the Array. With your logic it should be safe to only synchronize on
the index access because you modify only those entries in the Array
with index > current index.

This is the part that I'm not sure about. If I synchronize only on
updation of the index, there's nothing preventing the runtime from first
updating the index and THEN updating the hash in the array. unless
sychronizing guarantees that all statements prior to the synchronized
block are executed.

Not sure what you mean here. You properly synchronize access to index
which is exactly what you need to do to make the code safe.


sorry, what I meant was inside the top synchronized block I update the
hash and the index. But again there are no guarantees about the order in
which these actions happen. So I have to synchronize on the mutex while
retrieving index too.


oops I wasnt clear at all here I see :eek:. I meant I understood and got
the solution working with queues. I was wondering how to solve this
problem with locks :)

thanks
Vishnu
 
R

Robert Klemme

Robert Klemme wrote in post #972187:

This is the part that I'm not sure about. If I synchronize only on
updation of the index, there's nothing preventing the runtime from first
updating the index and THEN updating the hash in the array. unless
sychronizing guarantees that all statements prior to the synchronized
block are executed.

In Java - which has far more sophisticated threading and memory model
than MRI - it would also be sufficient to only synchronize the index
modification. The reason is that in Java each synchronized block has
two memory barriers - one on entry and one on exit. The JVM is only
allowed to reorder execution between two memory barriers but it is
never allowed that reordering of statements moves execution of a
statement across a memory barrier.
sorry, what I meant was inside the top synchronized block I update the
hash and the index. But again there are no guarantees about the order in
which these actions happen. So I have to synchronize on the mutex while
retrieving index too.

Yes, of course. If you want to make access to a resource thread safe
you must synchronize *all* accesses. That's the simple rule to make
code thread safe. Only the change of the Hash in the Array does not
need to be synchronized. In fact, you should strive to make
synchronized sections as short as possible to avoid unnecessary
contention.
oops I wasnt clear at all here I see :eek:. I meant I understood and got
the solution working with queues. I was wondering how to solve this
problem with locks :)

Well, I chose to ignore what you wanted and instead present what I
would consider a better solution. :)

The solution without queue would involve a ConditionVariable. This is
more appropriate than using sleep:

https://gist.github.com/764540#file_sc2.rb

Kind regards

robert


--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
V

Vishnu I.

Hi Robert
thanks for such a clear explanation. I have a couple of questions
though.
In Java - which has far more sophisticated threading and memory model
than MRI - it would also be sufficient to only synchronize the index
modification. The reason is that in Java each synchronized block has
two memory barriers - one on entry and one on exit. The JVM is only
allowed to reorder execution between two memory barriers but it is
never allowed that reordering of statements moves execution of a
statement across a memory barrier.

ah I see. So anything that happens before the synchronized block should
have happened before whats inside the sychronized block. Is this also
true in ruby?

Yes, of course. If you want to make access to a resource thread safe
you must synchronize *all* accesses. That's the simple rule to make
code thread safe. Only the change of the Hash in the Array does not
need to be synchronized. In fact, you should strive to make
synchronized sections as short as possible to avoid unnecessary
contention.

I dont follow this. If the synchrovised section has introduced a memory
barrier. Then accessing the shared counter either will or will not see
the updated index. But if it see's an updated index, then because of the
memory barrier mentioned previously, it should see an updated item too
no?

The solution without queue would involve a ConditionVariable. This is
more appropriate than using sleep:

https://gist.github.com/764540#file_sc2.rb

thanks :)

Vishnu
 
R

Robert Klemme

ah I see. So anything that happens before the synchronized block should
have happened before whats inside the sychronized =A0block. Is this also
true in ruby?
Yes.


I dont follow this. If the synchrovised section has introduced a memory
barrier. Then accessing the shared counter either will or will not see
the updated index. But if it see's an updated index, then because of the
memory barrier mentioned previously, it should see an updated item too
no?

Yes. But there is no additional synchronization needed for access to
the Array or Hash instances in the array. It is sufficient to only
update the index in a synchronized block. And in fact that is the
only operation that you should do there (apart from the condition
variable signal).

The term "memory barrier" is a concept from the JVM. AFAIK in Ruby's
MRI there is no concept as thread local and global memory (which the
memory barriers update). Second, a memory barrier effects *all the
memory*. Whatever changes were done before it are then propagated to
between memories (in JVM of course).

Does that help?

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
R

Robert Klemme

ah I see. So does each wait get the corresponding signal? so if I have
10 signals queued up, the next 10 waits can operate instantly?

signals are not queued. All threads waiting on the condition are
signalled (and woken up) - if there is no thread waiting, none is
woken up. But that is not needed because of the condition and
looping. The client will continue to work until the condition is
false. The CV only serves the purpose to wake up threads should they
go out of work to do.

Btw, I recommend Doug Lea's book - it's about Java but concepts
described there like thread confinement have broader validity.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
V

Vishnu I.

Yes. But there is no additional synchronization needed for access to
the Array or Hash instances in the array. It is sufficient to only
update the index in a synchronized block. And in fact that is the
only operation that you should do there (apart from the condition
variable signal).

The term "memory barrier" is a concept from the JVM. AFAIK in Ruby's
MRI there is no concept as thread local and global memory (which the
memory barriers update). Second, a memory barrier effects *all the
memory*. Whatever changes were done before it are then propagated to
between memories (in JVM of course).

Does that help?

Kind regards

robert

oops Im sorry I wasnt clear. My question was why would I need to
synchronize on the read of the index variable if the synchronized write
has created a memory barrier between the updation of the list item and
the index update? Thats the only piece i couldnt follow
 
V

Vishnu I.

Robert Klemme wrote in post #972756:
signals are not queued. All threads waiting on the condition are
signalled (and woken up) - if there is no thread waiting, none is
woken up. But that is not needed because of the condition and
looping. The client will continue to work until the condition is
false. The CV only serves the purpose to wake up threads should they
go out of work to do.

Btw, I recommend Doug Lea's book - it's about Java but concepts
described there like thread confinement have broader validity.

Kind regards

robert



hmm ok Im confused now. Say the main thread completes really quickly and
fires available.signal 10 times in 2 minutes. Meanwhile say the updater
has only updated twice. Wont the third update wait forever since there
is no one left to signal?
 
R

Robert Klemme

oops Im sorry I wasnt clear. My question was why would I need to
synchronize on the read of the index variable if the synchronized write
has created a memory barrier between the updation of the list item and
the index update? Thats the only piece i couldnt follow

Ah, I see. The reason is simple: if the thread reading the resource
does not synchronize there is no memory barrier and hence no guarantee
that it will read a current value. The writing thread may have
published the current state to global memory but your reading thread
does not necessarily have imported that change.

If you want to read more I suggest you google for "double checked
locking broken". You'll find some articles which dissect this
seemingly good idea.

From your other mail:
hmm ok Im confused now. Say the main thread completes really quickly and
fires available.signal 10 times in 2 minutes. Meanwhile say the updater
has only updated twice. Wont the third update wait forever since there
is no one left to signal?

No, because the updater won't block if it sees that there is still
work to do (i <=3D index). It will simply continue. You can easily
test this out by inserting a sleep in the updater.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
V

Vishnu I.

Robert Klemme wrote in post #972776:
Ah, I see. The reason is simple: if the thread reading the resource
does not synchronize there is no memory barrier and hence no guarantee
that it will read a current value. The writing thread may have
published the current state to global memory but your reading thread
does not necessarily have imported that change.

aha, got it. A synchronized block guarantees that stuff in one thread
that happens before the block is visible to any other thread that is
synchronized on the same mutex. Thanks, I get it now!

If you want to read more I suggest you google for "double checked
locking broken". You'll find some articles which dissect this
seemingly good idea.

From your other mail:


No, because the updater won't block if it sees that there is still
work to do (i <= index). It will simply continue. You can easily
test this out by inserting a sleep in the updater.


oh right. That was silly of me. :)
 
R

Robert Klemme

Robert Klemme wrote in post #972776:

aha, got it. A synchronized block guarantees that stuff in one thread
that happens before the block is visible to any other thread that is
synchronized on the same mutex. =A0Thanks, I get it now!

I don't think this is not scoped to mutexes: a memory barrier is
always a global thing. If you think about it anything else would be
very inefficient - the smallest unit of memory that an operating
system deals with is a page. Again, this is relevant for Java. In
Ruby there is no distinction between thread local and global memory -
unless you are using JRuby of course.

The reason for using the same mutex for the writer and the reader
thread is simply that you need *one* critical section in order to
ensure only one thread accesses whatever is guarded by the mutex at a
time.
oh right. That was silly of me. :)

Well, multithreading is a difficult topic that few people really grok.
Apparently the human brain does work concurrently internally, yet it
is not well suited to reason about concurrency. :) For example, for
me initially it was difficult to get the distinction right between a
Thread object and the thread of execution that is represented by it.
These are really two different things and the thread of execution is
only temporarily associated with the object through which it can be
manipulated and queried.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top