CAS operations and scalability...

Discussion in 'C++' started by aminer, Aug 26, 2012.

  1. aminer

    aminer Guest

    Hello,


    When the CAS operation "goes on the bus", use of CAS can impair scalability.
    but CAS can be accomplished locally -- that is, with no bus transactions --
    and then it can scale.

    If we then change the CAS operation that goes on the bus to a normal store
    you'll also see a similar slow-down in terms of coherency bus traffic, CAS
    isn't appreciably different than a normal store. Also the lock: prefix
    caused
    the LOCK# signal to be asserted, acquiring exclusive access to the bus.
    This doesn't scale of course

    As you have noticed i have wrote parallelhashlist (a parallel hashtable),
    you can find parallelhashlist here:

    http://pages.videotron.com/aminer/

    It's a parallel Hashtable with O(1) best case and O(log(n)) worst case
    access that uses lock striping and lightweight MREWs(multiple-readers
    -exclusive-writer) , this allows multiple threads to write and read
    concurently. also parallelhashlist maintains an independant counter , that
    counts the number of entries , for each segment of the hashtable and uses
    a lock for each counter, this is also for better scalability. and
    parallelhashlist
    is scaling very well, but since it is a parallel hashtable so the
    possibility of
    contention is low so why doi need the distributed reader-writer lock of
    Dmitry Vyukov inside my parallel hashlist ?

    Dmitry Vyukov's distributed reader-writer lock is here:

    http://www.1024cores.net/home/lock-...riter-problem/distributed-reader-writer-mutex


    Other than that I have done some tests with the lightweight MREW that i am
    using inside my parallelhashlist and i have done also some tests with my
    lockfree mpmc fifo queue and what i think is that the CAS is generating
    a lot of contention this is is why the lightweight MREW and my lockfree_mpmc
    are not scaling , but parallelhashlist is scaling very well cause i am using
    lock-striping that is lowering contention.

    What are doing Dmitry Vyukov in his distributed rwlock is lowering
    the contention using the same method as lock striping that i am using inside
    parallelhashlist it is why it is scaling, but there is still a possibility
    of contention in his distributed rwlock that can cause a problem to the
    scalability if there is too many threads and not a sufficient number of
    rwlocks in the Dmitry distributed rwlock to be able to lower the contention.

    I have tested parallelhashlist(a parallel hashtable that i have implemented)
    with four threads on a quad core and it's giving a very well scaling on both
    reads and writes.

    Also i have done some scalability tests on my parallelsort library and i
    have
    come
    to the conclusion that parallel heapsort is better on scalability than
    parallel quicksort
    cause the P part (of the Amdahl equation) is bigger in parallel heapsort
    than in parallel
    quicksort, the parallel heapsort is doing more on the parallel part, it's
    why it scales better than parallel quicksort, but parallel quicksort is
    still
    faster than parallel heapsort and parallel merge sort on my tests on a
    quad core processor.

    And about lockfree_mpmc( a lockfree fifo queue), i have done some tests
    and it's not scaling cause when you are using a single thread some variables
    are updated locally on the L1 cache but using multiple threads those
    variables are
    loaded from the L2 cache and it's more expensive to load them from the L2
    cache.and this does generate much more contention

    But even though lockfree_mpmc is not scalable, you can increase
    the P (parallel) part by doing more of the same: Increase the volume of
    data processed by the P part (and therefore the percentage p of time spent
    in computing). This is Gustafson's Law and you will get more scalability.

    For example i have used the IntToStr() function on each of the four threads
    (on
    a quad core) on my lockfree_mpmc test programs to convert from and integer
    to a string, so i have increased the P (parallel) part and i have got more
    scalability,
    this is Gustafson's Law, and you have to remember Gustafson's Law ,
    this is very important.


    You can download my parallel libraries from

    http://pages.videotron.com/aminer/




    Sincerely,
    Amine Moulay Ramdane.
    aminer, Aug 26, 2012
    #1
    1. Advertising

  2. aminer

    aminer Guest

    I wrote:

    "When the CAS operation "goes on the bus", use of CAS can impair
    scalability.
    but CAS can be accomplished locally -- that is, with no bus transactions --
    and then it can scale.

    If we then change the CAS operation that goes on the bus to a normal store
    you'll also see a similar slow-down in terms of coherency bus traffic, CAS
    isn't appreciably different than a normal store. Also the lock: prefix
    caused
    the LOCK# signal to be asserted, acquiring exclusive access to the bus.
    This doesn't scale of course"


    If you take a look at my threadpool engine

    http://pages.videotron.com/aminer/

    in the TThreadPool.execute() method i am using a
    local_balance:=LockedIncLong(balance1) mod FThreadCount;
    and LockedIncLong() function is going on the bus and it can
    generate a lot of contention, so it does not scale if there is a
    high arrival rate and a lot of contention on it.But even though it
    does not scale , in the majority of real world applications
    the processing rate of the worker threads is lower than the arrival rate to
    the TThreadPool.execute(), so this is not a problem i think.


    Thank you,

    Amine Moulay Ramdne


    "aminer" <> wrote in message
    news:k1dt8c$s1a$...
    > Hello,
    >
    >
    > When the CAS operation "goes on the bus", use of CAS can impair
    > scalability.
    > but CAS can be accomplished locally -- that is, with no bus
    > transactions --
    > and then it can scale.
    >
    > If we then change the CAS operation that goes on the bus to a normal store
    > you'll also see a similar slow-down in terms of coherency bus traffic, CAS
    > isn't appreciably different than a normal store. Also the lock: prefix
    > caused
    > the LOCK# signal to be asserted, acquiring exclusive access to the bus.
    > This doesn't scale of course
    >
    > As you have noticed i have wrote parallelhashlist (a parallel hashtable),
    > you can find parallelhashlist here:
    >
    > http://pages.videotron.com/aminer/
    >
    > It's a parallel Hashtable with O(1) best case and O(log(n)) worst case
    > access that uses lock striping and lightweight MREWs(multiple-readers
    > -exclusive-writer) , this allows multiple threads to write and read
    > concurently. also parallelhashlist maintains an independant counter , that
    > counts the number of entries , for each segment of the hashtable and uses
    > a lock for each counter, this is also for better scalability. and
    > parallelhashlist
    > is scaling very well, but since it is a parallel hashtable so the
    > possibility of
    > contention is low so why doi need the distributed reader-writer lock of
    > Dmitry Vyukov inside my parallel hashlist ?
    >
    > Dmitry Vyukov's distributed reader-writer lock is here:
    >
    > http://www.1024cores.net/home/lock-...riter-problem/distributed-reader-writer-mutex
    >
    >
    > Other than that I have done some tests with the lightweight MREW that i am
    > using inside my parallelhashlist and i have done also some tests with my
    > lockfree mpmc fifo queue and what i think is that the CAS is generating
    > a lot of contention this is is why the lightweight MREW and my
    > lockfree_mpmc
    > are not scaling , but parallelhashlist is scaling very well cause i am
    > using
    > lock-striping that is lowering contention.
    >
    > What are doing Dmitry Vyukov in his distributed rwlock is lowering
    > the contention using the same method as lock striping that i am using
    > inside
    > parallelhashlist it is why it is scaling, but there is still a possibility
    > of contention in his distributed rwlock that can cause a problem to the
    > scalability if there is too many threads and not a sufficient number of
    > rwlocks in the Dmitry distributed rwlock to be able to lower the
    > contention.
    >
    > I have tested parallelhashlist(a parallel hashtable that i have
    > implemented)
    > with four threads on a quad core and it's giving a very well scaling on
    > both
    > reads and writes.
    >
    > Also i have done some scalability tests on my parallelsort library and i
    > have
    > come
    > to the conclusion that parallel heapsort is better on scalability than
    > parallel quicksort
    > cause the P part (of the Amdahl equation) is bigger in parallel heapsort
    > than in parallel
    > quicksort, the parallel heapsort is doing more on the parallel part, it's
    > why it scales better than parallel quicksort, but parallel quicksort is
    > still
    > faster than parallel heapsort and parallel merge sort on my tests on a
    > quad core processor.
    >
    > And about lockfree_mpmc( a lockfree fifo queue), i have done some tests
    > and it's not scaling cause when you are using a single thread some
    > variables
    > are updated locally on the L1 cache but using multiple threads those
    > variables are
    > loaded from the L2 cache and it's more expensive to load them from the L2
    > cache.and this does generate much more contention
    >
    > But even though lockfree_mpmc is not scalable, you can increase
    > the P (parallel) part by doing more of the same: Increase the volume of
    > data processed by the P part (and therefore the percentage p of time spent
    > in computing). This is Gustafson's Law and you will get more scalability.
    >
    > For example i have used the IntToStr() function on each of the four
    > threads
    > (on
    > a quad core) on my lockfree_mpmc test programs to convert from and integer
    > to a string, so i have increased the P (parallel) part and i have got more
    > scalability,
    > this is Gustafson's Law, and you have to remember Gustafson's Law ,
    > this is very important.
    >
    >
    > You can download my parallel libraries from
    >
    > http://pages.videotron.com/aminer/
    >
    >
    >
    >
    > Sincerely,
    > Amine Moulay Ramdane.
    >
    >
    >
    aminer, Aug 26, 2012
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jesus M. Salvo Jr.
    Replies:
    2
    Views:
    3,998
    robert
    Feb 11, 2006
  2. cas in python

    , Jul 5, 2006, in forum: Python
    Replies:
    2
    Views:
    361
  3. cyber1boy

    JIRA and CAS

    cyber1boy, Feb 19, 2008, in forum: Java
    Replies:
    3
    Views:
    567
    cyber1boy
    Feb 25, 2008
  4. aminer
    Replies:
    1
    Views:
    302
    aminer
    Sep 14, 2012
  5. Lew
    Replies:
    9
    Views:
    306
    Kevin McMurtrie
    Sep 14, 2012
Loading...

Share This Page