STL map and pthread performance problem on Linux/GCC

Discussion in 'C++' started by nan.li.g@gmail.com, Aug 16, 2005.

  1. Guest

    Hello, all,
    I have an interesting problem about stl map and pthread on Linux
    and g++. The source code is as follows.


    //mt_map_test.cpp
    #include <string>
    #include <map>
    #include <unistd.h>
    #include <sys/types.h>
    #include <stdio.h>
    #include <string.h>
    #include <pthread.h>


    using namespace std;

    static void* thrd_work( void* data )
    {
    long count = reinterpret_cast<long>( data );

    const int SIZE =32;
    char buf[SIZE];
    memset( buf, 'h', SIZE );
    map<string, string> strMap;

    for ( long i=0; i< count ; i++ ) {
    char key[8];
    sprintf( key, "%d", i );
    strMap[ key ] = buf;
    }
    }

    int main()
    {
    int JOB_NUM = 320000;
    pthread_t tid[THRD_NUM];

    for ( int i=0; i< THRD_NUM; i++ ){
    pthread_create( tid+i, NULL, &thrd_work,
    reinterpret_cast<void*> (JOB_NUM) );
    }

    for ( int i=0; i< THRD_NUM; i++ ){
    pthread_join( tid, NULL );
    }

    }


    And here is what I got from my workstation ( Dual AMD Opteron machine,
    RHEL 3)

    [nan@eudyptula test]$ for t in 1 2; do g++ -DTHRD_NUM=$t
    mt_map_test.cpp -lpthread ; time ./a.out; done

    real 0m1.390s
    user 0m1.280s
    sys 0m0.120s

    real 0m3.450s
    user 0m5.320s
    sys 0m1.170s


    I expected that the 2 times should be roughly equal. But clearly I
    experienced significant slowdown with 2 threads.The same also happened
    to a dual Intel Xeon machine. I suspect the internal stl map
    implementation is improper.

    I've spent hours googling without any answer. I really need advice from
    a C++ expert. Thanks a lot.

    Nan
     
    , Aug 16, 2005
    #1
    1. Advertising

  2. wrote:

    []

    > I expected that the 2 times should be roughly equal. But clearly I
    > experienced significant slowdown with 2 threads.The same also happened
    > to a dual Intel Xeon machine. I suspect the internal stl map
    > implementation is improper.
    >
    > I've spent hours googling without any answer. I really need advice from
    > a C++ expert. Thanks a lot.


    The problem may well be in malloc(), since it uses a mutex to protect
    its data structures, thus serializing your map<> inserts. Try
    http://www.hoard.org/
     
    Maxim Yegorushkin, Aug 16, 2005
    #2
    1. Advertising

  3. wrote:
    > Hello, all,
    > I have an interesting problem about stl map and pthread on Linux
    > and g++. The source code is as follows.
    >
    >
    > //mt_map_test.cpp
    > #include <string>
    > #include <map>
    > #include <unistd.h>
    > #include <sys/types.h>
    > #include <stdio.h>
    > #include <string.h>
    > #include <pthread.h>
    >
    >
    > using namespace std;
    >
    > static void* thrd_work( void* data )
    > {
    > long count = reinterpret_cast<long>( data );
    >
    > const int SIZE =32;
    > //char buf[SIZE];


    // allow room for a nul-terminator so the temp std::string
    // constructed from buf[] by 'strMap[ key ] = buf' will
    // always be SIZE bytes in length.
    char buf[SIZE + 1];

    > memset( buf, 'h', SIZE );


    // nul-terminate buf[]. if we don't, then when a
    // std::string is made from buf[] the nbr of bytes
    // put into the std::string will be all bytes up to the
    // next zero byte (an unknown length) - undefined behavior.
    buf[SIZE] = '\0';

    > map<string, string> strMap;
    >
    > for ( long i=0; i< count ; i++ ) {
    > char key[8];
    > sprintf( key, "%d", i );
    > strMap[ key ] = buf;
    > }
    > }
    >
    > int main()
    > {
    > int JOB_NUM = 320000;
    > pthread_t tid[THRD_NUM];
    >
    > for ( int i=0; i< THRD_NUM; i++ ){
    > pthread_create( tid+i, NULL, &thrd_work,
    > reinterpret_cast<void*> (JOB_NUM) );
    > }
    >
    > for ( int i=0; i< THRD_NUM; i++ ){
    > pthread_join( tid, NULL );
    > }
    >
    > }
    >
    >
    > And here is what I got from my workstation ( Dual AMD Opteron machine,
    > RHEL 3)
    >
    > [nan@eudyptula test]$ for t in 1 2; do g++ -DTHRD_NUM=$t
    > mt_map_test.cpp -lpthread ; time ./a.out; done
    >
    > real 0m1.390s
    > user 0m1.280s
    > sys 0m0.120s
    >
    > real 0m3.450s
    > user 0m5.320s
    > sys 0m1.170s
    >
    >
    > I expected that the 2 times should be roughly equal. But clearly I
    > experienced significant slowdown with 2 threads.The same also happened
    > to a dual Intel Xeon machine. I suspect the internal stl map
    > implementation is improper.
    >
    > I've spent hours googling without any answer. I really need advice from
    > a C++ expert. Thanks a lot.
    >
    > Nan
    >


    Just because the you have 2 cpu's doesn't mean that each thread
    will run on its own cpu. I've seen discussions about this in
    the gcc and g++ newsgroups. I don't remember the details, but
    I seem to recall that special compile/link options were involved
    to force usage of the multiple cpu's.

    FYI, here's what I get on my old single-cpu PII-450 with
    384MB of RAM:

    real 0m9.407s
    user 0m9.006s
    sys 0m0.215s

    real 0m18.671s
    user 0m17.959s
    sys 0m0.489s

    Regards,
    Larry
     
    Larry I Smith, Aug 17, 2005
    #3
  4. Guest

    Thanks a lot for the replies. I can use sched_setaffinity for
    processor binding on Linux. But that does not look like what the
    problem is. Also, I made a multi-process program using fork(), and
    the result there is satisfactory.

    Below is the new code with processor binding, multi-process and
    corrections on the std::string.

    -----------------------------------------------------------
    #include <string>
    #include <map>
    #include <unistd.h>
    #include <sys/types.h>
    #include <stdio.h>
    #include <string.h>
    #include <pthread.h>
    #include <sys/wait.h>
    #include <sched.h>

    using namespace std;

    struct WorkData {

    int cpu;
    long count;

    };

    static void* work( void* data )
    {
    WorkData* workData = reinterpret_cast<WorkData*>( data );

    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET( workData->cpu, &mask);
    sched_setaffinity(0, &mask);

    const int SIZE =32;
    char buf[SIZE];
    memset( buf, 'h', SIZE );
    buf[SIZE-1] = '\0';
    map<string, string> strMap;

    for ( long i=0; i< workData->count ; i++ ) {
    char key[8];
    sprintf( key, "%d", i );
    strMap[ key ] = buf;
    }
    }


    int main()
    {
    const int JOB_NUM = 320000;
    WorkData workData[ CONCURRENCY ];
    #ifdef MT
    pthread_t tid[ CONCURRENCY ];
    for ( int i=0; i< CONCURRENCY; i++ ){
    workData.count = JOB_NUM;
    workData.cpu = i;
    pthread_create( tid+i, NULL, &work, reinterpret_cast<void*>
    (&(workData)) );
    }

    for ( int i=0; i< CONCURRENCY; i++ ){
    pthread_join( tid, NULL );
    }

    #endif

    #ifdef MP

    pid_t pid[ CONCURRENCY ];
    for ( int i=0; i< CONCURRENCY; i++ ){
    workData.count = JOB_NUM;
    workData.cpu = i;

    if ( (pid = fork()) == 0 ) {

    work( reinterpret_cast<void*> ( &(workData)));
    exit( 0 );
    }
    }

    for ( int i=0; i< CONCURRENCY; i++ ){
    waitpid( pid, NULL, 0 );
    }

    #endif
    }

    ----------------------------------------------------------
    Performace:

    [nan@eudyptula test]$ for m in MT MP; do for t in 1 2; do printf "\n%s
    %d" $m $t; g++ -D$m -DCONCURRENCY=$t mt_map_test.cpp -lpthread ;
    time ./a.out; done; done

    MT 1
    real 0m1.410s
    user 0m1.380s
    sys 0m0.030s

    MT 2
    real 0m3.650s
    user 0m5.230s
    sys 0m1.640s

    MP 1
    real 0m1.380s
    user 0m1.340s
    sys 0m0.040s

    MP 2
    real 0m1.400s
    user 0m2.650s
    sys 0m0.130s

    I also suspect there is some locking problem. I have not got my
    program to work with hoard yet. But I did not find any slowdown when I
    did a simple test on malloc in a multi-threaded program. For now, I
    am going to dig into the map implementation on my machine.
     
    , Aug 17, 2005
    #4
  5. Guest

    I tried the same program again tonight on a Redhat FC3 dual processor
    machine. The problem just went away. Notice the gcc on FC3 is
    3.4.2 . Before I used 3.2.3. on RHEL3.

    gcc -v
    Reading specs from /usr/lib/gcc/i386-redhat-linux/3.4.2/specs
    Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
    --infodir=/usr/share/info --enable-shared --enable-threads=posix
    --disable-checking --with-system-zlib --enable-__cxa_atexit
    --disable-libunwind-exceptions --enable-java-awt=gtk
    --host=i386-redhat-linux
    Thread model: posix
    gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)

    $ for m in MT ; do for t in 1 2; do printf "\n%s %d" $m $t; g++
    -D$m -DCONCURRENCY=$t mt_map_test.cpp -lpthread; time ./a.out; done;
    done

    MT 1
    real 0m1.707s
    user 0m1.632s
    sys 0m0.075s

    MT 2
    real 0m1.805s
    user 0m3.303s
    sys 0m0.240s

    BTW, the signature of sched_setaffinity changed on FC3,
    need to use 'sched_setaffinity(0, sizeof(mask), &mask);'

    I also got hoard working on FC3. But no further improvement. On RHEL3,
    I got the so file with the following warning:
    Compiling for Linux
    In file included from libhoard.cpp:164:
    usesimtls.cpp: In function `int pthread_create(pthread_t*, const
    pthread_attr_t*, void*(*)(void*), void*)':
    usesimtls.cpp:323: warning: `pthread_attr_setstackaddr' is deprecated
    (declared
    at /usr/include/nptl/pthread.h:299)
    In file included from libhoard.cpp:164:
    usesimtls.cpp: In function `void pthread_exit(void*)':
    usesimtls.cpp:354: warning: `noreturn' function does ret

    But I got segfault in running my program with MT2. MT1, MP1 and MP2
    are all OK.

    Clearly, something has changed between gcc 3.2 and gcc 3.4.
     
    , Aug 17, 2005
    #5
  6. wrote:

    []

    > And here is what I got from my workstation ( Dual AMD Opteron machine,
    > RHEL 3)
    >
    > [nan@eudyptula test]$ for t in 1 2; do g++ -DTHRD_NUM=$t
    > mt_map_test.cpp -lpthread ; time ./a.out; done
    >
    > real 0m1.390s
    > user 0m1.280s
    > sys 0m0.120s
    >
    > real 0m3.450s
    > user 0m5.320s
    > sys 0m1.170s
    >
    >
    > I expected that the 2 times should be roughly equal. But clearly I
    > experienced significant slowdown with 2 threads.The same also happened
    > to a dual Intel Xeon machine. I suspect the internal stl map
    > implementation is improper.
    >
    > I've spent hours googling without any answer. I really need advice from
    > a C++ expert. Thanks a lot.


    The problem may be in libstdc++ caching allocator. See
    http://gcc.gnu.org/onlinedocs/libstdc /20_util/allocator.html

    I ran your code on a Dual Xeon 2.8 box with caching disabled and
    enabled. Here are my results:

    my@devel:~/src/exp> cat /etc/issue
    Welcome to SuSE Linux 9.2 (i586) - Kernel \r (\l).
    my@devel:~/src/exp> uname -a
    Linux devel 2.6.11.4-20a-smp #1 SMP Wed Mar 23 21:52:37 UTC 2005 i686
    i686 i386 GNU/Linux
    my@devel:~/src/exp> g++ --version
    g++ (GCC) 3.3.4 (pre 3.3.5 20040809)
    Copyright (C) 2003 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is
    NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
    PURPOSE.

    my@devel:~/src/exp> for t in 1 2; do g++ -O3 -DTHRD_NUM=$t exp.cpp
    -pthread ; GLIBCPP_FORCE_NEW=1 GLIBCXX_FORCE_NEW=1 time -p ./a.out;
    done
    real 1.16
    user 1.10
    sys 0.05
    real 1.37
    user 2.55
    sys 0.14

    my@devel:~/src/exp> for t in 1 2; do g++ -O3 -DTHRD_NUM=$t exp.cpp
    -pthread ; time -p ./a.out; done
    real 1.12
    user 1.06
    sys 0.05
    real 3.48
    user 5.28
    sys 1.40

    In the former case with caching disabled it's clear from the real time
    numbers that the task is scaled well on the two processors.

    A little boost is gained using hoard allocator:

    my@devel:~/src/exp> for t in 1 2; do g++ -O3 -DTHRD_NUM=$t exp.cpp
    -pthread ; LD_PRELOAD=/usr/local/lib/libhoard.so:/usr/lib/libdl.so
    GLIBCPP_FORCE_NEW=1 GL\IBCXX_FORCE_NEW=1 time -p ./a.out; done
    real 1.15
    user 1.09
    sys 0.05
    real 1.31
    user 2.46
    sys 0.13
     
    Maxim Yegorushkin, Aug 17, 2005
    #6
  7. vic Guest

    Just to fix up the report output, I ran the following command,

    cat /proc/cpuinfo | egrep "(model name|cpu MHz)" ; \
    uname -a | awk '{print "kernel name : " $1 " " $3 " " $11 " " $14
    }' ; \
    gcc -v 2>&1 | grep gcc | awk '{print "gcc version : " $3 $4 }'; \
    for method in MT MP; do \
    for concurrency in 1 2 3 4 5 ; do \
    printf "%s / %s = " $method $concurrency ; \
    g++ -D$method=1 -DCONCURRENCY=$concurrency mt_map_test.cpp
    -lpthread ; \
    for samples in 1 2 3 4 5 6 7 8 9 ; do \
    /usr/bin/time ./a.out 2>&1 | \
    grep -v swaps | \
    sed -e "s/^\(.*\)user.*$/\1/g" ; \
    done | sort | head -5 | tail -1 ; \
    done ; \
    done

    The basic idea is to run multiple times and pull the median . . .

    For convenience I also added some conditional definitions,
    #ifndef CONCURRENCY
    #define CONCURRENCY 1
    #endif

    #if !defined( MP ) && !defined( MT )
    #define MT 1
    #endif

    My machine has only a single processor, so the results do not seem
    surprising,

    model name : AMD Athlon(tm) 64 Processor 2800+
    cpu MHz : 1808.843
    kernel name : Linux 2.6.12-1.1398_FC4 x86_64 GNU/Linux
    gcc version : 4.0.120050727
    MT / 1 = 1.30
    MT / 2 = 2.65
    MT / 3 = 4.01
    MT / 4 = 5.35
    MT / 5 = 6.47
    MP / 1 = 1.34
    MP / 2 = 2.69
    MP / 3 = 4.01
    MP / 4 = 5.35
    MP / 5 = 6.69
     
    vic, Aug 17, 2005
    #7
  8. Guest

    Thanks a lot. Maxim. GLIBCPP_FORCE_NEW is exactly where the problem
    is. I dropped the processor binding code and ran the following script
    ( adapted from Vic's report script ) on 2 machines.

    $ cat /home/nan/measure.sh
    #!/bin/bash

    cat /proc/cpuinfo | egrep "(model name|cpu MHz)" ;
    uname -a | awk '{print "kernel name : " $1 " " $3 " " $11 " " $14
    }';
    gcc -v 2>& 1 | grep '^gcc' | awk '{print "gcc version : " $3 $4 }';
    printf "\n CD CE";

    for m in MT MP; do
    for t in $(seq 1 4); do
    printf "\n%s / %d = " $m $t;
    g++ -D$m -DCONCURRENCY=$t mt_map_test.cpp
    -lpthread;
    for cache in 'export' 'export -n' ; do
    for samples in $(seq 1 5); do
    $cache GLIBCPP_FORCE_NEW=1; $cache
    GLIBCXX_FORCE_NEW=1; /usr/bin/time -f %e ./a.out 2>&1;
    done | sort | head -3 | tail -1 | tr '\n' ' '
    done
    done
    done

    Machine 1
    $ sh measure.sh
    model name : Intel(R) Xeon(TM) CPU 2.80GHz
    cpu MHz : 2791.459
    model name : Intel(R) Xeon(TM) CPU 2.80GHz
    cpu MHz : 2791.459
    model name : Intel(R) Xeon(TM) CPU 2.80GHz
    cpu MHz : 2791.459
    model name : Intel(R) Xeon(TM) CPU 2.80GHz
    cpu MHz : 2791.459
    kernel name : Linux 2.6.9-1.667smp 2004 i386
    gcc version : 3.4.220041017

    CD CE
    MT / 1 = 1.69 1.68
    MT / 2 = 1.79 1.86
    MT / 3 = 2.85 2.93
    MT / 4 = 3.12 3.11
    MP / 1 = 1.54 1.54
    MP / 2 = 1.56 1.57
    MP / 3 = 3.49 2.45
    MP / 4 = 3.51 3.14

    Machine 2
    $ sh measure.sh
    model name : Intel(R) Xeon(TM) CPU 2.40GHz
    cpu MHz : 2387.948
    model name : Intel(R) Xeon(TM) CPU 2.40GHz
    cpu MHz : 2387.948
    model name : Intel(R) Xeon(TM) CPU 2.40GHz
    cpu MHz : 2387.948
    model name : Intel(R) Xeon(TM) CPU 2.40GHz
    cpu MHz : 2387.948
    kernel name : Linux 2.4.21-4.ELsmp 2003 i386
    gcc version : 3.2.320030502

    CD CE
    MT / 1 = 2.00 1.97
    MT / 2 = 2.19 4.93
    MT / 3 = 3.16 7.22
    MT / 4 = 3.75 9.94
    MP / 1 = 1.82 1.99
    MP / 2 = 1.84 2.00
    MP / 3 = 3.11 3.24
    MP / 4 = 3.73 4.25

    Note the above machines have 2 Hyperthreaded CPUs, so it only scales
    up to 2.
    For this particular program, it seems that MP is consistently faster
    than MT.
    And surprisingly, GLIBCPP_FORCE_NEW or GLIBCXX_FORCE_NEW has no effect
    on gcc 3.4.
     
    , Aug 17, 2005
    #8
  9. wrote:
    > Thanks a lot. Maxim. GLIBCPP_FORCE_NEW is exactly where the problem
    > is. I dropped the processor binding code and ran the following script
    > ( adapted from Vic's report script ) on 2 machines.


    []

    > Note the above machines have 2 Hyperthreaded CPUs, so it only scales
    > up to 2.
    > For this particular program, it seems that MP is consistently faster
    > than MT.
    > And surprisingly, GLIBCPP_FORCE_NEW or GLIBCXX_FORCE_NEW has no effect
    > on gcc 3.4.


    I've just checked out gcc 3.4.3 sources and GLIBCXX_FORCE_NEW check is
    still there.

    I ran a slightly modified script version to include results with Hoard
    allocator. Here CE/D stand for c++ caching allocator enabled/disabled,
    H stands for Hoard.

    my@devel:~/src/exp> ./measure
    model name : Intel(R) Xeon(TM) CPU 2.80GHz
    cpu MHz : 2792.047
    model name : Intel(R) Xeon(TM) CPU 2.80GHz
    cpu MHz : 2792.047
    model name : Intel(R) Xeon(TM) CPU 2.80GHz
    cpu MHz : 2792.047
    model name : Intel(R) Xeon(TM) CPU 2.80GHz
    cpu MHz : 2792.047
    kernel name : Linux 2.6.11.4-20a-smp 2005 i386
    gcc version : 3.3.4

    CE CD CEH CDH
    MT / 1 = 1.63 1.66 1.63 1.65
    MT / 2 = 4.29 1.79 4.24 1.77
    MT / 3 = 6.16 2.53 6.23 2.44
    MT / 4 = 9.33 3.05 9.03 2.24
    MP / 1 = 1.62 1.49 1.63 1.51
    MP / 2 = 1.63 1.51 1.64 1.53
    MP / 3 = 2.29 2.14 2.30 2.17
    MP / 4 = 2.70 2.59 2.74 2.66
     
    Maxim Yegorushkin, Aug 18, 2005
    #9
  10. Guest

    It would be nice if you can try this again with gcc 3.4.3.
     
    , Aug 18, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Avin
    Replies:
    2
    Views:
    15,638
    Jerald Fijerald
    May 8, 2004
  2. Replies:
    6
    Views:
    611
    Pete Becker
    Apr 4, 2006
  3. Alexander Kotelnikov
    Replies:
    7
    Views:
    2,050
    Mathias Gaunard
    Nov 23, 2006
  4. kl
    Replies:
    7
    Views:
    1,307
    James Kanze
    Jan 1, 2008
  5. tutul
    Replies:
    1
    Views:
    500
    Victor Bazarov
    Mar 26, 2010
Loading...

Share This Page