Welcome to Intel® Software Network Quick Login | Join | Help |
Search in Intel® Software Network Forums
in Go

Atomic Increment Benchmark on Intel Hardware

Last post 06-20-2008, 10:32 AM by Raf_Schietekat. 10 replies.
Sort Posts: Previous Next
 06-16-2008, 9:17 PM 30257056  

Atomic Increment Benchmark on Intel Hardware

Hey all,

I've been reading "The Art of Multiprocessor Programming" which seems to have nothing but good things to say about atomic operations.  They sound nice, but as a pessimist I did a benchmark today (the code is at the bottom of this post).

I did a benchmark with a global atomic<long long> variable, which is incremented by N-1 worker threads infinitely, and the Nth thread performs an increment a fixed number of times (100,000,000) and times how long this takes.  This should demonstrate the worst performance for a highly-contested atomic variable.  The result is very poor, and I'm trying to understand why.... and what to do to improve performance.  The test was done on a Linux system, source was compiled with icc, and it's a dual-quad-core xeon with 64-bit extensions, and TBB 2.1.  Here are the results:


#cores
Time(s)
Final Counter Value
#increments/s
Speedup over 1 Core
1
1.43
10000000070141475.361
2
5.98
847055237141611090.922.02
3
11.8
90181216976405982.341.09
4
12.06
75033861162196502.90.89
5
15.18
83676505955139935.220.79
6
18.73
94541518050477870.50.72
7
18.01
94862742252678696.010.75
8
54.63
237987161543564811.920.62


I should really say that past two cores the atomic operations do not scale well.  I conducted another test, using distributed atomic operations, where private atomic<long long> variables were allocated with each object to be threaded, and each object was allocated with the cache_aligned_allocator to ensure that the atomic<long long> objects were on different cache lines.  Timing did not seem to improve much, I did not perform a detailed analysis as above.  I remember reading something about atomic operations locking the bus, and this would explain the numbers above.

It seems that highly contested atomic operations will have poor performance, worst yet, distributed atomic operations (as in many atomic variables on separate cache lines) could suffer from the same problem.

Comments?  It seems that wait-free data structures which use atomic operations will really require future hardware to have better (faster) support for them (if possible... hardware wise probably pretty hard, unless non-colliding atomic operations were ok).

Also, I love having tbb::tbb_thread for these kinds of benchmarks!

Source code here:

// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address


#include <tbb/atomic.h>
#include <tbb/tbb_thread.h>
#include <tbb/tick_count.h>

#include <iostream>
#include <cstdlib>
#include <vector>

const long long  numIncrementsForTrial = 100000000;

tbb::atomic<long long> atomicRegister;

// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address


#include <tbb/atomic.h>
#include <tbb/tbb_thread.h>
#include <tbb/tick_count.h>

#include <iostream>
#include <cstdlib>
#include <vector>

const long long  numIncrementsForTrial = 100000000;

tbb::atomic<long long> atomicRegister;



// Increment atomically for numIncrementsForTrial and time it, displaying the time taken
void timedIncrementer()
{
}


// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
    int _id;
public:
    InfinitelyIncrement(int id) : _id(id)
    {
    }

    void operator()()
    {
        std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
        while(1)// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address


#include <tbb/atomic.h>
#include <tbb/tbb_thread.h>
#include <tbb/tick_count.h>

#include <iostream>
#include <cstdlib>
#include <vector>
// I forgot #include<utility> but icc compiled anyways
const long long  numIncrementsForTrial = 100000000;

tbb::atomic<long long> atomicRegister;

// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
    int _id;
public:
    InfinitelyIncrement(int id) : _id(id)
    {
    }

    void operator()()
    {
        std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
        while(1)
        {
            atomicRegister.fetch_and_increment();
        }
    }
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
    long long _increments;
public:
    AtomicIncrementTimer(long long increments) : _increments(increments) { }

    void operator()()
    {
        // Give the threads a bit of time to warm up, then start the atomic
        // increment timer

        tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

        std::cout << "Starting timer..." << std::endl;

        tbb::tick_count start, end;
        start = tbb::tick_count::now();
        for(long long i = 0; i < _increments; ++i)
        {
            atomicRegister.fetch_and_increment();
        }
        end = tbb::tick_count::now();
        std::cout << "Time: " << (end - start).seconds() << std::endl;

    }
};


int main(int argc, char** argv)
{

    int numThreads;

    if(argc != 2)
    {
        std::cout << "Please provide a number of threads." << std::endl;
        return 0;

    }
    numThreads = atoi(argv[1]);

    std::cout << "Running with " << numThreads << " threads." << std::endl;

    std::vector< std::pair<tbb::tbb_thread*, InfinitelyIncrement*> > threadCollection;

    for(int i = 0; i < numThreads - 1; ++i)
    {
        InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
        threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
    }

    AtomicIncrementTimer timerObject(numIncrementsForTrial);

    tbb::tbb_thread timerThread( timerObject );

    timerThread.join();

    std::cout << "Final Value: " << atomicRegister << std::endl;

    // Don't bother reclaiming memory, because there is no way to terminate
    // the threads.

    return 0;



}

        {
            atomicRegister.fetch_and_increment();
        }
    }
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
    long long _increments;
public:
    AtomicIncrementTimer(long long increments) : _increments(increments) { }

    void operator()()
    {
        // Give the threads a bit of time to warm up, then start the atomic
        // increment timer

        tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

        std::cout << "Starting timer..." << std::endl;

        tbb::tick_count start, end;
        start = tbb::tick_count::now();
        for(long long i = 0; i < _increments; ++i)
        {
            atomicRegister.fetch_and_increment();
        }
        end = tbb::tick_count::now();
        std::cout << "Time: " << (end - start).seconds() << std::endl;

    }
};


int main(int argc, char** argv)
{

    int numThreads;

    if(argc != 2)
    {
        std::cout << "Please provide a number of threads." << std::endl;
        return 0;

    }
    numThreads = atoi(argv[1]);

    std::cout << "Running with " << numThreads << " threads." << std::endl;

    std::vector< std::pair<tbb::tbb_thread*, InfinitelyIncrement*> > threadCollection;

    for(int i = 0; i < numThreads - 1; ++i)
    {
        InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
        threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
    }

    AtomicIncrementTimer timerObject(numIncrementsForTrial);

    tbb::tbb_thread timerThread( timerObject );

    timerThread.join();

    std::cout << "Final Value: " << atomicRegister << std::endl;

    // Don't bother reclaiming memory, because there is no way to terminate
    // the threads.

    return 0;



}


// Increment atomically for numIncrementsForTrial and time it, displaying the time taken
void timedIncrementer()
{
}


// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
    int _id;
public:
    InfinitelyIncrement(int id) : _id(id)
    {
    }

    void operator()()
    {
        std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
        while(1)
        {
            atomicRegister.fetch_and_increment();
        }
    }
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
    long long _increments;
public:
    AtomicIncrementTimer(long long increments) : _increments(increments) { }

    void operator()()
    {
        // Give the threads a bit of time to warm up, then start the atomic
        // increment timer

        tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

        std::cout << "Starting timer..." << std::endl;

        tbb::tick_count start, end;
        start = tbb::tick_count::now();
        for(long long i = 0; i < _increments; ++i)
        {
            atomicRegister.fetch_and_increment();
        }
        end = tbb::tick_count::now();
        std::cout << "Time: " << (end - start).seconds() << std::endl;

    }
};


int main(int argc, char** argv)
{

    int numThreads;

    if(argc != 2)
    {
        std::cout << "Please provide a number of threads." << std::endl;
        return 0;

    }
    numThreads = atoi(argv[1]);

    std::cout << "Running with " << numThreads << " threads." << std::endl;

    std::vector< std::pair<tbb::tbb_thread*, InfinitelyIncrement*> > threadCollection;

    for(int i = 0; i < numThreads - 1; ++i)
    {
        InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
        threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
    }

    AtomicIncrementTimer timerObject(numIncrementsForTrial);

    tbb::tbb_thread timerThread( timerObject );

    timerThread.join();

    std::cout << "Final Value: " << atomicRegister << std::endl;

    // Don't bother reclaiming memory, because there is no way to terminate
    // the threads.

    return 0;



}

 
 06-18-2008, 12:47 AM 30257149 in reply to 30257056  

Re: Atomic Increment Benchmark on Intel Hardware

Maybe this is the wrong place to say this, but have you tested this on another processor that's not in the future? It seems only x86(_64) uses bus locking.

Of course this might colour the results of any benchmarks that compare locking and lock-free algorithms... so it would be interesting to see further test results (and/or informed opinions).

 
 06-18-2008, 9:04 AM 30257187 in reply to 30257149  

Re: Atomic Increment Benchmark on Intel Hardware

I'm not sure what you mean by "...on another processor that's not in the future".  When I said future hardware I meant that perhaps better hardware could be designed to support atomic operations.

What I'm specifically looking at is the throughput of atomic operations, since most (if not all) wait-free algorithms that I know of rely upon them.  Realistically atomic operations would be distributed across processors, and used quite frequently if wait-free algorithms depending upon them were used.

Quoting from the Intel manual "Volume 3A: System Programming Guide, Part 1"...


Section 7.1.4
"For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation
as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to insure that the operation is carried out atomically."

This seems to imply that the bus might lock even if the memory being operated upon is in the processor's cache.  The bus locking / unlocking could describe the performance that has been observed.
 
 06-18-2008, 1:07 PM 30257209 in reply to 30257187  

Re: Atomic Increment Benchmark on Intel Hardware

Bus locking went out in the 90's.

Try the example with a separate atomic variable, on separate cache lines, for each thread and I think you'll see it scale just fine.

Besides cache line transfers, there are other costs for atomic operation when they involve the x86 LOCK prefix (e.g. as for atomic increment), because the LOCK prefix takes away efficiencies of out-of-order pipelined execution:

  1. LOCKed instructions are serializing instructions.  As Section 7.4 of previously mentiond Volume 3A says, these instructions cannot execute until all prior instructions execute and pending stores are written.
  2. LOCKed instructions imply a full memory fence. Hence subsequent loads cannot be done ahead of time.
 
 06-18-2008, 10:41 PM 30257239 in reply to 30257209  

Re: Atomic Increment Benchmark on Intel Hardware

With "on another processor that's not in the future" I meant that you need not wait for "future hardware to have better (faster) support for [atomic operations]": several instruction sets that do not imply locking/serialisation can be tested right now, including from Intel (Itanium makes you choose between acquire and release semantics). I have not examined the test program like Arch, who however made a suggestion to test with separate cache lines, which you wrote did not seem to improve timing much but without providing code, so you appear to be out of synch with each other on that point.

As for efficiencies of LOCKed instructions: are the results of these instructions supposed to be visible to any observer, not just those participating in the coherent cache? That might make matters worse than otherwise, it seems. I think I have seen examples of architectures providing both coherent-cache-only fences and ones that go all the way to memory, and/or that distinguish between data-only traffic and the intricacies of self-modifying code.

 
 06-19-2008, 10:25 AM 30257286 in reply to 30257239  

Re: Atomic Increment Benchmark on Intel Hardware

I have performed the tests again.  The table in my first post is in error, so this time I will better explain my measures and experiments.  The charts below show the number of cores involved the execution.  For a number of threads N, N-1 threads will be spawned which increment an atomic <long long> counter to infinitity.  Another thread will be spawned which will delay for 5.0 seconds, before incrementing the same atomic counter for 100,000,000 iterations.  Each thread is presumed to be mapped to its own core, and there is no load on the system during execution (I'm the only user).

The times slowdown is calculated as the execution time for n threads / execution time for 1 thread.

Here are the results for a single atomic<long long> value:

#cores Time(s) Times Slower than 1 Core
1 1.42 1
2 5.84 4.12
3 11.75 8.29
4 10.26 7.23
5 18.29 12.89
6 16.57 11.68
7 36.29 25.58
8 14.88 10.49

Note that usually the time for 8 cores is much longer (around 45 - 50s), in this particular execution it was substantially lower for some reason.

I rewrote the benchmark with the same idea in mind, however this time I gave each thread a copy of a class which contained its own atomic<long long> value to increment.  The classes were allocated into a concurrent_vector, which by default uses the cache_aligned_allocator.  Unless I'm mistaken, this should result in the atomic values being placed onto separate cache lines.  Here are the results, when multiple atomic values are on different cache lines:

#cores Time(s) Times Slower than 1 Core
1 1.42 1
2 5.91 4.16
3 15.08 10.62
4 22.62 15.94
5 1.43 1
6 5.14 3.62
7 8.24 5.81
8 26.27 18.51


These results seem to imply that since the last thread is always the one measured, that when there are 5 threads the measuring thread has its own core.  There is an interesting trend here, since there are two quad-core processors.  This is still not the performance I was expecting, I was hoping to have a scaled performance where each core took the same amount of time to perform a fixed number of atomic increments on atomic values on its own cache line.

Here is the code for the new test:
#include <tbb/atomic.h>
#include <tbb/tbb_thread.h>
#include <tbb/tick_count.h>

#include <tbb/concurrent_vector.h>

#include <iostream>
#include <cstdlib>

const long long  numIncrementsForTrial = 100000000;

typedef tbb::atomic<long long> atomic_t;


// Class that infinitely increments a local atomic value
class InfinitelyIncrement
{
private:
    int _id;
    atomic_t _register;
public:
    InfinitelyIncrement(int id) : _id(id)
    {
    }

    void operator()()
    {
        std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
        while(1)
        {
            _register.fetch_and_increment();
        }
    }
};

// Class that increments a local atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
    long long _increments;
    atomic_t _register;
public:
    AtomicIncrementTimer(long long increments) : _increments(increments) { }

    void operator()()
    {
        // Give the threads a bit of time to warm up, then start the atomic
        // increment timer

        tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

        std::cout << "Starting timer..." << std::endl;

        tbb::tick_count start, end;
        start = tbb::tick_count::now();
        for(long long i = 0; i < _increments; ++i)
        {
            _register.fetch_and_increment();
        }
        end = tbb::tick_count::now();
        std::cout << "Time: " << (end - start).seconds() << std::endl;

    }
};


int main(int argc, char** argv)
{

    int numThreads;

    if(argc != 2)
    {
        std::cout << "Please provide a number of threads." << std::endl;
        return 0;

    }
    numThreads = atoi(argv[1]);

    std::cout << "Running with " << numThreads << " threads." << std::endl;

    tbb::concurrent_vector< std::pair<tbb::tbb_thread*, InfinitelyIncrement*> > threadCollection;

    for(int i = 0; i < numThreads - 1; ++i)
    {
        InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
        threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
    }

    AtomicIncrementTimer timerObject(numIncrementsForTrial);

    tbb::tbb_thread timerThread( timerObject );

    timerThread.join();

    // Don't bother reclaiming memory, because there is no way to terminate
    // the threads.

    return 0;



}


 
 06-19-2008, 1:10 PM 30257307 in reply to 30257286  

Re: Atomic Increment Benchmark on Intel Hardware

Doesn't cache_aligned_allocator mean that segments are aligned, not elements (unless each is an integral number of cache lines long)? You would need to pad the elements to get what you need, I suppose. Plus, in your code, the concurrent_vector contains pairs of pointers, while the structures containing the atomics are allocated by plain-vanilla "new".

 
 06-19-2008, 2:58 PM 30257312 in reply to 30257307  

Re: Atomic Increment Benchmark on Intel Hardware

Raf.. you're right... it scales perfectly now!  I'm not sure about cache_aligned_allocator actually, what I think it does changes from time to time....

Here's the latest code... it shows that atomic operations on different cache lines scale beautifully... which means there is a lot of hope for wait-free data structures and algorithms.

// This tests atomic increment throughput by spawning N threads (supplied
// by the user), and each thread incrementing the same memory address


#include <tbb/atomic.h>
#include <tbb/tbb_thread.h>
#include <tbb/tick_count.h>

#include <vector>

#include <iostream>
#include <cstdlib>

const long long  numIncrementsForTrial = 100000000;

typedef tbb::atomic<long long> atomic_t;


// Class that infinitely increments the global atomic value
class InfinitelyIncrement
{
private:
    int _id;
    atomic_t _register;
    int padding[8000];
public:
    InfinitelyIncrement(int id) : _id(id)
    {

    }

    void operator()()
    {
        std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
        while(1)
        {
            _register.fetch_and_increment();
        }
    }
};

// Class that increments the global atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
    long long _increments;
    atomic_t _register;
    int padding[8000];
public:
    AtomicIncrementTimer(long long increments) : _increments(increments) { }

    void operator()()
    {
        // Give the threads a bit of time to warm up, then start the atomic
        // increment timer

        tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );

        std::cout << "Starting timer..." << std::endl;

        tbb::tick_count start, end;
        start = tbb::tick_count::now();
        for(long long i = 0; i < _increments; ++i)
        {
            _register.fetch_and_increment();
        }
        end = tbb::tick_count::now();
        std::cout << "Time: " << (end - start).seconds() << std::endl;

    }
};


int main(int argc, char** argv)
{

    int numThreads;

    if(argc != 2)
    {
        std::cout << "Please provide a number of threads." << std::endl;
        return 0;

    }
    numThreads = atoi(argv[1]);

    std::cout << "Running with " << numThreads << " threads." << std::endl;

    std::vector< std::pair<tbb::tbb_thread*, InfinitelyIncrement*> > threadCollection;

    for(int i = 0; i < numThreads - 1; ++i)
    {
        InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
        threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
    }

    AtomicIncrementTimer timerObject(numIncrementsForTrial);

    tbb::tbb_thread timerThread( timerObject );

    timerThread.join();

    // Don't bother reclaiming memory, because there is no way to terminate
    // the threads.

    return 0;



}



 
 06-19-2008, 11:07 PM 30257340 in reply to 30257312  

Re: Atomic Increment Benchmark on Intel Hardware

That's a lot of padding! You only need to pad up to a cache line. If you don't want to waste any memory, keep things neatly aligned (C++ will require you to jump through a few hoops to do that in a portable way); if you want quick-and-easy benchmark results, a fixed 64 bytes of padding should be enough (assuming ownership is based on lines and lines are 64 bytes), without any need for cache-related alignment.

I'm wondering how this could possibly be "perfectly" scalable on an architecture that fully orders atomic operations, because there will have to be some form of communication between the cores to make that happen, even without bus locking.

 
 06-20-2008, 8:31 AM 30257377 in reply to 30257340  

Re: Atomic Increment Benchmark on Intel Hardware

If two trees fall in a sequentially consistent forest, and no one sees them fall, did they fall in order?

The example scales perfectly because no communication is required.  Each thread holds its cacheline in the 'Modified' state, meaning no other processor can see it.  Lock-free algorithms do require communication.

A more representative, yet still simple, test would be "Counting Philosophers".  Arrange the N threads around a table and have pairs of adjacent threads communicate using a lock-free algorithm.  The simplest such algorithm is fetch-and-add on a shared counter.  So let each thread continuously increment the counter on its left and decrement the counter on its right.  See if that scales. 

I recommend the increment/decrement combination because it has a simple invariant.  Assuming the counters are zero to start with, after all threads are done the total sum must be zero.

 
 06-20-2008, 10:32 AM 30257388 in reply to 30257377  

Re: Atomic Increment Benchmark on Intel Hardware

By Schrödinger, you're right!
 
View as RSS news feed in XML

Shortcuts


Tags For This Post

...

Community Tags

...