I have performed the tests again. The table in my first post is in error, so this time I will better explain my measures and experiments. The charts below show the number of cores involved the execution. For a number of threads N, N-1 threads will be spawned which increment an atomic <long long> counter to infinitity. Another thread will be spawned which will delay for 5.0 seconds, before incrementing the same atomic counter for 100,000,000 iterations. Each thread is presumed to be mapped to its own core, and there is no load on the system during execution (I'm the only user).
The times slowdown is calculated as the execution time for n threads / execution time for 1 thread.
Here are the results for a single atomic<long long> value:
| #cores |
Time(s) |
Times Slower than 1 Core |
| 1 |
1.42 |
1 |
| 2 |
5.84 |
4.12 |
| 3 |
11.75 |
8.29 |
| 4 |
10.26 |
7.23 |
| 5 |
18.29 |
12.89 |
| 6 |
16.57 |
11.68 |
| 7 |
36.29 |
25.58 |
| 8 |
14.88 |
10.49 |
Note that usually the time for 8 cores is much longer (around 45 - 50s), in this particular execution it was substantially lower for some reason.
I rewrote the benchmark with the same idea in mind, however this time I gave each thread a copy of a class which contained its own atomic<long long> value to increment. The classes were allocated into a concurrent_vector, which by default uses the cache_aligned_allocator. Unless I'm mistaken, this should result in the atomic values being placed onto separate cache lines. Here are the results, when multiple atomic values are on different cache lines:
| #cores |
Time(s) |
Times Slower than 1 Core |
| 1 |
1.42 |
1 |
| 2 |
5.91 |
4.16 |
| 3 |
15.08 |
10.62 |
| 4 |
22.62 |
15.94 |
| 5 |
1.43 |
1 |
| 6 |
5.14 |
3.62 |
| 7 |
8.24 |
5.81 |
| 8 |
26.27 |
18.51 |
These results seem to imply that since the last thread is always the one measured, that when there are 5 threads the measuring thread has its own core. There is an interesting trend here, since there are two quad-core processors. This is still not the performance I was expecting, I was hoping to have a scaled performance where each core took the same amount of time to perform a fixed number of atomic increments on atomic values on its own cache line.
Here is the code for the new test:
#include <tbb/atomic.h>
#include <tbb/tbb_thread.h>
#include <tbb/tick_count.h>
#include <tbb/concurrent_vector.h>
#include <iostream>
#include <cstdlib>
const long long numIncrementsForTrial = 100000000;
typedef tbb::atomic<long long> atomic_t;
// Class that infinitely increments a local atomic value
class InfinitelyIncrement
{
private:
int _id;
atomic_t _register;
public:
InfinitelyIncrement(int id) : _id(id)
{
}
void operator()()
{
std::cout << "InfinitelyIncrement ID: " << _id << " is running." << std::endl;
while(1)
{
_register.fetch_and_increment();
}
}
};
// Class that increments a local atomic value for a fixed number of
// iterations, and times the duration of the operation.
class AtomicIncrementTimer
{
private:
long long _increments;
atomic_t _register;
public:
AtomicIncrementTimer(long long increments) : _increments(increments) { }
void operator()()
{
// Give the threads a bit of time to warm up, then start the atomic
// increment timer
tbb::this_tbb_thread::sleep( tbb::tick_count::interval_t(5.0) );
std::cout << "Starting timer..." << std::endl;
tbb::tick_count start, end;
start = tbb::tick_count::now();
for(long long i = 0; i < _increments; ++i)
{
_register.fetch_and_increment();
}
end = tbb::tick_count::now();
std::cout << "Time: " << (end - start).seconds() << std::endl;
}
};
int main(int argc, char** argv)
{
int numThreads;
if(argc != 2)
{
std::cout << "Please provide a number of threads." << std::endl;
return 0;
}
numThreads = atoi(argv[1]);
std::cout << "Running with " << numThreads << " threads." << std::endl;
tbb::concurrent_vector< std::pair<tbb::tbb_thread*, InfinitelyIncrement*> > threadCollection;
for(int i = 0; i < numThreads - 1; ++i)
{
InfinitelyIncrement* tmpInfiniteIncrement = new InfinitelyIncrement(i);
threadCollection.push_back( std::make_pair(new tbb::tbb_thread( *tmpInfiniteIncrement ), tmpInfiniteIncrement ) );
}
AtomicIncrementTimer timerObject(numIncrementsForTrial);
tbb::tbb_thread timerThread( timerObject );
timerThread.join();
// Don't bother reclaiming memory, because there is no way to terminate
// the threads.
return 0;
}