may_max11:I developed a simple program to access the performance of parallel_reduce and parallel_for against serial execution. the program simpy sums up the content of a large array (order of 10M) and display the seq in which objects are copied and joined. To my astonishment the parallized program is taking a lot more time compared to the serial program to sum up the contents. Please give me some explanation and solution...
It's almost always the case with these simple and naive parallel benchmark programs that their authors are astonished by bad parallel performance :)
Just imagine you are the compiler, and compare the two summing loops from this point of view. First, the serial loop:
for(size_t i=0; i<len; i++) {
sum += a[i];
}
It's very simple to optimize, having in mind that len is a constant to the compiler. For example, VC++ 2005 made the following code out of that:
fldz
lea eax, DWORD PTR [edi+010h]
mov ecx, 0x2625A0h
main+0x105: fadd QWORD PTR [eax-16]
add eax, 0x20h
sub ecx, 0x1h
fadd QWORD PTR [eax-40]
fadd QWORD PTR [eax-32]
fadd QWORD PTR [eax-24]
jnz main+0x105
This magic 0x2625A0 constant is 2500000, one fourth of the array length. There are four floating point additions, loop index (ecx) decrement, and bumping array pointer (eax) 4 elements forward. That's it, very small code which has exactly one memory load operation per iteration followed by one store at the end (not shown here). It's very efficient.
Now consider the TBB variant of it:
for(size_t i=r.begin(); i!=r.end(); i++) {
sum += a[i];
}
The compiler is much more conservative in interpreting this loop because its boundaries are variable, and it's length is unknown. Also sum is the object data field which possibly adds other restrictions. In the above serial code the compiler put sum onto a register, might be due to the fact that the code is in constructor and no data field can be visible until it finishes? Anyway, in the TBB code it did not do that:
execute+0x52: mov edi, DWORD PTR [edx]
lea edi, DWORD PTR [edi+ecx*8]
execute+0x57: fld QWORD PTR [edi]
add ecx, 0x1h
fadd QWORD PTR [edx+08h]
add edi, 0x8h
fstp QWORD PTR [edx+08h]
cmp ecx, DWORD PTR [esi+08h]
jnz execute+0x57
The array pointer is in edi, and you see in the first two lines it is loaded in two steps: first base, then incremented by offset. On each iteration, it is incremented by 0x8h, i.e. just one element. The loop variable (ecx) is also incremented by one, and compared to a memory location (range.end()). Finally, the result is saved back to sum each time, and consequently sum is loaded each time. So we have 3 memory loads per loop iteration, comparing to 1 in serial case.
Do you now understand better why your benchmark "favors" serial case?
Alexey Kukanov
TBB developer