Challenge
Implement a horizontal add/subtract using SSE3 instructions. Most SIMD instructions operate vertically. Data elements of the result in position k are a function of data elements in position k the instructions operands. Horizontal instructions operate differently. Contiguous data elements from the same operand are used to produce the result.
Packed horizontal add instructions can be useful to evaluate dot products and matrix multiplications, and to facilitate some SIMD computations operating on vectors that are arranged in arrays of structures.
Solution
Use the HADDPS instruction, as shown in the code given here. HADDPS performs a single-precision addition on contiguous data elements. The first data element of the result is obtained by adding the first and second elements of the first operand. The second element is obtained by adding the third and fourth elements of the first operand. The third element is obtained by adding the first and second elements of the second operand. The fourth element is obtained by adding the third and fourth elements of the second operand.
The following example demonstrates computing the dot product of a four-component vector; it can be adapted and extended to compute matrix multiplication of a 4x4 matrix:
// An example that computes a four component dot product and
// broadcasts the result which is stored in xmm0.
movaps xmm0, Vector1
movaps xmm1, Vector2
mulps xmm0, xmm1
haddps xmm0, xmm0
haddps xmm0, xmm0
// An example that computes two four-component
// dot products from four vectors.
movaps xmm0, Vector1
movaps xmm1, Vector2
movaps xmm2, Vector3
movaps xmm3, Vector4
mulps xmm0, xmm1
mulps xmm2, xmm3
haddps xmm0, xmm2
haddps xmm0, xmm0
Source
Next Generation Intel® Processor: Software Developers Guide.