Challenge
Optimize Absolute-Difference Motion Compensation on the Pentium® 4 Processor. One important technique used in video compression is to predict movement between consecutive frames. In many cases, the visual representation of a moving object stays the same from frame to frame, merely changing position within the viewing field. A substantially improved compression ratio can be achieved by producing displacement vectors of the object from frame to frame, instead of compressing the object separately in each frame. Calculating these vectors is called motion estimation, and it requires the calculation of an absolute difference between blocks of frames.
The motion-estimation function sums the absolute differences (or squared differences) between the pixel values of two different 16x16 blocks and finds the best match. In MPEG1, for example, the calculation can be made in four ways. Either the absolute difference between the pixel values (L1 norm) or the square of the differences (L2 Norm) is summed. Orthogonal to the difference equation, this sum can be accumulated with respect to a reference block that has been shifted, either by some number of whole pixel positions or by some number of half-pixel positions.
The sample below is a C code example for a 16 x 16-pixel full pel motion estimation using absolute differences. The code has a fast-out, so that if the difference so far accumulated across a subset of rows is more than the best match so far, then it aborts the rest of the absolute-difference calculation:
char *btpr; /* pointer to start row of 16x16 pixel block being compressed */
char *cptr; /* pointer to start row of 16x16 pixel reference block */
val = 0;
for (i=0; i<16; i++) {
for (j=0; j<16; j++) {
data = (*(bptr++)- *(cptr++));
if (data<0){val -= data;}
else {val += data;}
}
/* Fast out after this row if best match has been exceeded */
if (val > best_value) break;
/* Update pointer to next row */
bptr += (rm->width - 16);
/* Update pointer to next row */
cptr += (cm->width - 16);
}
Solution
Use the psubusb (packed byte subtract unsigned with saturation) Streaming SIMD Extensions 2 (SSE2) instruction to generate the absolute differences without requiring 16-bit precision. The flow of coding the motion-estimation inner loop using SSE2 instructions is shown in the figure below:

(click on the image above or click here for a larger view)
Usually, subtraction of two eight-bit unsigned numbers produces a nine-bit signed result. Thus, in order to retain precision, it may seem that a conversion to 16 bits is needed before the subtract operation, but such conversion is unnecessary when using the psubusb instruction. By subtracting source 2 from source 1 and then performing the opposite operation (on a copy of one of the sources), each result register has the absolute difference value for the case when a respective element in source 1 was larger than a respective element in source 2, and zero otherwise (because a negative result saturates to zero). The same result occurs for the opposite subtraction, which generates the absolute differences in those cases where the elements in source 2 were larger than the elements in source 1.
Performing an OR operation on the results of the two subtractions generates the eight desired absolute-difference results. This technique allows the absolute difference to be performed in byte precision, which is substantially faster than if it were done in 16-bit precision.
The SSE2 code for the inner-most computation of absolute difference is presented in the sample below. This code does not include the fast-out option that was incorporated into the C code given in the Challenge section. In general, this fast-out is not relevant after each 16 x 1 estimation, as the overhead cost of adding the fast-out for every iteration of the SSE2 instruction loop is prohibitive.
__asm {
movdqu xmm0, [m1]
movdqu xmm1, [m2]
movdqa xmm2, xmm0
psubusb xmm0, xmm1
psubusb xmm1, xmm2
por xmm0, xmm1
movdqa xmm1, xmm0
punpcklbw xmm0, xmm6
punpcklbw xmm1, xmm6
movdqa xmm3, xmm1
pshufd xmm1, xmm0, 238
pshufd xmm3, xmm0, 68
paddw xmm1, xmm3
movdqa xmm4, xmm1
pshufd xmm4, xmm4, 78
paddw xmm1, xmm4
}
Source
Absolute-Difference Motion Estimation for Intel® Pentium® 4 Processors