Intel® Software Network Knowledge Base Wiki


Constructing Nav Tree
One Moment...

(refresh menu)



 
Welcome, Guest | Quick Login | Register

Develop for Core processor


Optimize Motion Compensation on the Pentium® 4 Processor

Version 4, Changed by LINDA SWINK on 3/21/2008
Created by: KYLEX.S.LEWIS@INTEL.COM

Challenge

Optimize Absolute-Difference Motion Compensation on the Pentium® 4 Processor. One important technique used in video compression is to predict movement between consecutive frames. In many cases, the visual representation of a moving object stays the same from frame to frame, merely changing position within the viewing field. A substantially improved compression ratio can be achieved by producing displacement vectors of the object from frame to frame, instead of compressing the object separately in each frame. Calculating these vectors is called motion estimation, and it requires the calculation of an absolute difference between blocks of frames.

The motion-estimation function sums the absolute differences (or squared differences) between the pixel values of two different 16x16 blocks and finds the best match. In MPEG1, for example, the calculation can be made in four ways. Either the absolute difference between the pixel values (L1 norm) or the square of the differences (L2 Norm) is summed. Orthogonal to the difference equation, this sum can be accumulated with respect to a reference block that has been shifted, either by some number of whole pixel positions or by some number of half-pixel positions.

The sample below is a C code example for a 16 x 16-pixel full pel motion estimation using absolute differences. The code has a fast-out, so that if the difference so far accumulated across a subset of rows is more than the best match so far, then it aborts the rest of the absolute-difference calculation:

	char *btpr; /* pointer to start row of 16x16 pixel block being compressed */ 
	char *cptr; /* pointer to start row of 16x16 pixel reference block */ 

	val = 0; 
	for (i=0; i<16; i++) { 
	for (j=0; j<16; j++) { 
	data = (*(bptr++)- *(cptr++)); 
	if (data<0){val -= data;}
	else {val += data;} 
	} 
	/* Fast out after this row if best match has been exceeded */ 
	if (val > best_value) break; 
	
	/* Update pointer to next row */ 
	bptr += (rm->width - 16); 
	
	/* Update pointer to next row */ 
	cptr += (cm->width - 16); 
	} 
	 

Solution


Use the psubusb (packed byte subtract unsigned with saturation) Streaming SIMD Extensions 2 (SSE2) instruction to generate the absolute differences without requiring 16-bit precision. The flow of coding the motion-estimation inner loop using SSE2 instructions is shown in the figure below:

(click on the image above or click here for a larger view)

Usually, subtraction of two eight-bit unsigned numbers produces a nine-bit signed result. Thus, in order to retain precision, it may seem that a conversion to 16 bits is needed before the subtract operation, but such conversion is unnecessary when using the psubusb instruction. By subtracting source 2 from source 1 and then performing the opposite operation (on a copy of one of the sources), each result register has the absolute difference value for the case when a respective element in source 1 was larger than a respective element in source 2, and zero otherwise (because a negative result saturates to zero). The same result occurs for the opposite subtraction, which generates the absolute differences in those cases where the elements in source 2 were larger than the elements in source 1.

Performing an OR operation on the results of the two subtractions generates the eight desired absolute-difference results. This technique allows the absolute difference to be performed in byte precision, which is substantially faster than if it were done in 16-bit precision.

The SSE2 code for the inner-most computation of absolute difference is presented in the sample below. This code does not include the fast-out option that was incorporated into the C code given in the Challenge section. In general, this fast-out is not relevant after each 16 x 1 estimation, as the overhead cost of adding the fast-out for every iteration of the SSE2 instruction loop is prohibitive.

	__asm { 
	movdqu xmm0, [m1]
	movdqu xmm1, [m2] 
	movdqa xmm2, xmm0 
	psubusb xmm0, xmm1 
	psubusb xmm1, xmm2
	por xmm0, xmm1
	movdqa xmm1, xmm0 
	punpcklbw xmm0, xmm6 
	punpcklbw xmm1, xmm6 
	movdqa xmm3, xmm1 
	pshufd xmm1, xmm0, 238  
	pshufd xmm3, xmm0, 68 
	paddw xmm1, xmm3
	movdqa xmm4, xmm1 
	pshufd xmm4, xmm4, 78 
	paddw xmm1, xmm4 
	} 
	 

Source


Absolute-Difference Motion Estimation for Intel® Pentium® 4 Processors

 



Served
23 Knowledge Bases
604 Pages
Search
Powering Up Search...


Vote on this Page

Tags For This Page
Loading Tags..

Tag This



Additional legal information