Intel® Software Network Knowledge Base Wiki


Constructing Nav Tree
One Moment...

(refresh menu)



 
Welcome, Guest | Quick Login | Register

Develop for Core processor


Get Faster Video Rendering on the Intel® Pentium® 4 Processor

Version 9, Changed by LINDA SWINK on 7/3/2008
Created by: KYLEX.S.LEWIS@INTEL.COM

By Eric L. Palmer

Abstract

This document describes how to make sure that your video rendering code is as fast as it can be on Intel® Pentium® 4 processor-based systems. This method applies to any application that handles data in a YUV 4:2:0 color-space and renders the video pictures to a display device of a different color-space. The key point is that the data needs to be written to the output buffer strictly in a linear order to optimize the performance of the memory sub-system. If this is not done, it is possible that your code may run slower on a faster Pentium® 4 processor-based system. The coding pitfall to be avoided here is called a write-combining order violation. Also presented is a related data-organization optimization for video codecs that do Motion Compensation or Motion Estimation that can lead to a significant application-level speedup.

Introduction

If you measure the time a part of your application takes to execute on a 1.7GHz Pentium® 4 processor, and on a 2.4GHz Pentium® 4 processor, you would expect the time on the 2.4GHz system to be less, or at least the same, right? If your application processes planar YUV data, such as YUV 4:2:0 or YV12 data, the part of your application that is responsible for sending the images to the display may have a problem that is so severe that it could actually take longer on a 2.4GHz system than on a 1.7GHz system. The speed of video rendering is limited by the speed of your system's main memory and by the AGP bus speed, so if everything else is the same except the frequency of the CPUs in two systems, the video rendering speed should be the same. This document describes how to identify whether your code has a write-combining order violation that causes a slowdown, and if so, how to eliminate it and get the best possible performance on any Pentium® 4 processor-based system.


YUV 4:2:0 Memory Layout

Applications that use planar YUV data include MPEG-1/2/4, M-JPEG, and DV codecs, as well as applications that call on these codecs such as DVD-playback software or video-editing software. This data is called planar because it is stored in three different blocks of memory, as shown to the right. The YUV 4:2:0 data format has one U sample and one V sample for every four Y samples, thus the U and V planes are each ¼ the size of the Y plane. The U and V samples are logically in the center of four Y samples as shown below.

Not all PC-based video display hardware
YUV 4:2:0 Sample Locations

directly supports the display of YUV 4:2:0 video, and it is common to convert it into the YUV 4:2:2 format or into a regular RGB format. Video codecs that employ motion compensation may want to convert to another output format for performance reasons, as described in the Further Optimization section below. Another reason for converting formats is to make software-based post-processing, such as deinterlacing, easier and more efficient.

Before trying to identify a write-combining order violation, you may want to understand more about what write-combining memory is and how it different from write-back memory. There are three main types of memory in a PC — Write-back (WB), Uncachable (UC), and Write-combining (WC or USWC). Write-back memory is the type of memory normally used by most programs. Data written to WB memory is first written to the processor's cache, and is written to memory when it is evicted from the cache. On the Pentium® 4 processor, 64 bytes of data are written from the cache at a time. Data written to UC memory is written directly to memory and is never allowed to reside in the processor's cache. This data is written to memory 4 bytes at a time, and these memory writes are much less efficient than the 64-byte writes that occur with WB memory. The WC memory type was developed in conjunction with AGP to allow applications to write data to the UC memory on AGP devices (video cards), but with performance more like

Understanding Write Combining
that of WB memory. To do this, the processor has a small number of 64-byte write-combining buffers that are the initial recipients of all WC writes(see figure). When each buffer is filled, it writes the entire 64 bytes of data to memory. To achieve maximum WC performance, the data must be written to contiguous linear addresses. If, for example, only 4 bytes out of every 64-byte contiguous memory region is written to each WC buffer, then the contents of the buffers will be evicted with 4-byte memory writes. Any time a WC buffer is evicted when it is not full (called a WC Partial Write), its contents are written to memory with very inefficient 4-byte writes. If this occurs in your code, you will see a 4x or more slowdown, so it is very much worth some effort to avoid WC Partial Writes!

Finding Write-Combining Order Violations in Video-Rendering Code

Write-combining order violations occur when you have a piece of code that writes data to WC memory in an order that causes a large number of those writes to be WC Partial Writes. As mentioned above, WC Partial Writes are very inefficient — they cripple the performance of the AGP memory interface, and can cause your program to crawl where it should be flying. First, find the part of your code that writes to WC memory. In a DirectShow filter that is connected to the input of the overlay mixer, this will be where data is written to the memory associated with the filter's output pin. In a DirectX application, find the function that writes to the memory referenced by a pSurface->Lock() call. Examine such code as follows:

  1. Verify that the WC memory is never read from. Reads from WC memory are even slower than WC Partial Writes, and your application should be designed such that they never occur. If you need to read the image data to modify it or create a new image, keep a copy of the data that will need to be read in regular WB memory, and then copy it to the WC memory.
  2. Verify that the function(s) writing to the WC memory write to addresses that are contiguous and increasing. This means that only one row of the image should be written at a time. If your algorithm needs to process multiple output rows at a time, make sure that it uses temporary buffers in WB memory, and then copies each row to the WC destination.

    Note: You can use the WC Partial Write counter in VTune™ to verify that you have reduced the number of WC Partial Writes by removing WC order violations.

Case Study — YUV 4:2:0 to YUV 4:2:2 (YUY2) Output Format Conversion

Now we will look specifically at the case of WC order violations in the Rendering part of a MPEG-2 video decoder. The decoder stores video internally in the planar YUV 4:2:0 format. The data is rendered to the video overlay in the YUV 4:2:2 format corresponding to the YUY2 fourCC code. The decoder could render the data in the YV12 format corresponding to YUV 4:2:0, but YV12 is not supported on all systems, and since each row of the YUY2 format is independent, it is easier to perform deinterlacing while converting to YUY2 (deinterlacing implementation not shown here). The figure below depicts the conversion. Look at the original C code for the conversion function, YV12toYUY2.


void YV12toYUY2(BYTE *curY, BYTE *curU, BYTE *curV,<br />

BYTE *pDst, int XSize, int YSize,<br />

int srcpitch, int dstpitch /* bytes wide for YUY2 surface */) {<br />

int row, col;<br />

int dstpadbytes = dstpitch - 2*XSize;<br />

int srcpadbytes = srcpitch - XSize;<br /><br />

for (row=0; row < (int)YSize; row += 2) {<br />

// Original, partial writes<br />

for (col=0; col < (int)XSize; col += 2) {<br />

// first row, YUYV<br />

*pDst = *curY;<br />

*(pDst+1) = *curU;<br />

*(pDst+2) = *(curY+1);<br />

*(pDst+3) = *curV;<br /><br />

// second row, YUYV<br />

                     [   *(pDst+dstpitch) = *(curY+srcpitch);<br /><strong> WC ORDER VIOLATION!</strong> [   *(pDst+dstpitch + 1) = *curU;<br />

                     [   *(pDst+dstpitch + 2) = *(curY+srcpitch+1);<br />

                     [   *(pDst+dstpitch + 3) = *curV;<br /><br />

pDst += 4;<br />

curY += 2;<br />

curU++;<br />

curV++;<br />

}<br />

// output at end of first row,<br />

// jump to start of third row<br />

pDst += dstpadbytes + dstpitch;<br />

curY += srcpadbytes + srcpitch;<br />

curU += srcpadbytes >> 1;<br />

curV += srcpadbytes >> 1;<br />

}<br />

}

 

Notice the WC order violation in the code above. There are four bytes written to the first row and then four bytes to the second row. Because the AGP memory cannot keep up with the rate that faster Pentium® 4 processor-based systems can write the data, this causes early write-combining buffer evictions and the dreaded WC Partial writes. The code below shows how the order violations are removed by using a temporary buffer in WB memory.

void YV12toYUY2_tmp(BYTE *curY, BYTE *curU, BYTE *curV,<br />

BYTE *pDst, BYTE *pTmp, int XSize, int YSize,<br />

int srcpitch, int dstpitch /* bytes wide for YUY2 surface */) {<br />

int row, col;<br />

int dstpadbytes = dstpitch - 2*XSize;<br />

int srcpadbytes = srcpitch - XSize;<br />

BYTE *plbuf;<br /><br />

for (row=0; row < (int)YSize; row += 2) {<br />

plbuf = pTmp;<br />

for (col=0; col < (int)XSize; col += 2) {<br />

// first row, YUYV<br />

*pDst = *curY;<br />

*(pDst+1) = *curU;<br />

*(pDst+2) = *(curY+1);<br />

*(pDst+3) = *curV;<br /><br />

                      [    // second row, YUYV<br />

                      [    *plbuf = *(curY+XSize);<br /><strong>NO WC Order Violation</strong> [    *(plbuf+1) = *curU;<br />

                      [    *(plbuf+2) = *(curY+XSize+1);<br />

                      [    *(plbuf+3) = *curV;<br /><br />

pDst += 4;<br />

plbuf += 4;<br />

curY += 2;<br />

curU++;<br />

curV++;<br />

}<br />

pDst += dstpadbytes;<br />

memcpy(pDst, pTmp, XSize*2);<br />

// output at end of first row,<br />

// jump to start of third row<br />

pDst += dstpitch;<br />

curY += XSize;<br />

}<br />

}<br />

See Apendix 1 for an SSE-2 optimized version of the above function.



 

Further Optimization

Though the internal format of the MPEG decoder's data is YUV 4:2:0, it is not the best format to use if/when optimizing for performance on the Pentium® 4 processor with SSE-2. Motion Compensation (MC) takes a significant portion of the time of a MPEG-2 decoder, and can be optimized using the 16-byte integer SIMD instruction in SSE-2. The Y data is processed in 16x16 blocks, matching the 16-byte instructions perfectly. The U and V data, however, only contain 8 bytes in each row of the 8x8 U and V blocks. In order to process the U and V data just as efficiently as the Y data, convert the internal data format to the "Combined-UV" format. This means that instead of a plane of U data and a plane of V data, there is one plane of UV data. To do this, store the data with the U and V bytes interleaved (U V U V U V…) Then in MC, process 16 bytes of UV data per row of 16x8 UV blocks. Since the internal data is now in a format that no video hardware recognizes, it needs to be converted. The conversion is very similar to YV12toYUY2, however, and a SSE-2 optimized YCUVtoYUY2 function is included in Appendix 2. In the table in the Performance Summary section, note that the Combined-UV conversion is actually faster than the regular YV12toYUY2.

This Combined-UV optimization is a good way to get a small speedup on a codec that has already been optimized using SSE-2 instructions. In the test case below, the speedup is 4%, which may seem small, but it is significant because the original code is already well optimized such that it is very difficult to get any additional speedup. The Combined-UV method helps by allowing for the use of 16-byte SIMD integer instructions, and by providing a more efficient pattern of memory accesses.

MPEG-2 Decoder mode

Overall FPS

Speedup

YUV 4:2:0, 3 planes

121.4

 

YUV 4:2:0, Combined UV (2 planes)

125.7

1.04



Performance Summary

The table below shows the performance of the versions of the YV12toYUY2 function discussed above, measured on a 2.4GHz Pentium® 4 processor. The SSE-2 optimized version runs over 40% faster with the WC order violation removed. Note that the C version with the order violation removed is slower because it calls the memcpy function, which is very inefficient.

YV12toYUY2

Frames per second

Original C version

625

C version, removed WC

order violation

389

SSE-2 version

564

SSE-2 version, removed

WC order violation

808

SSE-2 version, Combined-UV,

no WC order violation

841



The figure below shows how the SSE-2 version of YV12toYUY2 performs with and without the WC order violation. Notice that with the WC order violation (top line), it is actually slower at 2.0GHz than at 1.7GHz. The bottom line shows that the fully optimized version is limited by memory, and gets little to no speedup as frequency increases.


void YV12toYUY2_SSE2_tmp(BYTE *curY, BYTE *curU, BYTE *curV,<br />

BYTE *pDst, BYTE *pTmp, int XSize, int YSize,<br />

int srcPitch, int dstPitch /* bytes wide for YUY2 surface */) {<br />

int row, col;<br />

int XSize_2 = XSize >> 1;<br />

int srcPitch_2 = srcPitch >> 1;<br /><br />

__m128i vzero;<br />

__m128i vtmp0, vtmp1, vtmp2, vtmp3, vtmp4, vtmp5, vtmp6;<br /><br />

vzero = _mm_setzero_si128();<br /><br />

for (row=0; row < YSize; row += 2) {<br />

// watch for buffer size issues<br />

for (col=0; col < XSize_2; col += 16) {<br />

// Load 16 Y's, row 0<br />

vtmp0 = _mm_load_si128((__m128i*)(curY+2*col));<br />

vtmp1 = _mm_loadl_epi64((__m128i*)(curU+col));<br />

// Load 8 U's<br />

vtmp2 = _mm_loadl_epi64((__m128i*)(curV+col));<br />

// Load 8 V's<br />

vtmp6 = _mm_load_si128((__m128i*)(curY+2*col+srcPitch));<br />

// Load 16 Y's, row 1<br /><br />

vtmp3 = vtmp0;<br />

vtmp0 = _mm_unpacklo_epi8(vtmp0, vzero);<br />

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0<br />

vtmp1 = _mm_unpacklo_epi8(vtmp1, vzero);<br />

// __U7__U6__U5__U4__U3__U2__U1__U0<br />

vtmp2 = _mm_unpacklo_epi8(vtmp2, vzero);<br />

// __V7__V6__V5__V4__V3__V2__V1__V0<br />

vtmp3 = _mm_unpackhi_epi8(vtmp3, vzero);<br />

// __YF__YE__YD__YC__YB__YA__Y9__Y8<br /><br />

vtmp4 = vtmp1;<br />

vtmp1 = _mm_unpacklo_epi16(vtmp1, vzero);<br />

// ______U3______U2______U1______U0<br />

vtmp5 = vtmp2;<br />

vtmp2 = _mm_unpacklo_epi16(vtmp2, vzero);<br />

// ______V3______V2______V1______V0<br /><br />

vtmp4 = _mm_unpackhi_epi16(vtmp4, vzero);<br />

// ______U7______U6______U5______U4<br />

vtmp5 = _mm_unpackhi_epi16(vtmp5, vzero);<br />

// ______V7______V6______V5______V4<br /><br />

vtmp1 = _mm_slli_epi32(vtmp1, 8);<br />

// ____U3______U2______U1______U0__<br />

vtmp2 = _mm_slli_epi32(vtmp2, 24);<br />

// V3______V2______V1______V0______<br />

vtmp4 = _mm_slli_epi32(vtmp4, 8);<br />

// ____U7______U6______U5______U4__<br />

vtmp5 = _mm_slli_epi32(vtmp5, 24);<br />

// V7______V6______V5______V4______<br /><br />

// All 8 xmm regs used<br />

vtmp0 = _mm_or_si128(vtmp0, vtmp1);<br />

// __Y7U3Y6__Y5U2Y4__Y3U1Y2__Y1U0Y0<br />

vtmp3 = _mm_or_si128(vtmp3, vtmp4);<br />

// __YFU7YE__YDU6YC__YBU5YA__Y9U4Y8<br />

vtmp0 = _mm_or_si128(vtmp0, vtmp2);<br />

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0<br />

vtmp3 = _mm_or_si128(vtmp3, vtmp5);<br />

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8<br /><br />

_mm_stream_si128((__m128i*)(pDst+4*col), vtmp0);<br />

// store first 8 pixels of row 0<br />

vtmp0 = vtmp6;<br />

_mm_stream_si128((__m128i*)(pDst+4*col+16), vtmp3);<br />

// store second 8 pixels of row 0<br /><br />

vtmp6 = _mm_unpacklo_epi8(vtmp6, vzero);<br />

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0, row 1<br />

vtmp0 = _mm_unpackhi_epi8(vtmp0, vzero);<br />

// __YF__YE__YD__YC__YB__YA__Y9__Y8, row 1<br /><br />

vtmp6 = _mm_or_si128(vtmp6, vtmp1);<br />

// __Y7U3Y6__Y5U2Y4__Y3U1Y2__Y1U0Y0<br />

vtmp0 = _mm_or_si128(vtmp0, vtmp4);<br />

// __YFU7YE__YDU6YC__YBU5YA__Y9U4Y8<br />

vtmp6 = _mm_or_si128(vtmp6, vtmp2);<br />

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0<br />

vtmp0 = _mm_or_si128(vtmp0, vtmp5);<br />

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8<br /><br />

// store first 8 pixels of row 1<br />

_mm_store_si128((__m128i*)(pTmp+4*col), vtmp6);<br />

// store second 8 pixels of row 1<br />

_mm_store_si128((__m128i*)(pTmp+4*col+16), vtmp0);<br /><br />

// ------------ Second set ---------------<br />

vtmp0 = _mm_load_si128((__m128i*)(curY+2*col+16));<br />

// Load 16 Y's, row 0<br />

vtmp1 = _mm_loadl_epi64((__m128i*)(curU+col+8));<br />

// Load 8 U's<br />

vtmp2 = _mm_loadl_epi64((__m128i*)(curV+col+8));<br />

// Load 8 V's<br />

vtmp6 = _mm_load_si128((__m128i*)(curY+2*col+srcPitch+16));<br />

// Load 16 Y's, row 1<br /><br />

vtmp3 = vtmp0;<br />

vtmp0 = _mm_unpacklo_epi8(vtmp0, vzero);<br />

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0<br />

vtmp1 = _mm_unpacklo_epi8(vtmp1, vzero);<br />

// __U7__U6__U5__U4__U3__U2__U1__U0<br />

vtmp2 = _mm_unpacklo_epi8(vtmp2, vzero);<br />

// __V7__V6__V5__V4__V3__V2__V1__V0<br />

vtmp3 = _mm_unpackhi_epi8(vtmp3, vzero);<br />

// __YF__YE__YD__YC__YB__YA__Y9__Y8<br /><br />

vtmp4 = vtmp1;<br />

vtmp1 = _mm_unpacklo_epi16(vtmp1, vzero);<br />

// ______U3______U2______U1______U0<br />

vtmp5 = vtmp2;<br />

vtmp2 = _mm_unpacklo_epi16(vtmp2, vzero);<br />

// ______V3______V2______V1______V0<br /><br />

vtmp4 = _mm_unpackhi_epi16(vtmp4, vzero);<br />

// ______U7______U6______U5______U4<br />

vtmp5 = _mm_unpackhi_epi16(vtmp5, vzero);<br />

// ______V7______V6______V5______V4<br /><br />

vtmp1 = _mm_slli_epi32(vtmp1, 8);<br />

// ____U3______U2______U1______U0__<br />

vtmp2 = _mm_slli_epi32(vtmp2, 24);<br />

// V3______V2______V1______V0______<br />

vtmp4 = _mm_slli_epi32(vtmp4, 8);<br />

// ____U7______U6______U5______U4__<br />

vtmp5 = _mm_slli_epi32(vtmp5, 24);<br />

// V7______V6______V5______V4______<br /><br />

// All 8 xmm regs used<br />

vtmp0 = _mm_or_si128(vtmp0, vtmp1);<br />

// __Y7U3Y6__Y5U2Y4__Y3U1Y2__Y1U0Y0<br />

vtmp3 = _mm_or_si128(vtmp3, vtmp4);<br />

// __YFU7YE__YDU6YC__YBU5YA__Y9U4Y8<br />

vtmp0 = _mm_or_si128(vtmp0, vtmp2);<br />

// V3Y7U3Y6V2Y5U2Y4V1Y3U1Y2V0Y1U0Y0<br />

vtmp3 = _mm_or_si128(vtmp3, vtmp5);<br />

// V7YFU7YEV6YDU6YCV5YBU5YAV4Y9U4Y8<br /><br />

_mm_stream_si128((__m128i*)(pDst+4*col+32), vtmp0);<br />

// store first 8 pixels of row 0<br />

vtmp0 = vtmp6;<br />

_mm_stream_si128((__m128i*)(pDst+4*col+48), vtmp3);<br />

// store second 8 pixels of row 0<br /><br />

vtmp6 = _mm_unpacklo_epi8(vtmp6, vzero);<br />

// __Y7__Y6__Y5__Y4__Y3__Y2__Y1__Y0, row 1<br />

vtmp0 = _mm_unpackhi_epi8(vtmp0, vzero);<br />

// __YF__YE__YD__YC__YB__YA__Y9__Y8, row 1<br /><br />

vtmp6 = _mm_or_si128(vtmp6, vtmp1);<br />

// __Y7U3Y6__Y5U2Y4__Y3U1Y2__Y1U0Y0<br />

vtmp0 = _mm_or_si128(vtmp0, vtmp4);<br />