Intel® Software Network Knowledge Base Wiki


Constructing Nav Tree
One Moment...

(refresh menu)



 
Welcome, Guest | Quick Login | Register

Develop for Core processor


Avoid Partial Memory Accesses on 32-Bit Intel® Architecture

Version 3, Changed by LINDA SWINK on 3/21/2008
Created by: KYLEX.S.LEWIS@INTEL.COM

Challenge

Avoid partial memory accesses. Consider a case with large load after a series of small stores to the same area of memory (beginning at memory address mem). The large load will stall in this case as shown here:

mov mem, eax ; store dword to address “mem" 
mov mem + 4, ebx ; store dword to address “mem + 4" 
: 
: 
movq mm0, mem ; load qword at address “mem", stalls 
The movq must wait for the stores to write memory before it can access all the data it requires. This stall can also occur with other data types (for example, when bytes or words are stored and then words or doublewords are read from the same area of memory).

Solution
Build data into a qword before storing it into memory. When you change the code sequence as shown here, the processor can access the data without delay:
movd mm1, ebx ; build data into a qword first 
; before storing it to memory 
movd mm2, eax 
psllq mm1, 32 
por mm1, mm2 
movq mem, mm1 ; store SIMD variable to “mem" as 
; a qword 
: 
: 
movq mm0, mem ; load qword SIMD “mem", no stall 
Also consider a case with a series of small loads after a large store to the same area of memory (beginning at memory address mem) as shown here. Most of the small loads will stall because they are not aligned with the store; see “Store Forwarding” in Chapter 2 of the IA-32 Intel® Architecture Optimization Reference Manual for more details.
movq mem, mm0 ; store qword to address “mem" 
: 
: 
mov bx, mem + 2 ; load word at “mem + 2" stalls 
mov cx, mem + 4 ; load word at “mem + 4" stalls
The word loads must wait for the quadword store to write to memory before they can access the data they require. This stall can also occur with other data types (for example, when doublewords or words are stored and then words or bytes are read from the same area of memory). When you change the code sequence as shown here, the processor can access the data without delay:
movq mem, mm0 ; store qword to address “mem" 
: 
: 
movq mm1, mem ; load qword at address “mem" 
movd eax, mm1 ; transfer “mem + 2" to eax from 
; MMX register, not memory 
psrlq mm1, 32 
shr eax, 16 
movd ebx, mm1 ; transfer “mem + 4" to bx from 
; MMX register, not memory 
and ebx, 0ffffh 

These transformations, in general, increase the number of instructions required to perform the desired operation. For Pentium® II processors, Pentium® III processors, and Pentium® 4 processors, the benefit of avoiding forwarding problems outweighs the performance penalty, due to the increased number of instructions, making the transformations worthwhile. 

Source

IA-32 Intel® Architecture Optimization Reference Manual



Served
25 Knowledge Bases
604 Pages
Search
Powering Up Search...


Vote on this Page

Tags For This Page
Loading Tags..

Tag This



Additional legal information