Challenge
Avoid partial memory accesses. Consider a case with large load after a series of small stores to the same area of memory (beginning at memory address mem). The large load will stall in this case as shown here:
mov mem, eax ; store dword to address “mem"
mov mem + 4, ebx ; store dword to address “mem + 4"
:
:
movq mm0, mem ; load qword at address “mem", stalls
The movq must wait for the stores to write memory before it can access all the data it requires. This stall can also occur with other data types (for example, when bytes or words are stored and then words or doublewords are read from the same area of memory).
Solution
Build data into a qword before storing it into memory. When you change the code sequence as shown here, the processor can access the data without delay:
movd mm1, ebx ; build data into a qword first
; before storing it to memory
movd mm2, eax
psllq mm1, 32
por mm1, mm2
movq mem, mm1 ; store SIMD variable to “mem" as
; a qword
:
:
movq mm0, mem ; load qword SIMD “mem", no stall
Also consider a case with a series of small loads after a large store to the same area of memory (beginning at memory address mem) as shown here. Most of the small loads will stall because they are not aligned with the store; see “Store Forwarding” in Chapter 2 of the IA-32 Intel® Architecture Optimization Reference Manual for more details.
movq mem, mm0 ; store qword to address “mem"
:
:
mov bx, mem + 2 ; load word at “mem + 2" stalls
mov cx, mem + 4 ; load word at “mem + 4" stalls
The word loads must wait for the quadword store to write to memory before they can access the data they require. This stall can also occur with other data types (for example, when doublewords or words are stored and then words or bytes are read from the same area of memory). When you change the code sequence as shown here, the processor can access the data without delay:
movq mem, mm0 ; store qword to address “mem"
:
:
movq mm1, mem ; load qword at address “mem"
movd eax, mm1 ; transfer “mem + 2" to eax from
; MMX register, not memory
psrlq mm1, 32
shr eax, 16
movd ebx, mm1 ; transfer “mem + 4" to bx from
; MMX register, not memory
and ebx, 0ffffh
These transformations, in general, increase the number of instructions required to perform the desired operation. For Pentium® II processors, Pentium® III processors, and Pentium® 4 processors, the benefit of avoiding forwarding problems outweighs the performance penalty, due to the increased number of instructions, making the transformations worthwhile.
Source
IA-32 Intel® Architecture Optimization Reference Manual