Hi, Thank you for your question. Your question covers many areas dealing with performance, from individual instruction to some algorithmic issues; any one area may worth a long discussion. So, my reply may ramble a bit. I’ll start at the instruction level issues dealing with microarchitectures.
Latency and throughput of MOVDQU. Basically, different microarchitectures handle unaligned loads with different performance characteristics. Here’s what I’ve found from several recent microarchitectures. Please note, the result on this instruction alone can be complicated by factors such as, address alignment, and locality. To make it a bit simpler, I’ll just compare 2 different addresses alignments (aligned is done with offset 0 bytes, and unaligned is done with offset 57 bytes so a cacheline split must be handled by the microarchitecture) and L1 data cache locality. I’ll show the best estimate I have based on software-based measurement of 5 machines in the order (Northwood-2001,Prescott-2004, Merom-2006, Penryn-2007, I only have access to a 2005 opteron chip) .
MOVDQA Throughput (no cacheline split): Nwd: 2, Psc: 1, Merom: 1, Penryn: 1, Opteron: 2
MOVDQU Throughput (no cacheline split): 4, 2, 2, 2, 2.5~3
MOVDQU Latency (no cacheline split): Nwd/PSC: ~Mid 20 cycles, Merom/Penryn: ~ 10 cycles, Opteron appears to be low teens.
MOVDQU Throughput (w/ cacheline split): Nwd: ~30, PSC: ~Mid 20 cycles, Merom: ~20, Penryn: ~17, Opteron: ~3.3.
MOVDQU Latency (w/ cacheline split): Nwd: 50+, PSC: ~40 cycles, Merom: ~21, Penryn: ~18, Opteron: ~20+.
Handling cacheline splits has always been a challenge for microarchitectures. Deeper-pipelined machine with aggressive speculative execution tend to face more challenges.
The LDDQU instruction has always been defined to have architectural behavior of the input and output state as behaving the same way as the load flavor of MOVDQU. Different microarchitecture may implement hardware to optimize the performance of cacheline split loads. PSC did put in extra hardware so that the throughput of cache split loads is ~ 2 cycles to mitigate the long-pipe issues manifesting in handling cacheline splits. In Merom and Penryn, they have shorter pipeline and the tradeoff was made so LDDQU does the same thing as a MOVDQU.
While these are just the relative simple case of getting data from L1 interacting with the out-of-order engine. L2 or cache misses will get into other interesting interactions with other subsystems in the microarchitecture. These would be interesting to some folks at some level. But the overarching thing is that how would software benefit from the finer details that can change from one stepping to another? The fundamental heuristic remains that the less cache line splits, the more likely the application’s performance will improve.
This naturally leads to the central question of how feasible is that heuristic in real-world problems, such as the one that prompted your question?
I don’t know enough details of the PSADBW implementation of your block-search, but I’ll use a similar code that one of our colleague published on motion-estimation as a proxy, which faces the same unaligned memory access challenges. (see http://softwarecommunity.intel.com/articles/eng/1246.htm)
At the center of such a block-search problem, we have two 16-byte loads, a psadbw and a paddusw, where one of the load is from a stationary block, and the other load on a roving block that must cover the entire target frame.
One of the easiest and intuitive approach is simply issue MOVDQU for both the stationary block (which can start from anywhere in the parent frame) and the roving block. But this is very sub-optimal.
Since the stationary block would be read outside the loops that rove around the target frame, a good compiler would have easily taken the approach of making a copy of the unaligned stationary block to a 16-byte aligned address. This immediately transformed half of the 16B memory references to aligned loads.
Your interest of using LDDQU is obviously in the right direction of attempting to reduce cache line splits that a significant portion of the other half of 16-B memory references would experience.
IMHO, the other approach you alluded to of using PALIGNR actually is more optimal than LDDQU (even if LDDQU didn’t become plain MOVDQU in our recent processors).
Using Kiefer’s SSE2 intrinsic code (see link above) as an example, the inner loop of one pair of 16x16 block SAD operation involves 16 PSADBW, 16 PADDUSW, 16 loads done with 16-B aligned loads, 16 loads done with MOVDQU.
Instead of shifting the target block one pixel at a time to the right (why we have to use MOVDQU), the 16-B loads in the target frame could be done at an interval of every 16-bytes. And PALIGNR would be used to splice up the target block between every 16-byte mileposts. With the PALIGN approach, the target 16x16 block in between the mileposts can be synthesized from registers in-flight. The savings are reduced number of memory accesses in the first place, the number of cacheline splits are naturally reduced.
Additionally, it seems plausible that encoder software could have some control of the placement of each frame. If the beginning of each frame can be placed on 64-byte boundary, there may also be opportunity to control the address alignment of each scanline in some cases (if padding is allowed). These may be additional opportunities to reduce the number of unaligned memory loads to improve performance.
sjkuo