Challenge
Use software data prefetch to hide the latency of data access in performance-critical sections of application code. The prefetch instruction allows data to be fetched in advance of its actual usage. The prefetch instructions do not change the user-visible semantics of a program, although they may affect the program’s performance. The prefetch instructions merely provide a hint to the hardware and generally will not generate exceptions or faults.
The prefetch instructions load either non-temporal data or temporal data in the specified cache level. This data-access type and the cache level are specified as a hint. Depending on the implementation, the instruction fetches 32 or more aligned bytes, including the specified address byte, into the instruction-specified cache levels.
Excessive use of prefetch instructions may waste memory bandwidth and result in a performance penalty due to resource constraints. Nevertheless, the prefetch instructions can lessen the overhead of memory transactions by preventing cache pollution and by using the caches and memory efficiently. This is particularly important for applications that share critical system resources, such as the memory bus.
Solution
Use the prefetch instructions in predictable memory-access patterns, time-consuming innermost loops, and locations where the execution pipeline may stall if data is not available. Using the prefetch instructions is recommended only if data does not fit in cache. The prefetch instructions are mainly designed to improve application performance by hiding memory latency in the background. If segments of an application access data in a predictable manner, for example, using arrays with known strides, then they are good candidates for using prefetch to improve performance.
Streaming SIMD Extensions include four flavors of prefetch instructions: one non-temporal, and three temporal. They correspond to two types of operations, temporal and non-temporal. The prefetch instruction is implementation-specific; applications need to be tuned to each implementation to maximize performance.
Note: At the time of prefetch, if the data is already found in a cache level that is closer to the processor than the cache level specified by the instruction, no data movement occurs.
The non-temporal instruction is prefetchnta, which fetches the data into the second-level cache, minimizing cache pollution.
The temporal instructions are as follows:
The following table lists the prefetch implementation differences between the Pentium® III processor and Pentium 4 processor:
|
Prefetch Type
|
Pentium III Processor
|
Pentium 4 Processor
|
|
Prefetch NTA
|
- Fetch 32 bytes
- Fetch into 1st- level cache
- Do not fetch into 2nd-level cache
|
- Fetch 128 bytes
- Do not fetch into 1st-level cache
- Fetch into 1 way of 2nd-level cache
|
|
PrefetchT0
|
- Fetch 32 bytes
- Fetch into 1st- level cache
- Fetch into 2nd- level cache
|
- Fetch 128 bytes
- Do not fetch into 1st-level cache
- Fetch into 2nd- level cache
|
|
PrefetchT1, PrefetchT2
|
- Fetch 32 bytes
- Fetch into 2nd-level cache only
- Do not fetch into 1st-level cache
|
- Fetch 128 bytes
- Do not fetch into 1st-level cache
- Fetch into 2nd- level cache only
|
For more information, including a comparison of prefetch and load instructions, see the IA-32 Intel® Architecture Optimization Reference Manual.
Source
IA-32 Intel® Architecture Optimization Reference Manual