One of our engineers responds as follows:
RDTSC is not a serializing instruction. On out-of-order machines, there is no guarantee back-to-back RDTSC will return monotonically increasing values. There is a well-known technique to ensure monotonic behavior, by placing a serializing instruction immediately before RDTSC. A common choice of serializing instruction is CPUID.
Naturally, adding a serializing instruction before RDTSC adds to the overhead of the timing measurement. Depending on your timing measurement philosophy, you have to decide (a) measure frequently (thereby requiring extra overhead to ensure monotonic RDTSC) or (b) minimize measurement overhead thru amortization of each RDTSC over many instructions (if the RDTSC is done not too frequently, the finite length of the OOO window effectively guarantee monotonic behavior). On the other hand, if you choose (a). you may have to invest in other techniques to calibrate how much overhead your in-situ CPUID+RDTSC measurement cost you.
Taking the approach of (a) also meant you may have to characterize the statistical variance of the measurement overhead, because both CPUID and RDTSC are complex instructions consisting of relatively long sequence of micro-ops, they likely execute and complete with difference number of cycles each time. In particular, execute CPUID with different input value is likely to take varying amount of cycles.
To summarize,
1. Back-to-back RDTSC returning non-monotonic value is not unexpected.
2. If you application requires frequent and monotonic RDTSC, you must add CPUID (with EAX=0) immediately before each RDTSC. You may need to decide what to do with increased measurement overhead...
3. If your app's sampling period can be sufficiently large, you may be able to use RDTSC without a serializing instruction and still get monotonic behavior, the key is to ensure you unroll to have a large enough number of instruction between two RDTSC.
==
Lexi S.
Intel(R) Software Network Support
http://www.intel.com/software
Contact us