The following is a question received by Intel(R) Software Network Support, followed by the responses provided by our Application Engineering team:
Q. I am developing on Pentium 4 with Windows XP and DevCpp/GCC-Compiler. I need to measure the performance of a small number of floating point calculations (for example, fadd, fsub and so on) . The Enhanced Timer is not suitable because of the overhead. Now I am looking for examples of how to measure it by using processor clocks (RDTSC command). The most referred information I found was in the document "Using the RDTSC Instruction for Performance Monitoring", which is an Intel Corporation document (1997). The source code part of this document (3A) is exactly what I need to do. To use it, it is necessary to calculate the overhead first (variable base) and "warm up" the cache, due to the effects of cache misses and other processes which are using the same processor. Unfortunately the variable base is changing all the time, so it is not possible to produce repeatable measurement. I would be grateful to get some advice.
A. We forwarded this question to several engineers, and received the following responses:
#1. My first question in response to "I need to measure a small number of floating point operations" would be WHY? If it's only a small number, it can't be performance critical!
If it's a large number (even if occasionally in some inner loop), then "overhead" doesn't matter and you can use your timing routines (ETimer, RDTSC, or QueryPerfCounter, etc.). You may have to re-harness your code do this, of course.
#2. Measurements with RDTSC are most credible if the number of clocks between the pair of RDTSC instructions is at least a few thousand, preferably tens of thousands.
Typically, one wraps the interesting code sequence in a loop. Also (for certain OS reasons), you should repeat this multiple times -- the first measurement is usually wrong.
i.e.:
repeat 5 times
start = RDTSC
loop 50,000 times
the small number of FP instructions I want to test
endloop
end = RDTSC
end repeat
When you do it this way, loop overhead is pretty incidental, and you can just compute (end-start)/50000 for each iteration and get your performance. I would print each of the 5 trials.
I would expect the first iteration result to be quite different from the following 4 results.
#3. You should also be aware of the caveats around measuring something on a Pentium(R) 4 processor. It will be significantly different on our new cores. I recommend you get a copy of the Intel(R) VTune(TM) Performance Analyzer.
#4. Our guess is that you might be reverse engineering performance for key FP sequences and working out cache latencies and stride/timing semantics using these annotations.
If this is the case, our guess is that you are likely doing this while playing off requisite algorithm/blocking strategies, perhaps even while comparing our u-arch with a competitive one.
More unlikely, but if it turns out that you are tuning small code sequences on OOO machine, our recommendation would be to guide you otherwise.
RDTSC on the Pentium(R) 4 processor is noisy, synchronizes the pipeline, and at last check, had a latency of ~90-120 clocks in Pentium(R) 4 (former codename Northwood) processor implementation.
This would certainly introduce "Heisenberg" uncertainity aspects into your measurements.
Which version of GCC are you using? Hopefully something post 3.3.* and even 4.1.* would be better.
In the end, if you choose to use a "counter", you will be challenged by SNR issues unless you integrate in the set-up/design of your performance experiments.
Q. In response to #1:
For my scientific project, I need to measure the performance of a small algorithm with different architectures (Processors, operating systems and so on). The algorithm contains just the addition and subtraction of floating point numbers (4-5 operations). I have already measured it with counters like QueryPerfCounter and got some results. To achieve them, I needed to deal with such effects like overhead of loops, calling the QueryPerfCounter from the Windows API, cache refresh and others, which produce a big overhead comparing with operations I need to measure. So after taking into account all of these effects, the results are unfortunately not precise enough. For this reason, I have made the decision to measure primary with RDTSC.
In response to #2:
I have found two methods of measurements in the document "Using the RDTSC Instruction for Performance Monitoring". One of them is dealing with the small length of code, such as in my case. To overcome the effects of instruction and data cache misses, the technique of “cache warming” is applied. Here is the assembler code (Should be repeated 3 times):
CPUID
RDTSC
mov cyc, eax
CPUID
RDTSC
sub eax, cyc
mov base, eax
Since the variable base is changing for each measurement, it is impossible to get repeatable results. This is my main problem at the moment.
In response to #3:
Is the Intel(R) VTune(TM) Performance Analyzer also suitable for small number of operations (like in my case)?
In response to #4:
I am using DevC++ 4.9.9.2 with GCC. I would be glad if you describe more details about these issues.
A. Our engineers responded:
#1. Here's some additional data covering RDTSC operation. I took the time to dust off previous work and corresponding diagnostic programs and re-examined them for validity. First, as to whether or not executing RDTSC distorts the measurement: the fact is that it will for shorter instruction sequences that execute within the "shadow" of an instance of RDTSC execution. Presuming that no power/thermal events that affect core clock frequency take place, on Pentium(R) 4 microarchitecture RDTSC is ~80 clocks and on Intel(R) Core(TM) microarchitecture RDTSC is ~65 clocks. This was the basis of my Heisenberg allusion, in that upon inserting pair of RDTSC, one essentially cannot measure time spans less that about twice the pipeline "shadow". Even then, one must be cognizant that recovered precision is in direct proportion to time span duration relative to time span of twice the pipeline "shadow". So this is lower bounds of what minimum time span can be measured using present instruction-based technology. Second, as to whether there is jitter among pairs of RDTSC used to measure time span of an instruction sequence. If purely executing within the core, e.g. recurrence relations, Pentium 4 is very faithful here when executing from the trace cache and no jitter is ever seen (at least by me). On Core u-arch, one will experience jitter, perhaps up to ~25% but typically ~5% of time span being measured. I attribute this to variances in instruction fetch/decode operations when code is not ideally placed relative to measurement and control-flow groups of instructions. There is usually always jitter among pairs of RDTSC used to measure time span of an instruction sequence if there are outstanding memory operations in the pipeline. The standard deviation of measured values range up to ~30% for short less iterative sequences and around 5-10% for long more iterative sequences. This is true on both Pentium 4 and Core u-arch's and use of simple binning is advised, especially when looking for best case performance.
#2. Regardless of what the PRM says, the sequence:
rdtsc
{a small number of instructions with a cumulative latency less than hundreds or thousands of clocks}
rdtsc
is very unlikely to yield a reliable result. The CPUID step is probably not needed either, although there are other opinions on that point.
The rdtsc instruction is serializing with respect to itself. It is not actually serializing.
Fundamentally, what all the respondents are saying is that this is an out of order machine, and the very notion of deterring the latency of a 3 instruction sequence is quite slippery. You can get very reliable measurments of larger blocks of code (with a few caveats as noted below). But don't try to measure something small. AND, check that your result is repeatable, and your measurement stable.
==
Lexi S.
Intel(R) Software Network Support
http://www.intel.com/software
Contact us