Challenge
Avoid bottlenecks associated with simple math functions in a wide variety of floating-point applications, ranging from financial analytics to 2D image manipulation to 3D physics engines. This issue is particularly evident on the Pentium® 4 Processor, where the latency of trigonometric, logarithmic, and exponential floating-point instructions can approach 200 clock cycles.
Solution
Use one of four optimized math libraries from Intel, each of which provides a speedy alternative to slow hardware-based operations: the Approximate Math, Vector Math, Short Vector Math, and LibM libraries. These libraries trade off different levels of performance, parallelism, and accuracy, so you can choose the one most appropriate for your application.
To effectively choose one of the four libraries discussed here, you will need to know the precision requirements of your algorithm, as well as the amount of parallelism available around each math function call. The latter determines whether SIMD instructions can be used to compute multiple results in parallel. For Windows* developers, your choice of compiler should not be an issue, as all of the libraries are fully compatible with both the Intel® C++ Compiler and the Microsoft* Visual C++ Compiler.
- Approximate Math Library: The Approximate Math Library is a set of open source math functions that operate on both scalar and packed data using SSE (Streaming SIMD Extensions) and SSE2 instructions. Of all the options, AM library provides the best performance, but it comes at the cost of low precision. Specifically, the precision is comparable to that of the SSE reciprocal approximation instructions (which have an absolute relative error less than or equal to 1.5 * 2-12). As such, the library only operates on single precision data. (Approximating results at double-precision is not especially practical). In the scalar form, the argument must be passed in the lowest field of an XMM register, while in the packed form the argument consists of four numbers arranged adjacently in one XMM register. The level of parallelism available in your algorithm will determine whether the packed version can be used.
As the "Open Source" tag implies, the AM Library is completely free and does not require a license to use. The library package is available by clicking this link for the Approximate Math (AM) library: http://www.intel.com/design/pentiumiii/devtools/AMaths.zip - Vector Math Library: The Vector Math Library (VML) is a subset of the Intel® Math Kernel Library, intended for parallel math computations on large vectors of data. Unlike the AM library, VML does not compromise precision, so each math function is available in both single-precision and double-precision form. The limiting requirement for VML is that you must organize your input data into a single array to be passed as an argument into a VML call.
For example, VML would be appropriate for the loop shown in the figure below, but only if the loop bound (iArrayLength) is a large value. If the src array contained double-precision data, this whole loop could be replaced by a single call to the vdsin() function. Likewise, vssin() is the counterpart for single-precision.
The Intel® Math Kernel Library (including VML) is available for free download. If you decide to use it in a retail product, you must obtain a product license. - LibM Math Library: The LibM library provides highly optimized scalar math functions that serve as direct replacements for the standard C calls. The LibM versions are fully accurate and do not attempt to extract algorithm-level parallelism, so they can be applied in even the most rigid coding situations.
LibM is packaged with the Intel® compilers and thus requires the purchase of an Intel Compiler license if you choose to build it into a retail product (even if you do not use the compiler itself). For details on implementing LibM, see the separate item, How to Implement the LibM Math Library. - Short Vector Math Library: The Short Vector Math Library (SVML) is perhaps the trickiest of the four libraries to apply, but is worth the extra pain for the large performance gains. SVML leverages SSE and SSE2 instructions to process either four packed single-precision numbers or two packed double-precision numbers in one call. Results are fully accurate. As with VML, the programmer is responsible for lining up the data in parallel fashion – something not always possible, depending on the nature of the algorithm.
Like LibM, SVML comes with the Intel® compilers. In fact, SVML was developed solely as an enabler for the Intel Compiler's automatic SIMD vectorization capability. Using SVML, the compiler can vectorize simple loops containing calls to math functions, but still there are many more complex opportunities where programmer intervention (i.e. a human brain) is needed to fully extract the parallelism. For details on implementing SVML, see the separate item, How to Implement the Short Vector Math Library.
Source
Integrating Fast Math Libraries for the Intel Pentium® 4 Processor