Intel® Performance Tuning Utility 3.1 Product Overview

Author: Intel® Software Network
Published On: Wednesday, September 26, 2007 | Last Modified On: Wednesday, July 30, 2008

Overview

The Intel® Performance Tuning Utility (PTU) is a cross-platform performance analysis tool set. Alongside such traditional features as identifying the hottest modules and functions of the application, tracking call sequences, identifying performance-critical source code, the Intel PTU has new, more powerful capabilities of data collection, analysis, and visualization. For experienced tuners Intel PTU offers the processor hardware event counters for a detailed look into the performance of the memory system, architectural tuning, etc. It can relate your issues back to the source code. If you are analyzing an application for which you don’t have the source, Intel PTU allows you to represent data with basic block granularity and provides function control flow graph to navigate the disassembly. The Intel Performance Tuning Utility is available for both Windows and Linux.

The Intel® Performance Tuning Utility offers:

Statistical Call Graph
Profiles with low overhead to detect where time is spent in your application

Event Based Sampling
Uses the processor’s onboard performance monitoring hardware to get a detailed look into performance issues

Basic Block Analysis
Displays hotspots with basic block granularity and generates a control flow graph for advanced analysis of application, even without the source code

Events over IP graph
Generates a histogram of performance events distributed over application code

Loop Analysis
Identifies loops and recursion in your application to aid optimization

Result difference
Compares the results of multiple runs to measure changes in performance

Data Access Profiling
Identifies memory hotspots and relates them to code hotspots

Heap Profiler
Identifies dynamic memory usage by application. Can help identify memory leaks

Instrumentation-based Call Graph, Call Count
Provides exact call graph and call count information for your application

Version 3.1 Update 3 of the Intel Performance Tuning Utility introduces the following new features and enhancements:

  • New CPUs support: ability to recognize CPU and do basic sampling on it
  • Data Access Profiling: filtering memory data by instructions in Source View, Latency Histogram, Utility charts with Access Stride and Array-of-structures distribution, Working Set chart, global data objects granularity for Memory Hotspots, some other GUI improvements
  • Instrumentation-based Call Graph, Call Count, and Heap Profiling: now available on Windows* Intel 64 architecture
  • Profile descriptions: an option to edit embeded descriptions for user defined profiles
  • Multiple bug fixes and performance improvements


Notes:

  • Intel® Performance Tuning Utility 3.1 Update 3 is not backward compatible with the previous versions in regard to opening/viewing data collected by the previous versions. In some cases, database re-conversion can help.
  • To install Intel® PTU features with the purpose of using Statistical Call Graph, Exact Call Graph, and Heap Profiling collections or viewing any collection results you can be a regular user. To enable Sampling, Data Access Profiling collections you must be a system administrator. As soon as Intel PTU is properly installed (see INSTALL.txt for details), all the features can be used by regular users. Enabling Statistical Call Graph collection on Windows host requires system administrator privileges because it is driver-based.
  • Enabling Sampling and Data Access Profiling collections requires driver installation and may affect the work of Sampling collector in Intel® VTune™ Performance Analyzer. It is not guaranteed that Intel® PTU 3.1 and VTune analyzer can share/use the same sampling driver. It is not guaranteed that VTune analyzer can properly read sampling data files(*.tb5) generated by Intel PTU 3.1 either. If you experience problems with sampling collection or viewing its results in VTune analyzer or Intel PTU, make sure each product uses the driver it is shipped with. See INSTALL.txt to learn how to run a proper driver.
  • Many capabilities of the Intel® Performance Tuning Utility have "prototype" level of maturity and are expected to develop continuously in further updates and releases. Any feedback is appreciated.


System Requirements

This section details the processor, memory, disk space, and operating system requirements for installing and using various components of the Intel® Performance Tuning Utility. The product was validated on platforms with the following parameters.

Processor Requirements

Processor

IA-32
architecture

Intel® 64
architecture

IA-64
architecture

Intel® Celeron® processor

+

 

 

Intel® Celeron® D processor

+

 

 

Intel® Pentium® 4 processor

+

+

 

Intel® Pentium® D processor

+

+

 

Intel® Pentium® 4 processor Extreme Edition

+

 

 

Intel® Xeon® processor

+

 

 

Intel® Xeon® DP processor

+

+

 

Intel® Xeon® MP processor

+

+

 

Intel® Pentium® M processor

+

 

 

Mobile Intel® Pentium® 4 processor

+

 

 

Mobile Intel® Celeron® processor

+

 

 

Intel® Celeron® M processor

+

 

 

Intel® Core™ Duo processor

+

 

 

Intel® Core™ 2 Duo processor

+

+

 

Intel® Xeon® processor 50xx, 51xx, 7xxx series

+

+

 

Intel® Itanium® 2 processor

 

 

+

Intel® Itanium® 2 processor series 9000

 

 

+

To view the full list of currently supported processors, enter:

>vtsarun -cl

Memory Requirements

The application you are tuning may be memory and disk space consuming. If this is the case, make sure you have sufficient memory and disk space for running both your application and the Intel Performance Tuning Utility.

Interface

RAM

Swap space

Command line collector and viewer

> 256 MB

> 256 MB

Loop profiling enabled

> 700 MB

> 700 MB

Graphical User Interface

> 512 MB

> 512 MB

Disk Space Requirements

Component

Disk Space

Total (archive file, its extracted files, and all installed components)

300-400 MB

Operating System Requirements

The Intel Performance Tuning Utility was tested on the following Windows* and Linux* distributions:

Operating System

IA-32
architecture

Intel® 64
architecture

IA-64
architecture

Microsoft* Windows XP Professional Service Pack 2

+

 

 

Microsoft* Windows XP Professional x64 Edition Service Pack 1

 

+

 

Microsoft* Windows Server 2003 Enterprise Edition Service Pack 1

+

 

 

Microsoft* Windows Server 2003 Enterprise x64 Edition Service Pack 1, 2

 

+

 

Microsoft* Windows Server 2008

 

+

 

Microsoft* Windows Vista* (Ultimate, Enterprise)

+

+

 

Microsoft* Windows Vista* Service Pack 1

 

+

 

Red Hat* Fedora* Core 5 (kernel 2.6.15)

 

+

 

Red Hat* Fedora* 7 (kernel 2.6.21-1.3194.fc7)

 

+

 

Red Flag Linux* 5.0 DC Server (kernel 2.6.9-11)

 

 

+

Red Hat* Enterprise Linux* Advanced Server 3.0 Update 6 (kernel 2.4.21-37)

+

+

+

Red Hat* Enterprise Linux* Advanced Server 4.0 Update 3, 4, 5 (kernel 2.6.9)

+

+

+

Red Hat* Enterprise Linux* Advanced Server 5.0 (kernel 2.6.18-8)

+

+

+

Red Hat* Enterprise Linux* Advanced Server 5.1 (kernel 2.6.18-53)

+

+

 

SuSE* Linux* Enterprise Server 9 Service Pack 3 (kernel 2.6.5)

+

+

+

SuSE* Linux* Enterprise Server 10 (kernel 2.6.16.21-0.8)

+

+

+

SuSE* Linux* Enterprise Server 10 Service Pack 1 (kernel 2.6.16.46-0.12)

+

+

 

Turbolinux* 10 (kernel 2.6.9-5.15)

 

 

+

The Intel Performance Tuning Utility works with ALL compilers that follow industry standard object code formats. It was tested on applications built with the following compilers:

  • GCC* 3.2, 3.3, 3.4, 4.0
  • Intel® C++ Compiler 9.0
  • Intel® C++ Compiler 9.1
  • Intel® C++ Compiler 10.0
  • Intel® C++ Compiler 10.1
  • Microsoft* Visual C++* 6.0
  • Microsoft* Visual C++* 2002
  • Microsoft* Visual C++* 2003
  • Microsoft* Visual C++* 2005

Java Environment Requirements

The Intel Performance Tuning Utility requires Eclipse* 3.2, EMF* 2.2, and GEF* 3.2 installed for normal work of the graphical user interface (GUI). Eclipse environment, in its turn, requires the Java* Virtual Machine for its work. Please refer to the <Eclipse_home>/readme/readme_eclipse.html (Running Eclipse chapter) for the list of JVMs supported by Eclipse. The Intel PTU package includes all the components listed above.

Installation

To see the Intel® Performance Tuning Utility installation details please refer to the Installation Guide (INSTALL.txt).

Documentation

The documentation for the Intel Performance Tuning Utility is presented in the following formats:

  • Readme file provides product overview information, lists package content and technical support sites.
  • Installation Guide describes the steps required to install Intel Performance Tuning Utility.
  • Release Notes lists the systems tested for the compatibility with Intel Performance Tuning Utility, describes known issues and product limitations.
  • Command-line help provides short command reference and usage modes. To view the command-line help for the Intel Performance Tuning Utility commands, enter: <command_name> -h.
  • User Guide provides full-scale product description including GUI and command-line reference and usage models.

Reference Guide provides reference information about instructions, events, and penalties for the supported processors. To access the Reference Guide, go to the Eclipse* Help menu > Help Contents and select the Intel(R) Performance Tuning Utility book from the table of contents.

Related Products and Services

Information on Intel® software development products is available at http://www.intel.com/cd/software/products/asmo-na/eng/index.htm. Visit the following product-related sites for additional information:

  • The Intel® Software College provides training for developers on leading-edge software development technologies. Training consists of online and instructor-led courses covering all Intel architectures, platforms, tools, and technologies.
  • The Intel® compilers enable software to run at top speeds and fully support the latest Intel® processors. Compatible with other tools you use, the Intel compilers integrate into popular development environments and features source and binary compatibility with other widely-used compilers.
  • The Intel® Performance Library Suite provides a set of routines optimized for various Intel processors.
  • The Intel® Math Kernel Library provides developers of scientific and engineering software with a set of linear algebra, fast Fourier transforms and vector math functions optimized for the latest Intel processors.
  • The Intel® Integrated Performance Primitives consists of cross-platform tools to build high performance software for several Intel architectures and several operating systems


Known Problems and Limitations

  • To see the correct functions (their names and boundaries) in Hotspot and Source views you need to have symbol file on Windows or full symbol information in the Linux executable. Do not pass -s option to ld command and do not run strip command on executable. For best results, use -g compiler option on Linux and /Zi on Windows so that debug information is available.
  • The Loop profiler analyzes modules compiled for the native OS architecture only. For example, if modules are compiled for Linux IA-32 it is not possible to detect loops on Linux Intel 64.
  • Intel Performance Tuning Utility does not work correctly for executables that have non-English symbols in their names and sources (e.g. non-English comments). Do not use non-English symbols in the path to the Intel PTU.
  • For Microsoft* Windows Vista*, you need to be an administrator to work with the Intel Performance Tuning Utility.
  • An application may hang when collection is started under MC (Midnight Commander) on Linux and collector starts application using the -- option. This happens due to application conflicts with MC in console (tty) usage. Make sure you are running collection outside MC.
  • You cannot invoke the Source View on the Linux system for the experiment collected on the Windows system.
  • Note that this version of the Intel Performance Tuning Utility applies search directories for symbol files (both predefined and user-defined) only when drilling down to Source View. To use symbol files for sampling/stack sampling/data profiling views, locate symbol and binary files to the same directory and use the Re-convert command in GUI or the --re-convert option in command line.

GUI Problems and Limitations
  • Intel® PTU normally works with collected data size up to 500Mb (database size). Greater data can slow down a reaction of GUI dramatically. Use -i option to regulate the size (and level of detail) of Statistical Call Graph collection results.
  • Currently Statistical Call Graph results do not contain the sampled event and Sample After Value (SAV) used as collection configuration parameters. Pay attention to this information while setting up Profiler Configuration or refer to the used Profile Configuration. This will be fixed in one of the next updates.
  • Occasionally the message on insufficient JVM memory can be displayed. It happens because the Java* Virtual Machine allocates the fixed amount of memory at Eclipse* start. You can increase the value using the -vmargs -Xmx512M option when starting Eclipse. In this example, 512 MB are allocated.
  • For Itanium®-based systems, the Control Flow Graph (part of the Source View) does not show branches and cycles.
  • Linux Intel 64 version of Intel PTU GUI may fail to start on some operating systems (e.g. on SLES* 10). Replace the Eclipse directory under the Intel 64 version of Intel PTU with the Eclipse directory from the IA-32 version of Intel PTU.
  • Statistical Call Graph GUI view may work incorrectly for the stacks with more than 1000 items. Such stacks are usually generated in case of recursive calls.

Statistical Call Graph Collection Problems and Limitations

  • Intel Performance Tuning Utility supports only time-based Statistical Call Graph profiling on Linux* and measures processor usage only. For example, if your application does a lot of I/O operations, this is not visible in the results. Timer interval is limited by OS timer granularity and cannot be lower than configured in OS.
  • Intel Performance Tuning Utility Statistical Call Graph cannot profile statically linked executables.
  • You can collect Statistical Call Graph data only on one hardware event on Windows.
  • Intel Performance Tuning Utility Statistical Call Graph (SCG) on Linux platform depends on unwinding information (unwind table) encoded within the executable or usage of frame pointers. By default GCC and Intel C compiler do not generate unwind table for C programs. However they use EBP as a frame pointer in each function which is enough for stack unwinding. In case of using optimization options compilers prefer to generate ESP based frames and unwinding information that is necessary to correctly perform stack walking. Unwinding information is located in .debug_frame, .eh_frame sections; SCG stack walking algorithm uses unwind info from .eh_frame section. The .eh_frame_hdr section is segment section and available via program header table; this section helps to find location of .eh_frame section in address space of loaded program. Without .eh_frame_hdr section SCG stack walking algorithm can not find location of .eh_frame since it does not parse raw binary on disc.

    There are known issues with stack unwinding on IA-32 and Intel® 64 architectures Linux platforms and reasons connected with incomplete unwind information:
    • Program is compiled with optimization options for example -fomit-frame-pointer on GCC or -O2 on Intel C Compiler. As result, compiler can generate ESP based frames and in case of incomplete or absents unwind information stack unwinding does not work.
    • Some versions of GCC compilers do not generate .eh_frame_hdr section on IA-32 architecture even if -fexception option is used.
    • Compiler may generate invalid FDE's (Frame Description Entry) in unwind table for some address ranges. As result Intel Performance Tuning Utility provides incorrect return address when unwinding from this address range.
    • Function on stack may be skipped. Sample may fall in the prolog of the hot function right before the initialization of the frame pointer. If unwind info for the given sample is absent and the pervious frame is a EBP based frame, caller of the hotspot is skipped during the stack unwinding.
    • The caller function of hotspot from glibc (memset, memcpy) or math library (sin, pow) may be skipped on Linux IA-32. Unfortunately glibc and math library have incomplete unwinding information for many optimized functions. At the same time those functions do note use frame pointers. As a result, caller function is skipped when unwinding the stack since frame pointer is pointing to the function above the caller of the hotspot.

    You can resolve some of the stack walking issues described above by generation of full unwind information. Use the -fasynchronous-unwind-tables option for GCC and the -fexceptions option for Intel C compiler. To make that sure your executable (and shared libs) have this information, use the objdump -h <binary> command. You should see .eh_frame_hdr section there. For C++ programs exception handling tables are generated by default, however if you switched off exception handling by using the -fno-exceptions option you will need to force generation of exception handling tables or frame pointers. To do this in GCC use -fasynchronous-unwind-tables or -fp options, in ICC you may use only the -fp option.
    If it does not help, reduce optimization level (in case it is possible).

    • Intel Performance Tuning Utility Statistical Call Graph on Windows platform depends on FPO (Frame Pointer Omission) data located within PDB file or usage of frame pointers. Stack unwinding can be improved if PDB file exists. Use symchk utility (part of Debugging Tools for Windows package) to load PDB file for system binary and add location of PDB file in _NT_SYMBOL_PATH environment variable.
      Example of loading PDB file for kernel32.dll:
      symchk /s srv*c:\symbols*http://msdl.microsoft.com/download/symbols C:\winnt\system32\kernel32.dll
      If it does not help reduce optimization level (in case it is possible).
    • The displayed number of samples for functions in Statistical Call graph results may be incorrect in some cases. The known problem is that in the Caller/Callee view self and total samples for recursive functions could be incorrect.


    Sampling collection problems and limitations

    • Opening of the event configuration dialog for sampling collection may take long time especially for Intel® Itanium® 2 processor. This happens due to large number of events and duplication of events modifier for each event in XML file passed from command line to the GUI.
    • Sampling may not work on Intel® Itanium® 2 processor with old Linux kernels (for example, TurboLinux10). You may request updates from Linux distribution maker.


    Heap Profiler, exact Call Graph, Call Count problems and limitations

    • No Java-based application profiling.
    • Call Graph/Call Count and Heap profiler cannot profile self-modifying code.
    • Call Graph, Call Count and Heap Profiler may not work on applications which contain SSE4 instructions.
    • On Windows systems, you cannot profile multi-process applications using the Heap Profiler, exact Call Graph, or Call Count tools.
    • Heap Profiler does not stop collection after clicking the stop button under Eclipse*. To finish the data collection close the application under profiling.
    • Heap Profiler does not produce data if the application under profiling is terminated with CTRL+C.
    • On Linux* systems running on IA-32 architecture, Heap Profiler may show incomplete stack for application memory allocations/deallocations if it is launched with --exact=no option on the application compiled with the --fomit-frame-pointer option.
    • On Windows* you cannot run Call Graph/Call Count and Heap Profiler analysis on systems protected by the McAfee Host Intrusion Prevention* antivirus software. Make sure you disable this software first.
    • The trace mode of Heap Profiler can generate gigabytes of results. Make sure you have enough disk space.
    • Trace Children' mode in the Heap Profiler on Linux* handles only situations when executable is called after fork. Profiling of application calling fork without executable or an executable without fork will not produce results.
    • The non-exact (or fast) mode of Heap Profiler is available on Linux* operating systems on IA-32 architecture only.
    • The non-exact mode of the Heap Profiler depends on exception handling information encoded within the executable. To ensure your executable (and shared libs) have this information use the >objdump –h <binary> command. You should see .eh_frame_hdr there. Several GCC compilation options affect presence/content of this section. Possible solutions are:
      • the –fnoexception GCC option turns off generation of exception related code and exception handling tables. If you use this option, enable generation of unwind tables using the -fasynchronous-unwind-tables option
      • GCC has a bug that shows up when -fomit-frame-pointer switch is used. For some reason, GCC removes .eh_frame_hdr. To workaround this bug you will need to use the -fasynchronous-unwind-tables option


    Data Access Profiler problems and limitations

    There is a list of systems below where the data profiling is possible:

    Processor

    Windows

    Linux

    IA-32
    architecture

    Intel® 64
    architecture

    IA-64
    architecture

    IA-32
    architecture

    Intel® 64
    architecture

    IA-64
    architecture

    Intel® Pentium® 4 processor

    +

    +

     

    -

    -

     

    Intel® Core™ 2 Duo processor

    +

    +

     

    -

    +

     

    Intel® Itanium® 2 processor series 9000

     

     

    +

     

     

    +

    On the systems with Intel® Core™ 2 Duo processor some memory load instructions may use the same register as source and destination, for example mov [rax], ax. If samples fall on such instructions they are ignored by data profiling view because it is impossible to calculate data address of the load in this case.

    Feedback and Technical Support

    Your feedback is very important to us. To point to an issue and receive a technical answer for the tools provided in this product, visit the web site where you got the package. You can learn about the discussion forum possibilities from that web page. We do not provide technical support for the tools inside this product.

    Diagnostic and Logging

    While running, the Intel Performance Tuning Utility logs the experiment workflow. Log files are created in a directory assigned as a directory for temporary data for current user. For example, to reach the log location type

    Linux: cd /tmp/ptu-log-${USER}
    Windows: cd %TEMP%/ptu-log-%USERNAME% or type %TEMP%/ptu-log-%USERNAME% in the explorer address bar and press enter.

    The folder ptu-log-<username> contains history of all commands executed in the file history.txt and folders with command processing details. To provide the response team with information about a problem, it is recommended to archive the experiment and ptu-log-<username> directories, and send it along with the problem report to the response team for further investigation.

Post a comment If you have any questions, please contact our support team.