<> ---- = Overview = The '''perf''' command line tool is the interface which can be used by users to utilize the new performance counter subsystem in recent Linux kernels (version > 2.6.30). The new performance counter subsystem of Linux allows for using provides rich abstractions over the PMU hardware capabilities of modern CPUs. It provides per task, per CPU and per-workload counters, counter groups, and it provides sampling capabilities on top of those. It also provides abstraction for 'software events' - such as minor/major page faults, task migrations, task context-switches and tracepoints. The '''perf''' tool can be used to optimize, validate and measure applications, workloads or the full system. {i} '''perf''' is at this time only available on the westmere server. Installations and deployment on other platforms/machines is to be discussed. = perf commands = '''perf''' invocation uses different commands and particular options to these commands to perfom fine monitoring of performance-relevant information. The full list of commands supported by '''perf''' can be seen by issuing the following: {{{ [root@westmere 2006-1.1]# perf help usage: perf [--version] [--help] COMMAND [ARGS] The most commonly used perf commands are: annotate Read perf.data (created by perf record) and display annotated code archive Create archive with object files with build-ids found in perf.data file bench General framework for benchmark suites buildid-cache Manage build-id cache. buildid-list List the buildids in a perf.data file diff Read two perf.data files and display the differential profile kmem Tool to trace/measure kernel memory(slab) properties list List all symbolic event types record Run a command and record its profile into perf.data report Read perf.data (created by perf record) and display the profile sched Tool to trace/measure scheduler properties (latencies) stat Run a command and gather performance counter statistics timechart Tool to visualize total system behavior during a workload top System profiling tool. trace Read perf.data (created by perf record) and display trace output See 'perf help COMMAND' for more information on a specific command. }}} == perf list == This command will give a rather legthy list of all supported events that can be observed by the '''perf''' utility. A short snippet of the list is shown below: {{{ [user@westmere]# perf list List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cpu-clock [Software event] task-clock [Software event] page-faults OR faults [Software event] minor-faults [Software event] major-faults [Software event] context-switches OR cs [Software event] cpu-migrations OR migrations [Software event] alignment-faults [Software event] emulation-faults [Software event] rNNN [Raw hardware event descriptor] L1-dcache-loads [Hardware cache event] L1-dcache-load-misses [Hardware cache event] L1-dcache-stores [Hardware cache event] ... }}} The '''rNNN''' option can be used to persribe the monitoring of a performance counter/event by its hardware code. This is useful because not all Performance Monitoring Unit (PMU) features are covered by the Software and hardware events defined by the '''perf''' developers. For example if one wants to read ou the number of SIMD 64-bit integer operations, there is no event for that defined in the '''perf''' command line interface. In such cases one can obtain the hexadecimal code for that particular performance event and use it with the '''rNNN''' event description, where 'NNN' is substituted with the particular hardware code. An example of this follows in the next section. == perf stat == This command is the most common and easy way to gather performance statistics of an application. After being acquainted with what one can observe as performance counters through invocation of '''perf list''', the events listet there can be used by '''perf stat'''. Here is an example how to get the number of instructions a command needed during operation and how many clock cycles that took: {{{ [root@westmere]# perf stat -e instructions -e cycles ls /usr bin etc games include lib lib64 libexec local sbin share src sue tmp vice Performance counter stats for 'ls /usr': 992698 instructions # 0.697 IPC 1425153 cycles 0.001307494 seconds time elapsed }}} Besides the '-e' option one can use also additional parameters to the 'perf stat' command: {{{ usage: perf stat [] [] -e, --event event selector. use 'perf list' to list available events -i, --inherit child tasks inherit counters -p, --pid stat events on existing pid -a, --all-cpus system-wide collection from all CPUs -c, --scale scale/normalize counters -v, --verbose be more verbose (show counter open errors, etc) -r, --repeat repeat command and print average + stddev (max: 100) -n, --null null run - dont start any counters }}} For obtaining the list of codes for the hardware events not defined in the event list of '''perf list''' one must download and install the ''libpfm4.so'' library from [[http://perfmon2.sourceforge.net/|here]]. In the compilation folder there is an '''examples''' directory, in which a tool called '''showeventinfo''' is located. It will give detailed description of all supported hardware events on the aprticular machine as well as their codes. The process of obtaining these codes is as follows - first download (via git) and build the latest ''libpfm4'' library. Then run the '''showeventinfo''' command: {{{ [root@westmere]# git clone git://perfmon2.git.sourceforge.net/gitroot/perfmon2/libpfm4 [root@westmere]# cd libpfm4 [root@westmere]# make [root@westmere]# ./examples/showeventinfo Detected PMU models: [17, ix86arch, "Intel X86 architectural PMU", 6 events] [50, perf, "perf_events generic PMU", 35 events] [52, wsm_dp, "Intel Westmere (dual-socket)", 91 events] [53, wsm_unc, "Intel Westmere uncore", 52 events] Total events: 2090 available, 184 supported #----------------------------- IDX : 872415232 PMU name : wsm_dp (Intel Westmere (dual-socket)) Name : UNHALTED_CORE_CYCLES Equiv : CPU_CLK_UNHALTED:THREAD_P Desc : count core clock cycles whenever the clock signal on the specific core is running (not halted). Code : 0x3c Modif-00 : 0x00 : [u] : monitor at priv level 1, 2, 3 (boolean) Modif-01 : 0x01 : [k] : monitor at priv level 0 (boolean) Modif-02 : 0x02 : [i] : invert (boolean) Modif-03 : 0x03 : [e] : edge level (boolean) Modif-04 : 0x04 : [c] : counter-mask in range [0-255] (integer) Modif-05 : 0x05 : [t] : measure any thread (boolean) #----------------------------- IDX : 872415233 PMU name : wsm_dp (Intel Westmere (dual-socket)) Name : INSTRUCTION_RETIRED Equiv : INST_RETIRED:ANY_P Desc : count the number of instructions at retirement. Code : 0xc0 Modif-00 : 0x00 : [u] : monitor at priv level 1, 2, 3 (boolean) Modif-01 : 0x01 : [k] : monitor at priv level 0 (boolean) Modif-02 : 0x02 : [i] : invert (boolean) Modif-03 : 0x03 : [e] : edge level (boolean) Modif-04 : 0x04 : [c] : counter-mask in range [0-255] (integer) Modif-05 : 0x05 : [t] : measure any thread (boolean) ... }}} A full list obtained from the westmere machine can be viwed here - [[attachment:eventinfo_westmere.txt]]. == perf record == The command '''perf record''' gathers a performance counter profile from it, and write into '''perf.data''' - without displaying anything. The '''perf.data''' file can then be inspected later on, using '''perf report'''. (Other perf commands are designed to read the event data and create report output based on the other command's functionality.) For example, the following command runs '''sleep 5''', gathers events from all CPUs * '''-a''': system-wide collection from all CPUs. * '''-f''': overwrites data into the default output file '''perf.data'''. * '''-g''': does call-graph recording. * '''-e''': selects the target event. {{{ [user@westmere]# perf record -f -a -e cache-misses sleep 5 }}} == perf report == Use '''perf report''' to output an analysis of '''perf.data''', the report output includes the command, object, and function for the target executable. For example, the following command produces a sorted report of the executable '''sleep 5''' that consume the most time: (This example shows 46% of CPU time spent on command '''perf'''). If you use ('''-g''') option in previous '''perf record''' command, then you can see '''cache-misses''' event info on the call-graph. The output of '''perf report''' from our test '''sleep''' is: {{{ [user@westmere]# perf report # Samples: 149952 # # Overhead Command Shared Object Symbol # ........ ............... .......................................................................... ...... # 39.23% perf.2.6.32-44. [kernel.kallsyms] [k] list_del 20.72% perf.2.6.32-44. [kernel.kallsyms] [k] selinux_dentry_open 15.81% perf.2.6.32-44. [kernel.kallsyms] [k] clear_page_c 6.80% sleep [kernel.kallsyms] [k] kmem_cache_free 3.37% init [mpt2sas] [k] _base_interrupt 1.72% init [kernel.kallsyms] [k] find_busiest_group 1.57% init [kernel.kallsyms] [k] run_timer_softirq 1.09% perf.2.6.32-44. [kernel.kallsyms] [k] _cond_resched 1.08% init [kernel.kallsyms] [k] __rcu_pending 0.89% init [kernel.kallsyms] [k] profile_tick 0.88% init [kernel.kallsyms] [k] __switch_to 0.88% init [kernel.kallsyms] [k] __rcu_process_callbacks 0.88% events/4 [kernel.kallsyms] [k] mutex_trylock 0.87% kondemand/6 [kernel.kallsyms] [k] cpufreq_cpu_get 0.77% init [kernel.kallsyms] [k] insert_work 0.67% kondemand/0 [kernel.kallsyms] [k] _spin_lock_irqsave 0.63% init [kernel.kallsyms] [k] run_posix_cpu_timers 0.63% events/11 [kernel.kallsyms] [k] cache_reap ... }}} == perf annotate == '''perf annonate''' reads the input file and displays an annotated version of the code. If the object file has debug symbols then the source code will be displayed alongside assembly code. If there is no debug info in the object, then annotated assembly is displayed. == perf bench == The '''perf bench''' is a family of tools that provides general framework for benchmark suites. Currently the '''perf bench''' has following several sub-commands: {{{ perf bench sched # scheduler and IPC mechanism - messaging: Benchmark for scheduler and IPC mechanisms - pipe: Flood of communication over pipe() between two processes - all: test all suite (pseudo suite) perf bench mem # memory access performance - memcpy: Simple memory copy in various ways - all: test all suite (pseudo suite) perf bench all # test all subsystem (pseudo subsystem) }}} Here is a quick way to run all available benchmark suite. {{{ [user@westmere]# perf bench all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.073 [sec] # Running sched/pipe benchmark... # Extecuted 1000000 pipe operations between two tasks Total time: 10.892 [sec] 10.892100 usecs/op 91809 ops/sec # Running mem/memcpy benchmark... # Copying 1MB Bytes from 0x7f16f0118010 to 0x7f16f0219010 ... 984.251969 MB/Sec }}} === perf diff === The '''perf diff''' command reads two perf.data files and displays the differential profiles. It will be useful to show the delta percentage between two different events info of a same symbol. Following example shows how to do '''perf diff''' for the workload '''sleep 5'''. === perf kmem === The '''perf kmem''' command can be used to trace and measure kernel memory(slab) properties. First execute '''perf kmem record''' to record the kmem events of an arbitrary workload into a data file - without displaying anything. For example, the following command uses a key '''bytes''' and records workload '''sleep 5''' into the default '''perf.data''' file. Then one can use '''perf kmem stat''' to get the kernel memory statistics. {{{ [user@westmere]# perf kmem -s bytes record sleep 5 [user@westmere]# perf kmem stat SUMMARY ======= Total bytes requested: 66502256 Total bytes allocated: 66568880 Total bytes wasted on internal fragmentation: 66624 Internal fragmentation: 0.100083% Cross CPU allocations: 86/2245548 }}} == perf timechart == The '''perf timechart''' command can be used to create a '''SVG''' output file which shows how CPU cycles and I/O wait times are distributed across processes in the system over time. First execute '''perf timechart record''' to gather a performance counter profile from it, and write into a data file - without displaying anything. For example, the following command records system-wide '''cache-misses''' events ('''-a -e''') from all CPUs for the workload '''sleep 5''', and overwrite ('''-f''') the data into the default '''perf.data''' file. {{{ [user@westmere]# perf timechard record -f -a -e cache-misses sleep 5 }}} Then you can use '''perf timechar''' to create a '''SVG''' output file - '''result.svg'''. {{{ [user@westmere]# perf timechart -o result.svg }}} = Running applications under perf surveilance = The best way to run performance measurements with '''perf''' is to create a small script in which you can define the environment in which your application should run, as well as the various options and commands to '''perf'''. In the next section a sample script for running the High Energy Physics SPEC benchmark is presented. == HEPSPEC == After installing the HEPSPEC benchmark suite as described [[HEPSPEC|here]], one can use the following script to run performance monitoring tasks on each of the benchmarks, which are part of the HEPSPEC suite. {i} Right now only 64-bit versions of the benchmark compiled with GCC are supported on SL6. {i}The following script should be adjusted to the specific environment (HEPSPEC installation, etc.) {{{#!/bin/bash cd $HEPSPEC_INSTALL_DIR # specific, to be defined by user source shrc; wait; export LD_LIBRARY_PATH=/usr/local/lib echo "Using 64 gcc settings" rm -fr ./config/cern.cfg; cp -pr /afs/ifh.de/group/rz/HEPSPEC/spec2k6/linux64-gcc_cern.cfg ./config/cern.cfg; cp -pr /afs/ifh.de/group/rz/HEPSPEC/benchspec_standalone/CPU2006/*.bset $HEPSPEC_INSTALL_DIR/benchspec/CPU2006 for i in astar dealII namd omnetpp povray soplex xalancbmk do OUTFILE=$i-`date +%d-%h-%Y_%H:%M:%S`.out runspec --config=cern --action=scrub $i runspec --config=cern --action=build $i perf stat -e r12 -e rFD -e r10 -e rC4 -e rF7 -e r18 -e r14 -e r3C -e r1EC -e r1DB -e rC0 \ -e cpu-cycles -e instructions -e bus-cycles -e cpu-clock -e task-clock -e page-faults \ -e cs -e migrations -e alignment-faults -e emulation-faults \ runspec --config=cern --nobuild --noreportable $i 2>&1 >> $OUTFILE; # Show the names of the specific codes (got them from eventinfo_westmere.txt) echo "SIMD_INT_128 0x12" echo "SIMD_INT_64 0xFD" echo "SSE/MMX micro-ops 0x10" echo "BR_INSTR_RETIRED 0xC4" echo "FP_ASSIST 0xF7" echo "INST_DECODED 0x18" echo "ARITH 0x14" echo "CPU_CLK_UNHALTED 0x3C" echo "THREAD_ACTIVE 0x1EC" echo "UOP_UNFUSION 0x1DB" echo "INSTRUCTION_RETIRED 0xC0" done; }}} = Results = The perf monitoring tool was used to profile the Performance Monitoring Unit's Registers for all events on four different scientific applications. These are: * ASTRA - ''A space charge tracking algorithm'' is a particle tracking code developed by Klaus Flöttmann and others * Corsika and Simtel - part of the astropartcle physics simulation research at the future CTA experiment * ICEsim - part of the simulation code used by the AMANDA research group at DESY Zeuthen * HEPSPEC - benchmark suite specific to High-Energy Physics Simulations The monitoring of the events was done via skripts for every single application. The scripts for all of the above applications can be found [[http://www.ifh.de/~boyanov/perf/|here]]. The results of the monitoring can be found in this [[attachment:PERF.ods|spreadsheet]]. Unlite other software tools for reading out and monitoring performance events (like perfmon and libpfm) the perf tool cannot provide acurate information about the INT and FP instruction counts of the particular applications.