<<TableOfContents>>

----
= Overview =
The '''perf''' command line tool is the interface which can be used by users to utilize the new performance counter subsystem in recent Linux kernels (version > 2.6.30). The new performance counter subsystem of Linux allows for using provides rich abstractions over the PMU hardware capabilities of modern CPUs. It provides per task, per CPU and per-workload counters, counter groups, and it provides sampling capabilities on top of those. It also provides abstraction for 'software events' - such as minor/major page faults, task migrations, task context-switches and tracepoints. The '''perf''' tool can be used to optimize, validate and measure applications, workloads or the full system.

{i} '''perf''' is at this time only available on the westmere server. Installations and deployment on other platforms/machines is to be discussed.

= perf commands =
'''perf''' invocation uses different commands and particular options to these commands to perfom fine monitoring of performance-relevant information. The full list of commands supported by '''perf''' can be seen by issuing the following:

{{{
[root@westmere 2006-1.1]# perf help

 usage: perf [--version] [--help] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   diff            Read two perf.data files and display the differential profile
   kmem            Tool to trace/measure kernel memory(slab) properties
   list            List all symbolic event types
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   stat            Run a command and gather performance counter statistics
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   trace           Read perf.data (created by perf record) and display trace output

 See 'perf help COMMAND' for more information on a specific command.
}}}
== perf list ==
This command will give a rather legthy list of all supported events that can be observed by the '''perf''' utility. A short snippet of the list is shown below:

{{{
[user@westmere]# perf list

List of pre-defined events (to be used in -e):

  cpu-cycles OR cycles                       [Hardware event]
  instructions                               [Hardware event]
  cache-references                           [Hardware event]
  cache-misses                               [Hardware event]
  branch-instructions OR branches            [Hardware event]
  branch-misses                              [Hardware event]
  bus-cycles                                 [Hardware event]

  cpu-clock                                  [Software event]
  task-clock                                 [Software event]
  page-faults OR faults                      [Software event]
  minor-faults                               [Software event]
  major-faults                               [Software event]
  context-switches OR cs                     [Software event]
  cpu-migrations OR migrations               [Software event]
  alignment-faults                           [Software event]
  emulation-faults                           [Software event]

  rNNN                                       [Raw hardware event descriptor]

  L1-dcache-loads                            [Hardware cache event]
  L1-dcache-load-misses                      [Hardware cache event]
  L1-dcache-stores                           [Hardware cache event]
  ...
}}}
The '''rNNN''' option can be used to persribe the monitoring of a performance counter/event by its hardware code. This is useful because not all Performance Monitoring Unit (PMU) features are covered by the Software and hardware events defined by the '''perf''' developers. For example if one wants to read ou the number of SIMD 64-bit integer operations, there is no event for that defined in the '''perf''' command line interface. In such cases one can obtain the hexadecimal code for that particular performance event and use it with the '''rNNN''' event description, where 'NNN' is substituted with the particular hardware code. An example of this follows in the next section.

== perf stat ==
This command is the most common and easy way to gather performance statistics of an application. After being acquainted with what one can observe as performance counters through invocation of '''perf list''', the events listet there can be used by '''perf stat'''. Here is an example how to get the number of instructions a command needed during operation and how many clock cycles that took:

{{{
[root@westmere]# perf stat -e instructions -e cycles ls /usr
bin  etc  games  include  lib  lib64  libexec  local  sbin  share  src  sue  tmp  vice

 Performance counter stats for 'ls /usr':

         992698  instructions             #      0.697 IPC
        1425153  cycles

    0.001307494  seconds time elapsed
}}}
Besides the '-e' option one can use also additional parameters to the 'perf stat' command:

{{{
 usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events
    -i, --inherit         child tasks inherit counters
    -p, --pid <n>         stat events on existing pid
    -a, --all-cpus        system-wide collection from all CPUs
    -c, --scale           scale/normalize counters
    -v, --verbose         be more verbose (show counter open errors, etc)
    -r, --repeat <n>      repeat command and print average + stddev (max: 100)
    -n, --null            null run - dont start any counters
}}}
For obtaining the list of codes for the hardware events not defined in the event list of '''perf list''' one must download and install the ''libpfm4.so'' library from [[http://perfmon2.sourceforge.net/|here]]. In the compilation folder there is an '''examples''' directory, in which a tool called '''showeventinfo''' is located. It will give detailed description of all supported hardware events on the aprticular machine as well as their codes. The process of obtaining these codes is as follows - first download (via git) and build the latest ''libpfm4'' library. Then run the '''showeventinfo''' command:

{{{
[root@westmere]# git clone git://perfmon2.git.sourceforge.net/gitroot/perfmon2/libpfm4
[root@westmere]# cd libpfm4
[root@westmere]# make
[root@westmere]# ./examples/showeventinfo

Detected PMU models:
        [17, ix86arch, "Intel X86 architectural PMU", 6 events]
        [50, perf, "perf_events generic PMU", 35 events]
        [52, wsm_dp, "Intel Westmere (dual-socket)", 91 events]
        [53, wsm_unc, "Intel Westmere uncore", 52 events]
Total events: 2090 available, 184 supported
#-----------------------------
IDX      : 872415232
PMU name : wsm_dp (Intel Westmere (dual-socket))
Name     : UNHALTED_CORE_CYCLES
Equiv    : CPU_CLK_UNHALTED:THREAD_P
Desc     : count core clock cycles whenever the clock signal on the specific core is running (not halted).
Code     : 0x3c
Modif-00 : 0x00 : [u] : monitor at priv level 1, 2, 3 (boolean)
Modif-01 : 0x01 : [k] : monitor at priv level 0 (boolean)
Modif-02 : 0x02 : [i] : invert (boolean)
Modif-03 : 0x03 : [e] : edge level (boolean)
Modif-04 : 0x04 : [c] : counter-mask in range [0-255] (integer)
Modif-05 : 0x05 : [t] : measure any thread (boolean)
#-----------------------------
IDX      : 872415233
PMU name : wsm_dp (Intel Westmere (dual-socket))
Name     : INSTRUCTION_RETIRED
Equiv    : INST_RETIRED:ANY_P
Desc     : count the number of instructions at retirement.
Code     : 0xc0
Modif-00 : 0x00 : [u] : monitor at priv level 1, 2, 3 (boolean)
Modif-01 : 0x01 : [k] : monitor at priv level 0 (boolean)
Modif-02 : 0x02 : [i] : invert (boolean)
Modif-03 : 0x03 : [e] : edge level (boolean)
Modif-04 : 0x04 : [c] : counter-mask in range [0-255] (integer)
Modif-05 : 0x05 : [t] : measure any thread (boolean)
...
}}}
A full list obtained from the westmere machine can be viwed here - [[attachment:eventinfo_westmere.txt]].

== perf record ==
The command '''perf record''' gathers a performance counter profile from it, and write into '''perf.data''' - without displaying anything. The '''perf.data''' file can then be inspected later on, using '''perf report'''. (Other perf commands are designed to read the event data and create report output based on the other command's functionality.)

For example, the following command runs '''sleep 5''', gathers events from all CPUs

 * '''-a''': system-wide collection from all CPUs.
 * '''-f''': overwrites  data into the default output file '''perf.data'''.
 * '''-g''': does call-graph recording.
 * '''-e''': selects the target event.

{{{
[user@westmere]# perf record -f -a -e cache-misses sleep 5
}}}
== perf report ==
Use '''perf report''' to output an analysis of '''perf.data''', the  report output includes the command, object, and function for the target  executable. For example, the following command produces a sorted report  of the executable '''sleep 5''' that consume the most time: (This example shows 46% of CPU time spent on command '''perf'''). If you use ('''-g''') option in previous '''perf record''' command, then you can see '''cache-misses''' event info on the call-graph. The output of '''perf report''' from our test '''sleep''' is:

{{{
[user@westmere]# perf report
# Samples: 149952
#
# Overhead          Command                                                               Shared Object  Symbol
# ........  ...............  ..........................................................................  ......
#
    39.23%  perf.2.6.32-44.  [kernel.kallsyms]                                                           [k] list_del
    20.72%  perf.2.6.32-44.  [kernel.kallsyms]                                                           [k] selinux_dentry_open
    15.81%  perf.2.6.32-44.  [kernel.kallsyms]                                                           [k] clear_page_c
     6.80%            sleep  [kernel.kallsyms]                                                           [k] kmem_cache_free
     3.37%             init  [mpt2sas]                                                                   [k] _base_interrupt
     1.72%             init  [kernel.kallsyms]                                                           [k] find_busiest_group
     1.57%             init  [kernel.kallsyms]                                                           [k] run_timer_softirq
     1.09%  perf.2.6.32-44.  [kernel.kallsyms]                                                           [k] _cond_resched
     1.08%             init  [kernel.kallsyms]                                                           [k] __rcu_pending
     0.89%             init  [kernel.kallsyms]                                                           [k] profile_tick
     0.88%             init  [kernel.kallsyms]                                                           [k] __switch_to
     0.88%             init  [kernel.kallsyms]                                                           [k] __rcu_process_callbacks
     0.88%         events/4  [kernel.kallsyms]                                                           [k] mutex_trylock
     0.87%      kondemand/6  [kernel.kallsyms]                                                           [k] cpufreq_cpu_get
     0.77%             init  [kernel.kallsyms]                                                           [k] insert_work
     0.67%      kondemand/0  [kernel.kallsyms]                                                           [k] _spin_lock_irqsave
     0.63%             init  [kernel.kallsyms]                                                           [k] run_posix_cpu_timers
     0.63%        events/11  [kernel.kallsyms]                                                           [k] cache_reap
...
}}}
== perf annotate ==
'''perf annonate''' reads the input file and displays an annotated  version of the code. If the object file has debug symbols then the  source code will be displayed alongside assembly code. If there is no  debug info in the object, then annotated assembly is displayed.

== perf bench ==
The '''perf bench''' is a family of tools that provides general framework for benchmark suites. Currently the '''perf bench''' has following several sub-commands:

{{{
perf bench sched            # scheduler and IPC mechanism
     - messaging:             Benchmark for scheduler and IPC mechanisms
     - pipe:                  Flood of communication over pipe() between two processes
     - all:                   test all suite (pseudo suite)
perf bench mem              # memory access performance
     - memcpy:                Simple memory copy in various ways
     - all:                   test all suite (pseudo suite)
perf bench all              # test all subsystem (pseudo subsystem)
}}}
Here is a quick way to run all available benchmark suite.

{{{
[user@westmere]# perf bench all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

     Total time: 0.073 [sec]

# Running sched/pipe benchmark...
# Extecuted 1000000 pipe operations between two tasks

     Total time: 10.892 [sec]

      10.892100 usecs/op
          91809 ops/sec

# Running mem/memcpy benchmark...
# Copying 1MB Bytes from 0x7f16f0118010 to 0x7f16f0219010 ...

     984.251969 MB/Sec
}}}
=== perf diff ===
The '''perf diff''' command reads two perf.data files and displays  the differential profiles. It will be useful to show the delta  percentage between two different events info of a same symbol. Following  example shows how to do '''perf diff''' for the workload '''sleep 5'''.

=== perf kmem ===
The '''perf kmem''' command can be used to trace and measure kernel memory(slab) properties. First execute '''perf kmem record''' to record the kmem events of an  arbitrary workload into a data file - without displaying anything. For  example, the following command uses a key '''bytes''' and records workload '''sleep 5'''  into the default '''perf.data''' file. Then one can use '''perf kmem stat''' to get the kernel memory statistics.

{{{
[user@westmere]# perf kmem -s bytes record sleep 5
[user@westmere]# perf kmem stat
SUMMARY
=======
Total bytes requested: 66502256
Total bytes allocated: 66568880
Total bytes wasted on internal fragmentation: 66624
Internal fragmentation: 0.100083%
Cross CPU allocations: 86/2245548
}}}
== perf timechart ==
The '''perf timechart''' command can be used to create a '''SVG''' output file which shows how CPU cycles and I/O wait times are distributed across processes in the system over time.

First execute '''perf timechart record''' to gather a performance  counter profile from it, and write into a data file - without displaying  anything. For example, the following command records system-wide '''cache-misses''' events ('''-a -e''') from all CPUs for the workload '''sleep 5''', and overwrite ('''-f''') the data into the default '''perf.data''' file.

{{{
[user@westmere]# perf timechard record -f -a -e cache-misses sleep 5
}}}
Then you can use '''perf timechar''' to create a '''SVG''' output file - '''result.svg'''.

{{{
[user@westmere]# perf timechart -o result.svg
}}}
= Running applications under perf surveilance =
The best way to run performance measurements with '''perf''' is to create a small script in which you can define the environment in which your application should run, as well as the various options and commands to '''perf'''. In the next section a sample script for running the High Energy Physics SPEC benchmark is presented.

== HEPSPEC ==
After installing the HEPSPEC benchmark suite as described [[HEPSPEC|here]], one can use the following script to run performance monitoring tasks on each of the benchmarks, which are part of the HEPSPEC suite.

{i} Right now only 64-bit versions of the benchmark compiled with GCC are supported on SL6.

{i}The following script should be adjusted to the specific environment (HEPSPEC installation, etc.)

{{{#!/bin/bash

cd $HEPSPEC_INSTALL_DIR # specific, to be defined by user
source shrc;
wait;
export LD_LIBRARY_PATH=/usr/local/lib

echo "Using 64 gcc settings"
rm -fr ./config/cern.cfg;
cp -pr /afs/ifh.de/group/rz/HEPSPEC/spec2k6/linux64-gcc_cern.cfg ./config/cern.cfg;
cp -pr /afs/ifh.de/group/rz/HEPSPEC/benchspec_standalone/CPU2006/*.bset $HEPSPEC_INSTALL_DIR/benchspec/CPU2006

for i in astar dealII namd omnetpp povray soplex xalancbmk
do
    OUTFILE=$i-`date +%d-%h-%Y_%H:%M:%S`.out
    runspec --config=cern --action=scrub $i
    runspec --config=cern --action=build $i
    perf stat -e r12 -e rFD -e r10 -e rC4 -e rF7 -e r18 -e r14 -e r3C -e r1EC -e r1DB -e rC0 \
              -e cpu-cycles -e instructions -e bus-cycles -e cpu-clock -e task-clock -e page-faults \
              -e cs -e migrations -e alignment-faults -e emulation-faults \
              runspec --config=cern --nobuild --noreportable $i 2>&1 >> $OUTFILE;

    # Show the names of the specific codes (got them from eventinfo_westmere.txt)
    echo "SIMD_INT_128        0x12"
    echo "SIMD_INT_64         0xFD"
    echo "SSE/MMX micro-ops   0x10"
    echo "BR_INSTR_RETIRED    0xC4"
    echo "FP_ASSIST           0xF7"
    echo "INST_DECODED        0x18"
    echo "ARITH               0x14"
    echo "CPU_CLK_UNHALTED    0x3C"
    echo "THREAD_ACTIVE       0x1EC"
    echo "UOP_UNFUSION        0x1DB"
    echo "INSTRUCTION_RETIRED 0xC0"
done;
}}}
= Results =
The perf monitoring tool was used to profile the Performance Monitoring Unit's Registers for all events on four different scientific applications. These are:

 * ASTRA - ''A space charge tracking algorithm'' is a particle tracking code developed by Klaus Flöttmann and others
 * Corsika and Simtel - part of the astropartcle physics simulation research at the future CTA experiment
 * ICEsim - part of the simulation code used by the AMANDA research group at DESY Zeuthen
 * HEPSPEC - benchmark suite specific to High-Energy Physics Simulations

The monitoring of the events was done via skripts for every single application. The scripts for all of the above applications can be found [[http://www.ifh.de/~boyanov/perf/|here]].

The results of the monitoring can be found in this [[attachment:PERF.ods|spreadsheet]].

Unlite other software tools for reading out and monitoring performance events (like perfmon and libpfm) the perf tool cannot provide acurate information about the INT and FP instruction counts of the particular applications.