Contents

Overview
perf commands
Running applications under perf surveilance
1. HEPSPEC
Results

Overview

The perf command line tool is the interface which can be used by users to utilize the new performance counter subsystem in recent Linux kernels (version > 2.6.30). The new performance counter subsystem of Linux allows for using provides rich abstractions over the PMU hardware capabilities of modern CPUs. It provides per task, per CPU and per-workload counters, counter groups, and it provides sampling capabilities on top of those. It also provides abstraction for 'software events' - such as minor/major page faults, task migrations, task context-switches and tracepoints. The perf tool can be used to optimize, validate and measure applications, workloads or the full system.

perf is at this time only available on the westmere server. Installations and deployment on other platforms/machines is to be discussed.

perf commands

perf invocation uses different commands and particular options to these commands to perfom fine monitoring of performance-relevant information. The full list of commands supported by perf can be seen by issuing the following:

[root@westmere 2006-1.1]# perf help

 usage: perf [--version] [--help] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   diff            Read two perf.data files and display the differential profile
   kmem            Tool to trace/measure kernel memory(slab) properties
   list            List all symbolic event types
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   stat            Run a command and gather performance counter statistics
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   trace           Read perf.data (created by perf record) and display trace output

 See 'perf help COMMAND' for more information on a specific command.

perf list

This command will give a rather legthy list of all supported events that can be observed by the perf utility. A short snippet of the list is shown below:

[user@westmere]# perf list

List of pre-defined events (to be used in -e):

  cpu-cycles OR cycles                       [Hardware event]
  instructions                               [Hardware event]
  cache-references                           [Hardware event]
  cache-misses                               [Hardware event]
  branch-instructions OR branches            [Hardware event]
  branch-misses                              [Hardware event]
  bus-cycles                                 [Hardware event]

  cpu-clock                                  [Software event]
  task-clock                                 [Software event]
  page-faults OR faults                      [Software event]
  minor-faults                               [Software event]
  major-faults                               [Software event]
  context-switches OR cs                     [Software event]
  cpu-migrations OR migrations               [Software event]
  alignment-faults                           [Software event]
  emulation-faults                           [Software event]

  rNNN                                       [Raw hardware event descriptor]

  L1-dcache-loads                            [Hardware cache event]
  L1-dcache-load-misses                      [Hardware cache event]
  L1-dcache-stores                           [Hardware cache event]
  ...

The rNNN option can be used to persribe the monitoring of a performance counter/event by its hardware code. This is useful because not all Performance Monitoring Unit (PMU) features are covered by the Software and hardware events defined by the perf developers. For example if one wants to read ou the number of SIMD 64-bit integer operations, there is no event for that defined in the perf command line interface. In such cases one can obtain the hexadecimal code for that particular performance event and use it with the rNNN event description, where 'NNN' is substituted with the particular hardware code. An example of this follows in the next section.

perf stat

This command is the most common and easy way to gather performance statistics of an application. After being acquainted with what one can observe as performance counters through invocation of perf list, the events listet there can be used by perf stat. Here is an example how to get the number of instructions a command needed during operation and how many clock cycles that took:

[root@westmere]# perf stat -e instructions -e cycles ls /usr
bin  etc  games  include  lib  lib64  libexec  local  sbin  share  src  sue  tmp  vice

 Performance counter stats for 'ls /usr':

         992698  instructions             #      0.697 IPC
        1425153  cycles

    0.001307494  seconds time elapsed

Besides the '-e' option one can use also additional parameters to the 'perf stat' command:

 usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events
    -i, --inherit         child tasks inherit counters
    -p, --pid <n>         stat events on existing pid
    -a, --all-cpus        system-wide collection from all CPUs
    -c, --scale           scale/normalize counters
    -v, --verbose         be more verbose (show counter open errors, etc)
    -r, --repeat <n>      repeat command and print average + stddev (max: 100)
    -n, --null            null run - dont start any counters

For obtaining the list of codes for the hardware events not defined in the event list of perf list one must download and install the libpfm4.so library from here. In the compilation folder there is an examples directory, in which a tool called showeventinfo is located. It will give detailed description of all supported hardware events on the aprticular machine as well as their codes. The process of obtaining these codes is as follows - first download (via git) and build the latest libpfm4 library. Then run the showeventinfo command:

[root@westmere]# git clone git://perfmon2.git.sourceforge.net/gitroot/perfmon2/libpfm4
[root@westmere]# cd libpfm4
[root@westmere]# make
[root@westmere]# ./examples/showeventinfo

Detected PMU models:
        [17, ix86arch, "Intel X86 architectural PMU", 6 events]
        [50, perf, "perf_events generic PMU", 35 events]
        [52, wsm_dp, "Intel Westmere (dual-socket)", 91 events]
        [53, wsm_unc, "Intel Westmere uncore", 52 events]
Total events: 2090 available, 184 supported
#-----------------------------
IDX      : 872415232
PMU name : wsm_dp (Intel Westmere (dual-socket))
Name     : UNHALTED_CORE_CYCLES
Equiv    : CPU_CLK_UNHALTED:THREAD_P
Desc     : count core clock cycles whenever the clock signal on the specific core is running (not halted).
Code     : 0x3c
Modif-00 : 0x00 : [u] : monitor at priv level 1, 2, 3 (boolean)
Modif-01 : 0x01 : [k] : monitor at priv level 0 (boolean)
Modif-02 : 0x02 : [i] : invert (boolean)
Modif-03 : 0x03 : [e] : edge level (boolean)
Modif-04 : 0x04 : [c] : counter-mask in range [0-255] (integer)
Modif-05 : 0x05 : [t] : measure any thread (boolean)
#-----------------------------
IDX      : 872415233
PMU name : wsm_dp (Intel Westmere (dual-socket))
Name     : INSTRUCTION_RETIRED
Equiv    : INST_RETIRED:ANY_P
Desc     : count the number of instructions at retirement.
Code     : 0xc0
Modif-00 : 0x00 : [u] : monitor at priv level 1, 2, 3 (boolean)
Modif-01 : 0x01 : [k] : monitor at priv level 0 (boolean)
Modif-02 : 0x02 : [i] : invert (boolean)
Modif-03 : 0x03 : [e] : edge level (boolean)
Modif-04 : 0x04 : [c] : counter-mask in range [0-255] (integer)
Modif-05 : 0x05 : [t] : measure any thread (boolean)
...

A full list obtained from the westmere machine can be viwed here - eventinfo_westmere.txt.

perf record

The command perf record gathers a performance counter profile from it, and write into perf.data - without displaying anything. The perf.data file can then be inspected later on, using perf report. (Other perf commands are designed to read the event data and create report output based on the other command's functionality.)

For example, the following command runs sleep 5, gathers events from all CPUs

-a: system-wide collection from all CPUs.
-f: overwrites data into the default output file perf.data.
-g: does call-graph recording.
-e: selects the target event.

[user@westmere]# perf record -f -a -e cache-misses sleep 5

perf report

Use perf report to output an analysis of perf.data, the report output includes the command, object, and function for the target executable. For example, the following command produces a sorted report of the executable sleep 5 that consume the most time: (This example shows 46% of CPU time spent on command perf). If you use (-g) option in previous perf record command, then you can see cache-misses event info on the call-graph. The output of perf report from our test sleep is:

[user@westmere]# perf report
# Samples: 149952
#
# Overhead          Command                                                               Shared Object  Symbol
# ........  ...............  ..........................................................................  ......
#
    39.23%  perf.2.6.32-44.  [kernel.kallsyms]                                                           [k] list_del
    20.72%  perf.2.6.32-44.  [kernel.kallsyms]                                                           [k] selinux_dentry_open
    15.81%  perf.2.6.32-44.  [kernel.kallsyms]                                                           [k] clear_page_c
     6.80%            sleep  [kernel.kallsyms]                                                           [k] kmem_cache_free
     3.37%             init  [mpt2sas]                                                                   [k] _base_interrupt
     1.72%             init  [kernel.kallsyms]                                                           [k] find_busiest_group
     1.57%             init  [kernel.kallsyms]                                                           [k] run_timer_softirq
     1.09%  perf.2.6.32-44.  [kernel.kallsyms]                                                           [k] _cond_resched
     1.08%             init  [kernel.kallsyms]                                                           [k] __rcu_pending
     0.89%             init  [kernel.kallsyms]                                                           [k] profile_tick
     0.88%             init  [kernel.kallsyms]                                                           [k] __switch_to
     0.88%             init  [kernel.kallsyms]                                                           [k] __rcu_process_callbacks
     0.88%         events/4  [kernel.kallsyms]                                                           [k] mutex_trylock
     0.87%      kondemand/6  [kernel.kallsyms]                                                           [k] cpufreq_cpu_get
     0.77%             init  [kernel.kallsyms]                                                           [k] insert_work
     0.67%      kondemand/0  [kernel.kallsyms]                                                           [k] _spin_lock_irqsave
     0.63%             init  [kernel.kallsyms]                                                           [k] run_posix_cpu_timers
     0.63%        events/11  [kernel.kallsyms]                                                           [k] cache_reap
...

perf annotate

perf annonate reads the input file and displays an annotated version of the code. If the object file has debug symbols then the source code will be displayed alongside assembly code. If there is no debug info in the object, then annotated assembly is displayed.

perf bench

The perf bench is a family of tools that provides general framework for benchmark suites. Currently the perf bench has following several sub-commands:

perf bench sched            # scheduler and IPC mechanism
     - messaging:             Benchmark for scheduler and IPC mechanisms
     - pipe:                  Flood of communication over pipe() between two processes
     - all:                   test all suite (pseudo suite)
perf bench mem              # memory access performance
     - memcpy:                Simple memory copy in various ways
     - all:                   test all suite (pseudo suite)
perf bench all              # test all subsystem (pseudo subsystem)

Here is a quick way to run all available benchmark suite.

[user@westmere]# perf bench all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

     Total time: 0.073 [sec]

# Running sched/pipe benchmark...
# Extecuted 1000000 pipe operations between two tasks

     Total time: 10.892 [sec]

      10.892100 usecs/op
          91809 ops/sec

# Running mem/memcpy benchmark...
# Copying 1MB Bytes from 0x7f16f0118010 to 0x7f16f0219010 ...

     984.251969 MB/Sec

perf diff

The perf diff command reads two perf.data files and displays the differential profiles. It will be useful to show the delta percentage between two different events info of a same symbol. Following example shows how to do perf diff for the workload sleep 5.

perf kmem

The perf kmem command can be used to trace and measure kernel memory(slab) properties. First execute perf kmem record to record the kmem events of an arbitrary workload into a data file - without displaying anything. For example, the following command uses a key bytes and records workload sleep 5 into the default perf.data file. Then one can use perf kmem stat to get the kernel memory statistics.

[user@westmere]# perf kmem -s bytes record sleep 5
[user@westmere]# perf kmem stat
SUMMARY
=======
Total bytes requested: 66502256
Total bytes allocated: 66568880
Total bytes wasted on internal fragmentation: 66624
Internal fragmentation: 0.100083%
Cross CPU allocations: 86/2245548

perf timechart

The perf timechart command can be used to create a SVG output file which shows how CPU cycles and I/O wait times are distributed across processes in the system over time.

First execute perf timechart record to gather a performance counter profile from it, and write into a data file - without displaying anything. For example, the following command records system-wide cache-misses events (-a -e) from all CPUs for the workload sleep 5, and overwrite (-f) the data into the default perf.data file.

[user@westmere]# perf timechard record -f -a -e cache-misses sleep 5

Then you can use perf timechar to create a SVG output file - result.svg.

[user@westmere]# perf timechart -o result.svg

Running applications under perf surveilance

The best way to run performance measurements with perf is to create a small script in which you can define the environment in which your application should run, as well as the various options and commands to perf. In the next section a sample script for running the High Energy Physics SPEC benchmark is presented.

HEPSPEC

After installing the HEPSPEC benchmark suite as described here, one can use the following script to run performance monitoring tasks on each of the benchmarks, which are part of the HEPSPEC suite.

Right now only 64-bit versions of the benchmark compiled with GCC are supported on SL6.

{i}The following script should be adjusted to the specific environment (HEPSPEC installation, etc.)

{{{#!/bin/bash

cd $HEPSPEC_INSTALL_DIR # specific, to be defined by user source shrc; wait; export LD_LIBRARY_PATH=/usr/local/lib

echo "Using 64 gcc settings" rm -fr ./config/cern.cfg; cp -pr /afs/ifh.de/group/rz/HEPSPEC/spec2k6/linux64-gcc_cern.cfg ./config/cern.cfg; cp -pr /afs/ifh.de/group/rz/HEPSPEC/benchspec_standalone/CPU2006/*.bset $HEPSPEC_INSTALL_DIR/benchspec/CPU2006

for i in astar dealII namd omnetpp povray soplex xalancbmk do

OUTFILE=$i-date +%d-%h-%Y_%H:%M:%S.out runspec --config=cern --action=scrub $i runspec --config=cern --action=build $i perf stat -e r12 -e rFD -e r10 -e rC4 -e rF7 -e r18 -e r14 -e r3C -e r1EC -e r1DB -e rC0 \
- -e cpu-cycles -e instructions -e bus-cycles -e cpu-clock -e task-clock -e page-faults \ -e cs -e migrations -e alignment-faults -e emulation-faults \
  runspec --config=cern --nobuild --noreportable $i 2>&1 >> $OUTFILE;
# Show the names of the specific codes (got them from eventinfo_westmere.txt) echo "SIMD_INT_128 0x12" echo "SIMD_INT_64 0xFD" echo "SSE/MMX micro-ops 0x10" echo "BR_INSTR_RETIRED 0xC4" echo "FP_ASSIST 0xF7" echo "INST_DECODED 0x18" echo "ARITH 0x14" echo "CPU_CLK_UNHALTED 0x3C" echo "THREAD_ACTIVE 0x1EC" echo "UOP_UNFUSION 0x1DB" echo "INSTRUCTION_RETIRED 0xC0"

done; }}}

Results

The perf monitoring tool was used to profile the Performance Monitoring Unit's Registers for all events on four different scientific applications. These are:

ASTRA - A space charge tracking algorithm is a particle tracking code developed by Klaus Flöttmann and others
Corsika and Simtel - part of the astropartcle physics simulation research at the future CTA experiment
ICEsim - part of the simulation code used by the AMANDA research group at DESY Zeuthen
HEPSPEC - benchmark suite specific to High-Energy Physics Simulations

The monitoring of the events was done via skripts for every single application. The scripts for all of the above applications can be found here.

The results of the monitoring can be found in this spreadsheet.

Unlite other software tools for reading out and monitoring performance events (like perfmon and libpfm) the perf tool cannot provide acurate information about the INT and FP instruction counts of the particular applications.

Wiki

Page