<<TableOfContents>>

----
= Overview =
'''''pfmon''''' is a tool used for monitoring and gathering performance data during application runtime. It takes advantage of Performance Monitoring Units (PMU) hardware features available in modern processor microarchitectures (like Intel's Nehalem for example). It can deliver very detailed information based on more than 120 events related to conditions occuring at microarchitecture (core) as well as uncore hardware levels.

The main purpose for deploying the '''''pfmon''''' utility at DESY, Zeuthen is to measure ratios of integer and floating point operation of various scientific applications. The idea behind such a measurement is to give an estimate of how well a common benchmarking suite (for example HEPSPEC), for which the ratio is more or less stable, can be used to assess computing infrastructure for running wide spectrum of applications.

'''''pfmon''''' has very little system overhead and allows both counting or sampling of events. Perfmon documentation is available [[http://perfmon2.sourceforge.net/pfmon_usersguide.html|here]].

{i} Under test now is the 'perf' command-line tool which is available with the next release of Scientific Linux 6. More about it here - [[perf]].

= Using the Perfmon utility from command line =
Perfmon is installed only on the westmere machine and can be used after setting up the working enviroment. This is done on the following manner:

{{{
[photon] # export LD_LIBRARY_PATH=/usr/local/lib
[photon] # which pfmon
}}}
The output from the last command should show you the location of the pfmon executable.

The pfmon application gives an interface to the Performance Monitoring Unit (PMU) for observing, counting and sampling different hardware events. The list of all supported events on the given processor architecture can be given with the following command:

{{{
[photon] # pfmon -l
UNHALTED_CORE_CYCLES
INSTRUCTIONS_RETIRED
UNHALTED_REFERENCE_CYCLES
LAST_LEVEL_CACHE_REFERENCES
LAST_LEVEL_CACHE_MISSES
BRANCH_INSTRUCTIONS_RETIRED
...
}}}
Additional description of each event and the possible masks which can be applied to it (that is the defining different circumstances in which the event will be reported) can be seen by invoking the "pfmon -i <event>" command. For example:

{{{
[photon] # pfmon -i UNHALTED_CORE_CYCLES
Name     : UNHALTED_CORE_CYCLES
Code     : 0x3c
Counters : [ 0 1 2 3 17 ]
Desc     : count core clock cycles whenever the clock signal on the specific core is running (not halted). Alias to event CPU_CLK_UNHALTED:THREAD
PEBS     : No
Uncore   : No
}}}
There are numerous options which one can supply as parameters to '''''pfmon.''''' These are described in detail [[http://perfmon2.sourceforge.net/pfmon_usersguide.html#options|here]].

A simple script for running an application under perfmon surveilance can look like this:

{{{
#!/bin/bash 

# Enter the foldder where the application resides 
cd /lustre/fs1/user/boyanov/Astra;

# Initialize environment 
export LD_LIBRARY_PATH=/usr/local/lib; ini openmpi;

# Setup an unique filename for the output to be written to 
export OUTFILE="astra-`date +%d-%m-%Y`.out" echo $OUTFILE;

# Run perfmon 
pfmon --eu-counter-format --with-header --outfile=$OUTFILE --verbose --system-wide --aggregate-results --follow-all -e FP_COMP_OPS_EXE:X87,UNHALTED_CORE_CYCLES,INSTRUCTIONS_RETIRED --priv-level=u,k  mpirun -n 16 ./astra_openmpi_lx64_r370_solverfix test1.in }}} 

The most interesting options here are the ''--outfile ''which specifies the file where the results of the run are saved, ''--system-wide'', which specifies that not only the application executable should be monitored, but also all processes on the system,'' --follow-all'', which specifies that all threads and processes spawned by the application via ''fork(), pthread_create()'' and the like should also be monitored and taken into consideration. 

Next comes the ''-e'' option, after which the names of all the events to be monitored for, with the corresponding masks, are listed. By specifying the ''-e'' option multiple times one can define several sets of events to be monitored. This is necessary as the PMU supports only the monitoring of up to four or in rare occasions five events at a time. When working with more than one event set, one has to also specify the ''--switch-timeout=N'', where N is the number of milliseconds before switching to monitoring the next event set.

The ''--priv-level'' option specifies at which priviledge level the events are to be monitored. By default, pfmon monitors only what is going at the user level (application level). This is true for both per-thread and system wide mode. The number of privilege levels depends on the processor architecture. Most processors support at least 2 levels (user and kernel). It is also possible to monitor at several levels at the same time by specifying more than one level. On processors with only 2 privilege levels, options -2 and -1 are ignored are not supported. The levels can be specified for all events or on a per-event basis. To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3). To set the level for each event, the ''--priv-levels'' option must be used.

One can specify different output formats for the coutned values with ''--eu-counter-format'' or ''--us-counter-format'' parameters to pfmon. 

Another useful command line parameter is the --check-events-only, which will make a consistency check of the event set specified for monitoring. The test gives an answer wheater or not the specified events are monitorable simultaneoulsy (i.e. can reside in the same event set).

= Using the Perfmon utility GUI =
There is a Pyhton-based Graphical user Interface for the pfmon utility. It is installed on photon.ifh.de under /opt/gpfmon-0.8. In order to start the GUI one has to invoke the following commands:
{{{
[photon] # export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/loca/lib;
[photon] # cd /opt/gpfmon-0.8
[photon] # ./gpfmon.py &
}}}

= Results =
In the following subsections the results of some scientific applications as well as common benchmars are shown.

== HEP-SPEC benchmark ==
The HEPSPEC benchmark was run with the default configuration file provided by CERN for the all_cpp benchmarks suite which consists of seven different benchmarking programs, three integer and four floating point related. General information about the SPEC benchmarks can befound [[http://www.spec.org/cpu2006/|here]]. Information about the High-Energy-Physics (HEP) related SPEC benchmarks can be found [[https://twiki.cern.ch/twiki/bin/view/FIOgroup/TsiBenchHEPSPEC|here]].

=== Results Mai 2010 ===
 * astar benchmark
{{{
    350169 FP_COMP_OPS_EXE:X87
 194652389 UNHALTED_CORE_CYCLES
2353841271 INSTRUCTIONS_RETIRED
}}}

 * dealII Benchmark 
{{{
    395564 FP_COMP_OPS_EXE:X87
 413629882 UNHALTED_CORE_CYCLES
3075063716 INSTRUCTIONS_RETIRED
}}}
 * namd Benchmark
{{{
    350807 FP_COMP_OPS_EXE:X87
 202617854 UNHALTED_CORE_CYCLES
2442684653 INSTRUCTIONS_RETIRED
}}}
 * omnetpp Benchmark 
{{{
    362955 FP_COMP_OPS_EXE:X87
 201110938 UNHALTED_CORE_CYCLES
2303420699 INSTRUCTIONS_RETIRED
}}}
 * povray Benchmark 
{{{
    350807 FP_COMP_OPS_EXE:X87
 213551069 UNHALTED_CORE_CYCLES
2550040092 INSTRUCTIONS_RETIRED
}}}
 * soplex Benchmark 
{{{
    417840 FP_COMP_OPS_EXE:X87
 615025314 UNHALTED_CORE_CYCLES
9058917954 INSTRUCTIONS_RETIRED
}}}
 * xalancbmk Benchmark
{{{
    353757 FP_COMP_OPS_EXE:X87
 495160829 UNHALTED_CORE_CYCLES
4162076477 INSTRUCTIONS_RETIRED
}}}

=== Results December 2010 ===
 * astar benchmark
{{{
        0 SIMD_INT_64:PACK
        0 SIMD_INT_64:PACKED_ARITH
        0 SIMD_INT_64:PACKED_LOGICAL
        0 SIMD_INT_64:PACKED_MPY
        0 SIMD_INT_64:PACKED_SHIFT
        0 SIMD_INT_64:SHUFFLE_MOVE
        0 SIMD_INT_64:UNPACK
        0 SIMD_INT_128:PACK
        0 SIMD_INT_128:PACKED_ARITH
        0 SIMD_INT_128:PACKED_LOGICAL
        0 SIMD_INT_128:PACKED_MPY
        0 SIMD_INT_128:PACKED_SHIFT
   74.445 SIMD_INT_128:SHUFFLE_MOVE
   26.011 SIMD_INT_128:UNPACK
1.427.258 ARITH:CYCLES_DIV_BUSY
   78.197 ARITH:DIV
  728.205 ARITH:MUL
   32.416 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION
   18.075 FP_COMP_OPS_EXE:SSE_FP
        0 FP_COMP_OPS_EXE:SSE_FP_PACKED
   35.470 FP_COMP_OPS_EXE:SSE_FP_SCALAR
        0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION
        0 FP_COMP_OPS_EXE:MMX
4.220.908 FP_COMP_OPS_EXE:SSE2_INTEGER
  230.099 FP_COMP_OPS_EXE:X87
}}}

 * dealII Benchmark 
{{{
      0 SIMD_INT_64:PACK
      0 SIMD_INT_64:PACKED_ARITH
      0 SIMD_INT_64:PACKED_LOGICAL
      0 SIMD_INT_64:PACKED_MPY
      0 SIMD_INT_64:PACKED_SHIFT
      0 SIMD_INT_64:SHUFFLE_MOVE
      0 SIMD_INT_64:UNPACK
      0 SIMD_INT_128:PACK
      0 SIMD_INT_128:PACKED_ARITH
      0 SIMD_INT_128:PACKED_LOGICAL
      0 SIMD_INT_128:PACKED_MPY
      0 SIMD_INT_128:PACKED_SHIFT
 72.706 SIMD_INT_128:SHUFFLE_MOVE
 19.146 SIMD_INT_128:UNPACK
136.887 ARITH:CYCLES_DIV_BUSY
 14.439 ARITH:DIV
656.647 ARITH:MUL
 32.984 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION
 57.337 FP_COMP_OPS_EXE:SSE_FP
      0 FP_COMP_OPS_EXE:SSE_FP_PACKED
 11.597 FP_COMP_OPS_EXE:SSE_FP_SCALAR
      0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION
      0 FP_COMP_OPS_EXE:MMX
855.118 FP_COMP_OPS_EXE:SSE2_INTEGER
148.652 FP_COMP_OPS_EXE:X87
}}}

 * namd Benchmark
{{{
        0 SIMD_INT_64:PACK
        0 SIMD_INT_64:PACKED_ARITH
        0 SIMD_INT_64:PACKED_LOGICAL
        0 SIMD_INT_64:PACKED_MPY
        0 SIMD_INT_64:PACKED_SHIFT
        0 SIMD_INT_64:SHUFFLE_MOVE
        0 SIMD_INT_64:UNPACK
        0 SIMD_INT_128:PACK
        0 SIMD_INT_128:PACKED_ARITH
        0 SIMD_INT_128:PACKED_LOGICAL
        0 SIMD_INT_128:PACKED_MPY
        0 SIMD_INT_128:PACKED_SHIFT
  104.143 SIMD_INT_128:SHUFFLE_MOVE
   14.324 SIMD_INT_128:UNPACK
  467.375 ARITH:CYCLES_DIV_BUSY
   37.445 ARITH:DIV
  736.157 ARITH:MUL
   16.616 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION
   16.108 FP_COMP_OPS_EXE:SSE_FP
        0 FP_COMP_OPS_EXE:SSE_FP_PACKED
   45.426 FP_COMP_OPS_EXE:SSE_FP_SCALAR
        0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION
        0 FP_COMP_OPS_EXE:MMX
5.315.319 FP_COMP_OPS_EXE:SSE2_INTEGER
  154.938 FP_COMP_OPS_EXE:X87
}}}
 * omnetpp Benchmark 
{{{
      0 SIMD_INT_64:PACK
      0 SIMD_INT_64:PACKED_ARITH
      0 SIMD_INT_64:PACKED_LOGICAL
      0 SIMD_INT_64:PACKED_MPY
      0 SIMD_INT_64:PACKED_SHIFT
      0 SIMD_INT_64:SHUFFLE_MOVE
      0 SIMD_INT_64:UNPACK
      0 SIMD_INT_128:PACK
      0 SIMD_INT_128:PACKED_ARITH
      0 SIMD_INT_128:PACKED_LOGICAL
      0 SIMD_INT_128:PACKED_MPY
      0 SIMD_INT_128:PACKED_SHIFT
 72.025 SIMD_INT_128:SHUFFLE_MOVE
  9.324 SIMD_INT_128:UNPACK
 86.774 ARITH:CYCLES_DIV_BUSY
 19.225 ARITH:DIV
814.914 ARITH:MUL
 85.397 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION
 18.332 FP_COMP_OPS_EXE:SSE_FP
      0 FP_COMP_OPS_EXE:SSE_FP_PACKED
 16.422 FP_COMP_OPS_EXE:SSE_FP_SCALAR
      0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION
      0 FP_COMP_OPS_EXE:MMX
101.918 FP_COMP_OPS_EXE:SSE2_INTEGER
146.113 FP_COMP_OPS_EXE:X87
}}}
 * povray Benchmark 
{{{
      0 SIMD_INT_64:PACK
      0 SIMD_INT_64:PACKED_ARITH
      0 SIMD_INT_64:PACKED_LOGICAL
      0 SIMD_INT_64:PACKED_MPY
      0 SIMD_INT_64:PACKED_SHIFT
      0 SIMD_INT_64:SHUFFLE_MOVE
      0 SIMD_INT_64:UNPACK
      0 SIMD_INT_128:PACK
      0 SIMD_INT_128:PACKED_ARITH
      0 SIMD_INT_128:PACKED_LOGICAL
      0 SIMD_INT_128:PACKED_MPY
      0 SIMD_INT_128:PACKED_SHIFT
 61.728 SIMD_INT_128:SHUFFLE_MOVE
 13.574 SIMD_INT_128:UNPACK
 78.637 ARITH:CYCLES_DIV_BUSY
 22.518 ARITH:DIV
506.300 ARITH:MUL
 23.701 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION
 33.989 FP_COMP_OPS_EXE:SSE_FP
      0 FP_COMP_OPS_EXE:SSE_FP_PACKED
 35.484 FP_COMP_OPS_EXE:SSE_FP_SCALAR
      0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION
      0 FP_COMP_OPS_EXE:MMX
100.566 FP_COMP_OPS_EXE:SSE2_INTEGER
218.570 FP_COMP_OPS_EXE:X87
}}}
 * soplex Benchmark 
{{{
         0 SIMD_INT_64:PACK
         0 SIMD_INT_64:PACKED_ARITH
         0 SIMD_INT_64:PACKED_LOGICAL
         0 SIMD_INT_64:PACKED_MPY
         0 SIMD_INT_64:PACKED_SHIFT
         0 SIMD_INT_64:SHUFFLE_MOVE
         0 SIMD_INT_64:UNPACK
         0 SIMD_INT_128:PACK
         0 SIMD_INT_128:PACKED_ARITH
         0 SIMD_INT_128:PACKED_LOGICAL
         0 SIMD_INT_128:PACKED_MPY
         0 SIMD_INT_128:PACKED_SHIFT
   362.236 SIMD_INT_128:SHUFFLE_MOVE
     4.260 SIMD_INT_128:UNPACK
   277.214 ARITH:CYCLES_DIV_BUSY
    21.337 ARITH:DIV
   518.253 ARITH:MUL
     7.507 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION
    18.512 FP_COMP_OPS_EXE:SSE_FP
         0 FP_COMP_OPS_EXE:SSE_FP_PACKED
    34.237 FP_COMP_OPS_EXE:SSE_FP_SCALAR
         0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION
         0 FP_COMP_OPS_EXE:MMX
34.844.570 FP_COMP_OPS_EXE:SSE2_INTEGER
   293.240 FP_COMP_OPS_EXE:X87
}}}
 * xalancbmk Benchmark
{{{
        0 SIMD_INT_64:PACK
        0 SIMD_INT_64:PACKED_ARITH
        0 SIMD_INT_64:PACKED_LOGICAL
        0 SIMD_INT_64:PACKED_MPY
        0 SIMD_INT_64:PACKED_SHIFT
        0 SIMD_INT_64:SHUFFLE_MOVE
        0 SIMD_INT_64:UNPACK
        0 SIMD_INT_128:PACK
        0 SIMD_INT_128:PACKED_ARITH
        0 SIMD_INT_128:PACKED_LOGICAL
        0 SIMD_INT_128:PACKED_MPY
        0 SIMD_INT_128:PACKED_SHIFT
  224.621 SIMD_INT_128:SHUFFLE_MOVE
    1.291 SIMD_INT_128:UNPACK
  553.674 ARITH:CYCLES_DIV_BUSY
   14.997 ARITH:DIV
  484.840 ARITH:MUL
    9.489 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION
    6.920 FP_COMP_OPS_EXE:SSE_FP
        0 FP_COMP_OPS_EXE:SSE_FP_PACKED
    9.024 FP_COMP_OPS_EXE:SSE_FP_SCALAR
        0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION
        0 FP_COMP_OPS_EXE:MMX
7.555.265 FP_COMP_OPS_EXE:SSE2_INTEGER
  739.540 FP_COMP_OPS_EXE:X87
}}}

== Astra space charge tracking simulation ==
Astra (A Space charge TRacking Algorithm) is a particle tracking code developed by Klaus Flöttmann at DESY. More information can be found [[http://tesla.desy.de/~lfroehli/astra/|here]].

{{{
             2496490700640 FP_COMP_OPS_EXE:X87
              522740864768 UNHALTED_CORE_CYCLES
           174061079388628 INSTRUCTIONS_RETIRED
}}}