<> ---- = Overview = '''''pfmon''''' is a tool used for monitoring and gathering performance data during application runtime. It takes advantage of Performance Monitoring Units (PMU) hardware features available in modern processor microarchitectures (like Intel's Nehalem for example). It can deliver very detailed information based on more than 120 events related to conditions occuring at microarchitecture (core) as well as uncore hardware levels. The main purpose for deploying the '''''pfmon''''' utility at DESY, Zeuthen is to measure ratios of integer and floating point operation of various scientific applications. The idea behind such a measurement is to give an estimate of how well a common benchmarking suite (for example HEPSPEC), for which the ratio is more or less stable, can be used to assess computing infrastructure for running wide spectrum of applications. '''''pfmon''''' has very little system overhead and allows both counting or sampling of events. Perfmon documentation is available [[http://perfmon2.sourceforge.net/pfmon_usersguide.html|here]]. {i} Under test now is the 'perf' command-line tool which is available with the next release of Scientific Linux 6. More about it here - [[perf]]. = Using the Perfmon utility from command line = Perfmon is installed only on the westmere machine and can be used after setting up the working enviroment. This is done on the following manner: {{{ [photon] # export LD_LIBRARY_PATH=/usr/local/lib [photon] # which pfmon }}} The output from the last command should show you the location of the pfmon executable. The pfmon application gives an interface to the Performance Monitoring Unit (PMU) for observing, counting and sampling different hardware events. The list of all supported events on the given processor architecture can be given with the following command: {{{ [photon] # pfmon -l UNHALTED_CORE_CYCLES INSTRUCTIONS_RETIRED UNHALTED_REFERENCE_CYCLES LAST_LEVEL_CACHE_REFERENCES LAST_LEVEL_CACHE_MISSES BRANCH_INSTRUCTIONS_RETIRED ... }}} Additional description of each event and the possible masks which can be applied to it (that is the defining different circumstances in which the event will be reported) can be seen by invoking the "pfmon -i " command. For example: {{{ [photon] # pfmon -i UNHALTED_CORE_CYCLES Name : UNHALTED_CORE_CYCLES Code : 0x3c Counters : [ 0 1 2 3 17 ] Desc : count core clock cycles whenever the clock signal on the specific core is running (not halted). Alias to event CPU_CLK_UNHALTED:THREAD PEBS : No Uncore : No }}} There are numerous options which one can supply as parameters to '''''pfmon.''''' These are described in detail [[http://perfmon2.sourceforge.net/pfmon_usersguide.html#options|here]]. A simple script for running an application under perfmon surveilance can look like this: {{{ #!/bin/bash # Enter the foldder where the application resides cd /lustre/fs1/user/boyanov/Astra; # Initialize environment export LD_LIBRARY_PATH=/usr/local/lib; ini openmpi; # Setup an unique filename for the output to be written to export OUTFILE="astra-`date +%d-%m-%Y`.out" echo $OUTFILE; # Run perfmon pfmon --eu-counter-format --with-header --outfile=$OUTFILE --verbose --system-wide --aggregate-results --follow-all -e FP_COMP_OPS_EXE:X87,UNHALTED_CORE_CYCLES,INSTRUCTIONS_RETIRED --priv-level=u,k mpirun -n 16 ./astra_openmpi_lx64_r370_solverfix test1.in }}} The most interesting options here are the ''--outfile ''which specifies the file where the results of the run are saved, ''--system-wide'', which specifies that not only the application executable should be monitored, but also all processes on the system,'' --follow-all'', which specifies that all threads and processes spawned by the application via ''fork(), pthread_create()'' and the like should also be monitored and taken into consideration. Next comes the ''-e'' option, after which the names of all the events to be monitored for, with the corresponding masks, are listed. By specifying the ''-e'' option multiple times one can define several sets of events to be monitored. This is necessary as the PMU supports only the monitoring of up to four or in rare occasions five events at a time. When working with more than one event set, one has to also specify the ''--switch-timeout=N'', where N is the number of milliseconds before switching to monitoring the next event set. The ''--priv-level'' option specifies at which priviledge level the events are to be monitored. By default, pfmon monitors only what is going at the user level (application level). This is true for both per-thread and system wide mode. The number of privilege levels depends on the processor architecture. Most processors support at least 2 levels (user and kernel). It is also possible to monitor at several levels at the same time by specifying more than one level. On processors with only 2 privilege levels, options -2 and -1 are ignored are not supported. The levels can be specified for all events or on a per-event basis. To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3). To set the level for each event, the ''--priv-levels'' option must be used. One can specify different output formats for the coutned values with ''--eu-counter-format'' or ''--us-counter-format'' parameters to pfmon. Another useful command line parameter is the --check-events-only, which will make a consistency check of the event set specified for monitoring. The test gives an answer wheater or not the specified events are monitorable simultaneoulsy (i.e. can reside in the same event set). = Using the Perfmon utility GUI = There is a Pyhton-based Graphical user Interface for the pfmon utility. It is installed on photon.ifh.de under /opt/gpfmon-0.8. In order to start the GUI one has to invoke the following commands: {{{ [photon] # export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/loca/lib; [photon] # cd /opt/gpfmon-0.8 [photon] # ./gpfmon.py & }}} = Results = In the following subsections the results of some scientific applications as well as common benchmars are shown. == HEP-SPEC benchmark == The HEPSPEC benchmark was run with the default configuration file provided by CERN for the all_cpp benchmarks suite which consists of seven different benchmarking programs, three integer and four floating point related. General information about the SPEC benchmarks can befound [[http://www.spec.org/cpu2006/|here]]. Information about the High-Energy-Physics (HEP) related SPEC benchmarks can be found [[https://twiki.cern.ch/twiki/bin/view/FIOgroup/TsiBenchHEPSPEC|here]]. === Results Mai 2010 === * astar benchmark {{{ 350169 FP_COMP_OPS_EXE:X87 194652389 UNHALTED_CORE_CYCLES 2353841271 INSTRUCTIONS_RETIRED }}} * dealII Benchmark {{{ 395564 FP_COMP_OPS_EXE:X87 413629882 UNHALTED_CORE_CYCLES 3075063716 INSTRUCTIONS_RETIRED }}} * namd Benchmark {{{ 350807 FP_COMP_OPS_EXE:X87 202617854 UNHALTED_CORE_CYCLES 2442684653 INSTRUCTIONS_RETIRED }}} * omnetpp Benchmark {{{ 362955 FP_COMP_OPS_EXE:X87 201110938 UNHALTED_CORE_CYCLES 2303420699 INSTRUCTIONS_RETIRED }}} * povray Benchmark {{{ 350807 FP_COMP_OPS_EXE:X87 213551069 UNHALTED_CORE_CYCLES 2550040092 INSTRUCTIONS_RETIRED }}} * soplex Benchmark {{{ 417840 FP_COMP_OPS_EXE:X87 615025314 UNHALTED_CORE_CYCLES 9058917954 INSTRUCTIONS_RETIRED }}} * xalancbmk Benchmark {{{ 353757 FP_COMP_OPS_EXE:X87 495160829 UNHALTED_CORE_CYCLES 4162076477 INSTRUCTIONS_RETIRED }}} === Results December 2010 === * astar benchmark {{{ 0 SIMD_INT_64:PACK 0 SIMD_INT_64:PACKED_ARITH 0 SIMD_INT_64:PACKED_LOGICAL 0 SIMD_INT_64:PACKED_MPY 0 SIMD_INT_64:PACKED_SHIFT 0 SIMD_INT_64:SHUFFLE_MOVE 0 SIMD_INT_64:UNPACK 0 SIMD_INT_128:PACK 0 SIMD_INT_128:PACKED_ARITH 0 SIMD_INT_128:PACKED_LOGICAL 0 SIMD_INT_128:PACKED_MPY 0 SIMD_INT_128:PACKED_SHIFT 74.445 SIMD_INT_128:SHUFFLE_MOVE 26.011 SIMD_INT_128:UNPACK 1.427.258 ARITH:CYCLES_DIV_BUSY 78.197 ARITH:DIV 728.205 ARITH:MUL 32.416 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 18.075 FP_COMP_OPS_EXE:SSE_FP 0 FP_COMP_OPS_EXE:SSE_FP_PACKED 35.470 FP_COMP_OPS_EXE:SSE_FP_SCALAR 0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 FP_COMP_OPS_EXE:MMX 4.220.908 FP_COMP_OPS_EXE:SSE2_INTEGER 230.099 FP_COMP_OPS_EXE:X87 }}} * dealII Benchmark {{{ 0 SIMD_INT_64:PACK 0 SIMD_INT_64:PACKED_ARITH 0 SIMD_INT_64:PACKED_LOGICAL 0 SIMD_INT_64:PACKED_MPY 0 SIMD_INT_64:PACKED_SHIFT 0 SIMD_INT_64:SHUFFLE_MOVE 0 SIMD_INT_64:UNPACK 0 SIMD_INT_128:PACK 0 SIMD_INT_128:PACKED_ARITH 0 SIMD_INT_128:PACKED_LOGICAL 0 SIMD_INT_128:PACKED_MPY 0 SIMD_INT_128:PACKED_SHIFT 72.706 SIMD_INT_128:SHUFFLE_MOVE 19.146 SIMD_INT_128:UNPACK 136.887 ARITH:CYCLES_DIV_BUSY 14.439 ARITH:DIV 656.647 ARITH:MUL 32.984 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 57.337 FP_COMP_OPS_EXE:SSE_FP 0 FP_COMP_OPS_EXE:SSE_FP_PACKED 11.597 FP_COMP_OPS_EXE:SSE_FP_SCALAR 0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 FP_COMP_OPS_EXE:MMX 855.118 FP_COMP_OPS_EXE:SSE2_INTEGER 148.652 FP_COMP_OPS_EXE:X87 }}} * namd Benchmark {{{ 0 SIMD_INT_64:PACK 0 SIMD_INT_64:PACKED_ARITH 0 SIMD_INT_64:PACKED_LOGICAL 0 SIMD_INT_64:PACKED_MPY 0 SIMD_INT_64:PACKED_SHIFT 0 SIMD_INT_64:SHUFFLE_MOVE 0 SIMD_INT_64:UNPACK 0 SIMD_INT_128:PACK 0 SIMD_INT_128:PACKED_ARITH 0 SIMD_INT_128:PACKED_LOGICAL 0 SIMD_INT_128:PACKED_MPY 0 SIMD_INT_128:PACKED_SHIFT 104.143 SIMD_INT_128:SHUFFLE_MOVE 14.324 SIMD_INT_128:UNPACK 467.375 ARITH:CYCLES_DIV_BUSY 37.445 ARITH:DIV 736.157 ARITH:MUL 16.616 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 16.108 FP_COMP_OPS_EXE:SSE_FP 0 FP_COMP_OPS_EXE:SSE_FP_PACKED 45.426 FP_COMP_OPS_EXE:SSE_FP_SCALAR 0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 FP_COMP_OPS_EXE:MMX 5.315.319 FP_COMP_OPS_EXE:SSE2_INTEGER 154.938 FP_COMP_OPS_EXE:X87 }}} * omnetpp Benchmark {{{ 0 SIMD_INT_64:PACK 0 SIMD_INT_64:PACKED_ARITH 0 SIMD_INT_64:PACKED_LOGICAL 0 SIMD_INT_64:PACKED_MPY 0 SIMD_INT_64:PACKED_SHIFT 0 SIMD_INT_64:SHUFFLE_MOVE 0 SIMD_INT_64:UNPACK 0 SIMD_INT_128:PACK 0 SIMD_INT_128:PACKED_ARITH 0 SIMD_INT_128:PACKED_LOGICAL 0 SIMD_INT_128:PACKED_MPY 0 SIMD_INT_128:PACKED_SHIFT 72.025 SIMD_INT_128:SHUFFLE_MOVE 9.324 SIMD_INT_128:UNPACK 86.774 ARITH:CYCLES_DIV_BUSY 19.225 ARITH:DIV 814.914 ARITH:MUL 85.397 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 18.332 FP_COMP_OPS_EXE:SSE_FP 0 FP_COMP_OPS_EXE:SSE_FP_PACKED 16.422 FP_COMP_OPS_EXE:SSE_FP_SCALAR 0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 FP_COMP_OPS_EXE:MMX 101.918 FP_COMP_OPS_EXE:SSE2_INTEGER 146.113 FP_COMP_OPS_EXE:X87 }}} * povray Benchmark {{{ 0 SIMD_INT_64:PACK 0 SIMD_INT_64:PACKED_ARITH 0 SIMD_INT_64:PACKED_LOGICAL 0 SIMD_INT_64:PACKED_MPY 0 SIMD_INT_64:PACKED_SHIFT 0 SIMD_INT_64:SHUFFLE_MOVE 0 SIMD_INT_64:UNPACK 0 SIMD_INT_128:PACK 0 SIMD_INT_128:PACKED_ARITH 0 SIMD_INT_128:PACKED_LOGICAL 0 SIMD_INT_128:PACKED_MPY 0 SIMD_INT_128:PACKED_SHIFT 61.728 SIMD_INT_128:SHUFFLE_MOVE 13.574 SIMD_INT_128:UNPACK 78.637 ARITH:CYCLES_DIV_BUSY 22.518 ARITH:DIV 506.300 ARITH:MUL 23.701 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 33.989 FP_COMP_OPS_EXE:SSE_FP 0 FP_COMP_OPS_EXE:SSE_FP_PACKED 35.484 FP_COMP_OPS_EXE:SSE_FP_SCALAR 0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 FP_COMP_OPS_EXE:MMX 100.566 FP_COMP_OPS_EXE:SSE2_INTEGER 218.570 FP_COMP_OPS_EXE:X87 }}} * soplex Benchmark {{{ 0 SIMD_INT_64:PACK 0 SIMD_INT_64:PACKED_ARITH 0 SIMD_INT_64:PACKED_LOGICAL 0 SIMD_INT_64:PACKED_MPY 0 SIMD_INT_64:PACKED_SHIFT 0 SIMD_INT_64:SHUFFLE_MOVE 0 SIMD_INT_64:UNPACK 0 SIMD_INT_128:PACK 0 SIMD_INT_128:PACKED_ARITH 0 SIMD_INT_128:PACKED_LOGICAL 0 SIMD_INT_128:PACKED_MPY 0 SIMD_INT_128:PACKED_SHIFT 362.236 SIMD_INT_128:SHUFFLE_MOVE 4.260 SIMD_INT_128:UNPACK 277.214 ARITH:CYCLES_DIV_BUSY 21.337 ARITH:DIV 518.253 ARITH:MUL 7.507 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 18.512 FP_COMP_OPS_EXE:SSE_FP 0 FP_COMP_OPS_EXE:SSE_FP_PACKED 34.237 FP_COMP_OPS_EXE:SSE_FP_SCALAR 0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 FP_COMP_OPS_EXE:MMX 34.844.570 FP_COMP_OPS_EXE:SSE2_INTEGER 293.240 FP_COMP_OPS_EXE:X87 }}} * xalancbmk Benchmark {{{ 0 SIMD_INT_64:PACK 0 SIMD_INT_64:PACKED_ARITH 0 SIMD_INT_64:PACKED_LOGICAL 0 SIMD_INT_64:PACKED_MPY 0 SIMD_INT_64:PACKED_SHIFT 0 SIMD_INT_64:SHUFFLE_MOVE 0 SIMD_INT_64:UNPACK 0 SIMD_INT_128:PACK 0 SIMD_INT_128:PACKED_ARITH 0 SIMD_INT_128:PACKED_LOGICAL 0 SIMD_INT_128:PACKED_MPY 0 SIMD_INT_128:PACKED_SHIFT 224.621 SIMD_INT_128:SHUFFLE_MOVE 1.291 SIMD_INT_128:UNPACK 553.674 ARITH:CYCLES_DIV_BUSY 14.997 ARITH:DIV 484.840 ARITH:MUL 9.489 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 6.920 FP_COMP_OPS_EXE:SSE_FP 0 FP_COMP_OPS_EXE:SSE_FP_PACKED 9.024 FP_COMP_OPS_EXE:SSE_FP_SCALAR 0 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 FP_COMP_OPS_EXE:MMX 7.555.265 FP_COMP_OPS_EXE:SSE2_INTEGER 739.540 FP_COMP_OPS_EXE:X87 }}} == Astra space charge tracking simulation == Astra (A Space charge TRacking Algorithm) is a particle tracking code developed by Klaus Flöttmann at DESY. More information can be found [[http://tesla.desy.de/~lfroehli/astra/|here]]. {{{ 2496490700640 FP_COMP_OPS_EXE:X87 522740864768 UNHALTED_CORE_CYCLES 174061079388628 INSTRUCTIONS_RETIRED }}}