#pragma section-numbers on ---- <> ---- = Overview of the Intel MPI Benchmarks = The main goal of running the Intel MPI benchmarks is not only to asses performance of the cluster network in general, but also the influence of MPI library implementations and different compilers on communication performance. One case can be the investigation on which pair of MPI libraries / compiler, for example OpenMPI/GCC and IntelMPI/ICC, results in better (smaller latency and more aggregate bandwidth) performance. With one executable, all of the supported benchmarks, or a subset specified by the command line, can be run. The rules, such as time measurement (including a repetitive call of the kernels for better clock synchronization), message lengths, selection of communicators to run a particular benchmark (inside the group of all started processes) are program parameters. For a clear structuring of the set of benchmarks, IMB introduces classes of benchmarks: Single Transfer, Parallel Transfer, and Collective. This classification refers to different ways of interpreting results, and to a structuring of the code itself. It does not actually influence the way of using IMB. = Types of benchmarks = Each of the IMB benchmark kernels is based on particular MPI function or set of functions (like [[http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Sendrecv.html|MPI_Sendrecv]], [[http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Reduce.html|MPI_Reduce]], etc ) . Because of this boundedness the benchmarks are naturally divided in three different categories - single transfer, parallel transfer and collective benchmarks. == Single transfer benchmarks == The benchmarks in this class are to focus on a single message transferred between two processes. This class of benchmarks includes only the Ping-Pong and Ping-Ping tests, which are basically the passing of a single message from the initiating host to his peer and then waiting for an answer. In IMB interpretation, Ping-Ping measures the same as Ping-Pong, under the particular circumstance that a message is obstructed by an oncoming one (sent simultaneously by the same process that receives the own one). Single transfer benchmarks only run with 2 active processes. For Ping-Ping, pure timings will be reported, and the throughput is related to a single message. Expected numbers, very likely, are between half and full Ping-Pong throughput. With this, Ping-Ping determines the throughput of messages under non optimal conditions (namely, oncoming traffic). Throughput values are defined in MBytes/sec = ''2^10^ bytes / sec'' scale, i.e. throughput = ''X / 2^20^ * 10^6^ / time'' = ''X / (1.048576 * time)'', when time is in μsec). == Parallel transfer benchmarks == Parallel transfer benchmarks focus on global mode or patterns. The activity at a certain process is in concurrency with other processes, the benchmark measures message passing efficiency under global load. These class include Sendrecv and Exchange benchmarks. The first of the benchmarks is based on MPI_Sendrecv(), the processes form a periodic communication chain. Each process sends to the right and receives from the left neighbor in the chain. The second benchmark, namely Exchange, is a communications pattern that often occurs in grid splitting algorithms (boundary exchanges). The group of processes is seen as a periodic chain, and each process exchanges data with both left and right neighbor in the chain. Throughput values are defined in MBytes/sec = ''2^10^ bytes / sec'' scale (i.e. throughput [Mbyte/sec] = ''( (nmsg * X [bytes]) /2^20^) * (10^6^ / time)'' = ''(nmsg * X) / (1.048576 * time)'', when time is in μsec). For the interpretation of Sendrecv and Exchange, more than 1 message (per sample) counts. As to the throughput numbers, the total turnover (the number of sent plus the number of received bytes) at a certain process is taken into account. E.g., for the case of 2 processes, Sendrecv becomes the bi-directional test: perfectly bi-directional systems are rewarded by a double Ping-Pong throughput here. == Collective benchmarks == This class contains all benchmarks that are collective in the MPI convention meaning that a group of MPI message passing routines performs one (processor)-to-many (processors) and/or many-to-one communications (for example [[http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Gather.html|MPI_Gather]] or [[http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Scatter.html|MPI_Scatter]]). Not only is the message passing power of the system relevant here, but also the quality of the implementation. Raw timings and no throughput are reported. Note that certain collective benchmarks (namely the reductions) play a particular role as they are not pure message passing tests, but also depend on an efficient implementation of certain numerical operations (the performance of such compiler- and architecture-dependant operations can be assessed by means of the [[HEPSPEC]]). Each benchmark is run with varying message lengths X bytes, and timings are averaged over multiple samples.The following MPI functions can be benchmarked: * [[http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Reduce.html|MPI_Reduce]] - Reduces a vector of length L = X/sizeof(float) float items. The MPI data-type is MPI_FLOAT, the MPI operation is MPI_SUM * [[http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Reduce_scatter.html|MPI_Reduce_scatter]] - Reduces a vector of length L = X/sizeof(float)float items. The MPI data-type is MPI_FLOAT, the MPI operation is MPI_SUM. In the scatter phase, the L items are split as evenly as possible. Exactly, when np = #processes, L = r*np+s (s = L mod np), then process with rank i gets r+1 items when i [-h[elp]] [Benchmark1 [Benchmark2 [ ... ] ] ] [-npmin P_min] [-multi Outflag] [-input ] [-map

x] }}} The {{{-npmin}}} argument has to be an integer {{{P_min}}}, specifying the minimum number of processes to run all selected benchmarks. {{{P_min}}} may be 1 and {{{P_min > P}}} is handled as {{{P_min = P}}}. With no {{{-npmin}}} selection the default parameters are used. Given {{{P_min}}}, the selected process numbers are{{{:}}} {{{ { P_min}}}, 2{{{P_min}}}, 4{{{P_min}}}, ..., largest 2x{{{P_min}}} 2^16^) and show min and max communication timing far lower that the above mentioned "special" cases. For all other benchmakrs both implementation pairs show similar results.