Contents

Overview of the Intel MPI Benchmarks
Types of benchmarks
Running IMB
Interactive script for running IMB Jobs
Results

1. Overview of the Intel MPI Benchmarks

The main goal of running the Intel MPI benchmarks is not only to asses performance of the cluster network in general, but also the influence of MPI library implementations and different compilers on communication performance. One case can be the investigation on which pair of MPI libraries / compiler, for example OpenMPI/GCC and IntelMPI/ICC, results in better (smaller latency and more aggregate bandwidth) performance.

With one executable, all of the supported benchmarks, or a subset specified by the command line, can be run. The rules, such as time measurement (including a repetitive call of the kernels for better clock synchronization), message lengths, selection of communicators to run a particular benchmark (inside the group of all started processes) are program parameters.

For a clear structuring of the set of benchmarks, IMB introduces classes of benchmarks: Single Transfer, Parallel Transfer, and Collective. This classification refers to different ways of interpreting results, and to a structuring of the code itself. It does not actually influence the way of using IMB.

2. Types of benchmarks

Each of the IMB benchmark kernels is based on particular MPI function or set of functions (like MPI_Sendrecv, MPI_Reduce, etc ) . Because of this boundedness the benchmarks are naturally divided in three different categories - single transfer, parallel transfer and collective benchmarks.

2.1. Single transfer benchmarks

The benchmarks in this class are to focus on a single message transferred between two processes. This class of benchmarks includes only the Ping-Pong and Ping-Ping tests, which are basically the passing of a single message from the initiating host to his peer and then waiting for an answer. In IMB interpretation, Ping-Ping measures the same as Ping-Pong, under the particular circumstance that a message is obstructed by an oncoming one (sent simultaneously by the same process that receives the own one). Single transfer benchmarks only run with 2 active processes. For Ping-Ping, pure timings will be reported, and the throughput is related to a single message. Expected numbers, very likely, are between half and full Ping-Pong throughput. With this, Ping-Ping determines the throughput of messages under non optimal conditions (namely, oncoming traffic).

Throughput values are defined in MBytes/sec = 2¹⁰ bytes / sec scale, i.e. throughput = X / 2²⁰ * 10⁶ / time = X / (1.048576 * time), when time is in μsec).

2.2. Parallel transfer benchmarks

Parallel transfer benchmarks focus on global mode or patterns. The activity at a certain process is in concurrency with other processes, the benchmark measures message passing efficiency under global load. These class include Sendrecv and Exchange benchmarks. The first of the benchmarks is based on MPI_Sendrecv(), the processes form a periodic communication chain. Each process sends to the right and receives from the left neighbor in the chain. The second benchmark, namely Exchange, is a communications pattern that often occurs in grid splitting algorithms (boundary exchanges). The group of processes is seen as a periodic chain, and each process exchanges data with both left and right neighbor in the chain.

Throughput values are defined in MBytes/sec = 2¹⁰ bytes / sec scale (i.e. throughput [Mbyte/sec] = ( (nmsg * X [bytes]) /2²⁰) * (10⁶ / time) = (nmsg * X) / (1.048576 * time), when time is in μsec).

For the interpretation of Sendrecv and Exchange, more than 1 message (per sample) counts. As to the throughput numbers, the total turnover (the number of sent plus the number of received bytes) at a certain process is taken into account. E.g., for the case of 2 processes, Sendrecv becomes the bi-directional test: perfectly bi-directional systems are rewarded by a double Ping-Pong throughput here.

2.3. Collective benchmarks

This class contains all benchmarks that are collective in the MPI convention meaning that a group of MPI message passing routines performs one (processor)-to-many (processors) and/or many-to-one communications (for example MPI_Gather or MPI_Scatter). Not only is the message passing power of the system relevant here, but also the quality of the implementation. Raw timings and no throughput are reported. Note that certain collective benchmarks (namely the reductions) play a particular role as they are not pure message passing tests, but also depend on an efficient implementation of certain numerical operations (the performance of such compiler- and architecture-dependant operations can be assessed by means of the HEPSPEC).

Each benchmark is run with varying message lengths X bytes, and timings are averaged over multiple samples.The following MPI functions can be benchmarked:

MPI_Reduce - Reduces a vector of length L = X/sizeof(float) float items. The MPI data-type is MPI_FLOAT, the MPI operation is MPI_SUM
MPI_Reduce_scatter - Reduces a vector of length L = X/sizeof(float)float items. The MPI data-type is MPI_FLOAT, the MPI operation is MPI_SUM. In the scatter phase, the L items are split as evenly as possible. Exactly, when np = #processes, L = r*np+s (s = L mod np), then process with rank i gets r+1 items when i<s, and r items when i≥s
MPI_Allreduce - Reduces a vector of length L = X/sizeof(float) float items. The MPI data-type is MPI_FLOAT, the MPI operation is MPI_SUM
MPI_Allgather - Every process inputs X bytes and receives the gathered X*(#processes) bytes
MPI_Alltoallfunction - Every process inputs X*(#processes) bytes (X for each process) and receives X*(#processes) bytes (X from each process)
MPI_Bcast- A root process broadcasts X bytes to all. The root of the operation is changed cyclically

3. Running IMB

This section describes how to build and run the Intel MPI benchmark, as well as the various parameters which can be passed to it from the command line.

3.1. Building IMB with different compilers and MPI versions

There are several ways in which one can compile the IMB benchmarks. This include:

Compiling with Gnu GCC / OpenMPI

# ini openmpi
# make -f make_gcc

Compiling with Intel Compiler (ICC) / Intel MPI

# ini openmpi_intel
# make -f make_icc

Compiling with Intel Compiler (ICC) / Intel MPI for use in the Intel Cluster Tollkit which is described here.

3.2. Invoking IMB

IMB is run like a normal MPI application by invoking mpirun (or corresponding) with the following example sequence:

mpirun -np 128 ./IMB-MPI1 {IMB parameters}

IMB has a standard and an optional configuration. In the standard case, all parameters have fixed default values. In standard mode, message lengths are varied from 0,1,2,4,8,16 … to 4194304 bytes. Through a command line flag, an arbitrary set of message lengths can be input by a file (flag –msglen). The minimum P_min and maximum number P of processes can be selected by the user via command line, the benchmarks run on P_min, 2 times P_min, 4 times P_min, ... 2^X times P_min < P and P processes.

IMB can be run with different process numbers but for optimal insights we prefer P = number of cores. For example on the PAX cluster farm one can run IMB with 2 to 128 processes. This number derives from the fact that each PAX cluster consists of 16 blades, each equiped with two 4-core Xeon processors.

In addition to that, one can specify a machine file to the mpirun command, as described in Usage of the Linux Cluster at DESY Zeuthen.

# run PingPong on 64 groups of two processes on the whole paxYY machine:
mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot --mca btl "^udapl" ./IMB-MPI1 -np 128 PingPong -multi 1

An example machinefile for running the IMB benchmarks on all 128 cores (e.g. all 16 blades with their respective 8 slots) of one PAX blade center looks like this:

# more machinefile_all_pax0_8slot
pax00 slots=8
pax01 slots=8
pax02 slots=8
pax03 slots=8
pax04 slots=8
pax05 slots=8
pax06 slots=8
pax07 slots=8
pax08 slots=8
pax09 slots=8
pax0a slots=8
pax0b slots=8
pax0c slots=8
pax0d slots=8
pax0e slots=8
pax0f slots=8

All benchmarks will by default run on Q=[1,] 2, 4, 8, ..., largest 2^x < P processes, e.g. if P=11, then Q=[1,]2,4,8,11 processes will be selected. The Q processes driving the benchmark are called the active processes.

Finally all the results and output from the benchmark can be written to a log file, which filename is constructed by the following convention:

$LOG=IMBrun_$PROCCNT_$HOST_$MPIVERSION_$BENCHDESCR_$MULTI_$MAP.log

where $PROCCNT describes the process number, $HOST - the hostname of the machine where the benchmarks are started, $MPIVERSION - the MPI library implementation used by the compilation of the benchmarks, $BENCHDESCR is one of "single", "parallel", "collective" or "all" and $MULTI and $MAP describe two IMB command line parameters (see next section for details on these parameters).

3.3. IMB parameters

The following is list of valid parameter for the IMB benchmark.

mpirun -np P IMB-<..>
[-h[elp]]
[Benchmark1 [Benchmark2 [ ... ] ] ]
[-npmin P_min]
[-multi Outflag]
[-input <Input_file>]
[-map <P>x<Q>]

The -npmin argument has to be an integer P_min, specifying the minimum number of processes to run all selected benchmarks. P_min may be 1 and P_min > P is handled as P_min = P. With no -npmin selection the default parameters are used. Given P_min, the selected process numbers are:

{ P_min, 2P_min, 4P_min, ..., largest 2xP_min <P, P }

The -multi flag dictates wheater the benchmark will be run in multi-mode, meaning that all P processes will be divided into evenly sized communication groups, which execute the particular benchmark (for example Sendrecv). This allows not only to test the performance of all node in with single benchmarks simultaneously, it also helps conducting information about parallel and collective benchmarks under global load.

The -file argument is used for specifying a file in which all the names for the benchmarks to be run are described. An example of such file is showed below:

# IMB benchmark selection file
PingPong
PingPing
#Allgatherv
Allgather
Scatter
Sendrecv
Exchange

The -map MxQ argument is used to split the number of running processes into M groups of Q processes. For example one can run the Exchange benchmark with 128 processes, defining 16 groups of 8 processes each by invoking:

 mpirun -np 128 ./IMB-MPI1 -map 16x8 -npmin 8 Exchange

Another example will be to run PingPong benchmark in such manner that each process belonging to one of two different 64 node partitions communicates with one peer from the foreign process group:

mpirun –np 128 ./IMB-MPI1 –map 64x2 PingPong

Each row in the following table gives the process numbers with which the specified benchmarks will be run for -map MxQ, depending on the -npmin argument value (e.g. by which process number to start).

0	M	...	(Q-2)M	(Q-1)M
1
...
M-1	2M-1		(Q-1)M-1	Q(M-1)

4. Interactive script for running IMB Jobs

The script is located in /afs/ifh.de/group/rz/IMB/src and takes the following parameters:

# ./runIMB.sh -h

Usage:
./runIMB.sh [COMPILER] [PROC NUM] [MACHINEFILE] [MAP] [MULTI] [BENCHES]

where:
[COMPILER]    is one of 'icc' or 'gcc'
[PROC NUM]    is the number of processes to be run in the benchmark
[MACHINEFILE] specifies the machine file name describing the slot allocation in a rack
[MAP]         describes the partitioning of the processes in groups, e.g. 64x2 defines 64 groups of 2 processes
[MULTI]       is one of 'multi' or 'nomulti'
[BENCHES]     is one of 'single', 'parallel', 'collective' or 'all'

The script will compile IMB with the selected pair of compiler/MPI libraries and run the selected group of benchmarks. The results are stored in a file in the working directory. An example here is the following invocation, which will start 128 processes running the parallel benchmarks (Sendrecv and Exchange) in 32 groups of 4 processes each.

# export PROCCNT=128;
# mpirun -np $PROCCNT ./runIMB.sh gcc $PROCCNT ./machinefile_all_pax0_8slot 32x4 nomulti

5. Results

The first runs of the IMB benchmarks on the PAX cluster had the goal of determinin performance differences between using two sets of compiler/MPI implemetation - one being the GCC compiler with OpenMPI, while the other is Intel ICC compiler and Intel MPI Library. For benchmarking the PAX Cluster at Zeuthen with IMB we have choosen the following parameters:

# Sinle benchmarks
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 -multi 1 -map 64x2 PingPong PingPing
# Parallel Benchmarks (12)
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 sendrecv exchange -map 2x64 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 sendrecv exchange -map 4x32 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 sendrecv exchange -map 8x16 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 sendrecv exchange -map 16x8 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 sendrecv exchange -map 32x4 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 sendrecv exchange -map 64x2 -npmin 128
# Collective Benchmarks - Reduce (18)
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allreduce Reduce Reduce_scatter -map 2x64 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allreduce Reduce Reduce_scatter -map 4x32 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allreduce Reduce Reduce_scatter -map 8x16 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allreduce Reduce Reduce_scatter -map 16x8 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allreduce Reduce Reduce_scatter -map 32x4 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allreduce Reduce Reduce_scatter -map 64x2 -npmin 128
# Collective Benchmarks - Gather (24)
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allgather Allgatherv Gather Gatherv -map 2x64 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allgather Allgatherv Gather Gatherv -map 4x32 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allgather Allgatherv Gather Gatherv -map 8x16 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allgather Allgatherv Gather Gatherv -map 16x8 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allgather Allgatherv Gather Gatherv -map 32x4 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Allgather Allgatherv Gather Gatherv -map 64x2 -npmin 128
# Collective Benchmarks - Scatter and Alltoall asdffasdfad
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Scatter Scatterv Alltoall Alltoallv Bcast -map 2x64 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Scatter Scatterv Alltoall Alltoallv Bcast -map 4x32 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Scatter Scatterv Alltoall Alltoallv Bcast -map 8x16 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Scatter Scatterv Alltoall Alltoallv Bcast -map 16x8 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Scatter Scatterv Alltoall Alltoallv Bcast -map 32x4 -npmin 128
# mpirun -np 128 -machinefile ./machinefile_all_pax0_8slot ./IMB-MPI1 Scatter Scatterv Alltoall Alltoallv Bcast -map 64x2 -npmin 128

We ran the IMB with 128 processes in all benchmarks as this allowed us to put the whole machine under stress, as well as to benchmark as much nodes as possible at a time. The -multi flag was used only for the single transfer benchmarks, because in such manner all nodes are participating in the test (although not all peer-to-peer communication links can be tested).

For the parallel benchmarks we use different -map flags, because in such manner different partitioning schemes of the machine are tested, as well as communication chains in groups with increasing number of nodes. In order to avoid that many mappings are traversed in one benchmark, we used the -npmin flag set to 128. In this way only one run of the selected benchmarks with the exact mapping is made, and not many runs with smaller mappings, as it is by default. (For example if one has a mapping 16x8 and no -npmin selection, then all passing mappings 1x2, 1x4, 1x8, 2x8, 4x8, 8x8 and at the end 16x8 will be ran).

The first two groups of benchmarks, the Single and Parallel benchmarks are used to rather measure the MPI implementations, while the colletive benchmarks, which are more CPU-bound and involve many computations, give also an insight into the performance of the different compilers.

One easy way to convert the output files from IMB is to run the following command:

cat IMBrun-128-openmpi-GCC-parallel-nomulti-2x64.log |sed -e 's/\([0-9a-zA-Z\.]*\)  */\1;/g';

The charts in the following link give an overview of the benchmark results for two different pairs of compiler/MPI library - the first one being GCC/OpenMPI and the other one being ICC/Intel MPI. http://www.ifh.de/~boyanov/IMB_Auswertung.html

As one can see from the results, with the Ping-Pong and Ping-Ping the GCC/OpenMPI pair show a little higher latency time (~ 50 microseconds for the bigger packets). After packet size of 2¹¹ = 2048 Bytes the curve gets jagged but still increasing, what could be due to some buffer in the netwrok interconnect geting full or low level (L1) cache misses. After 2²⁰ byte packet sizes bandwidth degradation could be observed for both compiler/MPI library pairs. At this packet size also the peak bandwidth for both cases is observed - ICC/Intel MPI is far better that GCC/OpenMPI with 7900 MB/sec versus 5100 MB/sec for Ping-Pong. For Ping-Ping saturation comes at 2²⁰ packet size for GCC/OMPI and 2¹⁶ for ICC/IMPI and yet again the Intel compiler and library suite comes with better peak bandwidth of 4900 MB/sec before the park of GCC/OMPI (2700 MB/sec). Latencies for the Ping-Ping benchmark are relatively the same, although the ICC/IMPI pair shows again slight advantage.

For the minimal and maximal communication timing in Sendrecv, Exchange and Allreduce benchmarks the results are similar for both implementations.

For the case of Reduce benchmark both implementations show very high communication times for ver large packets (about 200000 seconds).

The results for Reduce_scatter are interesting with the fact that both mplementations show peaks at packet sizes of 131072 and 262144 bytes respectively, as well as at 2097152 and 4194304 bytes. packet sizes of 524288 and 1048576 bytes are almost linearly increasing as the lower packet sizes (>2¹⁶) and show min and max communication timing far lower that the above mentioned "special" cases.

For all other benchmakrs both implementation pairs show similar results.

Wiki

Page