1. Overview

General introduciton to initial GPE project can be found here.

2. GPU Hardware

The current GPU system at DESY (Zeuthen) consists of a single server with dual nVidia Tesla C2050 GPU cards. It is hosted on gpu1 and is also used as a testbed for new developments in GPU-to-GPU networking with custom designed interconnects and InfiniBand.

3. Environment

Currently on the system the newest version of the CUDA SDK 4.0 alongside with device drivers and libraries are installed on gpu1. The Software Development Kit provides the following:

CUDA driver 270.35
CUDA Toolkit 4.0.11
CUDA SDK 4.0.11
GPU Debugging & Profiling Tools
GPU-Accelerated Math Libraries
GPU-Accelerated Performance Primitives (Thrust library)

4. GPU Benchmarks

For evaluation and development a set of common benchmarks, as well as specially designed micro benchmarks were run on the gpu1 system.

4.1. Low-level benchmarks

Custom designed benchmarks use OpenMPI and OpenMP for task parallelization and allocation on the host CPUs and evaluate the following performance metrics discussed below. For allocating benchmark threads to particular processor socets and cores specific options to mpirun were used. These include OpenMPI processor/memory affinity options described below:

mpirun --mca mpi_paffinity_alone 1 --mca rmaps_rank_file_path rank.cfg -np 4 executable

The option mpi_paffinity_alone=1 enables processor (and potentially memory) affinity. The options shown can be also defined via environment variables:

export OMPI_MCA_mpi_paffinity_alone=1
    export OMPI_MCA_rmaps_rank_file_path=rank.cfg

Sample rank files for different cores/processes/thread configurations are shown below:

#1: 1x2 cores / 2x processes / 1x thread per process
rank 0=znpnb90 slot=0
rank 1=znpnb90 slot=1

#2: 1x4 cores / 2x processes / 2x threads per process
rank 0=gpu1 slot=0-1
rank 1=gpu1 slot=2-3

#3: 2x4 cores / 2x processes / 4x threads per process
rank 0=gpu1 slot=0:*
rank 1=gpu1 slot=1:*

Using this a couple of benchmarks were implemented and run. These measure bandwidth of host do device memory transactions, as well as latencies of memory transactions for the case where two GPUs work simultaneously.

Memory Bandwidth for unpinned memory and synchronous / asynchronous transfers
- Here the bandwidth of host to device memory copy operations is measured. The host memory areas used are allocated with common malloc() calls and are not pinned to physical page addresses - thus they are subject to page swapping. On the other hand memory is pinned when allocating it via the cudaHostAlloc() call. For differentiating between synchronous and asynchronous transfers we used cudaMemcpy and cudaMemcpyasync. The effects of both transfer type and memory pinning are scribed here.
Latency of host-ot-GPU memory copy operations for mulltiple GPUs
- Here the latency for host to device memory copy operations is measured. This time however the memory regions of the host memory are pinned to physical addresses and only asynchronous memory transfer is used. The difference in the setup here is that both GPUs run the benchmark simultaneously, and we differe between two configurations - parallel (host process running on CPU socket 0 uses GPU 0, and process running on CPU socket 1 uses GPU 1) and cross (process on CPU socet 0 uses GPU 1 and vice versa).
- Latecny of host-to-GPU memory copy operations for parallel configuration
- Latency of host-to-GPU memory copy operations for corss configuration
Bandwidth and latency of GPU-to-GPU communication
1. mpirun options and rankfiles
2. MPI send/recv vs. CUDA 4.0 peer-to-peer communication primitives
GPU-to-InfiniBand hardware datapath propagation delay
1. perftest measurements

5. GPU Applications

Recent applications utilizing the gpu1 system at DESY (Zeuthen) are Chroma-based LQCD munerical simulationd and applications from the field of Astro Particle Physics.

5.1. Application-level benchmarks

For ensuring consistency of performance with real-world applications, the Scalable HeterOgeneous Computing (SHOC) benchmark suite was run on the gpu1 system too. This benchmark suite gives not only benchmark results for CUDA implementations of key algorithms but also corresponding implementations of the more general OpenCL parallel programming framework.

5.2. Debugger and Profiler Tools

Compute Visual profiler
CUDA Debugger

6. GPUDirect

The term GPUDirect refers to a mechanism which allows different drivers to share pinned memory pages. This is realised by a (minor) modification of the Linux kernel and change in the network device drivers (i.e. IB device driver). The pages are pinned via the CUDA driver and the drivers who want to use these pages have to implement a mechanism which allow these to be notified about any change.

Kernel patch - The required patch has not made it into the current SL6 kernel, i.e. the kernel needs to be patched. Unfortunately the patch provided by NVIDIA needs to be changed in order to match the current SL6 version. This is doable but of course means that the performance can only be verified for an (temporary) experimental setup and cannot make it into the standard deployment channel.
Using GPUDirect - programms just have to allocate memory using cudaMallocHost (instead of, e.g., malloc) and use the pointer for MPI send/recv functions. NVIDIA provides this example TODO.

7. Monitoring

gpu1 in Nagios
GPU cores temperature
host and device free and/or used memory
GPU core frequency
GPU cores utilization/load

NOTE: Sections marked with $/!\$ need further discussion

-  ⇤ ← Revision 32 as of 2011-06-08 10:04:35 → 
  Size: 7146
  Editor: KonstantinBoyanov
  Comment:
+   ← Revision 33 as of 2011-06-08 10:08:22 → ⇥
  Size: 7021
  Editor: KonstantinBoyanov
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 4:
-Line 5:
+Line 6:
 Line 38:
-  mpirun --mca mpi_paffinity_alone 1 --mca rmaps_rank_file_path rank.cfg -np 4 executable
+mpirun --mca mpi_paffinity_alone 1 --mca rmaps_rank_file_path rank.cfg -np 4 executable
 Line 40:
-Line 44:
+Line 43:
-    export OMPI_MCA_mpi_paffinity_alone=1
+export OMPI_MCA_mpi_paffinity_alone=1
-Line 47:
+Line 46:
-  Sample rank files for different cores/processes/thread configurations are shown below:
+ . Sample rank files for different cores/processes/thread configurations are shown below:
-Line 51:
+Line 49:
-    #1: 1x2 cores / 2x processes / 1x thread per process
      rank 0=znpnb90 slot=0
      rank 1=znpnb90 slot=1
    #2: 1x4 cores / 2x processes / 2x threads per process
      rank 0=gpu1 slot=0-1
      rank 1=gpu1 slot=2-3
    #3: 2x4 cores / 2x processes / 4x threads per process
      rank 0=gpu1 slot=0:*
      rank 1=gpu1 slot=1:*
+#1: 1x2 cores / 2x processes / 1x thread per process
rank 0=znpnb90 slot=0
rank 1=znpnb90 slot=1
-Line 61:
+Line 53:
+{{{
#2: 1x4 cores / 2x processes / 2x threads per process
rank 0=gpu1 slot=0-1
rank 1=gpu1 slot=2-3
}}}
{{{
#3: 2x4 cores / 2x processes / 4x threads per process
rank 0=gpu1 slot=0:*
rank 1=gpu1 slot=1:*
}}}
-Line 65:
+Line 66:
+  . Here the bandwidth of host to device memory copy operations is measured. The host memory areas used are allocated with common malloc() calls and are not pinned to physical page addresses - thus they are subject to page swapping. On the other hand memory is pinned when allocating it via the [[http://www.clear.rice.edu/comp422/resources/cuda/html/group__CUDART__MEMORY_g217d441a73d9304c6f0ccc22ec307dba.html|cudaHostAlloc()]] call. For differentiating between synchronous and asynchronous transfers we used [[http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html#g48efa06b81cc031b2aa6fdc2e9930741|cudaMemcpy]] and [[http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_ge4366f68c6fa8c85141448f187d2aa13.html|cudaMemcpyasync]]. The effects of both transfer type and memory pinning are scribed [[http://www.ifh.de/~boyanov/GPE/notes-gpu.pdf|here]].
-Line 66:
+Line 68:
-   Here the bandwidth of host to device memory copy operations is measured. The host memory areas used are allocated with common malloc() calls and are not pinned to physical page addresses - thus they are subject to page swapping. On the other hand memory is pinned when allocating it via the [[http://www.clear.rice.edu/comp422/resources/cuda/html/group__CUDART__MEMORY_g217d441a73d9304c6f0ccc22ec307dba.html|cudaHostAlloc()]] call. For differentiating between synchronous and asynchronous transfers we used [[http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html#g48efa06b81cc031b2aa6fdc2e9930741|cudaMemcpy]] and [[http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_ge4366f68c6fa8c85141448f187d2aa13.html|cudaMemcpyasync]]. The effects of both transfer type and memory pinning are scribed [[http://www.ifh.de/~boyanov/GPE/notes-gpu.pdf|here]].
+. Latency of host-ot-GPU memory copy operations for mulltiple GPUs
  . Here the latency for host to device memory copy operations is measured. This time however the memory regions of the host memory are pinned to physical addresses and only asynchronous memory transfer is used. The difference in the setup here is that both GPUs run the benchmark simultaneously, and we differe between two configurations - parallel (host process running on CPU socket 0 uses GPU 0, and process running on CPU socket 1 uses GPU 1) and cross (process on CPU socet 0 uses GPU 1 and vice versa).
  * Latecny of host-to-GPU memory copy operations for parallel configuration {{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working "parallel"|width="580"}}
-Line 68:
+Line 72:
+  * Latency of host-to-GPU memory copy operations for corss configuration {{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-0-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working "crossed"|width="580"}}
-Line 69:
+Line 74:
-. Latency of host-ot-GPU memory copy operations for mulltiple GPUs
+. Bandwidth and latency of GPU-to-GPU communication
  1. mpirun options and rankfiles
  1. MPI send/recv vs. CUDA 4.0 peer-to-peer communication primitives
-Line 71:
+Line 78:
-   Here the latency for host to device memory copy operations is measured. This time however the memory regions of the host memory are pinned to physical addresses and only asynchronous memory transfer is used. The difference in the setup here is that both GPUs run the benchmark simultaneously, and we differe between two configurations - parallel (host process running on CPU socket 0 uses GPU 0, and process running on CPU socket 1 uses GPU 1) and cross (process on CPU socet 0 uses GPU 1 and vice versa).

  1. Latecny of host-to-GPU memory copy operations for parallel configuration{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working "parallel"|width=580}}

  2. Latency of host-to-GPU memory copy operations for corss configuration{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-0-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working "crossed"|width=580}}


 3. Bandwidth and latency of GPU-to-GPU communication 
  1. mpirun options and rankfiles
  2. MPI send/recv vs. CUDA 4.0 peer-to-peer communication primitives


 4. GPU-to-InfiniBand hardware datapath propagation delay
+. GPU-to-InfiniBand hardware datapath propagation delay
-Line 87:
+Line 82:
-Line 88:
+Line 84:
-/!\ Recent applications utilizing the gpu1 system at DESY (Zeuthen) are [[http://usqcd.jlab.org/usqcd-docs/chroma/|Chroma]]-based LQCD munerical simulationd and applications from the field of Astro Particle Physics.
+Recent applications utilizing the gpu1 system at DESY (Zeuthen) are [[http://usqcd.jlab.org/usqcd-docs/chroma/|Chroma]]-based LQCD munerical simulationd and applications from the field of Astro Particle Physics.
-Line 96:
+Line 92:
-Line 97:
+Line 94:
-Line 101:
+Line 97:
-Line 102:
+Line 100:
-/!\
-Line 108:
+Line 105:
-What needs to be checked is the Mellanox IB driver which supports GPUDirect v1.
-Line 111:
+Line 108:
-/!\
-Line 113:
+Line 109:
-  * GPU cores temperature
  * host and device free and/or used memory
  * GPU core frequency
  * GPU cores utilization/load
+ * GPU cores temperature
 * host and device free and/or used memory
 * GPU core frequency
 * GPU cores utilization/load
-Line 120:
+Line 119:
-Sections marked with /!\ need further discussion
+NOTE: Sections marked with /!\ need further discussion

Wiki

Page