1. Overview

General introduciton to initial GPE project can be found here.

2. GPU Hardware

The current GPU system at DESY (Zeuthen) consists of a single server with dual nVidia Tesla C2050 GPU cards. It is hosted on gpu1 and is also used as a testbed for new developments in GPU-to-GPU networking with custom designed interconnects and InfiniBand.

3. Environment

Currently on the system the newest version of the CUDA SDK 4.0 alongside with device drivers and libraries are installed on gpu1. The Software Development Kit provides the following:

CUDA C/C++ Compiler
GPU Debugging & Profiling Tools
GPU-Accelerated Math Libraries
GPU-Accelerated Performance Primitives (Thrust library)
GPU-direct (under tests)

4. GPU Benchmarks

For evaluation and development a set of common benchmarks, as well as specially designed micro benchmarks were run on the gpu1 system.

4.1. Low-level benchmarks

Custom designed benchmarks use OpenMPI and OpenMP for task parallelization and allocation on the host CPUs and evaluate the following performance metrics:

Memory Bandwidth for unpinned memory and synchronous / asynchronous transfers
- Here the bandwidth of host to device memory copy operations is measured. The host memory areas used are allocated with common malloc() calls and are not pinned to physical page addresses - thus they are subject to page swapping. On the other hand memory is pinned when allocating it via the cudaHostAlloc() call. For differentiating between synchronous and asynchronous transfers we used cudaMemcpy and cudaMemcpyasync. The effects of both transfer type and memory pinning are scribed here.
Latency of host-ot-GPU memory copy operations for mulltiple GPUs
- Here the latency for host to device memory copy operations is measured. This time however the memory regions of the host memory are pinned to physical addresses and only asynchronous memory transfer is used. The difference in the setup here is that both GPUs run the benchmark simultaneously, and we differe between two configurations - parallel (host process running on CPU socket 0 uses GPU 0, and process running on CPU socket 1 uses GPU 1) and cross (process on CPU socet 0 uses GPU 1 and vice versa).
- Latecny of host-to-GPU memory copy operations for parallel configuration

Measurement with RDTSC for process with rank 0 and two GPUs working "crossed"

Latency of host-to-GPU memory copy operations for corss configuration

ALD

Bandwidth and latency of GPU-to-GPU communication
- mpirun options and rankfiles
- MPI send/recv vs. CUDA 4.0 peer-to-peer communication primitives
GPU-to-InfiniBand hardware datapath propagation delay
- perftest measurements

5. GPU Applications

$/!\$ Recent applications utilizing the gpu1 system at DESY (Zeuthen) are Chroma-based LQCD munerical simulationd and applications from the field of Astro Particle Physics.

5.1. Application-level benchmarks

For ensuring consistency of performance with real-world applications, the Scalable HeterOgeneous Computing (SHOC) benchmark suite was run on the gpu1 system too. This benchmark suite gives not only benchmark results for CUDA implementations of key algorithms but also corresponding implementations of the more general OpenCL parallel programming framework.

5.2. Debugger and Profiler Tools

Compute Visual profiler
CUDA Debugger

6. Monitoring

$/!\$

gpu1 in Nagios
- GPU cores temperature
- host and device free and/or used memory
- GPU core frequency
- GPU cores utilization/load

Sections marked with $/!\$ need further discussion

-  ⇤ ← Revision 24 as of 2011-06-06 11:49:34 → 
  Size: 5570
  Editor: KonstantinBoyanov
  Comment:
+   ← Revision 25 as of 2011-06-06 14:59:49 → ⇥
  Size: 4656
  Editor: KonstantinBoyanov
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 39:
- * Bandwidth of host-ot-GPU memory copy operations for mulltiple GPUs
+ * Latency of host-ot-GPU memory copy operations for mulltiple GPUs
 Line 41:
-   Here again the bandwidth for host to device memory copy operations is measured. This time however the memory regions of the host memory are pinned to physical addresses and only asynchronous memory transfer is used. The difference in the setup here is that both GPUs run the benchmark simultaneously, and we differe between two configurations - parallel (host process running on CPU socket 0 uses GPU 0, and process running on CPU socket 1 uses GPU 1) and cross (process on CPU socet 0 uses GPU 1 and vice versa).
+   Here the latency for host to device memory copy operations is measured. This time however the memory regions of the host memory are pinned to physical addresses and only asynchronous memory transfer is used. The difference in the setup here is that both GPUs run the benchmark simultaneously, and we differe between two configurations - parallel (host process running on CPU socket 0 uses GPU 0, and process running on CPU socket 1 uses GPU 1) and cross (process on CPU socet 0 uses GPU 1 and vice versa).
 Line 43:
-  * Bandwidth of GPU-ot-host memory copy operations for parallel configuration
||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-1-tcpu.png|Measurement with gettimeofday() for process with rank 0 and two GPUs working in "parallel"|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-1-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working in "parallel"|width=380}}||
||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcpu.png|Measurement with RDTSC for process with rank 0 and two GPUs working "crossed"|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working "crossed"|width=380}}||
+  * Latecny of host-to-GPU memory copy operations for parallel configuration
{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working "crossed"|width=380}}
-Line 48:
+Line 47:
-  * Bandwidth of GPU-ot-host memory copy operations for corss configuration
||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-0-tcpu.png|ALD|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-0-tcuda.png|ALD|width=380}}||
||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-1-tcpu.png|ALD|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-1-tcuda.png|ALD|width=380}}||
+  * Latency of host-to-GPU memory copy operations for corss configuration
{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-0-tcuda.png|ALD|width=380}}

Wiki

Page