182
Comment:
|
3328
|
Deletions are marked like this. | Additions are marked like this. |
Line 6: | Line 6: |
Line 7: | Line 8: |
General introduciton to initial GPE project can be found [[http://hpc.desy.de/gpe/|here]]. |
|
Line 9: | Line 13: |
The current GPU system at DESY (Zeuthen) consists of a single [[http://www.supermicro.com/products/system/4U/7046/SYS-7046GT-TRF.cfm?GPU=FC4|server]] with dual nVidia Tesla C2050 GPU cards. It is hosted on gpu1 and is also used as a testbed for new developments in GPU-to-GPU networking with custom designed interconnects and InfiniBand. | |
Line 10: | Line 15: |
= GPU SDK, Compiler and Libraries = | = Environment = Currently on the system the newest version of the [[http://developer.nvidia.com/cuda-toolkit-40|CUDA SDK 4.0]] alongside with device drivers and libraries are installed on gpu1. The Software Development Kit provides the following: * CUDA C/C++ Compiler * GPU Debugging & Profiling Tools * GPU-Accelerated Math Libraries * GPU-Accelerated Performance Primitives (Thrust library) * GPU-direct (under tests) |
Line 13: | Line 28: |
For evaluation and development a set of common benchmarks, as well as specially designed micro benchmarks were run on the gpu1 system. == Low-level benchmarks == Custom designed benchmarks use OpenMPI and OpenMP for task parallelization and allocation on the host CPUs and evaluate the following performance metrics: * memory Bandwidth for unpinned memory * Bandwidth of host-ot-GPU memory copy operations (for one and two devices parallel and cross configured) Fit for time measured with gettimeofday() for all ranks and two GPUs working in "parallel" ||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-1-tcpu.png|ALD|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-1-tcuda.png|ALD|width=380}}|| ||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcpu.png|ALD|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcuda.png|ALD|width=380}}|| * Bandwidth of GPU-ot-host memory copy operations (for one and two devices parallel and cross configured) * Bandwidth and latency of GPU-to-GPU communication * mpirun options and rankfiles * MPI send/recv vs. CUDA 4.0 peer-to-peer communication primitives * GPU-to-InfiniBand hardware datapath propagation delay * perftest measurements |
|
Line 15: | Line 55: |
/!\ Recent applications utilizing the gpu1 system at DESY (Zeuthen) are [[http://usqcd.jlab.org/usqcd-docs/chroma/|Chroma]]-based LQCD munerical simulationd and applications from the field of Astro Particle Physics. == Application-level benchmarks == For ensuring consistency of performance with real-world applications, the Scalable HeterOgeneous Computing ([[http://ft.ornl.gov/doku/shoc/start|SHOC]]) benchmark suite was run on the gpu1 system too. This benchmark suite gives not only benchmark results for CUDA implementations of key algorithms but also corresponding implementations of the more general OpenCL parallel programming framework. == Debugger and Profiler Tools == = Monitoring = /!\ * gpu1 in Nagios * GPU cores temperature * host and device free and/or used memory * GPU core frequency * GPU cores utilization/load ---- Sections marked with /!\ need further discussion |
Contents
1. Overview
General introduciton to initial GPE project can be found here.
2. GPU Hardware
The current GPU system at DESY (Zeuthen) consists of a single server with dual nVidia Tesla C2050 GPU cards. It is hosted on gpu1 and is also used as a testbed for new developments in GPU-to-GPU networking with custom designed interconnects and InfiniBand.
3. Environment
Currently on the system the newest version of the CUDA SDK 4.0 alongside with device drivers and libraries are installed on gpu1. The Software Development Kit provides the following:
- CUDA C/C++ Compiler
GPU Debugging & Profiling Tools
- GPU-Accelerated Math Libraries
- GPU-Accelerated Performance Primitives (Thrust library)
- GPU-direct (under tests)
4. GPU Benchmarks
For evaluation and development a set of common benchmarks, as well as specially designed micro benchmarks were run on the gpu1 system.
4.1. Low-level benchmarks
Custom designed benchmarks use OpenMPI and OpenMP for task parallelization and allocation on the host CPUs and evaluate the following performance metrics:
- memory Bandwidth for unpinned memory
- Bandwidth of host-ot-GPU memory copy operations (for one and two devices parallel and cross configured)
- Fit for time measured with gettimeofday() for all ranks and two GPUs working in "parallel"
- Bandwidth of GPU-ot-host memory copy operations (for one and two devices parallel and cross configured)
- Bandwidth and latency of GPU-to-GPU communication
- mpirun options and rankfiles
- MPI send/recv vs. CUDA 4.0 peer-to-peer communication primitives
GPU-to-InfiniBand hardware datapath propagation delay
- perftest measurements
5. GPU Applications
Recent applications utilizing the gpu1 system at DESY (Zeuthen) are Chroma-based LQCD munerical simulationd and applications from the field of Astro Particle Physics.
5.1. Application-level benchmarks
For ensuring consistency of performance with real-world applications, the Scalable HeterOgeneous Computing (SHOC) benchmark suite was run on the gpu1 system too. This benchmark suite gives not only benchmark results for CUDA implementations of key algorithms but also corresponding implementations of the more general OpenCL parallel programming framework.
5.2. Debugger and Profiler Tools
6. Monitoring
- gpu1 in Nagios
- GPU cores temperature
- host and device free and/or used memory
- GPU core frequency
- GPU cores utilization/load
Sections marked with need further discussion