Differences between revisions 22 and 44 (spanning 22 versions)

Contents

Overview
GPU Hardware
Environment
GPU Benchmarks
GPU Applications
1. Debugger and Profiler Tools
Monitoring
New GPU server
Success/failure stories

1. Overview

General introduciton to initial GPE project can be found here.

2. GPU Hardware

The current GPU system at DESY (Zeuthen) for the GPE project consists of two servers with dual nVidia Tesla C2050 GPU cards. They are named gpu1 and gpu2 and are also used as a testbed for new developments in GPU-to-GPU networking with custom designed interconnects and InfiniBand.

3. Environment

Currently on the systems the newest version of the CUDA SDK 8.0 alongside with device drivers and libraries are installed. The Software Development Kit provides the following:

CUDA driver 375.66
CUDA SDK 8.0
GPU Debugging & Profiling Tools
GPU-Accelerated Math Libraries
GPU-Accelerated Performance Primitives (Thrust library)

4. GPU Benchmarks

For evaluation and development a set of common benchmarks, as well as specially designed micro benchmarks were run on the gpu1/2 systems.

4.1. Custom low-level benchmarks

Custom designed benchmarks use OpenMPI and OpenMP for task parallelization and allocation on the host CPUs and evaluate the following performance metrics discussed below. For allocating benchmark threads to particular processor socets and cores specific options to mpirun were used. These include OpenMPI processor/memory affinity options described below:

mpirun --mca mpi_paffinity_alone 1 --mca rmaps_rank_file_path rank.cfg -np 4 executable

The option mpi_paffinity_alone=1 enables processor (and potentially memory) affinity. The options shown can be also defined via environment variables:

export OMPI_MCA_mpi_paffinity_alone=1
export OMPI_MCA_rmaps_rank_file_path=rank.cfg

Sample rank files for different cores/processes/thread configurations are shown below:

#1: 1x2 cores / 2x processes / 1x thread per process
rank 0=znpnb90 slot=0
rank 1=znpnb90 slot=1

#2: 1x4 cores / 2x processes / 2x threads per process
rank 0=gpu1 slot=0-1
rank 1=gpu1 slot=2-3

#3: 2x4 cores / 2x processes / 4x threads per process
rank 0=gpu1 slot=0:*
rank 1=gpu1 slot=1:*

Using this a couple of benchmarks were implemented and run. These measure bandwidth of host do device memory transactions, as well as latencies of memory transactions for the case where two GPUs work simultaneously.

Memory Bandwidth for unpinned memory and synchronous / asynchronous transfers
- Here the bandwidth of host to device memory copy operations is measured. The host memory areas used are allocated with common malloc() calls and are not pinned to physical page addresses - thus they are subject to page swapping. On the other hand memory is pinned when allocating it via the cudaHostAlloc() call. For differentiating between synchronous and asynchronous transfers we used cudaMemcpy and cudaMemcpyasync. The effects of both transfer type and memory pinning are scribed here.
Latency of host-ot-GPU memory copy operations for mulltiple GPUs
- Here the latency for host to device memory copy operations is measured. This time however the memory regions of the host memory are pinned to physical addresses and only asynchronous memory transfer is used. The difference in the setup here is that both GPUs run the benchmark simultaneously, and we differe between two configurations - parallel (host process running on CPU socket 0 uses GPU 0, and process running on CPU socket 1 uses GPU 1) and cross (process on CPU socet 0 uses GPU 1 and vice versa).
- Latecny of host-to-GPU memory copy operations for parallel configuration
- Latency of host-to-GPU memory copy operations for corss configuration
Bandwidth and latency of GPU-to-GPU communication
1. MPI send/recv vs. CUDA 4.0 peer-to-peer communication primitives
GPU-to-InfiniBand hardware datapath propagation delay
- For measuring the hardware datapath latency between host CPU and InfiniBand network adapter, as well as beteen GPU and InfiniBand network adapter, a micro benchmarks was developed which utilizes the loopback capability of the INfiniband HCA. In such mode packets send from a process running on the host CPU are looped back "inwards".
- For the current one-server setup such measurements are not possible however, because of the limitation the used QSFP loopback connector poses - if the HCA is not directly connected to other HCA or infiniBand switch, then the subnet manager throws an error for duplicate addresses.

4.2. Application-level benchmark suites

For ensuring consistency of performance with real-world applications, the Scalable Heter Ogeneous Computing (SHOC) benchmark suite was run on the gpu1 system too. This benchmark suite gives not only benchmark results for CUDA implementations of key algorithms but also corresponding implementations of the more general OpenCL parallel programming framework.

The SHOC benchmark suite currently contains benchmark programs, categoried based on complexity. Some measure low-level "feeds and speeds" behavior (Level 0), some measure the performance of a higher-level operation such as a Fast Fourier Transform (FFT) (Level 1), and the others measure real application kernels (Level 2).

Level 0
- Bus-Speed-Download: measures bandwidth of transferring data across the PCIe bus to a device.
- Bus-Speed-Readback: measures bandwidth of reading data back from a device.
- Device-Memory: measures bandwidth of memory accesses to various types of device memory including global, local, and image memories.
- Kernel-Compile: measures compile time for several OpenCL kernels, which range in complexity
- Max-Flops: measures maximum achievable floating point performance using a combination of auto-generated and hand coded kernels.
- Queue-Delay: measures the overhead of using the OpenCL command queue.
Level 1
- FFT: forward and reverse 1D FFT.
- MD: computation of the Lennard-Jones potential from molecular dynamics
- Reduction: reduction operation on an array of single or double precision floating point values.
- SGEMM: matrix-matrix multiply.
- Scan: scan (also known as parallel prex sum) on an array of single or double precision floating point values.
- Sort: sorts an array of key-value pairs using a radix sort algorithm
- Spmv: sparse matrix-vector multiplication
- Stencil2D: a 9-point stencil operation applied to a 2D data set. In the MPI version, data is distributed across MPI processes organized in a 2D Cartesian topology, with periodic halo exchanges.
- Triad: a version of the STREAM Triad benchmark, implemented in OpenCL and CUDA. This version includes PCIe transfer time.
Level 2
- S3D: A computationally-intensive kernel from the S3D turbulent combustion simulation program

4.3. GPUDirect

The term GPUDirect refers to a mechanism which allows different drivers to share pinned memory pages. This is realised by a (minor) modification of the Linux kernel and change in the network device drivers (i.e. IB device driver). The pages are pinned via the CUDA driver and the drivers who want to use these pages have to implement a mechanism which allow these to be notified about any change.

Kernel patch - The required patch has not made it into the current SL6 kernel, i.e. the kernel needs to be patched. Unfortunately the patch provided by NVIDIA needs to be changed in order to match the current SL6 version. This is doable but of course means that the performance can only be verified for an (temporary) experimental setup and cannot make it into the standard deployment channel.
Using GPUDirect - programms just have to allocate memory using cudaMallocHost (instead of, e.g., malloc) and use the pointer for MPI send/recv functions. NVIDIA provides this example TODO.

5. GPU Applications

Recent applications utilizing the gpu1 system at DESY (Zeuthen) are Chroma-based LQCD munerical simulationd and applications from the field of Astro Particle Physics.

5.1. Debugger and Profiler Tools

Compute Visual Profiler
CUDA Debugger

6. Monitoring

gpu1 in Nagios
GPU cores temperature
host and device free and/or used memory
GPU core frequency
GPU cores utilization/load

7. New GPU server

An example specification for a second GPU server can be as follows:

Supermicro Barebone 7046GT-TRF, 1400 Watt
Supermicro X8DTG-QF Mainboard
2x Intel Xeon X5667
Dual Intel 5520 Rev. 22 (Tylersburg) chipset
48 GB DDR3-1333 with ECC
2x 147GB SAS HDD
2x 450GB SAS HDD
LSI 8704ELP SAS/S-ATA RAID Controller incl. BBU
2x NVIDIA Fermi C2050 GPUs
Supermicro 19" Rackschienen for Barebone 7046GT-TRF
Mellanox MHQH19B-XTR ConnectX 2 Adapter Card, single Port QSFP, IB 40GB/s, PCI-Express x8

Optional components:

2x Intel Xeon X5675 (statt Intel X5667)
- [6 instead of 4 cores pro CPU]
2x NVIDIA Fermi C2070 GPUs (statt C2050)

8. Success/failure stories

Peer-to-Peer GPU memory access restrictions on dual Intel IOH machines

NOTE: Sections marked with need further discussion or are work in progress

-  ⇤ ← Revision 22 as of 2011-06-06 10:52:51 → 
  Size: 5209
  Editor: KonstantinBoyanov
  Comment:
+   ← Revision 44 as of 2017-05-15 09:10:53 → ⇥
  Size: 10381
  Editor: GötzWaschk
  Comment: CUDA 8.0
-Deletions are marked like this.
+Additions are marked like this.
 Line 4:
-Line 5:
+Line 6:
 Line 10:
-Line 13:
+Line 11:
-The current GPU system at DESY (Zeuthen) consists of a single [[http://www.supermicro.com/products/system/4U/7046/SYS-7046GT-TRF.cfm?GPU=FC4|server]] with dual nVidia Tesla C2050 GPU cards. It is hosted on gpu1 and is also used as a testbed for new developments in GPU-to-GPU networking with custom designed interconnects and InfiniBand.
+The current GPU system at DESY (Zeuthen) for the GPE project consists of two [[http://www.supermicro.com/products/system/4U/7046/SYS-7046GT-TRF.cfm?GPU=FC4|servers]] with dual nVidia Tesla C2050 GPU cards. They are named gpu1 and gpu2 and are also used as a testbed for new developments in GPU-to-GPU networking with custom designed interconnects and InfiniBand.
-Line 18:
+Line 14:
-Currently on the system the newest version of the [[http://developer.nvidia.com/cuda-toolkit-40|CUDA SDK 4.0]] alongside with device drivers and libraries are installed on gpu1. The Software Development Kit provides the following:
+Currently on the systems the newest version of the CUDA SDK 8.0 alongside with device drivers and libraries are installed. The Software Development Kit provides the following:
-Line 20:
+Line 16:
- * CUDA C/C++ Compiler
+ * CUDA driver 375.66
 * CUDA SDK 8.0
-Line 24:
+Line 21:
- * GPU-direct (under tests)
-Line 28:
+Line 23:
-For evaluation and development a set of common benchmarks, as well as specially designed micro benchmarks were run on the gpu1 system.
+For evaluation and development a set of common benchmarks, as well as specially designed micro benchmarks were run on the gpu1/2 systems.
-Line 30:
+Line 25:
+== Custom low-level benchmarks ==
Custom designed benchmarks use OpenMPI and OpenMP for task parallelization and allocation on the host CPUs and evaluate the following performance metrics discussed below. For allocating benchmark threads to particular processor socets and cores specific options to mpirun were used. These include [[http://www.open-mpi.org/faq/?category=tuning#using-paffinity|OpenMPI processor/memory affinity options]] described below:
-Line 31:
+Line 28:
+{{{
mpirun --mca mpi_paffinity_alone 1 --mca rmaps_rank_file_path rank.cfg -np 4 executable
}}}
The option mpi_paffinity_alone=1 enables processor (and potentially memory) affinity. The options shown can be also defined via environment variables:
-Line 32:
+Line 33:
-== Low-level benchmarks ==
Custom designed benchmarks use OpenMPI and OpenMP for task parallelization and allocation on the host CPUs and evaluate the following performance metrics:
+{{{
export OMPI_MCA_mpi_paffinity_alone=1
export OMPI_MCA_rmaps_rank_file_path=rank.cfg
}}}
 . Sample rank files for different cores/processes/thread configurations are shown below:
-Line 35:
+Line 39:
- * Memory Bandwidth for unpinned memory and synchronous / asynchronous transfers
   Here the bandwidth of host to device memory copy operations is measured. The host memory areas used are allocated with common malloc() calls and are not pinned to physical page addresses - thus they are subject to page swapping. On the other hand memory is pinned when allocating it via the [[http://www.clear.rice.edu/comp422/resources/cuda/html/group__CUDART__MEMORY_g217d441a73d9304c6f0ccc22ec307dba.html|cudaHostAlloc()]] call. For differentiating between synchronous and asynchronous transfers we used [[http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html#g48efa06b81cc031b2aa6fdc2e9930741|cudaMemcpy]] and [[http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_ge4366f68c6fa8c85141448f187d2aa13.html|cudaMemcpyasync]]. The effects of both transfer type and memory pinning are scribed [[http://www.ifh.de/~boyanov/GPE/notes-gpu.pdf|here]].
+{{{
#1: 1x2 cores / 2x processes / 1x thread per process
rank 0=znpnb90 slot=0
rank 1=znpnb90 slot=1
}}}
{{{
#2: 1x4 cores / 2x processes / 2x threads per process
rank 0=gpu1 slot=0-1
rank 1=gpu1 slot=2-3
}}}
{{{
#3: 2x4 cores / 2x processes / 4x threads per process
rank 0=gpu1 slot=0:*
rank 1=gpu1 slot=1:*
}}}
Using this a couple of benchmarks were implemented and run. These measure bandwidth of host do device memory transactions, as well as latencies of memory transactions for the case where two GPUs work simultaneously.
-Line 38:
+Line 56:
+. Memory Bandwidth for unpinned memory and synchronous / asynchronous transfers
  . Here the bandwidth of host to device memory copy operations is measured. The host memory areas used are allocated with common malloc() calls and are not pinned to physical page addresses - thus they are subject to page swapping. On the other hand memory is pinned when allocating it via the [[http://www.clear.rice.edu/comp422/resources/cuda/html/group__CUDART__MEMORY_g217d441a73d9304c6f0ccc22ec307dba.html|cudaHostAlloc()]] call. For differentiating between synchronous and asynchronous transfers we used [[http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html#g48efa06b81cc031b2aa6fdc2e9930741|cudaMemcpy]] and [[http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_ge4366f68c6fa8c85141448f187d2aa13.html|cudaMemcpyasync]]. The effects of both transfer type and memory pinning are scribed [[http://www.ifh.de/~boyanov/GPE/notes-gpu.pdf|here]].
-Line 39:
+Line 59:
+. Latency of host-ot-GPU memory copy operations for mulltiple GPUs
  . Here the latency for host to device memory copy operations is measured. This time however the memory regions of the host memory are pinned to physical addresses and only asynchronous memory transfer is used. The difference in the setup here is that both GPUs run the benchmark simultaneously, and we differe between two configurations - parallel (host process running on CPU socket 0 uses GPU 0, and process running on CPU socket 1 uses GPU 1) and cross (process on CPU socet 0 uses GPU 1 and vice versa).
  * Latecny of host-to-GPU memory copy operations for parallel configuration {{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working "parallel"|width="580"}}
-Line 40:
+Line 63:
- * Bandwidth of host-ot-GPU memory copy operations (for one and two devices parallel configured)
+  * Latency of host-to-GPU memory copy operations for corss configuration {{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-0-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working "crossed"|width="580"}}
-Line 42:
+Line 65:
-   Here again the bandwidth for host to device memory copy operations is measured. This time however the memory regions of the host memory are pinned to physical addresses via using the
+. Bandwidth and latency of GPU-to-GPU communication
  1. MPI send/recv vs. CUDA 4.0 peer-to-peer communication primitives
-Line 44:
+Line 68:
-||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-1-tcpu.png|Measurement with gettimeofday() for process with rank 0 and two GPUs working in "parallel"|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-1-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working in "parallel"|width=380}}||
||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcpu.png|Measurement with RDTSC for process with rank 0 and two GPUs working "crossed"|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-parallel-stream-ALL-rank-0-tcuda.png|Measurement with RDTSC for process with rank 0 and two GPUs working "crossed"|width=380}}||
+. GPU-to-InfiniBand hardware datapath propagation delay
  . For measuring the hardware datapath latency between host CPU and InfiniBand network adapter, as well as beteen GPU and InfiniBand network adapter, a micro benchmarks was developed which utilizes the loopback capability of the INfiniband HCA. In such mode packets send from a process running on the host CPU are looped back "inwards".
  . {X} For the current one-server setup such measurements are not possible however, because of the limitation the used QSFP loopback connector poses - if the HCA is not directly connected to other HCA or infiniBand switch, then the subnet manager throws an error for duplicate addresses.
-Line 47:
+Line 72:
+== Application-level benchmark suites ==
For ensuring consistency of performance with real-world applications, the Scalable [[HeterOgeneous|Heter Ogeneous]] Computing ([[http://ft.ornl.gov/doku/shoc/start|SHOC]]) benchmark suite was run on the gpu1 system too. This benchmark suite gives not only benchmark results for CUDA implementations of key algorithms but also corresponding implementations of the more general OpenCL parallel programming framework.
-Line 48:
+Line 75:
- * Bandwidth of GPU-ot-host memory copy operations (for one and two devices cross configured)
+The SHOC benchmark suite currently contains benchmark programs, categoried based on complexity. Some measure low-level "feeds and speeds" behavior (Level 0), some measure the performance of a higher-level operation such as a Fast Fourier Transform (FFT) (Level 1), and the others measure real application kernels (Level 2).
-Line 50:
+Line 77:
-||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-0-tcpu.png|ALD|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-0-tcuda.png|ALD|width=380}}||
||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-1-tcpu.png|ALD|width=380}}||<45%>{{http://www.ifh.de/~boyanov/GPEfigures/GPEbench-cross-stream-ALL-rank-1-tcuda.png|ALD|width=380}}||
+ * Level 0
  * Bus-Speed-Download: measures bandwidth of transferring data across the PCIe bus to a device.
  * Bus-Speed-Readback: measures bandwidth of reading data back from a device.
  * Device-Memory: measures bandwidth of memory accesses to various types of device memory including global, local, and image memories.
  * Kernel-Compile: measures compile time for several OpenCL kernels, which range in complexity
  * Max-Flops: measures maximum achievable floating point performance using a combination of auto-generated and hand coded kernels.
  * Queue-Delay: measures the overhead of using the OpenCL command queue.
-Line 53:
+Line 85:
- * Bandwidth and latency of GPU-to-GPU communication 
  * mpirun options and rankfiles
  * MPI send/recv vs. CUDA 4.0 peer-to-peer communication primitives
+ * Level 1
  * FFT: forward and reverse 1D FFT.
  * MD: computation of the Lennard-Jones potential from molecular dynamics
  * Reduction: reduction operation on an array of single or double precision floating point values.
  * SGEMM: matrix-matrix multiply.
  * Scan: scan (also known as parallel pre
x sum) on an array of single or double precision floating point values.
  * Sort: sorts an array of key-value pairs using a radix sort algorithm
  * Spmv: sparse matrix-vector multiplication
  * Stencil2D: a 9-point stencil operation applied to a 2D data set. In the MPI version, data is distributed across MPI processes organized in a 2D Cartesian topology, with periodic halo exchanges.
  * Triad: a version of the STREAM Triad benchmark, implemented in OpenCL and CUDA. This version includes PCIe transfer time.
-Line 57:
+Line 97:
- * GPU-to-InfiniBand hardware datapath propagation delay
  * perftest measurements
+ * Level 2
  * S3D: A computationally-intensive kernel from the S3D turbulent combustion simulation program

== GPUDirect ==
<!> The term GPUDirect refers to a mechanism which allows different drivers to share pinned memory pages. This is realised by a (minor) modification of the Linux kernel and change in the network device drivers (i.e. IB device driver). The pages are pinned via the CUDA driver and the drivers who want to use these pages have to implement a mechanism which allow these to be notified about any change.

 * Kernel patch - The required patch has not made it into the current SL6 kernel, i.e. the kernel needs to be patched. Unfortunately the patch provided by NVIDIA needs to be changed in order to match the current SL6 version. This is doable but of course means that the performance can only be verified for an (temporary) experimental setup and cannot make it into the standard deployment channel.
 * Using GPUDirect - programms just have to allocate memory using cudaMallocHost (instead of, e.g., malloc) and use the pointer for MPI send/recv functions. NVIDIA provides this example TODO.
-Line 62:
+Line 108:
-/!\ Recent applications utilizing the gpu1 system at DESY (Zeuthen) are [[http://usqcd.jlab.org/usqcd-docs/chroma/|Chroma]]-based LQCD munerical simulationd and applications from the field of Astro Particle Physics.



== Application-level benchmarks ==
For ensuring consistency of performance with real-world applications, the Scalable HeterOgeneous Computing ([[http://ft.ornl.gov/doku/shoc/start|SHOC]]) benchmark suite was run on the gpu1 system too. This benchmark suite gives not only benchmark results for CUDA implementations of key algorithms but also corresponding implementations of the more general OpenCL parallel programming framework.
+<!> Recent applications utilizing the gpu1 system at DESY (Zeuthen) are [[http://usqcd.jlab.org/usqcd-docs/chroma/|Chroma]]-based LQCD munerical simulationd and applications from the field of Astro Particle Physics.
-Line 71:
+Line 111:
- * Compute Visual profiler
+<!>
 * Compute Visual Profiler
-Line 76:
+Line 116:
+= Monitoring =
<!>
-Line 77:
+Line 119:
-= Monitoring =
/!\
-Line 80:
+Line 120:
-  * GPU cores temperature
  * host and device free and/or used memory
  * GPU core frequency
  * GPU cores utilization/load
+ * GPU cores temperature
 * host and device free and/or used memory
 * GPU core frequency
 * GPU cores utilization/load

= New GPU server =
<!>

An example specification for a second GPU server can be as follows:

    * Supermicro Barebone 7046GT-TRF, 1400 Watt 
    * Supermicro X8DTG-QF Mainboard
    * 2x Intel Xeon X5667
    * Dual Intel 5520 Rev. 22 (Tylersburg) chipset
    * 48 GB DDR3-1333 with ECC
    * 2x 147GB SAS HDD
    * 2x 450GB SAS HDD
    * LSI 8704ELP SAS/S-ATA RAID Controller incl. BBU
    * 2x NVIDIA Fermi C2050 GPUs
    * Supermicro 19" Rackschienen for Barebone 7046GT-TRF
    * Mellanox MHQH19B-XTR ConnectX 2 Adapter Card, single Port QSFP, IB 40GB/s, PCI-Express x8

Optional components:

    * 2x Intel Xeon X5675 (statt Intel X5667)
      [6 instead of 4 cores pro CPU]
    * 2x NVIDIA Fermi C2070 GPUs (statt C2050)

= Success/failure stories =

[[http://forums.nvidia.com/index.php?showtopic=197840|Peer-to-Peer GPU memory access restrictions on dual Intel IOH machines]]
-Line 87:
+Line 154:
-Sections marked with /!\ need further discussion
+NOTE: Sections marked with <!> need further discussion or are work in progress

Wiki

Page