GPU Scheduling and Resource Accounting: The Key to an Efficient AI Data Center

[Connect with LSF users and learn new skills in the IBM Spectrum LSF User Community!]

GPUs are the new CPUs

GPUs have become a staple technology in modern HPC and AI data centers. Of 102 new supercomputers joining the coveted Top 500 list at this year’s supercomputing conference, fully 42 use NVIDIA GPUs, including the reigning #1 and #2 IBM systems, Summit and Sierra. While announcements from AMD and Intel promise increased GPU competition down the road, as of today, NVIDIA is the clear leader. GPU-based clusters now account for 90% of the top 30 systems on the Green500 list[1], a testament to their performance and power-efficiency.

If there’s a downside to GPUs, it’s the cost. Depending on the configuration, a GPU-capable host can be in the range of 8-10x more expensive than similar a CPU-only system[2]. This cost differential makes it critical that GPU resources are used efficiently and that organizations be able to account for GPU-related spending by user, department, and project. In this article, we’ll provide an overview of GPU scheduling and resource accounting and explain why both are so important in modern HPC environments.

Powered by software

Much of the reason for the rise of GPUs is the mature software ecosystem around them. Virtually all top HPC and AI applications now provide some form of GPU support, accelerating applications anywhere from 20% to a 1,000-fold[3]. NVIDIA’s CUDA environment makes GPUs easier to program, minimizing the learning curve for developers by allowing them to code in familiar languages. CUDA makes it easy to implement parallel code as blocks of threads and provides conveniences such as “unified memory” creating the appearance of a common memory space between CPUs and GPUs, freeing developers of the burden of repeatedly copying data to and from host-memory and GPU devices[4].

GPU-ready libraries such as cuBLAS, cuFFT, and cuDNN help developers harness GPUs for HPC and AI applications while avoiding low-level GPU programming. For example, HPC applications that use Basic Linear Algebra Subprogram (BLAS) libraries can be adapted to call GPU-optimized cuBLAS functions enabling dramatic performance gains while reducing development effort.

GPUs pose unique challenges

While GPUs are here to stay, they pose unique challenges for HPC data center managers. Unlike traditional applications that execute on server CPUs, GPU applications are comprised of code that runs on CPUs ( “host code”) as well as code that runs on GPUs ( “device code”). The parallel device-code optimized for execution on a GPU is referred to as a “kernel” in CUDA terminology. GPUs run kernels comprised of blocks of many parallel threads. If a GPU application needs to parallelize operations across 1,000,000 array elements, for example, a developer might choose an execution configuration with 4096 blocks, each with 256 threads. Parameters such as the number of threads per thread block and number of blocks can dramatically affect performance depending on the GPU architecture.

Complicating things further, there have been multiple generations of data center GPUs since the original Tesla design. These include Maxwell, Pascal, Volta, and the latest Turing design. Different GPUs have different capabilities, different amounts of memory, and different numbers of Streaming Multiprocessors (SMs). For example, a Tesla P100 GPU has 56 SMs, each capable of supporting 2,048 execution threads. Developers and application administrators need to consider details such as device compute capabilities and CUDA library compatibility when deciding where to run GPU applications.

Scheduling plays a critical role

Running GPU applications on a single system is relatively straightforward, but in large HPC and AI environments, things can get complicated fast.

Compute environments are often heterogeneous with multiple servers running multiple generations of GPUs
Multiple users, departments and projects frequently compete for resources with different business, performance and technology requirements
Application needs can vary widely – in some cases, multiple GPU kernels may share a GPU, while in other cases applications may be distributed across multiple hosts and GPUs

HPC centers have long relied on workload managers such as IBM Spectrum LSF to optimize application performance and keep resources fully utilized, but for GPU-enabled applications, scheduling plays an especially critical role. This is partly due to the cost of the resources, but also because GPU application performance is especially sensitive to optimal workload placement. Also, it’s far too easy to underutilize GPU resources because of complex application and resource dependencies.

As the workload scheduler plays “Tetris,” placing the host and device portions of GPU workloads optimally across hosts, sockets, and cores, it needs detailed information. This includes GPU model, mode, device status, memory, and more. For example, details such as GPU operating temperature and ECC error counts can help avoid placing workloads on devices that are overheating or unhealthy.

GPU-specific information is also essential for resource accounting and reporting accurately at week or month-end what project or department used what GPU resources. For example, a distributed training model may execute across multiple hosts and GPUs, so to get an accurate picture of resources consumed, the scheduler needs to aggregate resource consumption metrics across multiple GPUs and hosts including details such as execution time, GPU memory consumed, and GPU energy consumed in addition to other resource metrics. Visibility to GPU SM and memory utilization can help HPC administrators tune scheduling and resource sharing policies to improve overall efficiency.

Fortunately, NVIDIA provides software solutions, including NVIDIA Data Center GPU Manager (DCGM)[5] that provide access to a wide-variety of GPU-specific metrics and expose these metrics in real-time to enable better scheduling decisions. Also, NVIDIA’s Multi Process Service (MPS)[6] provides an alternative binary-compatible implementation of the CUDA API that enables co-operative multi-process CUDA applications common in MPI and Deep Learning environments. MPS also enables multiple GPU kernels to execute on the same GPU, helping avoid GPU resources going to waste in cases where GPU cores or memory would otherwise be underutilized.

Spectrum LSF GPU support

Battle-tested on the world’s largest AI supercomputers, IBM Spectrum LSF provides rich support for both GPU scheduling and resource accounting. Spectrum LSF automatically configures itself to recognize GPUs and provides a comprehensive integration with NVIDIA DCGM. GPU-related metrics are integrated with LSF’s job and resource accounting facilities and are available to downstream reporting and monitoring solutions, including IBM Spectrum LSF Explorer. Spectrum LSF can also marshal NVIDIA MPS services to simplify multi-GPU applications and run containerized GPU workloads transparently using Spectrum LSF Application Profiles. To reduce cost, Spectrum LSF can automatically power down GPUs when not in use subject to policy to reduce data center operating costs.

An example of Spectrum LSF GPU-aware scheduling and resource accounting is shown in the figure below. With visibility to all facets of GPU operation, job requirements, and CPU and GPU bus topologies, Spectrum LSF can deploy the host and device portions of GPU applications optimally considering capabilities requirements and socket, core, and NVLINK affinity to maximize performance.

By harnessing these IBM Spectrum LSF capabilities with NVIDIA software, data center managers can improve application throughput, use expensive GPUs more efficiently, and easily account for GPU resource usage by project, application, and department.

For users of NVIDIA DGX systems, IBM provides a comprehensive guide to running IBM Spectrum LSF with NVIDIA DGX systems.

https://blogs.nvidia.com/blog/2019/11/19/record-gpu-accelerated-supercomputers-top500/
A p3.xlarge instance on AWS with 32 vCPUs and 8 NVIDIA V100 GPUs costs $12.24/hour on-demand. A similar m5a.8xlarge instance with 32 vCPUs costs $1.38 per hour. https://aws.amazon.com/ec2/pricing/on-demand/
Acceleration varies widely depending on how much application logic can benefit from parallel execution on a GPU – https://www.nvidia.com/content/intersect-360-HPC-application-support.pdf
Unified memory for CUDA beginners – https://devblogs.nvidia.com/unified-memory-cuda-beginners/
NVIDIA Data Center GPU Manager – https://developer.nvidia.com/dcgm

You can learn more about NVIDIA MPS at https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

GPUs are the new CPUs

GPUs have become a staple technology in modern HPC and AI data centers. Read more…

" share_counter=""]

The Information Nexus of Advanced Computing and Data systems for a High Performance World

Share

Copy short link