How is NCCL so fast?

December 27, 2021

NCCL, the state of the art library for distributed DNN training on GPU, has a reputation of not having much documentation on its design. No paper to be found, no extended press release, just a few talks[2, 4] and webinars, and the code, which is open source. This is unfortunate as NCCL is clearly doing something different from previous libraries.

This article presents the rise of RDMA in distributed computing, how chunking is a crucial technique to achieve link saturation, and why the ring topology is so important. The talks on NCCL are actually not such a bad starting point, but over time, the GitHub issues on the official repository have proven to be a rich resource for details and first principles of NCCL’s design. The following is a summary of those resources, along with some interpretations and reverse engineering of mine. The post is capped off by what I think could be next in the landscape.

Behind the fancy name, NCCL is just an implementation of MPI that is specifically optimized for inter-GPU communication, notably over 100Gb/s+ RDMA enabled interconnects. MPI, being born before CUDA, does not support GPU communication natively. Some CUDA aware implementations have proliferated in recent years, but performance has been an issue, not only from a distributed algorithm standpoint, but also in leveraging the highly heterogeneous set of network technologies relevant in today’s landscape.

From the infancy of distributed Deep Learning, networking, namely the ability to send a tensor from a GPU to another remote GPU for gradient averaging, has been an inescapable bottleneck. Ideally, the GPUs should be able to communicate autonomously through direct access to the NIC via PCIe, without any involvement from the CPU and the OS. The technology that allows various devices, such as RAM or GPUs, to communicate directly through the NIC, has been coined RDMA, for Remote Direct Memory Access. NVIDIA has fully invested in that paradigm, and has developed high capacity links and interconnects that exhibit RDMA like capabilities, in the likes of NVLink and others[3]. The DGX platform, an all in one cluster that hosts 10+ GPUs all connected via NVLink, is a perfect example.

NCCL’s main purpose is to provide an MPI implementation over RDMA interconnects, especially Infiniband. This is the most important edge that allows NCCL to outperform other libraries. NCCL exposes a lean interface, with the regular broadcast, all reduce, all-gather, but has been or is still uneager to deploy other verbs.

In practice, NCCL automatically detects the underlying network topology at boot time, and constructs optimal logical topologies. Note that a logical topology data structure may very well be heterogeneous, such as being a mix of TCP and PCIe. NCCL supports TCP, PCIe, Infiniband — the API provided by Mellanox cards, and GPUDirect.

An optimal peer to peer tensor transfer is a crucial building block to get right when implementing a collective communication library. GPUs are streaming devices, which means that you should always take advantage of any pipeline parallelism that is available. For P2P send/receive, that translates into sending the tensor in chunks. The different technology present different challenges, but this also becomes crucial when talking about collective algorithms. Aggregate network topology should be saturated, by balancing the size of tensors such that tensors are not so large that receiver nodes experience queuing, but still large enough to amortize network overheads. As it pertains to technology, if the interface is TCP or PCIe and must go through the CPU, then the advantage is even more pronounced, as the device to host memory transfers and the socket sending/receiving can be overlapped. You can experimentally match NCCL’s bandwidth by writing a few lines of code, with TCP sockets, the correct ring implementation and the optimal number of chunks.

NCCL also builds on top of previous work by leveraging latency and bandwidth optimal algorithms and topologies. Ring topologies are well known in distributed systems as having an optimal bandwidth (under normal assumptions). This is due to the fact that nodes only send to a single peer and thus have the full bandwidth available. This is also manifested in the worst-case amount of data that any node has to send, which is constant. When building abstractions on top of the ring, such as all-reduce, similar properties are displayed. This is explained in a blog post[1] about Baidu all reduce, which revived an old idea in HPC, for the sake of distributed deep learning. On the flip side, if latency is of concern, then the best topology becomes the tree, which has log(n) complexity, if we take the time.

NCCL aims to provide the best of both world by dynamically switching between the two to provide the best available. Notably, small tensors are not bandwidth bound, and transfer time is dominated by latency, so a tree transfer makes more sense, whereas large tensors are limited by throughput and so should be communicated through a ring.

Relying on the three main ideas presented in this article, you should be able to implement MPI with a performance close to NCCL on GPU. If you look at NCCL’s source code, a lot of the remaining complexity can be attributed to the support of multiple data types, multi-threading safety, and similar endeavours.

Future work

NCCL has been the state of the art for some time, with native support in PyTorch and Tensorflow, but it has its shortcomings.

The first one, for which Blink[5] has attempted to provide a solution, is the inability to optimally adapt to different topologies. Indeed, ring based primitives leave links under utilized when links are heterogeneous. This is due to the fact that the throughput of a ring primitive is limited by the slowest link. Blink instead models the optimal primitive as a graph problem, where nodes are accelerators and edges are links. A classical graph theory result then teaches us that the maximum flow — from root to all — can be achieved by finding the maximal packing of spanning trees. This is true assuming that GPUs can perform computation on transferred tensors at line rate, and that GPUs can accommodate multiple spanning trees.

The second main shortcoming is that data parallelism on its own is likely to come to an end, because it is not optimal at scale, although it should always remain a relevant paradigm. The two most promising areas that are emerging right now are the hybrid parallelism optimization paradigm approach, with works such as FlexFlow[6], and the fully decentralized approach based on gossiping gradient updates, such as PowerGossip[7].

References

Bringing HCP Techniques to Deep Learning, Andrew Gibiansky, 2017
NCCL 2.0, Sylvain Jeaugey
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect, Li et. al, arXiv 2019
Scaling Deep Learning Training with NCCL, Sylvain Jeaugey, developer.nvidia.com 2018
Blink: Fast and Generic Collectives for Distributed ML, Wang et. al, MLSys 2020
Beyond Data and Model Parallelism for Deep Neural Networks, Jia et. al, arXiv 2018
Practical Low-Rank Communication Compression in Decentralized Deep Learning, Vogels et. al, NeurIPS 2020