Why GPU is fully utilized during the NCCL AllReduce communication? #3553

ihchoi12 · 2022-05-30T07:18:24Z

ihchoi12
May 30, 2022

Hello team!

I'm analyzing the GPU utilization during inter-GPU communication with NCCL.
Here is the test setup:
- Model: Resnet50
- Framework: PyTorch with Horovod
- Machine: Two AWS EC2 p3.2xlarge instances (one Tesla V100 GPU per node)
- NCCL_VERSION=2.8.4-1

In the 2-node distributed training above, those two workers perform AllReduce operations for the calculated gradients during the backward propagation.

Without NCCL, the workflow of an AllReduce operation should be like this:
- GPU calculates the gradient of ResNet50 layers in the backward direction
- Once a layer's gradient has been calculated, Node0's CPU copies the gradient from GPU's DRAM to the CPU's DRAM
- Node0's CPU sends its local gradient (in the CPU's DRAM at this time) to the other node through NIC
- Node1's CPU receives the gradient from Node0 and copies it to the GPU's DRAM
- It happens in the opposite direction (Node1 => Node0) as well

Now we have NCCL. The screenshot below shows the Nsight timeline with some nvtx labels (FP: forward propagation, BP: backward propagation, Calc Grads: calculating gradients, Update Parms: updating parameters) of the 2-node training. There are several points that I really want to understand clearly.

As far as I know, NCCL launches separate CUDA kernels to perform communication operations on GPU.
Then, I'm curious exactly what are the CUDA kernels launched by NCCL AllReduce operations doing during the communication. Are the kernels performing memory copies from the GPU's DRAM to the CPU's DRAM via DMA?
Why the CPU utilization is very high during the backward propagation? Is it because of synchronization checks across workers?
I also see the GPU is fully utilized during the backward propagation. I believe it's handling both the computation and NCCL communication. How is the GPU resource distributed to computation and communication here?

I will be really appreciated any insights on this. Please also point me out if any of my understanding is wrong. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why GPU is fully utilized during the NCCL AllReduce communication? #3553

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why GPU is fully utilized during the NCCL AllReduce communication? #3553

Uh oh!

ihchoi12 May 30, 2022

Replies: 0 comments

ihchoi12
May 30, 2022