You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm analyzing the GPU utilization during inter-GPU communication with NCCL.
Here is the test setup:
- Model: Resnet50
- Framework: PyTorch with Horovod
- Machine: Two AWS EC2 p3.2xlarge instances (one Tesla V100 GPU per node)
- NCCL_VERSION=2.8.4-1
In the 2-node distributed training above, those two workers perform AllReduce operations for the calculated gradients during the backward propagation.
Without NCCL, the workflow of an AllReduce operation should be like this:
- GPU calculates the gradient of ResNet50 layers in the backward direction
- Once a layer's gradient has been calculated, Node0's CPU copies the gradient from GPU's DRAM to the CPU's DRAM
- Node0's CPU sends its local gradient (in the CPU's DRAM at this time) to the other node through NIC
- Node1's CPU receives the gradient from Node0 and copies it to the GPU's DRAM
- It happens in the opposite direction (Node1 => Node0) as well
Now we have NCCL. The screenshot below shows the Nsight timeline with some nvtx labels (FP: forward propagation, BP: backward propagation, Calc Grads: calculating gradients, Update Parms: updating parameters) of the 2-node training. There are several points that I really want to understand clearly.
As far as I know, NCCL launches separate CUDA kernels to perform communication operations on GPU.
Then, I'm curious exactly what are the CUDA kernels launched by NCCL AllReduce operations doing during the communication. Are the kernels performing memory copies from the GPU's DRAM to the CPU's DRAM via DMA?
Why the CPU utilization is very high during the backward propagation? Is it because of synchronization checks across workers?
I also see the GPU is fully utilized during the backward propagation. I believe it's handling both the computation and NCCL communication. How is the GPU resource distributed to computation and communication here?
I will be really appreciated any insights on this. Please also point me out if any of my understanding is wrong. Thanks!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello team!
I'm analyzing the GPU utilization during inter-GPU communication with NCCL.
Here is the test setup:
- Model: Resnet50
- Framework: PyTorch with Horovod
- Machine: Two AWS EC2 p3.2xlarge instances (one Tesla V100 GPU per node)
- NCCL_VERSION=2.8.4-1
In the 2-node distributed training above, those two workers perform AllReduce operations for the calculated gradients during the backward propagation.
Without NCCL, the workflow of an AllReduce operation should be like this:
- GPU calculates the gradient of ResNet50 layers in the backward direction
- Once a layer's gradient has been calculated, Node0's CPU copies the gradient from GPU's DRAM to the CPU's DRAM
- Node0's CPU sends its local gradient (in the CPU's DRAM at this time) to the other node through NIC
- Node1's CPU receives the gradient from Node0 and copies it to the GPU's DRAM
- It happens in the opposite direction (Node1 => Node0) as well
Now we have NCCL. The screenshot below shows the Nsight timeline with some nvtx labels (FP: forward propagation, BP: backward propagation, Calc Grads: calculating gradients, Update Parms: updating parameters) of the 2-node training. There are several points that I really want to understand clearly.
Then, I'm curious exactly what are the CUDA kernels launched by NCCL AllReduce operations doing during the communication. Are the kernels performing memory copies from the GPU's DRAM to the CPU's DRAM via DMA?
I will be really appreciated any insights on this. Please also point me out if any of my understanding is wrong. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions