Replies: 1 comment 4 replies
-
Hi @ihchoi12, all those You could also enable Horovod timeline, then you will get more detailed annotations in NSight Systems, too. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
According to the Horovod documentation, the default fusion buffer size is 128 MB. I'm running ResNet50 model training on two AWS p3.2xlarge nodes, and ResNet50's model size is 98 MB. Then, does it mean that the entire parameters should be batched into a single AllReduce operation? However, in my experiment, I see a lot of HorovodAllreduce operations are invoked in each iteration as shown in the Nsight timeline below:

Could you point me out if I have some misunderstanding?
Beta Was this translation helpful? Give feedback.
All reactions