Communication Time in Horovod - When is data being transferred? #3521
-
For an algorithm I have written I am trying to get out the communication time of the algorithm. I think it is interleaved with computation as stated in other threads, but what would be the fairest way to compare two algorithms? I have two tracing files: one using SGD, and one using my own method. My main question then would be: what part of the tracing file would grow (roughly) proportional to the network speed. Is it
Then for
Or a combination of above? I have read the timeline
Is this the only operation where data is being sent and received? Thanks in advance, Mario |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Yes, Apart from that there is some low-bandwidth negotiation. You might also want to look into profiling with Nsight Systems to get a more detailed understanding, see #2723 for some hooks in Horovod. |
Beta Was this translation helpful? Give feedback.
Yes,
NCCL_ALLREDUCE
andNCCL_ALLGATHER
are when the actual tensor payload is transferred over your network.Apart from that there is some low-bandwidth negotiation.
You might also want to look into profiling with Nsight Systems to get a more detailed understanding, see #2723 for some hooks in Horovod.