Replies: 5 comments
-
Another issue I met is that I am not able to get the same result from what I saw on chrome://tracing and the json file. I extract the time based on For example, I have this json file from |
Beta Was this translation helpful? Give feedback.
-
There can be some small extra delays from framework code. I don't know too much about PyTorch, but at least with graph-based TensorFlow code these things tend to get better in subsequent training steps. How large are the discrepancies you've seen, btw? I've found Nvidia's Nsight profiling tools really helpful to understand where my models spend time. To make this easier with Horovod, I've contributed a PR #2723 that allows you to both see (1) timing ranges approximately corresponding to your torch timings and (2) all the detailed Horovod timeline information in the same profiling display. |
Beta Was this translation helpful? Give feedback.
-
@maxhgerlach Hi Max, thanks for the reply. The horovod records |
Beta Was this translation helpful? Give feedback.
-
@dexhunter, on your plot the extra latency seems to grow with the message size. Might this just be latency incurred by copying around the data? |
Beta Was this translation helpful? Give feedback.
-
@maxhgerlach Hi thanks for the reply. From my understanding, with larger message sizes, the time for executing the operation will also grow. The plot is based on the timeline json file, so the time moving around the time should also be included as seen in WAIT_FOR_DATA or maybe I should change the ylabel to execution time? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Environment:
Bug report:
I use a simple script to record the elapsed time for allreduce op but I am getting different elapsed time from
torch.cuda.Event(enable_timing=True)
and the built-in timeline of horovod.My script
However, when I print out the result, there is always a gap between what torch records and the time on the timeline. I am just wondering which one will be more accurate? Thanks!
Beta Was this translation helpful? Give feedback.
All reactions