Different Time Recorded for Allreduce Operation #3067

dexhunter · 2021-06-30T09:12:25Z

dexhunter
Jun 30, 2021

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet) Pytorch
Framework version: 1.7.1
Horovod version: 0.22.0
MPI version: 4.0.1
CUDA version: 10.1
NCCL version: 2.6.4
Python version: 3.7.10
Spark / PySpark version:
Ray version:
OS and version: ubuntu
GCC version: 5.4.0
CMake version:

Bug report:

I use a simple script to record the elapsed time for allreduce op but I am getting different elapsed time from torch.cuda.Event(enable_timing=True) and the built-in timeline of horovod.

My script

import torch
import horovod.torch as hvd
import numpy as np

hvd.init()
torch.cuda.set_device(hvd.local_rank())

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
tensor_size = 2**28
x = torch.randn(tensor_size, dtype=torch.float).cuda()

# allreduce op
start.record()
reduced = hvd.allreduce(x, average=False)
end.record()

torch.cuda.synchronize()

elapsed_time=start.elapsed_time(end)

However, when I print out the result, there is always a gap between what torch records and the time on the timeline. I am just wondering which one will be more accurate? Thanks!

dexhunter · 2021-07-01T07:43:42Z

dexhunter
Jul 1, 2021
Author

Another issue I met is that I am not able to get the same result from what I saw on chrome://tracing and the json file. I extract the time based on pid but the results are a bit off. Any helps there? Thanks in advance!

For example, I have this json file from HOROVOD_TIMELINE and an extract time script gets ts based pid. But the results I got are different. For pid=28 I get 815.850ms from python script and 818.350ms from chrome://tracing.

+

0 replies

maxhgerlach · 2021-07-02T09:07:35Z

maxhgerlach
Jul 2, 2021
Collaborator

I use a simple script to record the elapsed time for allreduce op but I am getting different elapsed time from torch.cuda.Event(enable_timing=True) and the built-in timeline of horovod.

There can be some small extra delays from framework code. I don't know too much about PyTorch, but at least with graph-based TensorFlow code these things tend to get better in subsequent training steps. How large are the discrepancies you've seen, btw?

I've found Nvidia's Nsight profiling tools really helpful to understand where my models spend time. To make this easier with Horovod, I've contributed a PR #2723 that allows you to both see (1) timing ranges approximately corresponding to your torch timings and (2) all the detailed Horovod timeline information in the same profiling display.

0 replies

dexhunter · 2021-07-02T10:12:00Z

dexhunter
Jul 2, 2021
Author

How large are the discrepancies you've seen, btw?

@maxhgerlach Hi Max, thanks for the reply. The horovod records NEGOTIATE_ALLREDUCE for ~4949ms for the discrepancy is really large for the first pid, the rest difference is a bit smaller but still about ~30% difference between time recorded by torch and horovod timeline

0 replies

maxhgerlach · 2021-07-06T13:50:41Z

maxhgerlach
Jul 6, 2021
Collaborator

@dexhunter, on your plot the extra latency seems to grow with the message size. Might this just be latency incurred by copying around the data?

0 replies

dexhunter · 2021-07-06T14:26:09Z

dexhunter
Jul 6, 2021
Author

@maxhgerlach Hi thanks for the reply. From my understanding, with larger message sizes, the time for executing the operation will also grow. The plot is based on the timeline json file, so the time moving around the time should also be included as seen in WAIT_FOR_DATA or maybe I should change the ylabel to execution time?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different Time Recorded for Allreduce Operation #3067

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Different Time Recorded for Allreduce Operation #3067

Uh oh!

dexhunter Jun 30, 2021

Replies: 5 comments

Uh oh!

Uh oh!

dexhunter Jul 1, 2021 Author

Uh oh!

maxhgerlach Jul 2, 2021 Collaborator

Uh oh!

dexhunter Jul 2, 2021 Author

Uh oh!

maxhgerlach Jul 6, 2021 Collaborator

Uh oh!

dexhunter Jul 6, 2021 Author

dexhunter
Jun 30, 2021

dexhunter
Jul 1, 2021
Author

maxhgerlach
Jul 2, 2021
Collaborator

dexhunter
Jul 2, 2021
Author

maxhgerlach
Jul 6, 2021
Collaborator

dexhunter
Jul 6, 2021
Author