You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use available documentation to learn about ray.tune.integration.horovod.DistributedTrainableCreator, but I am unable to run the example code ray/tune/examples/horovod_simple on a multi-GPU cluster. The code works well in a single-GPU mode when I run python3 horovod_simple.py --mode square --slots-per-host 1 --gpu 1 --hosts-per-trial 1. However, changing --slots-per-host to use two GPUs per host gives me the following traceback and error:
Failure # 1 (occurred at 2021-10-20_14-55-03)
Traceback (most recent call last):
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper
return func(*args, **kwargs)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/worker.py", line 1621, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): �[36mray::WrappedHorovodTrainable.train_buffered()�[39m (pid=2054270, ip=10.12.18.52, repr=<ray.tune.integration.horovod.WrappedHorovodTrainable object at 0x15331ec6b438>)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/trainable.py", line 189, in train_buffered
result = self.train()
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/trainable.py", line 248, in train
result = self.step()
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/integration/horovod.py", line 137, in step
result = self.executor.execute(lambda w: w.step())[0]
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/ray/runner.py", line 296, in execute
return self.maybe_call_ray(self.driver.execute, **kwargs)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/ray/runner.py", line 360, in _maybe_call_ray
return driver_func(**kwargs)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/ray/runner.py", line 489, in execute
return ray.get([worker.execute.remote(fn) for worker in self.workers])
ray.exceptions.RayTaskError(TuneError): �[36mray::BaseHorovodWorker.execute()�[39m (pid=2054343, ip=10.12.18.52, repr=<horovod.ray.worker.BaseHorovodWorker object at 0x15554419b6a0>)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/integration/horovod.py", line 137, in
result = self.executor.execute(lambda w: w.step())[0]
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 379, in step
self._report_thread_runner_error(block=True)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error
("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
�[36mray::BaseHorovodWorker.execute()�[39m (pid=2054343, ip=10.12.18.52, repr=<horovod.ray.worker.BaseHorovodWorker object at 0x15554419b6a0>)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/torch/mpi_ops.py", line 944, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
RuntimeError: ncclCommInitRank failed: invalid usage
During handling of the above exception, another exception occurred:
�[36mray::BaseHorovodWorker.execute()�[39m (pid=2054343, ip=10.12.18.52, repr=<horovod.ray.worker.BaseHorovodWorker object at 0x15554419b6a0>)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 260, in run
self._entrypoint()
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 329, in entrypoint
self._status_reporter.get_checkpoint())
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func
output = fn()
File "rayHVDBasics.py", line 64, in train
hvd.broadcast_parameters(net.state_dict(), root_rank=0)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/torch/functions.py", line 59, in broadcast_parameters
synchronize(handle)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/torch/mpi_ops.py", line 949, in synchronize
raise HorovodInternalError(e)
horovod.common.exceptions.HorovodInternalError: ncclCommInitRank failed: invalid usage
Is there something I'm setting up incorrectly? I have
horovod = 0.23.0 (Framework: PyTorch, Controllers: Gloo, Tensor Ops: NCCL & Gloo)
ray = 1.7.0
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to use available documentation to learn about ray.tune.integration.horovod.DistributedTrainableCreator, but I am unable to run the example code ray/tune/examples/horovod_simple on a multi-GPU cluster. The code works well in a single-GPU mode when I run
python3 horovod_simple.py --mode square --slots-per-host 1 --gpu 1 --hosts-per-trial 1
. However, changing--slots-per-host
to use two GPUs per host gives me the following traceback and error:Is there something I'm setting up incorrectly? I have
horovod = 0.23.0 (Framework: PyTorch, Controllers: Gloo, Tensor Ops: NCCL & Gloo)
ray = 1.7.0
Beta Was this translation helpful? Give feedback.
All reactions