Unable to Run Example Ray Tune Integration with mutliple GPUs #3235

EClemMarq · 2021-10-21T21:06:44Z

EClemMarq
Oct 21, 2021

I am trying to use available documentation to learn about ray.tune.integration.horovod.DistributedTrainableCreator, but I am unable to run the example code ray/tune/examples/horovod_simple on a multi-GPU cluster. The code works well in a single-GPU mode when I run python3 horovod_simple.py --mode square --slots-per-host 1 --gpu 1 --hosts-per-trial 1. However, changing --slots-per-host to use two GPUs per host gives me the following traceback and error:

Failure # 1 (occurred at 2021-10-20_14-55-03)
Traceback (most recent call last):
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper
return func(*args, **kwargs)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/worker.py", line 1621, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): �[36mray::WrappedHorovodTrainable.train_buffered()�[39m (pid=2054270, ip=10.12.18.52, repr=<ray.tune.integration.horovod.WrappedHorovodTrainable object at 0x15331ec6b438>)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/trainable.py", line 189, in train_buffered
result = self.train()
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/trainable.py", line 248, in train
result = self.step()
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/integration/horovod.py", line 137, in step
result = self.executor.execute(lambda w: w.step())[0]
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/ray/runner.py", line 296, in execute
return self.maybe_call_ray(self.driver.execute, **kwargs)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/ray/runner.py", line 360, in _maybe_call_ray
return driver_func(**kwargs)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/ray/runner.py", line 489, in execute
return ray.get([worker.execute.remote(fn) for worker in self.workers])
ray.exceptions.RayTaskError(TuneError): �[36mray::BaseHorovodWorker.execute()�[39m (pid=2054343, ip=10.12.18.52, repr=<horovod.ray.worker.BaseHorovodWorker object at 0x15554419b6a0>)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/integration/horovod.py", line 137, in
result = self.executor.execute(lambda w: w.step())[0]
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 379, in step
self._report_thread_runner_error(block=True)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 527, in _report_thread_runner_error
("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
�[36mray::BaseHorovodWorker.execute()�[39m (pid=2054343, ip=10.12.18.52, repr=<horovod.ray.worker.BaseHorovodWorker object at 0x15554419b6a0>)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/torch/mpi_ops.py", line 944, in synchronize
mpi_lib.horovod_torch_wait_and_clear(handle)
RuntimeError: ncclCommInitRank failed: invalid usage
During handling of the above exception, another exception occurred:
�[36mray::BaseHorovodWorker.execute()�[39m (pid=2054343, ip=10.12.18.52, repr=<horovod.ray.worker.BaseHorovodWorker object at 0x15554419b6a0>)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 260, in run
self._entrypoint()
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 329, in entrypoint
self._status_reporter.get_checkpoint())
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/ray/tune/function_runner.py", line 594, in _trainable_func
output = fn()
File "rayHVDBasics.py", line 64, in train
hvd.broadcast_parameters(net.state_dict(), root_rank=0)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/torch/functions.py", line 59, in broadcast_parameters
synchronize(handle)
File "/mmfs1/home/EClemMarq/Pyth_Envs/env/lib64/python3.6/site-packages/horovod/torch/mpi_ops.py", line 949, in synchronize
raise HorovodInternalError(e)
horovod.common.exceptions.HorovodInternalError: ncclCommInitRank failed: invalid usage

Is there something I'm setting up incorrectly? I have
horovod = 0.23.0 (Framework: PyTorch, Controllers: Gloo, Tensor Ops: NCCL & Gloo)
ray = 1.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to Run Example Ray Tune Integration with mutliple GPUs #3235

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Unable to Run Example Ray Tune Integration with mutliple GPUs #3235

Uh oh!

Uh oh!

EClemMarq Oct 21, 2021

Replies: 0 comments

EClemMarq
Oct 21, 2021