Replies: 1 comment
-
You can control which backend to use for CPU operations via an environment variable, but it is only parsed once at initialization: But for GPU operations you pretty much need NCCL. While there is an option to use GPU-aware MPI, this is would have to be configured at compile time and seems to be pretty site-specific. https://horovod.readthedocs.io/en/stable/gpus.html#advanced-have-a-proprietary-mpi-implementation-with-gpu-support-optimized-for-your-network If you want to avoid NCCL, you can place your allreduce etc. operations on CPU in your training code. But this will lead to slowdowns. If both MPI and Gloo are available for the controller layer, MPI will be preferred IIRC. Running |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have built Horovod from source on a DGX A100 machine with the following backends:
Users of this installation use the MPI controller with NCCL for the majority of their work. However, one user has had deadlock issues and suspects it is due to NCCL.
While it is straightforward to switch controllers to the Gloo controller at runtime via
horovodrun --gloo ...
, it is not clear from the documentation as to whether or not it is possible to selectively disable NCCL, MPI Tensor Operation backends at runtime. I suspect not--- Horovod built in this way will always use a blend of Tensor Operations on a single CPU and GPU, indicated by:Are there any relevant environment variables that control the controller and/or Tensor Operation backends at runtime, similar to the compile-time flags
HOROVOD_GPU_OPERATIONS
,HOROVOD_CPU_OPERATIONS
? I have seen some references toHOROVOD_CONTROLLER
, but not in official documentation.Is there a way to see what the equivalent/full command is for
horovdrun --gloo
, like formpirun
in https://horovod.readthedocs.io/en/stable/mpi_include.html ?Beta Was this translation helpful? Give feedback.
All reactions