Changing Tensor Operations backend at runtime #3100

felker · 2021-07-20T17:15:49Z

felker
Jul 20, 2021

I have built Horovod from source on a DGX A100 machine with the following backends:

(conda/2021-06-28/base) ➜  ~ horovodrun --check-build
Horovod v0.22.1:
Available Frameworks:
    [X] TensorFlow
    [X] PyTorch
    [ ] MXNet
Available Controllers:
    [X] MPI
    [X] Gloo
Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo

Users of this installation use the MPI controller with NCCL for the majority of their work. However, one user has had deadlock issues and suspects it is due to NCCL.

While it is straightforward to switch controllers to the Gloo controller at runtime via horovodrun --gloo ..., it is not clear from the documentation as to whether or not it is possible to selectively disable NCCL, MPI Tensor Operation backends at runtime. I suspect not--- Horovod built in this way will always use a blend of Tensor Operations on a single CPU and GPU, indicated by:

When using an MPI controller, MPI will be used ... if tensors are placed in host memory prior to the allreduce request.

Are there any relevant environment variables that control the controller and/or Tensor Operation backends at runtime, similar to the compile-time flags HOROVOD_GPU_OPERATIONS, HOROVOD_CPU_OPERATIONS ? I have seen some references to HOROVOD_CONTROLLER, but not in official documentation.

Is there a way to see what the equivalent/full command is for horovdrun --gloo, like for mpirun in https://horovod.readthedocs.io/en/stable/mpi_include.html ?

maxhgerlach · 2021-08-12T15:48:58Z

maxhgerlach
Aug 12, 2021
Collaborator

You can control which backend to use for CPU operations via an environment variable, but it is only parsed once at initialization:

horovod/horovod/common/utils/env_parser.cc

Line 59 in 386be42

const char* user_cpu_operation = std::getenv(HOROVOD_CPU_OPERATIONS);

But for GPU operations you pretty much need NCCL. While there is an option to use GPU-aware MPI, this is would have to be configured at compile time and seems to be pretty site-specific. https://horovod.readthedocs.io/en/stable/gpus.html#advanced-have-a-proprietary-mpi-implementation-with-gpu-support-optimized-for-your-network

If you want to avoid NCCL, you can place your allreduce etc. operations on CPU in your training code. But this will lead to slowdowns.

If both MPI and Gloo are available for the controller layer, MPI will be preferred IIRC. Running horovodrun --gloo to switch to Gloo is equivalent to setting the environment variable HOROVOD_CONTROLLER=GLOO:

horovod/horovod/common/utils/env_parser.cc

Line 95 in 386be42

const char* user_cpu_operation = std::getenv(HOROVOD_CONTROLLER);

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changing Tensor Operations backend at runtime #3100

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Changing Tensor Operations backend at runtime #3100

Uh oh!

felker Jul 20, 2021

Replies: 1 comment

Uh oh!

Uh oh!

maxhgerlach Aug 12, 2021 Collaborator

felker
Jul 20, 2021

maxhgerlach
Aug 12, 2021
Collaborator