horovod allreduce hangs when using multi-nodes #3600

EtoDemerzel0427 · 2022-07-11T22:59:44Z

EtoDemerzel0427
Jul 11, 2022

I was running deep-gradient-compression using horovod. It works fine on a single machine, but when I tried to expand to multi-nodes, it would hang during the very first allreduce op:

$ horovodrun -np 8 -H c196-071:4,c196-072:4 --verbose  python -u train.py --configs configs/cifar/resnet20.py
Filtering local host names.
Remote host found: c196-071
Checking ssh on all remote hosts.
SSH was successful into all the remote hosts.
Testing interfaces on all the hosts.
Interfaces on all the hosts were successfully checked.
Common interface found: eno1 ib0
mpirun -l -np 8 -ppn 4 -hosts c196-071,c196-072     -genv NCCL_SOCKET_IFNAME=eno1,ib0    python -u train.py --configs configs/cifar/resnet20.py

It seems the connection tests between nodes are ok. Before this allreduce, there's a broadcast_parameters, and that works ok. So I guess this is not a connection problem. Also, the code can run without any issue on a single node with 4 GPUs. Do you have any ideas on what's wrong here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

horovod allreduce hangs when using multi-nodes #3600

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

horovod allreduce hangs when using multi-nodes #3600

Uh oh!

EtoDemerzel0427 Jul 11, 2022

Replies: 0 comments

EtoDemerzel0427
Jul 11, 2022