horovod allreduce hangs when using multi-nodes #3600
Unanswered
EtoDemerzel0427
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I was running deep-gradient-compression using horovod. It works fine on a single machine, but when I tried to expand to multi-nodes, it would hang during the very first allreduce op:
$ horovodrun -np 8 -H c196-071:4,c196-072:4 --verbose python -u train.py --configs configs/cifar/resnet20.py Filtering local host names. Remote host found: c196-071 Checking ssh on all remote hosts. SSH was successful into all the remote hosts. Testing interfaces on all the hosts. Interfaces on all the hosts were successfully checked. Common interface found: eno1 ib0 mpirun -l -np 8 -ppn 4 -hosts c196-071,c196-072 -genv NCCL_SOCKET_IFNAME=eno1,ib0 python -u train.py --configs configs/cifar/resnet20.py
It seems the connection tests between nodes are ok. Before this allreduce, there's a
broadcast_parameters
, and that works ok. So I guess this is not a connection problem. Also, the code can run without any issue on a single node with 4 GPUs. Do you have any ideas on what's wrong here?Beta Was this translation helpful? Give feedback.
All reactions