why horovod takes much more memory than dataparallel in pytorch #3001
Unanswered
Christian-lyc
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I just tried to train the network on imageNet dataset with pytorch.
It's two slow to use dataparallel. So I changed to horovod.
The script is below:
#SBATCH -p hlab
#SBATCH -A hlab
#SBATCH -t 48:00:00
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4
#SBATCH -o train_imagenet50.out
#SBATCH --mem-per-cpu=10240
module load cuda/11.1
module load python
module load gcc/7.3.0
module load openmpi/2.1.2
mpirun -np 12 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train_imagenet.py
But the problem is when I use horovod, I need to half the batch size. I checked the gpu usage, even half of the batchsize, the memory usage is much more than before. It used near 32 G, but with original batchsize, the usage is around 20 G. Also, the it's more slow than before. The usage seems also lower than before.
I don't know where is the problem. Could anyone help me? Thank you
Beta Was this translation helpful? Give feedback.
All reactions