why horovod takes much more memory than dataparallel in pytorch #3001

Christian-lyc · 2021-06-26T10:41:13Z

Christian-lyc
Jun 26, 2021

I just tried to train the network on imageNet dataset with pytorch.
It's two slow to use dataparallel. So I changed to horovod.

The script is below:

#SBATCH -p hlab
#SBATCH -A hlab
#SBATCH -t 48:00:00
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4
#SBATCH -o train_imagenet50.out
#SBATCH --mem-per-cpu=10240

module load cuda/11.1
module load python
module load gcc/7.3.0
module load openmpi/2.1.2

mpirun -np 12 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train_imagenet.py

But the problem is when I use horovod, I need to half the batch size. I checked the gpu usage, even half of the batchsize, the memory usage is much more than before. It used near 32 G, but with original batchsize, the usage is around 20 G. Also, the it's more slow than before. The usage seems also lower than before.
I don't know where is the problem. Could anyone help me? Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

why horovod takes much more memory than dataparallel in pytorch #3001

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

why horovod takes much more memory than dataparallel in pytorch #3001

Uh oh!

Christian-lyc Jun 26, 2021

Replies: 0 comments

Christian-lyc
Jun 26, 2021