How to conduct validation test during training with multi GPU? #3095
Unanswered
yjiangling
asked this question in
Q&A
Replies: 1 comment 1 reply
-
All your training processes will need to interrupt training, not just rank 0. Otherwise the other processes will crash as in your example. Personally, I run validation in a totally separate process from the training job. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
`
with tf.Session(config=config) as sess:
ckpt = tf.train.latest_checkpoint(hp.checkpoint)
`
It give the error as this:
INFO:root:Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[Node:DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_encoder_dense_Tensordot_transpose_1_grad_transpose_0 = HorovodAllreduceT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Anyone can give some suggestion? Thanks a lot!
Beta Was this translation helpful? Give feedback.
All reactions