How to conduct validation test during training with multi GPU? #3095

yjiangling · 2021-08-09T05:21:48Z

yjiangling
Aug 9, 2021

Hi all,

    When I use multi GPUs to train, but I want to conduct validation test during the train, how can I realize it? Here is my code:

`
with tf.Session(config=config) as sess:
ckpt = tf.train.latest_checkpoint(hp.checkpoint)

if ckpt is None:
    logging.info("Starting new training")
    sess.run(tf.global_variables_initializer())
    sess.run(bcast)
else:
    logging.info("Resuming from checkpoint: %s" % ckpt)
    saver.restore(sess, ckpt)

while True:
    try:
        ids, _gs, _loss, _acc, _summary, _ = sess.run([train_id, global_step, loss, accuracy, train_summary, train_op])
        if hvd.rank() == 0:
            logging.info("step {}, loss:{:.4f}, accuracy:{:.4f}".format(_gs, _loss, _acc))

            if _gs % 10 == 0:
                summary_writer.add_summary(_summary, _gs)

            if _gs % 1000 == 0:
                logging.info("# save models at {} step".format(_gs))
                saver.save(sess, ckpt_name, global_step=_gs)

            if math.isnan(_loss):
                logging.info("第hvd.rank({})个进程的第{}步出现错误".fromat(ids, _gs))
                raise Exception('Loss Exploded')

            if _gs % hp.eval_per_step == 0:
                logging.info("# statrt a validation test: ")
                ......

`

It give the error as this:

INFO:root:Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[Node:DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_encoder_dense_Tensordot_transpose_1_grad_transpose_0 = HorovodAllreduceT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Anyone can give some suggestion? Thanks a lot!

maxhgerlach · 2021-08-09T08:34:19Z

maxhgerlach
Aug 9, 2021
Collaborator

All your training processes will need to interrupt training, not just rank 0. Otherwise the other processes will crash as in your example.

Personally, I run validation in a totally separate process from the training job.

1 reply

yjiangling Aug 9, 2021
Author

All your training processes will need to interrupt training, not just rank 0. Otherwise the other processes will crash as in your example.

Personally, I run validation in a totally separate process from the training job.

Oh, I see, thanks a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to conduct validation test during training with multi GPU? #3095

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to conduct validation test during training with multi GPU? #3095

Uh oh!

Uh oh!

yjiangling Aug 9, 2021

Replies: 1 comment · 1 reply

Uh oh!

maxhgerlach Aug 9, 2021 Collaborator

Uh oh!

yjiangling Aug 9, 2021 Author

yjiangling
Aug 9, 2021

Replies: 1 comment 1 reply

maxhgerlach
Aug 9, 2021
Collaborator

yjiangling Aug 9, 2021
Author