Dead lock error message when resuming the keras_imagenet_resnet50.py from the checkpoint #3498
Replies: 2 comments
-
It seems that every worker is picking up from the checkpoint (Epoch 5) and got stuck there. The Rank 0 worker keeps trying to broadcast to the other workers. [1,2]:Epoch 5/90 |
Beta Was this translation helpful? Give feedback.
-
It seems that when resuming from the checkpoint, it got stuck when it calls hvd.callbacks.BroadcastGlobalVariablesCallback(0) to restore from the checkpoint. It seems to get stuck when calling hvd.broadcast_variables(self.model.optimizer.variables(), root_rank=self.root_rank) inside the BroadcastGlobalVariablesCallbackImpl class. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The keras_imagenet_resnet50py seems to be working fine when it started from the scratch, but it seems stuck when it resumes from the checkpoint. Since I am running on tensorflow 2, what I did was to comment out the tensorflow 1 part and put the tensorflow 2 part as follows:
Horovod: pin GPU to be used to process local rank (one GPU per process)
#config = tf.ConfigProto()
#config.gpu_options.allow_growth = True
#config.gpu_options.visible_device_list = str(hvd.local_rank())
#K.set_session(tf.Session(config=config))
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
When resuming from the checkpoint, the error messages that keep repeating are:
[1,0]:Epoch 2/90
[1,0]:2022-03-29 09:46:17.007028: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
[1,0]:2022-03-29 09:46:22.927790: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
[1,4]:2022-03-29 09:46:24.526593: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
[1,6]:2022-03-29 09:46:24.907267: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
[1,2]:2022-03-29 09:46:26.295599: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
[1,4]:2022-03-29 09:46:26.394497: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
[1,6]:2022-03-29 09:46:26.721821: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
[1,3]:2022-03-29 09:46:26.760648: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
[1,1]:2022-03-29 09:46:26.905272: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
[1,5]:2022-03-29 09:46:26.948601: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
[1,7]:2022-03-29 09:46:27.587535: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
[1,2]:2022-03-29 09:46:28.147603: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
[1,3]:2022-03-29 09:46:28.676358: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
[1,1]:2022-03-29 09:46:28.721767: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
[1,5]:2022-03-29 09:46:29.042281: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
[1,7]:2022-03-29 09:46:29.972991: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
[1,0]: 1/5004 [..............................] - ETA: 33:51:50 - loss: 6.0242 - accuracy: 0.0625 - top_k_categorical_accuracy: 0.2188[1,0]:[[1,0]:2022-03-29 09:48:32[1,0]:.[1,0]:960240[1,0]:: [1,0]:W[1,0]: [1,0]:/tmp/pip-install-z0jdoboy/horovod_1a83ee8a48a847e5aaae7aa3d6dda863/horovod/common/stall_inspector.cc[1,0]::[1,0]:107[1,0]:] [1,0]:One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Missing ranks:
[1,0]:0: [PartitionedCall/DistributedSGD_Allreduce/cond/then/_643/DistributedSGD_Allreduce/cond/HorovodAllreduce_grads_0, PartitionedCall/DistributedSGD_Allreduce/cond_1/then/_651/DistributedSGD_Allreduce/cond_1/HorovodAllreduce_grads_1_0, PartitionedCall/DistributedSGD_Allreduce/cond_10/then/_723/DistributedSGD_Allreduce/cond_10/HorovodAllreduce_grads_10_0, PartitionedCall/DistributedSGD_Allreduce/cond_100/then/_1443/DistributedSGD_Allreduce/cond_100/HorovodAllreduce_grads_100_0, PartitionedCall/DistributedSGD_Allreduce/cond_101/then/_1451/DistributedSGD_Allreduce/cond_101/HorovodAllreduce_grads_101_0, PartitionedCall/DistributedSGD_Allreduce/cond_102/then/_1459/DistributedSGD_Allreduce/cond_102/HorovodAllreduce_grads_102_0 ...]
[1,0]:1: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:2: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:3: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:4: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:5: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:6: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:7: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...][1,0]:
[1,0]:[[1,0]:2022-03-29 09:49:32[1,0]:.[1,0]:962926[1,0]:: [1,0]:W[1,0]: [1,0]:/tmp/pip-install-z0jdoboy/horovod_1a83ee8a48a847e5aaae7aa3d6dda863/horovod/common/stall_inspector.cc[1,0]::[1,0]:107[1,0]:] [1,0]:One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Missing ranks:
[1,0]:0: [PartitionedCall/DistributedSGD_Allreduce/cond/then/_643/DistributedSGD_Allreduce/cond/HorovodAllreduce_grads_0, PartitionedCall/DistributedSGD_Allreduce/cond_1/then/_651/DistributedSGD_Allreduce/cond_1/HorovodAllreduce_grads_1_0, PartitionedCall/DistributedSGD_Allreduce/cond_10/then/_723/DistributedSGD_Allreduce/cond_10/HorovodAllreduce_grads_10_0, PartitionedCall/DistributedSGD_Allreduce/cond_100/then/_1443/DistributedSGD_Allreduce/cond_100/HorovodAllreduce_grads_100_0, PartitionedCall/DistributedSGD_Allreduce/cond_101/then/_1451/DistributedSGD_Allreduce/cond_101/HorovodAllreduce_grads_101_0, PartitionedCall/DistributedSGD_Allreduce/cond_102/then/_1459/DistributedSGD_Allreduce/cond_102/HorovodAllreduce_grads_102_0 ...]
[1,0]:1: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:2: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:3: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:4: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:5: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:6: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...]
[1,0]:7: [HorovodBroadcast_conv1_bn_beta_0, HorovodBroadcast_conv1_bn_gamma_0, HorovodBroadcast_conv1_bn_moving_mean_0, HorovodBroadcast_conv1_bn_moving_variance_0, HorovodBroadcast_conv1_conv_bias_0, HorovodBroadcast_conv1_conv_kernel_0 ...][1,0]:
[1,0]:[[1,0]:2022-03-29 09:50:32[1,0]:.[1,0]:965553[1,0]:: [1,0]:W[1,0]: [1,0]:/tmp/pip-install-z0jdoboy/horovod_1a83ee8a48a847e5aaae7aa3d6dda863/horovod/common/stall_inspector.cc[1,0]::[1,0]:107[1,0]:] [1,0]:One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Any comment/suggestion would be appreciated.
Beta Was this translation helpful? Give feedback.
All reactions