这是indexloc提供的服务,不要输入任何密码
Skip to content

Non-deterministic behaviour: tf.math.unsorted_segment_sum uses CUDA Atomic Operations #39751

@gavins13

Description

@gavins13

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04.3
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below):v2.1.0-rc2-17-ge5bf8de and v2.2.0-rc4-8-g2b96f3662b
  • Python version: 3.7.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.1.105 and 7.6.5.32
  • GPU model and memory: RTX6000 24GB

Describe the current behavior
Currently, tf.math.unsorted_segment_sum uses non-deterministic GPU kernels which lead significant failings in the TensorFlow determinism venture. Other TensorFlow functions make use of tf.math.unsorted_segment_sum such as tf.gather (on backprop).

Some functions affected that I've discovered:

  • tfa.image.dense_image_warp (on backprop)
  • tf.gather (on backprop)

Describe the expected behavior

When TF_DETERMINISTIC_OPS=1, tf.math.unsorted_segment_sum should use deterministic GPU kernels leading to reproducibility.

Who will benefit from this bug fix correction?
Determinism is an extremely important part of our venture into deep learning as a community. Without determinism, it is hard to reliably tune hyperparameters and conduct other types of investigations such ablation studies. Whilst many TensorFlow operations have a deterministic alternative upon setting the OS Environment variable TF_DETERMINISTIC_OPS=1, tf.math.unsorted_segment_sum seems to have fallen under the radar, perhaps because other operations took priority (such as tf.reduce_sum).

Introducing this level of determinism to TensorFlow will allow it to be a better candidate for deep learning deployments in more sensitive environments such as medical. I.e. it doesn't make sense that a radiologist will look at result during one scan and then conduct the same scan and get a different result. It also affects the public's trust in AI venture altogether. As far as I'm aware, PyTorch offers full deterministic capabilities (perhaps due to the benefit of hindsight with TensorFlow not having it).

Standalone code to reproduce the issue
Code to reproduce the issue:
(Edit: Please see the code here instead: #39751 (comment)

I've added seed settings, TF_DETERMINISTIC_OPS, etc... and the issue still reproduces)

import tensorflow as tf
import numpy as np

num_segments = 4
data = tf.random.normal([30, 256, 256])
data = tf.constant(data)
segments = np.random.randint(low=0, high=num_segments, size=data.shape)

for i in range(5):
    reduced_summed = tf.math.unsorted_segment_sum(data, segments, num_segments)
    print(reduced_summed)

Output:

tf.Tensor([-273.92117 380.23163 1279.9718 -839.6437 ], shape=(4,), dtype=float32)
tf.Tensor([-273.92395 380.22168 1279.9834 -839.62573], shape=(4,), dtype=float32)
tf.Tensor([-273.91425 380.22177 1279.9773 -839.62976], shape=(4,), dtype=float32)
tf.Tensor([-273.9177 380.2243 1279.9733 -839.6427], shape=(4,), dtype=float32)
tf.Tensor([-273.91568 380.2217 1279.9747 -839.64166], shape=(4,), dtype=float32)

Note: all printed results are different but in reality, they should be the same

Colab notebook with this code can be found at : https://colab.research.google.com/drive/1HNHSfERQ_IDDDM7bgabii9TQPufpsutp?usp=sharing

** Unit Tests **
Essentially, the code above will produce the same result, rather than a different result every time it is executed in the for loop.

Coming soon.

Other info / logs

More information about the GPU operation can be found at https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/segment_reduction_ops_gpu.cu.cc

More information coming soon

Metadata

Metadata

Assignees

Labels

TF 2.2Issues related to TF 2.2comp:opsOPs related issuestype:bugBug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions