[determinism] Add sparse softmax/xent GPU-determinism #50070

duncanriach · 2021-06-04T05:27:18Z

This PR adds and tests deterministic forward and backward operation of tf.nn.sparse_softmax_cross_entropy_with_logits when running on a GPU. This PR is a follow-on from PR 49178 (Add non-sparse softmax/xent GPU-determinism).

Note that there are significant changes and enhancements to the existing tests that may be obscured by the restructuring of the test files.

This implementation uses tf.gather in a way that does not exercise the nondeterminism that is currently possible through its backprop path on GPU: in each batch slot, it gathers only once from the correct logit, so it backprops (using tf.math.unsorted_segment_sum) from only one gradient value (associated with the loss) per batch slot. Therefore, this solution is deterministic. However, due to the d9m-unimplemented exception-throwing that was added to tf.math.unsorted_segment_sum via PR 47925 (Add softmax/cross-entropy op exceptions for GPU determinism) use of tf.gather's backprop on a GPU while op-determinism is enabled will cause a tf.errors.UnimplementedError to be thrown. For this reason, the test currently disables the exception by setting TF_DISABLE_SPARSE_SOFTMAX_XENT_WITH_LOGITS_OP_DETERMINISM_EXCEPTIONS to '1' and this current PR should not be merged until a deterministic GPU implementation for tf.math.unsorted_segment_sum has been merged into the top-of-tree.

The pre-existing, nondeterministic op enforces validity of the labels input by raising an exception when running on CPU, or injecting NaNs into the forward and backwards paths associated with the illegal entries in labels when running on a GPU. This functionality is documented. The initial state of this PR emulates the original on-GPU functionality (injecting NaNs) when the new deterministic functionality is running on either a CPU or a GPU. There is also the beginnings of a python-level emulation of the on-CPU functionality (throwing an exception) in case we decide to complete that implementation. Emulating the on-CPU functionality at the python-level seems very hacky (the way I started doing it) and I'm not sure that's it's possible. The emulation of the on-GPU functionality at the python-level is solid but adds a reasonable amount of compute. Let's discuss.

Note that there is currently no tf.float64 GPU implementation of this op.

I was not able to make the determinism test exercise the pre-existing nondeterminism in the forward direction for this op when logits is of type tf.float16 because of the way logits is up-sampled to 32-bits in nn_ops.py. I'm not sure if it's ever possible to exercise that potential nondeterminism. In other words, the pre-existing GPU implementation may always operate deterministically in the forward direction when logits is of type tf.float16. This is mostly moot, of course (at least for training), since the op is a loss function and the gradients back to logits are still nondeterministic (for the original GPU implementation) when logits is of type tf.float16.

This PR is related to RFC: Enabling Determinism in TensorFlow. For status and history of GPU-determinism for this op, see here.

cc @reedwm @sanjoy @nluehr

duncanriach · 2021-06-07T22:01:42Z

Note that @benbarsdell will soon open a PR to make tf.math.unsorted_segment_sum operate deterministically on a GPU. This will remove the d9m-unimplemented exception-throwing for this op when running on a GPU (and when op-determinism is expected). This current PR cannot be merged until that exception-throwing functionality has been removed from the top-of-tree.

reedwm · 2021-06-07T23:02:23Z

Perhaps this should be implemented within the C++ op instead of on the Python level, which would solve the issue of raising an exception on the CPU. The core logic of the op, which are the following two lines, can be done in the C++ SparseSoftmaxCrossEntropyWithLogits op when determinism is enabled:

    log_probs = log_softmax_v2(logits)
    cost = math_ops.negative(array_ops.gather(log_probs, labels, batch_dims=1))

I think the current partially-implemented approach of checking math_ops.add(1, 2).device cannot be made to work in all cases, such as in graph mode when a GraphDef is saved then run on a separate machine. Alternatively, we could just always return NaNs even on the CPU, which is what XLA does despite it violating the documentation, but I would prefer raising exceptions on the CPU.

Implementing in C++ has a few more advantages. We don't have to deal with the awkwardness of a custom gradient. It's also probably faster as we don't have to run a lot of ops (like tf.where, tf.logical_or, etc). I also think it's slightly cleaner for the same op to be run regardless of whether determinism is enabled (we already run different ops for the non-sparse version but in that case the logic is a lot simpler).

The main disadvantage that I can think of is implementing the op in C++ will take more time. Unfortunately most of the non-test code in this PR would not be used.

@duncanriach what do you think? To implement, you could call the GatherFunctor and the SoftmaxFunctor from the SparseSoftmaxCrossEntropyWithLogits op.

Also /CC @sanjoy

duncanriach · 2021-06-15T04:30:22Z

@reedwm, the NaN injection (in both the forward and backward directions) when running on GPU, is currently provided by the Eigen-based implementation (in SparseXentLossGenerator and SparseXentGradGenerator in sparse_xent_op.h). Calling GatherFunctor and SoftmaxFunctor in SparseSoftmaxXentWithLogitsOp (implementing the math as in the Python-level solution) would not properly implement the NaN injection (or even the backprop output from the op). It would, however, solve the problem of throwing the exception when running on CPU. Maybe I'm misunderstanding something about your proposal; let me know if you think I've missed something.

I think that two reasonable (relatively temporary) compromises would be:

When op-determinism is enabled, inject NaNs when running on CPU as well as GPU (already basically implemented).
When op-determinism is enabled, neither inject NaNs nor throw an exception on either CPU or GPU (already basically implemented).

The second option would improve performance and is similar to what I did for tf.nn.bias_add, the current deterministic implementation of which skips some error checks (see here).

I suspect that for these ops for which we provide interim, deterministic Python-level solutions, it may be possible to later implement higher-performance, CUDA-level solutions that also include complete, high-performance error checks.

reedwm

Calling GatherFunctor and SoftmaxFunctor in SparseSoftmaxXentWithLogitsOp (implementing the math as in the Python-level solution) would not properly implement the NaN injection (or even the backprop output from the op)

True. Currently, GatherFunctor injects zeros for out-of-bounds indices. We could get around this by adding a parameter to GatherFunctor to inject NaN instead of zeros.

I think we should eventually go with implementing this in C++. However, I'm fine keeping the Python implementation for now, and later switching to the C++ approach.

I think that two reasonable (relatively temporary) compromises would be:

When op-determinism is enabled, inject NaNs when running on CPU as well as GPU (already basically implemented).

When op-determinism is enabled, neither inject NaNs nor throw an exception on either CPU or GPU (already basically implemented).

I prefer (2) since the implementation is simpler, there is precedent in tf.nn.bias_add, and we can eventually solve this by moving the determinism codepath of this op to C++. On the CPU, an exception will still be raised with (2) because tf.gather raises an exception on the CPU. On the GPU, zeros will be returned instead of NaN if the index is out of bounds due to tf.gather returning zeros.

reedwm · 2021-06-18T00:04:14Z

tensorflow/python/kernel_tests/sparse_xent_op_test_base.py

+          labels=labels, logits=logits)
+    return loss, tape.gradient(loss, logits)
+
+  def _npXent(self, labels, logits):


Renaming variables and methods is nice (like renaming features to logits), but makes it more difficult to see what's changed when comparing renamed files. For this PR it's fine, but in the future, when renaming a file like this, I would keep such changes to a minimum (unless the changes are necessary for the PR), and make the changes in a different PR which does not rename files.

Noted. Thanks for the feedback.

reedwm · 2021-06-18T00:59:05Z

tensorflow/python/kernel_tests/sparse_xent_op_deterministic_test.py

+  # The following instruction is to be deleted after
+  # tf.math.unsorted_segment_sum operates deterministically in top-of-tree (and
+  # therefore doesn't throw d9m-unimplemented exceptions), and before the
+  # current PR is merged.


Can you add the the string "DO NOT SUBMIT" (without quotes) to this comment? It will prevent the PR from accidentally being merged, and can be removed once the PR should be submitted. E.g.:

# DO NOT SUBMIT: Remove this once unsorted_segment_sum is determistic

Yes. The next commit will add this.

duncanriach · 2021-06-18T22:04:49Z

Some responses to your overall review comment, @reedwm:

We could get around this by adding a parameter to GatherFunctor to inject NaN instead of zeros.

True. However, this does not address the backprop path that would need to be implemented under SparseSoftmaxXentWithLogitsOp as well, including the injection of NaN into the parts of the gradient associated with illegal label values.

I prefer (2) since the implementation is simpler, there is precedent in tf.nn.bias_add, and we can eventually solve this by moving the determinism codepath of this op to C++.

Right. And the details of how that's implemented at the C++ level can be worked out later. I plan to include one or more TODOs.

On the CPU, an exception will still be raised with (2) because tf.gather raises an exception on the CPU.

This doesn't match with my memory of what I saw in the testing (no exception on CPU), and I cannot find this exception-raising functionality in the code. I do see that the documentation claims that there will be an exception raised on the CPU. I'll review this while revising the tests.

On the GPU, zeros will be returned instead of NaN if the index is out of bounds due to tf.gather returning zeros.

Yes, and I seem to remember that zeros are also returned by the backprop of the deterministic solution (rather than NaNs), which I think makes sense mathematically: when an invalid logit is the "correct" one, then changes in the actual logits will have no effect on the loss.

duncanriach · 2021-06-22T03:32:46Z

The following code will be removed from this pull request (from python/ops/nn_ops.py) by the next commit. It enables the python-level, GPU-deterministic implementation of tf.nn.spase_softmax_cross_entropy_with_logits to inject NaNs into the batch entries for which the labels value is invalid (in both the forward and backward direction). I'm placing this code here for potential future reference.

@custom_gradient.custom_gradient
def _sparse_softmax_cross_entropy_gradient_nan_injector(logits, labels):
  logits_shape = array_ops.shape(logits)
  class_count = math_ops.cast(logits_shape[-1], labels.dtype)
  nan_tensor = constant_op.constant(float('Nan'), dtype=logits.dtype)
  def grad(upstream_gradient):
    logits_all_nans = array_ops.broadcast_to(nan_tensor, logits_shape)
    labels_on_logits = array_ops.transpose(
        labels * array_ops.transpose(
            [array_ops.ones(logits_shape[1], dtype=labels.dtype)]))
    logits_grad = array_ops.where(
        math_ops.logical_or(
            math_ops.less(labels_on_logits, 0),
            math_ops.greater_equal(labels_on_logits, class_count)),
        logits_all_nans, upstream_gradient)
    labels_grad = None
    return [logits_grad, labels_grad]
  return logits, grad


def _sparse_softmax_cross_entropy_with_rank_2_logits(logits, labels, name):
  if _tf_deterministic_ops():
    # Handle invalid labels before running the op
    _sparse_softmax_cross_entropy_check_for_valid_labels_on_cpu(logits, labels)
    logits = _sparse_softmax_cross_entropy_gradient_nan_injector(logits, labels)

    # The actual op functionality
    log_probs = log_softmax_v2(logits)
    cost = math_ops.negative(array_ops.gather(log_probs, labels, batch_dims=1))

    # Force the output to be NaN when the corresponding label is invalid
    nan_tensor = constant_op.constant(float('Nan'), dtype=logits.dtype)
    cost_all_nans = array_ops.broadcast_to(nan_tensor, array_ops.shape(cost))
    class_count = math_ops.cast(array_ops.shape(logits)[-1], labels.dtype)
    cost = array_ops.where(
        math_ops.logical_or(
          math_ops.less(labels, 0),
          math_ops.greater_equal(labels, class_count)),
        cost_all_nans, cost)
  else:
    # The second output tensor contains the gradients. We use it in
    # _CrossEntropyGrad() in nn_grad but not here.
    cost, _ = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
        logits, labels, name=name)
  return cost

duncanriach · 2021-06-24T01:30:11Z

Here is the difference in the gradient generated by testInvalidLabelGPU when the selective gradient gating after tf.gather is removed:

Expected gradient:

[[ 0.    ,  0.    ,  0.    ,  0.    ],
 [ 0.25  ,  0.25  ,  0.25  , -0.75  ],
 [-0.968 ,  0.087 ,  0.237 ,  0.6439],
 [ 0.    ,  0.    ,  0.    ,  0.    ]])

Actual gradient:

[[ 0.      ,  0.      ,  0.      ,  0.      ],
 [-0.5     ,  0.5     ,  0.5     , -0.5     ],
 [-0.935883,  0.174289,  0.473766,  0.287828],
 [ 0.      ,  0.      ,  0.      ,  0.      ]]

reedwm · 2021-06-24T19:20:12Z

I'll mark this PR as approved once the unsorted_segment_sum is deterministic. The PR looks good to me and I have no further suggested changes. I don't want to approve now since someone may try to merge the PR and get confused by the "DO NOT SUBMIT".

The difference in gradients when the gating is disabled makes me suspect the tf.gather gradient or some other op might have a bug, as the NaN injection shouldn't change the gradients (unless I did the math wrong). Are you planning at looking at this? If not, I'll take a look myself.

duncanriach · 2021-06-24T22:40:34Z

I'll mark this PR as approved once the unsorted_segment_sum is deterministic.

Thanks. Sounds good.

The difference in gradients when the gating is disabled makes me suspect the tf.gather gradient or some other op might have a bug, as the NaN injection shouldn't change the gradients (unless I did the math wrong).

I suspect that this is revealing a bug in tf.math.unsorted_segment_sum (used with the backprop of tf.gather): the "illegal" segment ID is probably not getting stopped before it gets to tf.math.unsorted_segment_sum and then address aliasing is leading to gradients from the batch indices associated with the illegal labels spilling over, and getting added into, the batch indices associated with legal labels.

However, when the gradients associated with legal labels are blocked by injecting either zero or NaN in the forward direction (as this PR currently does), the associated gradient becomes zero and the gradients associated with legal labels come through correctly.

So, this is probably a bug associated with tf.gather (on GPU) such that while illegal values in its indices parameter lead to zeros being output in the forward direction, the associated back-propped gradients (channeled via tf.math.unsorted_segment_sum) are not being dropped appropriately.

Are you planning at looking at this? If not, I'll take a look myself.

I intend to look more deeply into this at some point (I added a TODO), but not right now. I planned to, at least, wait until after the new GPU implementation of tf.math.unsorted_segment_sum was in the top-of-tree, just in case that happened to resolve it. However, you're welcome to investigate sooner.

gbaned · 2021-06-30T15:37:30Z

@duncanriach Can you please resolve conflicts? Thanks!

duncanriach · 2021-07-08T02:52:01Z

@gbaned, force-pushed after rebase to resolve conflicts.

gbaned · 2021-07-29T16:18:13Z

@duncanriach Can you please resolve conflicts? Thanks!

duncanriach · 2021-07-30T00:22:30Z

@gbaned: rebased, conflicts resolved, and force-pushed.

reedwm

#51861 is merged so this looks ready to be merged. @duncanriach can you remove the TF_DISABLE_SEGMENT_REDUCTION_OP_DETERMINISM_EXCEPTIONS environmental variable from sparse_xent_op_deterministic_test.py, as well as the "DO NOT SUBMIT" comment right above it? Then I'll merge this.

If #51861 is rolled back, we can re-add the TF_DISABLE_SEGMENT_REDUCTION_OP_DETERMINISM_EXCEPTIONS environmental variable until it is rolled forward.

duncanriach · 2021-09-09T03:18:40Z

I'm going to rebase to get the other changes in my local tree and run a test before force pushing, which will probably happen tomorrow.

duncanriach · 2021-09-09T23:27:47Z

@reedwm, I rebased and force-pushed, but you only need to review the most recent commit.

…terminism PiperOrigin-RevId: 395893293 Change-Id: I9158e73d027ca4d5525c4e5311d6aa3554d96fb7

gbaned · 2021-09-10T10:53:52Z

Seems auto-merge is not happening but the changes are merged into master now, so we can close this. Thank you for the PR.

joker-eph · 2021-09-11T18:25:59Z

Seems like this test is failing on Windows (we don't have log because the output is too big).

reedwm · 2021-09-11T18:34:05Z

I think this is because unsorted_segment_sum is not deterministic on Windows and so will raise an error when determinism is enabled. I'll disable the test on Windows for now.

google-ml-butler bot added the size:L CL Change Size: Large label Jun 4, 2021

google-cla bot added the cla: yes label Jun 4, 2021

duncanriach changed the title ~~[determinism] Add sparse softmax/xent GPU-determinism.~~ [determinism] Add sparse softmax/xent GPU-determinism Jun 4, 2021

duncanriach mentioned this pull request Jun 5, 2021

Add GPU-deterministic back-prop for fused softmax/cross-entropy ops #38185

Closed

rthadur self-assigned this Jun 7, 2021

rthadur requested a review from sanjoy June 7, 2021 16:56

reedwm requested review from reedwm and removed request for sanjoy June 7, 2021 19:06

rthadur assigned gbaned and unassigned rthadur Jun 11, 2021

reedwm requested changes Jun 18, 2021

View reviewed changes

duncanriach requested a review from reedwm June 24, 2021 01:10

gbaned added the stat:awaiting response Status - Awaiting response from author label Jun 30, 2021

duncanriach force-pushed the sparse-softmax-xent-gpu-determinism branch from 025d2c4 to 840f840 Compare July 8, 2021 02:51

gbaned requested review from reedwm and removed request for reedwm July 8, 2021 11:53

gbaned added awaiting review Pull request awaiting review and removed stat:awaiting response Status - Awaiting response from author labels Jul 8, 2021

gbaned removed the awaiting review Pull request awaiting review label Jul 29, 2021

duncanriach force-pushed the sparse-softmax-xent-gpu-determinism branch from 16146ac to 5bd43c6 Compare July 30, 2021 00:21

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 1, 2021

gbaned requested review from reedwm and removed request for reedwm August 2, 2021 10:00

google-ml-butler bot added the awaiting review Pull request awaiting review label Aug 2, 2021

reedwm requested changes Sep 7, 2021

View reviewed changes

tensorflowbutler removed the awaiting review Pull request awaiting review label Sep 9, 2021

duncanriach force-pushed the sparse-softmax-xent-gpu-determinism branch from 99452c0 to 00e7050 Compare September 9, 2021 03:17

duncanriach added 4 commits September 8, 2021 20:19

[determinism] Add sparse softmax/xent GPU-determinism.

22e5401

[determinism] Address review (round 1) of PR 50070

cfa6189

[determinism] Fix pylint issues for PR 50070

93244fd

[determinism] Update PR 50070 following merge of PR 51861

2b53358

duncanriach force-pushed the sparse-softmax-xent-gpu-determinism branch from 00e7050 to 2b53358 Compare September 9, 2021 23:26

duncanriach requested a review from reedwm September 9, 2021 23:26

google-ml-butler bot added the awaiting review Pull request awaiting review label Sep 9, 2021

reedwm approved these changes Sep 9, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Sep 9, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 9, 2021

gbaned removed the awaiting review Pull request awaiting review label Sep 10, 2021

copybara-service bot pushed a commit that referenced this pull request Sep 10, 2021

Merge pull request #50070 from duncanriach:sparse-softmax-xent-gpu-de…

fb444a2

…terminism PiperOrigin-RevId: 395893293 Change-Id: I9158e73d027ca4d5525c4e5311d6aa3554d96fb7

gbaned closed this Sep 10, 2021

google-ml-butler bot removed the ready to pull PR ready for merge process label Sep 10, 2021

duncanriach deleted the sparse-softmax-xent-gpu-determinism branch September 11, 2021 03:17

[determinism] Add sparse softmax/xent GPU-determinism #50070

[determinism] Add sparse softmax/xent GPU-determinism #50070

Uh oh!

Conversation

duncanriach commented Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duncanriach commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reedwm commented Jun 7, 2021

Uh oh!

duncanriach commented Jun 15, 2021

Uh oh!

reedwm left a comment

Choose a reason for hiding this comment

Uh oh!

reedwm Jun 18, 2021

Choose a reason for hiding this comment

Uh oh!

duncanriach Jun 18, 2021

Choose a reason for hiding this comment

Uh oh!

reedwm Jun 18, 2021

Choose a reason for hiding this comment

Uh oh!

duncanriach Jun 18, 2021

Choose a reason for hiding this comment

Uh oh!

duncanriach commented Jun 18, 2021

Uh oh!

duncanriach commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duncanriach commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reedwm commented Jun 24, 2021

Uh oh!

duncanriach commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gbaned commented Jun 30, 2021

Uh oh!

duncanriach commented Jul 8, 2021

Uh oh!

gbaned commented Jul 29, 2021

Uh oh!

duncanriach commented Jul 30, 2021

Uh oh!

reedwm left a comment

Choose a reason for hiding this comment

Uh oh!

duncanriach commented Sep 9, 2021

Uh oh!

duncanriach commented Sep 9, 2021

Uh oh!

gbaned commented Sep 10, 2021

Uh oh!

joker-eph commented Sep 11, 2021

Uh oh!

reedwm commented Sep 11, 2021

Uh oh!

Uh oh!

duncanriach commented Jun 4, 2021 •

edited

Loading

duncanriach commented Jun 7, 2021 •

edited

Loading

duncanriach commented Jun 22, 2021 •

edited

Loading

duncanriach commented Jun 24, 2021 •

edited

Loading

duncanriach commented Jun 24, 2021 •

edited

Loading