这是indexloc提供的服务,不要输入任何密码
Skip to content

[determinism] Add sparse softmax/xent GPU-determinism #50070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

duncanriach
Copy link
Contributor

@duncanriach duncanriach commented Jun 4, 2021

This PR adds and tests deterministic forward and backward operation of tf.nn.sparse_softmax_cross_entropy_with_logits when running on a GPU. This PR is a follow-on from PR 49178 (Add non-sparse softmax/xent GPU-determinism).

Note that there are significant changes and enhancements to the existing tests that may be obscured by the restructuring of the test files.

This implementation uses tf.gather in a way that does not exercise the nondeterminism that is currently possible through its backprop path on GPU: in each batch slot, it gathers only once from the correct logit, so it backprops (using tf.math.unsorted_segment_sum) from only one gradient value (associated with the loss) per batch slot. Therefore, this solution is deterministic. However, due to the d9m-unimplemented exception-throwing that was added to tf.math.unsorted_segment_sum via PR 47925 (Add softmax/cross-entropy op exceptions for GPU determinism) use of tf.gather's backprop on a GPU while op-determinism is enabled will cause a tf.errors.UnimplementedError to be thrown. For this reason, the test currently disables the exception by setting TF_DISABLE_SPARSE_SOFTMAX_XENT_WITH_LOGITS_OP_DETERMINISM_EXCEPTIONS to '1' and this current PR should not be merged until a deterministic GPU implementation for tf.math.unsorted_segment_sum has been merged into the top-of-tree.

The pre-existing, nondeterministic op enforces validity of the labels input by raising an exception when running on CPU, or injecting NaNs into the forward and backwards paths associated with the illegal entries in labels when running on a GPU. This functionality is documented. The initial state of this PR emulates the original on-GPU functionality (injecting NaNs) when the new deterministic functionality is running on either a CPU or a GPU. There is also the beginnings of a python-level emulation of the on-CPU functionality (throwing an exception) in case we decide to complete that implementation. Emulating the on-CPU functionality at the python-level seems very hacky (the way I started doing it) and I'm not sure that's it's possible. The emulation of the on-GPU functionality at the python-level is solid but adds a reasonable amount of compute. Let's discuss.

Note that there is currently no tf.float64 GPU implementation of this op.

I was not able to make the determinism test exercise the pre-existing nondeterminism in the forward direction for this op when logits is of type tf.float16 because of the way logits is up-sampled to 32-bits in nn_ops.py. I'm not sure if it's ever possible to exercise that potential nondeterminism. In other words, the pre-existing GPU implementation may always operate deterministically in the forward direction when logits is of type tf.float16. This is mostly moot, of course (at least for training), since the op is a loss function and the gradients back to logits are still nondeterministic (for the original GPU implementation) when logits is of type tf.float16.

This PR is related to RFC: Enabling Determinism in TensorFlow. For status and history of GPU-determinism for this op, see here.

cc @reedwm @sanjoy @nluehr

@google-ml-butler google-ml-butler bot added the size:L CL Change Size: Large label Jun 4, 2021
@google-cla google-cla bot added the cla: yes label Jun 4, 2021
@duncanriach duncanriach changed the title [determinism] Add sparse softmax/xent GPU-determinism. [determinism] Add sparse softmax/xent GPU-determinism Jun 4, 2021
@rthadur rthadur self-assigned this Jun 7, 2021
@rthadur rthadur requested a review from sanjoy June 7, 2021 16:56
@reedwm reedwm requested review from reedwm and removed request for sanjoy June 7, 2021 19:06
@duncanriach
Copy link
Contributor Author

duncanriach commented Jun 7, 2021

Note that @benbarsdell will soon open a PR to make tf.math.unsorted_segment_sum operate deterministically on a GPU. This will remove the d9m-unimplemented exception-throwing for this op when running on a GPU (and when op-determinism is expected). This current PR cannot be merged until that exception-throwing functionality has been removed from the top-of-tree.

@reedwm
Copy link
Member

reedwm commented Jun 7, 2021

Perhaps this should be implemented within the C++ op instead of on the Python level, which would solve the issue of raising an exception on the CPU. The core logic of the op, which are the following two lines, can be done in the C++ SparseSoftmaxCrossEntropyWithLogits op when determinism is enabled:

    log_probs = log_softmax_v2(logits)
    cost = math_ops.negative(array_ops.gather(log_probs, labels, batch_dims=1))

I think the current partially-implemented approach of checking math_ops.add(1, 2).device cannot be made to work in all cases, such as in graph mode when a GraphDef is saved then run on a separate machine. Alternatively, we could just always return NaNs even on the CPU, which is what XLA does despite it violating the documentation, but I would prefer raising exceptions on the CPU.

Implementing in C++ has a few more advantages. We don't have to deal with the awkwardness of a custom gradient. It's also probably faster as we don't have to run a lot of ops (like tf.where, tf.logical_or, etc). I also think it's slightly cleaner for the same op to be run regardless of whether determinism is enabled (we already run different ops for the non-sparse version but in that case the logic is a lot simpler).

The main disadvantage that I can think of is implementing the op in C++ will take more time. Unfortunately most of the non-test code in this PR would not be used.

@duncanriach what do you think? To implement, you could call the GatherFunctor and the SoftmaxFunctor from the SparseSoftmaxCrossEntropyWithLogits op.

Also /CC @sanjoy

@rthadur rthadur assigned gbaned and unassigned rthadur Jun 11, 2021
@duncanriach
Copy link
Contributor Author

@reedwm, the NaN injection (in both the forward and backward directions) when running on GPU, is currently provided by the Eigen-based implementation (in SparseXentLossGenerator and SparseXentGradGenerator in sparse_xent_op.h). Calling GatherFunctor and SoftmaxFunctor in SparseSoftmaxXentWithLogitsOp (implementing the math as in the Python-level solution) would not properly implement the NaN injection (or even the backprop output from the op). It would, however, solve the problem of throwing the exception when running on CPU. Maybe I'm misunderstanding something about your proposal; let me know if you think I've missed something.

I think that two reasonable (relatively temporary) compromises would be:

  1. When op-determinism is enabled, inject NaNs when running on CPU as well as GPU (already basically implemented).
  2. When op-determinism is enabled, neither inject NaNs nor throw an exception on either CPU or GPU (already basically implemented).

The second option would improve performance and is similar to what I did for tf.nn.bias_add, the current deterministic implementation of which skips some error checks (see here).

I suspect that for these ops for which we provide interim, deterministic Python-level solutions, it may be possible to later implement higher-performance, CUDA-level solutions that also include complete, high-performance error checks.

Copy link
Member

@reedwm reedwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling GatherFunctor and SoftmaxFunctor in SparseSoftmaxXentWithLogitsOp (implementing the math as in the Python-level solution) would not properly implement the NaN injection (or even the backprop output from the op)

True. Currently, GatherFunctor injects zeros for out-of-bounds indices. We could get around this by adding a parameter to GatherFunctor to inject NaN instead of zeros.

I think we should eventually go with implementing this in C++. However, I'm fine keeping the Python implementation for now, and later switching to the C++ approach.

I think that two reasonable (relatively temporary) compromises would be:

  1. When op-determinism is enabled, inject NaNs when running on CPU as well as GPU (already basically implemented).
  2. When op-determinism is enabled, neither inject NaNs nor throw an exception on either CPU or GPU (already basically implemented).

I prefer (2) since the implementation is simpler, there is precedent in tf.nn.bias_add, and we can eventually solve this by moving the determinism codepath of this op to C++. On the CPU, an exception will still be raised with (2) because tf.gather raises an exception on the CPU. On the GPU, zeros will be returned instead of NaN if the index is out of bounds due to tf.gather returning zeros.

labels=labels, logits=logits)
return loss, tape.gradient(loss, logits)

def _npXent(self, labels, logits):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming variables and methods is nice (like renaming features to logits), but makes it more difficult to see what's changed when comparing renamed files. For this PR it's fine, but in the future, when renaming a file like this, I would keep such changes to a minimum (unless the changes are necessary for the PR), and make the changes in a different PR which does not rename files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Thanks for the feedback.

# The following instruction is to be deleted after
# tf.math.unsorted_segment_sum operates deterministically in top-of-tree (and
# therefore doesn't throw d9m-unimplemented exceptions), and before the
# current PR is merged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the the string "DO NOT SUBMIT" (without quotes) to this comment? It will prevent the PR from accidentally being merged, and can be removed once the PR should be submitted. E.g.:

# DO NOT SUBMIT: Remove this once unsorted_segment_sum is determistic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The next commit will add this.

@duncanriach
Copy link
Contributor Author

Some responses to your overall review comment, @reedwm:

We could get around this by adding a parameter to GatherFunctor to inject NaN instead of zeros.

True. However, this does not address the backprop path that would need to be implemented under SparseSoftmaxXentWithLogitsOp as well, including the injection of NaN into the parts of the gradient associated with illegal label values.

I prefer (2) since the implementation is simpler, there is precedent in tf.nn.bias_add, and we can eventually solve this by moving the determinism codepath of this op to C++.

Right. And the details of how that's implemented at the C++ level can be worked out later. I plan to include one or more TODOs.

On the CPU, an exception will still be raised with (2) because tf.gather raises an exception on the CPU.

This doesn't match with my memory of what I saw in the testing (no exception on CPU), and I cannot find this exception-raising functionality in the code. I do see that the documentation claims that there will be an exception raised on the CPU. I'll review this while revising the tests.

On the GPU, zeros will be returned instead of NaN if the index is out of bounds due to tf.gather returning zeros.

Yes, and I seem to remember that zeros are also returned by the backprop of the deterministic solution (rather than NaNs), which I think makes sense mathematically: when an invalid logit is the "correct" one, then changes in the actual logits will have no effect on the loss.

@duncanriach
Copy link
Contributor Author

duncanriach commented Jun 22, 2021

The following code will be removed from this pull request (from python/ops/nn_ops.py) by the next commit. It enables the python-level, GPU-deterministic implementation of tf.nn.spase_softmax_cross_entropy_with_logits to inject NaNs into the batch entries for which the labels value is invalid (in both the forward and backward direction). I'm placing this code here for potential future reference.

@custom_gradient.custom_gradient
def _sparse_softmax_cross_entropy_gradient_nan_injector(logits, labels):
  logits_shape = array_ops.shape(logits)
  class_count = math_ops.cast(logits_shape[-1], labels.dtype)
  nan_tensor = constant_op.constant(float('Nan'), dtype=logits.dtype)
  def grad(upstream_gradient):
    logits_all_nans = array_ops.broadcast_to(nan_tensor, logits_shape)
    labels_on_logits = array_ops.transpose(
        labels * array_ops.transpose(
            [array_ops.ones(logits_shape[1], dtype=labels.dtype)]))
    logits_grad = array_ops.where(
        math_ops.logical_or(
            math_ops.less(labels_on_logits, 0),
            math_ops.greater_equal(labels_on_logits, class_count)),
        logits_all_nans, upstream_gradient)
    labels_grad = None
    return [logits_grad, labels_grad]
  return logits, grad


def _sparse_softmax_cross_entropy_with_rank_2_logits(logits, labels, name):
  if _tf_deterministic_ops():
    # Handle invalid labels before running the op
    _sparse_softmax_cross_entropy_check_for_valid_labels_on_cpu(logits, labels)
    logits = _sparse_softmax_cross_entropy_gradient_nan_injector(logits, labels)

    # The actual op functionality
    log_probs = log_softmax_v2(logits)
    cost = math_ops.negative(array_ops.gather(log_probs, labels, batch_dims=1))

    # Force the output to be NaN when the corresponding label is invalid
    nan_tensor = constant_op.constant(float('Nan'), dtype=logits.dtype)
    cost_all_nans = array_ops.broadcast_to(nan_tensor, array_ops.shape(cost))
    class_count = math_ops.cast(array_ops.shape(logits)[-1], labels.dtype)
    cost = array_ops.where(
        math_ops.logical_or(
          math_ops.less(labels, 0),
          math_ops.greater_equal(labels, class_count)),
        cost_all_nans, cost)
  else:
    # The second output tensor contains the gradients. We use it in
    # _CrossEntropyGrad() in nn_grad but not here.
    cost, _ = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
        logits, labels, name=name)
  return cost

@duncanriach duncanriach requested a review from reedwm June 24, 2021 01:10
@duncanriach
Copy link
Contributor Author

duncanriach commented Jun 24, 2021

Here is the difference in the gradient generated by testInvalidLabelGPU when the selective gradient gating after tf.gather is removed:

Expected gradient:

[[ 0.    ,  0.    ,  0.    ,  0.    ],
 [ 0.25  ,  0.25  ,  0.25  , -0.75  ],
 [-0.968 ,  0.087 ,  0.237 ,  0.6439],
 [ 0.    ,  0.    ,  0.    ,  0.    ]])

Actual gradient:

[[ 0.      ,  0.      ,  0.      ,  0.      ],
 [-0.5     ,  0.5     ,  0.5     , -0.5     ],
 [-0.935883,  0.174289,  0.473766,  0.287828],
 [ 0.      ,  0.      ,  0.      ,  0.      ]]

@reedwm
Copy link
Member

reedwm commented Jun 24, 2021

I'll mark this PR as approved once the unsorted_segment_sum is deterministic. The PR looks good to me and I have no further suggested changes. I don't want to approve now since someone may try to merge the PR and get confused by the "DO NOT SUBMIT".

The difference in gradients when the gating is disabled makes me suspect the tf.gather gradient or some other op might have a bug, as the NaN injection shouldn't change the gradients (unless I did the math wrong). Are you planning at looking at this? If not, I'll take a look myself.

@duncanriach
Copy link
Contributor Author

duncanriach commented Jun 24, 2021

I'll mark this PR as approved once the unsorted_segment_sum is deterministic.

Thanks. Sounds good.

The difference in gradients when the gating is disabled makes me suspect the tf.gather gradient or some other op might have a bug, as the NaN injection shouldn't change the gradients (unless I did the math wrong).

I suspect that this is revealing a bug in tf.math.unsorted_segment_sum (used with the backprop of tf.gather): the "illegal" segment ID is probably not getting stopped before it gets to tf.math.unsorted_segment_sum and then address aliasing is leading to gradients from the batch indices associated with the illegal labels spilling over, and getting added into, the batch indices associated with legal labels.

However, when the gradients associated with legal labels are blocked by injecting either zero or NaN in the forward direction (as this PR currently does), the associated gradient becomes zero and the gradients associated with legal labels come through correctly.

So, this is probably a bug associated with tf.gather (on GPU) such that while illegal values in its indices parameter lead to zeros being output in the forward direction, the associated back-propped gradients (channeled via tf.math.unsorted_segment_sum) are not being dropped appropriately.

Are you planning at looking at this? If not, I'll take a look myself.

I intend to look more deeply into this at some point (I added a TODO), but not right now. I planned to, at least, wait until after the new GPU implementation of tf.math.unsorted_segment_sum was in the top-of-tree, just in case that happened to resolve it. However, you're welcome to investigate sooner.

@gbaned
Copy link
Contributor

gbaned commented Jun 30, 2021

@duncanriach Can you please resolve conflicts? Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Jun 30, 2021
@duncanriach duncanriach force-pushed the sparse-softmax-xent-gpu-determinism branch from 025d2c4 to 840f840 Compare July 8, 2021 02:51
@duncanriach
Copy link
Contributor Author

@gbaned, force-pushed after rebase to resolve conflicts.

@gbaned gbaned requested review from reedwm and removed request for reedwm July 8, 2021 11:53
@gbaned gbaned added awaiting review Pull request awaiting review and removed stat:awaiting response Status - Awaiting response from author labels Jul 8, 2021
@gbaned
Copy link
Contributor

gbaned commented Jul 29, 2021

@duncanriach Can you please resolve conflicts? Thanks!

@gbaned gbaned removed the awaiting review Pull request awaiting review label Jul 29, 2021
@duncanriach duncanriach force-pushed the sparse-softmax-xent-gpu-determinism branch from 16146ac to 5bd43c6 Compare July 30, 2021 00:21
@duncanriach
Copy link
Contributor Author

@gbaned: rebased, conflicts resolved, and force-pushed.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 1, 2021
@gbaned gbaned requested review from reedwm and removed request for reedwm August 2, 2021 10:00
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Aug 2, 2021
Copy link
Member

@reedwm reedwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#51861 is merged so this looks ready to be merged. @duncanriach can you remove the TF_DISABLE_SEGMENT_REDUCTION_OP_DETERMINISM_EXCEPTIONS environmental variable from sparse_xent_op_deterministic_test.py, as well as the "DO NOT SUBMIT" comment right above it? Then I'll merge this.

If #51861 is rolled back, we can re-add the TF_DISABLE_SEGMENT_REDUCTION_OP_DETERMINISM_EXCEPTIONS environmental variable until it is rolled forward.

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Sep 9, 2021
@duncanriach duncanriach force-pushed the sparse-softmax-xent-gpu-determinism branch from 99452c0 to 00e7050 Compare September 9, 2021 03:17
@duncanriach
Copy link
Contributor Author

I'm going to rebase to get the other changes in my local tree and run a test before force pushing, which will probably happen tomorrow.

@duncanriach duncanriach force-pushed the sparse-softmax-xent-gpu-determinism branch from 00e7050 to 2b53358 Compare September 9, 2021 23:26
@duncanriach duncanriach requested a review from reedwm September 9, 2021 23:26
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Sep 9, 2021
@duncanriach
Copy link
Contributor Author

@reedwm, I rebased and force-pushed, but you only need to review the most recent commit.

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Sep 9, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 9, 2021
@gbaned gbaned removed the awaiting review Pull request awaiting review label Sep 10, 2021
copybara-service bot pushed a commit that referenced this pull request Sep 10, 2021
…terminism

PiperOrigin-RevId: 395893293
Change-Id: I9158e73d027ca4d5525c4e5311d6aa3554d96fb7
@gbaned
Copy link
Contributor

gbaned commented Sep 10, 2021

Seems auto-merge is not happening but the changes are merged into master now, so we can close this. Thank you for the PR.

@gbaned gbaned closed this Sep 10, 2021
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Sep 10, 2021
@duncanriach duncanriach deleted the sparse-softmax-xent-gpu-determinism branch September 11, 2021 03:17
@joker-eph
Copy link
Contributor

Seems like this test is failing on Windows (we don't have log because the output is too big).

@reedwm
Copy link
Member

reedwm commented Sep 11, 2021

I think this is because unsorted_segment_sum is not deterministic on Windows and so will raise an error when determinism is enabled. I'll disable the test on Windows for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes size:L CL Change Size: Large
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants