-
Notifications
You must be signed in to change notification settings - Fork 74.8k
[determinism] Add sparse softmax/xent GPU-determinism #50070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[determinism] Add sparse softmax/xent GPU-determinism #50070
Conversation
Note that @benbarsdell will soon open a PR to make |
Perhaps this should be implemented within the C++ op instead of on the Python level, which would solve the issue of raising an exception on the CPU. The core logic of the op, which are the following two lines, can be done in the C++ log_probs = log_softmax_v2(logits)
cost = math_ops.negative(array_ops.gather(log_probs, labels, batch_dims=1)) I think the current partially-implemented approach of checking Implementing in C++ has a few more advantages. We don't have to deal with the awkwardness of a custom gradient. It's also probably faster as we don't have to run a lot of ops (like The main disadvantage that I can think of is implementing the op in C++ will take more time. Unfortunately most of the non-test code in this PR would not be used. @duncanriach what do you think? To implement, you could call the GatherFunctor and the SoftmaxFunctor from the SparseSoftmaxCrossEntropyWithLogits op. Also /CC @sanjoy |
@reedwm, the NaN injection (in both the forward and backward directions) when running on GPU, is currently provided by the Eigen-based implementation (in I think that two reasonable (relatively temporary) compromises would be:
The second option would improve performance and is similar to what I did for I suspect that for these ops for which we provide interim, deterministic Python-level solutions, it may be possible to later implement higher-performance, CUDA-level solutions that also include complete, high-performance error checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling
GatherFunctor
andSoftmaxFunctor
inSparseSoftmaxXentWithLogitsOp
(implementing the math as in the Python-level solution) would not properly implement the NaN injection (or even the backprop output from the op)
True. Currently, GatherFunctor
injects zeros for out-of-bounds indices. We could get around this by adding a parameter to GatherFunctor
to inject NaN instead of zeros.
I think we should eventually go with implementing this in C++. However, I'm fine keeping the Python implementation for now, and later switching to the C++ approach.
I think that two reasonable (relatively temporary) compromises would be:
- When op-determinism is enabled, inject NaNs when running on CPU as well as GPU (already basically implemented).
- When op-determinism is enabled, neither inject NaNs nor throw an exception on either CPU or GPU (already basically implemented).
I prefer (2) since the implementation is simpler, there is precedent in tf.nn.bias_add
, and we can eventually solve this by moving the determinism codepath of this op to C++. On the CPU, an exception will still be raised with (2) because tf.gather
raises an exception on the CPU. On the GPU, zeros will be returned instead of NaN if the index is out of bounds due to tf.gather
returning zeros.
labels=labels, logits=logits) | ||
return loss, tape.gradient(loss, logits) | ||
|
||
def _npXent(self, labels, logits): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming variables and methods is nice (like renaming features
to logits
), but makes it more difficult to see what's changed when comparing renamed files. For this PR it's fine, but in the future, when renaming a file like this, I would keep such changes to a minimum (unless the changes are necessary for the PR), and make the changes in a different PR which does not rename files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. Thanks for the feedback.
# The following instruction is to be deleted after | ||
# tf.math.unsorted_segment_sum operates deterministically in top-of-tree (and | ||
# therefore doesn't throw d9m-unimplemented exceptions), and before the | ||
# current PR is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the the string "DO NOT SUBMIT" (without quotes) to this comment? It will prevent the PR from accidentally being merged, and can be removed once the PR should be submitted. E.g.:
# DO NOT SUBMIT: Remove this once unsorted_segment_sum is determistic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The next commit will add this.
Some responses to your overall review comment, @reedwm:
True. However, this does not address the backprop path that would need to be implemented under
Right. And the details of how that's implemented at the C++ level can be worked out later. I plan to include one or more TODOs.
This doesn't match with my memory of what I saw in the testing (no exception on CPU), and I cannot find this exception-raising functionality in the code. I do see that the documentation claims that there will be an exception raised on the CPU. I'll review this while revising the tests.
Yes, and I seem to remember that zeros are also returned by the backprop of the deterministic solution (rather than NaNs), which I think makes sense mathematically: when an invalid logit is the "correct" one, then changes in the actual logits will have no effect on the loss. |
The following code will be removed from this pull request (from @custom_gradient.custom_gradient
def _sparse_softmax_cross_entropy_gradient_nan_injector(logits, labels):
logits_shape = array_ops.shape(logits)
class_count = math_ops.cast(logits_shape[-1], labels.dtype)
nan_tensor = constant_op.constant(float('Nan'), dtype=logits.dtype)
def grad(upstream_gradient):
logits_all_nans = array_ops.broadcast_to(nan_tensor, logits_shape)
labels_on_logits = array_ops.transpose(
labels * array_ops.transpose(
[array_ops.ones(logits_shape[1], dtype=labels.dtype)]))
logits_grad = array_ops.where(
math_ops.logical_or(
math_ops.less(labels_on_logits, 0),
math_ops.greater_equal(labels_on_logits, class_count)),
logits_all_nans, upstream_gradient)
labels_grad = None
return [logits_grad, labels_grad]
return logits, grad
def _sparse_softmax_cross_entropy_with_rank_2_logits(logits, labels, name):
if _tf_deterministic_ops():
# Handle invalid labels before running the op
_sparse_softmax_cross_entropy_check_for_valid_labels_on_cpu(logits, labels)
logits = _sparse_softmax_cross_entropy_gradient_nan_injector(logits, labels)
# The actual op functionality
log_probs = log_softmax_v2(logits)
cost = math_ops.negative(array_ops.gather(log_probs, labels, batch_dims=1))
# Force the output to be NaN when the corresponding label is invalid
nan_tensor = constant_op.constant(float('Nan'), dtype=logits.dtype)
cost_all_nans = array_ops.broadcast_to(nan_tensor, array_ops.shape(cost))
class_count = math_ops.cast(array_ops.shape(logits)[-1], labels.dtype)
cost = array_ops.where(
math_ops.logical_or(
math_ops.less(labels, 0),
math_ops.greater_equal(labels, class_count)),
cost_all_nans, cost)
else:
# The second output tensor contains the gradients. We use it in
# _CrossEntropyGrad() in nn_grad but not here.
cost, _ = gen_nn_ops.sparse_softmax_cross_entropy_with_logits(
logits, labels, name=name)
return cost |
Here is the difference in the gradient generated by Expected gradient:
Actual gradient:
|
I'll mark this PR as approved once the unsorted_segment_sum is deterministic. The PR looks good to me and I have no further suggested changes. I don't want to approve now since someone may try to merge the PR and get confused by the "DO NOT SUBMIT". The difference in gradients when the gating is disabled makes me suspect the |
Thanks. Sounds good.
I suspect that this is revealing a bug in However, when the gradients associated with legal labels are blocked by injecting either zero or NaN in the forward direction (as this PR currently does), the associated gradient becomes zero and the gradients associated with legal labels come through correctly. So, this is probably a bug associated with
I intend to look more deeply into this at some point (I added a TODO), but not right now. I planned to, at least, wait until after the new GPU implementation of |
@duncanriach Can you please resolve conflicts? Thanks! |
025d2c4
to
840f840
Compare
@gbaned, force-pushed after rebase to resolve conflicts. |
@duncanriach Can you please resolve conflicts? Thanks! |
16146ac
to
5bd43c6
Compare
@gbaned: rebased, conflicts resolved, and force-pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#51861 is merged so this looks ready to be merged. @duncanriach can you remove the TF_DISABLE_SEGMENT_REDUCTION_OP_DETERMINISM_EXCEPTIONS
environmental variable from sparse_xent_op_deterministic_test.py
, as well as the "DO NOT SUBMIT" comment right above it? Then I'll merge this.
If #51861 is rolled back, we can re-add the TF_DISABLE_SEGMENT_REDUCTION_OP_DETERMINISM_EXCEPTIONS
environmental variable until it is rolled forward.
99452c0
to
00e7050
Compare
I'm going to rebase to get the other changes in my local tree and run a test before force pushing, which will probably happen tomorrow. |
00e7050
to
2b53358
Compare
@reedwm, I rebased and force-pushed, but you only need to review the most recent commit. |
…terminism PiperOrigin-RevId: 395893293 Change-Id: I9158e73d027ca4d5525c4e5311d6aa3554d96fb7
Seems auto-merge is not happening but the changes are merged into master now, so we can close this. Thank you for the PR. |
Seems like this test is failing on Windows (we don't have log because the output is too big). |
I think this is because |
This PR adds and tests deterministic forward and backward operation of
tf.nn.sparse_softmax_cross_entropy_with_logits
when running on a GPU. This PR is a follow-on from PR 49178 (Add non-sparse softmax/xent GPU-determinism).Note that there are significant changes and enhancements to the existing tests that may be obscured by the restructuring of the test files.
This implementation uses
tf.gather
in a way that does not exercise the nondeterminism that is currently possible through its backprop path on GPU: in each batch slot, it gathers only once from the correct logit, so it backprops (usingtf.math.unsorted_segment_sum
) from only one gradient value (associated with the loss) per batch slot. Therefore, this solution is deterministic. However, due to the d9m-unimplemented exception-throwing that was added totf.math.unsorted_segment_sum
via PR 47925 (Add softmax/cross-entropy op exceptions for GPU determinism) use oftf.gather
's backprop on a GPU while op-determinism is enabled will cause atf.errors.UnimplementedError
to be thrown. For this reason, the test currently disables the exception by settingTF_DISABLE_SPARSE_SOFTMAX_XENT_WITH_LOGITS_OP_DETERMINISM_EXCEPTIONS
to'1'
and this current PR should not be merged until a deterministic GPU implementation fortf.math.unsorted_segment_sum
has been merged into the top-of-tree.The pre-existing, nondeterministic op enforces validity of the
labels
input by raising an exception when running on CPU, or injecting NaNs into the forward and backwards paths associated with the illegal entries inlabels
when running on a GPU. This functionality is documented. The initial state of this PR emulates the original on-GPU functionality (injecting NaNs) when the new deterministic functionality is running on either a CPU or a GPU. There is also the beginnings of a python-level emulation of the on-CPU functionality (throwing an exception) in case we decide to complete that implementation. Emulating the on-CPU functionality at the python-level seems very hacky (the way I started doing it) and I'm not sure that's it's possible. The emulation of the on-GPU functionality at the python-level is solid but adds a reasonable amount of compute. Let's discuss.Note that there is currently no
tf.float64
GPU implementation of this op.I was not able to make the determinism test exercise the pre-existing nondeterminism in the forward direction for this op when
logits
is of typetf.float16
because of the waylogits
is up-sampled to 32-bits innn_ops.py
. I'm not sure if it's ever possible to exercise that potential nondeterminism. In other words, the pre-existing GPU implementation may always operate deterministically in the forward direction whenlogits
is of typetf.float16
. This is mostly moot, of course (at least for training), since the op is a loss function and the gradients back tologits
are still nondeterministic (for the original GPU implementation) whenlogits
is of typetf.float16
.This PR is related to RFC: Enabling Determinism in TensorFlow. For status and history of GPU-determinism for this op, see here.
cc @reedwm @sanjoy @nluehr