Inconsistent NotEqual broadcasting behavior between CPU and GPU (CPU fails silently, GPU raises error)

### Issue type

Bug

### Have you reproduced the bug with TensorFlow Nightly?

Yes

### Source

source

### TensorFlow version

2.19.0

### Custom code

Yes

### OS platform and distribution

_No response_

### Mobile device

_No response_

### Python version

Python 3.12

### Bazel version

_No response_

### GCC/compiler version

_No response_

### CUDA/cuDNN version

_No response_

### GPU model and memory

_No response_

### Current behavior?

When tf.raw_ops.NotEqual is called with two tensors whose shapes are not broadcastable, the behavior is inconsistent between the CPU and GPU implementations.

The GPU correctly identifies the invalid input and raises an InvalidArgumentError, which is the expected behavior for a mathematically invalid operation.
The CPU, however, fails silently and returns a misleading scalar value (tf.Tensor(True, shape=(), dtype=bool)), even when the incompatible_shape_error=False flag is used.
This violates the principle of device consistency, where the same operation with the same inputs should yield the same result or error across all devices. The GPU's strict error handling is preferable as it prevents silent bugs in user code.

Failing loudly on invalid inputs is crucial for preventing silent errors and difficult-to-debug numerical issues. The CPU implementation should be updated to match the GPU's stricter and more correct behavior of erroring out when presented with non-broadcastable shapes for this operation.

### Standalone code to reproduce the issue

```shell
import numpy as np
import tensorflow as tf

# Set seed for reproducibility
np.random.seed(202)

# Generate input tensors with non-broadcastable shapes
# x.shape = (4, 1)
# y.shape = (1, 28, 2, 3, 2)
x = np.random.uniform(-32767., 127., size=(4, 1)).astype(np.float32)
y = np.random.uniform(0., 89., size=(1, 28, 2, 3, 2)).astype(np.float32)

# Convert to TensorFlow tensors
x_tensor = tf.constant(x, dtype=tf.float32)
y_tensor = tf.constant(y, dtype=tf.float32)

# --- CPU Execution ---
# This runs without error and produces a misleading result
try:
     with tf.device("/CPU:0"):
         result_cpu = tf.raw_ops.NotEqual(
             x=x_tensor,
             y=y_tensor,
             incompatible_shape_error=False,
             name="selu_cpu",
         )
     print("CPU Result:", result_cpu)
except Exception as e:
     print("CPU Error:", e)


# --- GPU Execution ---
# This correctly fails with an InvalidArgumentError
try:
     with tf.device("/GPU:0"):
         result_gpu = tf.raw_ops.NotEqual(
             x=x_tensor,
             y=y_tensor,
             incompatible_shape_error=False,
             name="selu_gpu",
         )
     print("GPU Result:", result_gpu)
except Exception as e:
     print("\nGPU Error:", e)
```

### Relevant log output

```shell
**CPU Output:**
CPU Result: tf.Tensor(True, shape=(), dtype=bool)

**GPU Output:**
GPU Error: {{function_node __wrapped__NotEqual_device_/job:localhost/replica:0/task:0/device:GPU:0}} required broadcastable shapes [Op:NotEqual] name: selu_gpu
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent NotEqual broadcasting behavior between CPU and GPU (CPU fails silently, GPU raises error) #97227

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent NotEqual broadcasting behavior between CPU and GPU (CPU fails silently, GPU raises error) #97227

Description

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions