Benchmarking Thrust Call

I am trying to benchmark a simple thrust call but seems to have issue:


Below is my thrust benchmark code: `thrust_test.cu`

<details>
<pre>
#include <nvbench/nvbench.cuh>
#include <cuda_fp16.h>
#include <cuda/std/chrono>

#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>

// Functor for squaring a number
struct square
{
    __host__ __device__
    __half operator()(const __half& x) const {
        return x * x;
    }
};

void thrust_benchmark(nvbench::state &state) {
  const auto size = state.get_int64("Elements");

  thrust::device_vector<__half> d_input(size, 2.0f);  // All elements = 2.0
  thrust::device_vector<__half> d_output(size);


  state.exec([&](nvbench::launch &launch) {
    // Perform transform: output[i] = square(input[i])
    thrust::transform(d_input.begin(), d_input.end(), d_output.begin(), square());
  });
}

NVBENCH_BENCH(thrust_benchmark)
    .add_int64_power_of_two_axis("Elements", {24});
</pre>
</details>

I am getting the following issue:

Command I ran: `./thrust_test --disable-blocking-kernel` (simple `./thrust_test` did not work)
<pre>
# Devices

## [0] `Tesla V100-PCIE-16GB`
* SM Version: 700 (PTX Version: 700)
* Number of SMs: 80
* SM Default Clock Rate: 1380 MHz
* Global Memory: 15796 MiB Free / 16144 MiB Total
* Global Memory Bus Peak: 898 GB/sec (4096-bit DDR @877MHz)
* Max Shared Memory: 96 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: Yes

# Log

```
Run:  [1/1] thrust_benchmark [Device=0 Elements=2^24]
Pass: Cold: 0.099860ms GPU, 0.111693ms CPU, 0.50s total GPU, 0.68s total wall, 5008x

######################################################################
##################### Possible Deadlock Detected #####################
######################################################################

Forcing unblock: The current measurement appears to have deadlocked
and the results cannot be trusted.

This happens when the KernelLauncher synchronizes the CUDA device.
If this is the case, pass the `sync` exec_tag to the `exec` call:

    state.exec(<KernelLauncher>); // Deadlock
    state.exec(nvbench::exec_tag::sync, <KernelLauncher>); // Safe

This tells NVBench about the sync so it can run the benchmark safely.

If the KernelLauncher does not synchronize but has a very long
execution time, this may be a false positive. If so, disable this
check with:

    state.set_blocking_kernel_timeout(-1);

The current timeout is set to 30 seconds.

For more information, see the 'Benchmarks that sync' section of the
NVBench documentation.

If this happens while profiling with an external tool,
pass the `--disable-blocking-kernel` flag or the `--profile` flag
(to also only run the benchmark once) to the executable.

For more information, see the 'Benchmark Properties' section of the
NVBench documentation.

Fail: Unexpected error: nvbench/nvbench/blocking_kernel.cu:124: Deadlock detected -- missing nvbench::exec_tag::sync? See stdout for details.
```

# Benchmark Results

## thrust_benchmark

### [0] Tesla V100-PCIE-16GB

No data -- check log.
</pre>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarking Thrust Call #248

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmarking Thrust Call #248

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions