-
Notifications
You must be signed in to change notification settings - Fork 85
Description
I am trying to benchmark a simple thrust call but seems to have issue:
Below is my thrust benchmark code: thrust_test.cu
#include #include #include#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>// Functor for squaring a number
struct square
{
host device
__half operator()(const __half& x) const {
return x * x;
}
};void thrust_benchmark(nvbench::state &state) {
const auto size = state.get_int64("Elements");thrust::device_vector<__half> d_input(size, 2.0f); // All elements = 2.0
thrust::device_vector<__half> d_output(size);state.exec([&](nvbench::launch &launch) {
// Perform transform: output[i] = square(input[i])
thrust::transform(d_input.begin(), d_input.end(), d_output.begin(), square());
});
}NVBENCH_BENCH(thrust_benchmark)
.add_int64_power_of_two_axis("Elements", {24});
I am getting the following issue:
Command I ran: ./thrust_test --disable-blocking-kernel
(simple ./thrust_test
did not work)
# Devices ## [0] `Tesla V100-PCIE-16GB` * SM Version: 700 (PTX Version: 700) * Number of SMs: 80 * SM Default Clock Rate: 1380 MHz * Global Memory: 15796 MiB Free / 16144 MiB Total * Global Memory Bus Peak: 898 GB/sec (4096-bit DDR @877MHz) * Max Shared Memory: 96 KiB/SM, 48 KiB/Block * L2 Cache Size: 6144 KiB * Maximum Active Blocks: 32/SM * Maximum Active Threads: 2048/SM, 1024/Block * Available Registers: 65536/SM, 65536/Block * ECC Enabled: Yes # Log ``` Run: [1/1] thrust_benchmark [Device=0 Elements=2^24] Pass: Cold: 0.099860ms GPU, 0.111693ms CPU, 0.50s total GPU, 0.68s total wall, 5008x ###################################################################### ##################### Possible Deadlock Detected ##################### ###################################################################### Forcing unblock: The current measurement appears to have deadlocked and the results cannot be trusted. This happens when the KernelLauncher synchronizes the CUDA device. If this is the case, pass the `sync` exec_tag to the `exec` call: state.exec(); // Deadlock state.exec(nvbench::exec_tag::sync, ); // Safe This tells NVBench about the sync so it can run the benchmark safely. If the KernelLauncher does not synchronize but has a very long execution time, this may be a false positive. If so, disable this check with: state.set_blocking_kernel_timeout(-1); The current timeout is set to 30 seconds. For more information, see the 'Benchmarks that sync' section of the NVBench documentation. If this happens while profiling with an external tool, pass the `--disable-blocking-kernel` flag or the `--profile` flag (to also only run the benchmark once) to the executable. For more information, see the 'Benchmark Properties' section of the NVBench documentation. Fail: Unexpected error: nvbench/nvbench/blocking_kernel.cu:124: Deadlock detected -- missing nvbench::exec_tag::sync? See stdout for details. ``` # Benchmark Results ## thrust_benchmark ### [0] Tesla V100-PCIE-16GB No data -- check log.