这是indexloc提供的服务,不要输入任何密码
Skip to content

Benchmarking Thrust Call #248

@maksud

Description

@maksud

I am trying to benchmark a simple thrust call but seems to have issue:

Below is my thrust benchmark code: thrust_test.cu

#include 
#include 
#include 

#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>

// Functor for squaring a number
struct square
{
host device
__half operator()(const __half& x) const {
return x * x;
}
};

void thrust_benchmark(nvbench::state &state) {
const auto size = state.get_int64("Elements");

thrust::device_vector<__half> d_input(size, 2.0f); // All elements = 2.0
thrust::device_vector<__half> d_output(size);

state.exec([&](nvbench::launch &launch) {
// Perform transform: output[i] = square(input[i])
thrust::transform(d_input.begin(), d_input.end(), d_output.begin(), square());
});
}

NVBENCH_BENCH(thrust_benchmark)
.add_int64_power_of_two_axis("Elements", {24});

I am getting the following issue:

Command I ran: ./thrust_test --disable-blocking-kernel (simple ./thrust_test did not work)

# Devices

## [0] `Tesla V100-PCIE-16GB`
* SM Version: 700 (PTX Version: 700)
* Number of SMs: 80
* SM Default Clock Rate: 1380 MHz
* Global Memory: 15796 MiB Free / 16144 MiB Total
* Global Memory Bus Peak: 898 GB/sec (4096-bit DDR @877MHz)
* Max Shared Memory: 96 KiB/SM, 48 KiB/Block
* L2 Cache Size: 6144 KiB
* Maximum Active Blocks: 32/SM
* Maximum Active Threads: 2048/SM, 1024/Block
* Available Registers: 65536/SM, 65536/Block
* ECC Enabled: Yes

# Log

```
Run:  [1/1] thrust_benchmark [Device=0 Elements=2^24]
Pass: Cold: 0.099860ms GPU, 0.111693ms CPU, 0.50s total GPU, 0.68s total wall, 5008x

######################################################################
##################### Possible Deadlock Detected #####################
######################################################################

Forcing unblock: The current measurement appears to have deadlocked
and the results cannot be trusted.

This happens when the KernelLauncher synchronizes the CUDA device.
If this is the case, pass the `sync` exec_tag to the `exec` call:

    state.exec(); // Deadlock
    state.exec(nvbench::exec_tag::sync, ); // Safe

This tells NVBench about the sync so it can run the benchmark safely.

If the KernelLauncher does not synchronize but has a very long
execution time, this may be a false positive. If so, disable this
check with:

    state.set_blocking_kernel_timeout(-1);

The current timeout is set to 30 seconds.

For more information, see the 'Benchmarks that sync' section of the
NVBench documentation.

If this happens while profiling with an external tool,
pass the `--disable-blocking-kernel` flag or the `--profile` flag
(to also only run the benchmark once) to the executable.

For more information, see the 'Benchmark Properties' section of the
NVBench documentation.

Fail: Unexpected error: nvbench/nvbench/blocking_kernel.cu:124: Deadlock detected -- missing nvbench::exec_tag::sync? See stdout for details.
```

# Benchmark Results

## thrust_benchmark

### [0] Tesla V100-PCIE-16GB

No data -- check log.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions