-
Notifications
You must be signed in to change notification settings - Fork 84
Description
The enclosed Python API example based on numba.cuda and throughput.cu
example deadlocks in d00834c.
It deadlocks because warm-up run done by state.exec
is performed using blocking kernel. The Numba jits the kernel on the first use (during warm-up) and loading of the jitted kernel to the context causes a synchronization, and a deadlock.
Performing warm-up explicitly before calling the state.exec
(replace if False:
with if True:
in the branch just before launcher
function definition resolves the issue.
Python script source
import sys
from collections.abc import Callable
import cuda.nvbench as nvbench
import numpy as np
from numba import cuda
def as_cuda_Stream(cs: nvbench.CudaStream) -> cuda.cudadrv.driver.Stream:
return cuda.external_stream(cs.addressof())
def make_kernel(items_per_thread: int) -> Callable:
@cuda.jit
def kernel(stride: np.uintp, elements: np.uintp, in_arr, out_arr):
tid = cuda.grid(1)
step = cuda.gridDim.x * cuda.blockDim.x
for i in range(stride * tid, stride * elements, stride * step):
for j in range(items_per_thread):
read_id = (items_per_thread * i + j) % elements
write_id = tid + j * elements
out_arr[write_id] = in_arr[read_id]
return kernel
def throughput_bench(state: nvbench.State) -> None:
stride = state.getInt64("Stride")
ipt = state.getInt64("ItemsPerThread")
nbytes = 128 * 1024 * 1024
elements = nbytes // np.dtype(np.int32).itemsize
alloc_stream = as_cuda_Stream(state.getStream())
inp_arr = cuda.device_array(elements, dtype=np.int32, stream=alloc_stream)
out_arr = cuda.device_array(elements * ipt, dtype=np.int32, stream=alloc_stream)
state.addElementCount(elements, column_name="Elements")
state.addGlobalMemoryReads(inp_arr.nbytes, column_name="Datasize")
state.addGlobalMemoryWrites(inp_arr.nbytes)
threads_per_block = 256
blocks_in_grid = (elements + threads_per_block - 1) // threads_per_block
krn = make_kernel(ipt)
if False:
# warm-up call ensures that kernel is loaded into context
# before blocking kernel is launched
krn[blocks_in_grid, threads_per_block, alloc_stream, 0](
stride, elements, inp_arr, out_arr
)
def launcher(launch: nvbench.Launch):
exec_stream = as_cuda_Stream(launch.getStream())
krn[blocks_in_grid, threads_per_block, exec_stream, 0](
stride, elements, inp_arr, out_arr
)
state.exec(launcher)
if __name__ == "__main__":
b = nvbench.register(throughput_bench)
b.addInt64Axis("Stride", [1, 2, 4])
b.addInt64Axis("ItemsPerThread", [1, 2, 3, 4])
nvbench.run_all_benchmarks(sys.argv)
The warm-up kernel is launched here:
nvbench/nvbench/detail/measure_cold.cuh
Lines 202 to 213 in d00834c
void run_warmup() | |
{ | |
if (m_run_once) | |
{ // Skip warmups | |
return; | |
} | |
kernel_launch_timer timer(*this); | |
this->launch_kernel(timer); | |
this->check_skip_time(m_cuda_timer.get_duration()); | |
} |
and the timer would use blocking kernel, since state was configured to use blocking kernel in my example:
nvbench/nvbench/detail/measure_cold.cuh
Lines 131 to 145 in d00834c
struct measure_cold_base::kernel_launch_timer | |
{ | |
kernel_launch_timer(measure_cold_base &measure) | |
: m_measure{measure} | |
, m_disable_blocking_kernel{measure.m_disable_blocking_kernel} | |
{} | |
__forceinline__ void start() | |
{ | |
m_measure.flush_device_l2(); | |
m_measure.sync_stream(); | |
if (!m_disable_blocking_kernel) | |
{ | |
m_measure.block_stream(); | |
} |
We time warm-up launch to decide whether we skip associated measurement data or record it.
If we run warm-up with blocking kernel disabled, this measurement might become a little inaccurate, but would eventually be drowned in additional data. The upside is the running warm-up without blocking kernel would ensure that all the required kernel are loaded in the context, and setting of CUDA_MODULE_LOAD
to EAGER
may no longer be necessary.