这是indexloc提供的服务,不要输入任何密码
Skip to content

Warm-up run should not use blocking kernel #240

@oleksandr-pavlyk

Description

@oleksandr-pavlyk

The enclosed Python API example based on numba.cuda and throughput.cu example deadlocks in d00834c.

It deadlocks because warm-up run done by state.exec is performed using blocking kernel. The Numba jits the kernel on the first use (during warm-up) and loading of the jitted kernel to the context causes a synchronization, and a deadlock.

Performing warm-up explicitly before calling the state.exec (replace if False: with if True: in the branch just before launcher function definition resolves the issue.

Python script source
import sys
from collections.abc import Callable

import cuda.nvbench as nvbench
import numpy as np
from numba import cuda


def as_cuda_Stream(cs: nvbench.CudaStream) -> cuda.cudadrv.driver.Stream:
    return cuda.external_stream(cs.addressof())


def make_kernel(items_per_thread: int) -> Callable:
    @cuda.jit
    def kernel(stride: np.uintp, elements: np.uintp, in_arr, out_arr):
        tid = cuda.grid(1)
        step = cuda.gridDim.x * cuda.blockDim.x
        for i in range(stride * tid, stride * elements, stride * step):
            for j in range(items_per_thread):
                read_id = (items_per_thread * i + j) % elements
                write_id = tid + j * elements
                out_arr[write_id] = in_arr[read_id]

    return kernel


def throughput_bench(state: nvbench.State) -> None:
    stride = state.getInt64("Stride")
    ipt = state.getInt64("ItemsPerThread")

    nbytes = 128 * 1024 * 1024
    elements = nbytes // np.dtype(np.int32).itemsize

    alloc_stream = as_cuda_Stream(state.getStream())
    inp_arr = cuda.device_array(elements, dtype=np.int32, stream=alloc_stream)
    out_arr = cuda.device_array(elements * ipt, dtype=np.int32, stream=alloc_stream)

    state.addElementCount(elements, column_name="Elements")
    state.addGlobalMemoryReads(inp_arr.nbytes, column_name="Datasize")
    state.addGlobalMemoryWrites(inp_arr.nbytes)

    threads_per_block = 256
    blocks_in_grid = (elements + threads_per_block - 1) // threads_per_block

    krn = make_kernel(ipt)

    if False:
        # warm-up call ensures that kernel is loaded into context
        # before blocking kernel is launched
        krn[blocks_in_grid, threads_per_block, alloc_stream, 0](
            stride, elements, inp_arr, out_arr
        )

    def launcher(launch: nvbench.Launch):
        exec_stream = as_cuda_Stream(launch.getStream())
        krn[blocks_in_grid, threads_per_block, exec_stream, 0](
            stride, elements, inp_arr, out_arr
        )

    state.exec(launcher)


if __name__ == "__main__":
    b = nvbench.register(throughput_bench)
    b.addInt64Axis("Stride", [1, 2, 4])
    b.addInt64Axis("ItemsPerThread", [1, 2, 3, 4])

    nvbench.run_all_benchmarks(sys.argv)

The warm-up kernel is launched here:

void run_warmup()
{
if (m_run_once)
{ // Skip warmups
return;
}
kernel_launch_timer timer(*this);
this->launch_kernel(timer);
this->check_skip_time(m_cuda_timer.get_duration());
}

and the timer would use blocking kernel, since state was configured to use blocking kernel in my example:

struct measure_cold_base::kernel_launch_timer
{
kernel_launch_timer(measure_cold_base &measure)
: m_measure{measure}
, m_disable_blocking_kernel{measure.m_disable_blocking_kernel}
{}
__forceinline__ void start()
{
m_measure.flush_device_l2();
m_measure.sync_stream();
if (!m_disable_blocking_kernel)
{
m_measure.block_stream();
}

We time warm-up launch to decide whether we skip associated measurement data or record it.

If we run warm-up with blocking kernel disabled, this measurement might become a little inaccurate, but would eventually be drowned in additional data. The upside is the running warm-up without blocking kernel would ensure that all the required kernel are loaded in the context, and setting of CUDA_MODULE_LOAD to EAGER may no longer be necessary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions