I've seen many kernels with cutlass::arch::launch_dependent_grid() or cudaTriggerProgrammaticLaunchCompletion() at the end of kernel. Is this a good practice?
I have a kernel with heavy preamble that follows a CUTLASS gemm kernel. So I want to let it overlap with the previous gemm.
In this case, wouldn't it be more efficient to trigger the dependent grid launch at the beginning of the GeMM kernel so preamble of the following kernel properly run in parallel? What's the tradeoff of trigger PDL early or lately in the primary kernel?