Is there any gemm kernel using distributed shared memory?
I have profiled many gemm kernel of hopper using nsight compute, but not found there's no data transfer using distributed shared memory.
example: cutlass/examples/hopper_gemm
- 48_hopper_warp_specialized_gemm
- 49_hopper_gemm_with_collective_builder
- ...63_hopper_gemm_with_weight_prefetch
- ./python/CuTeDSL/hopper/*
- ./cute/tutorial/hopper/*
Thus I want to know if cutlass has implemented distributed shared memory in gemm kernel?