What is your question?
Hi everyone, I modified the example to JIT mode, i.e., gemm(mA, mB, mC, stream), but observed cache missing among different processes.
In fact, stream is only used in kernel launch, and does not affect the compilation. Adding it as a placeholder in mangle name might help.
Do you have any suggestions? Thanks!
http://github.com/NVIDIA/cutlass/blob/bd96096d58e4886e204cd1d71a385ca73e7719b8/examples/python/CuTeDSL/hopper/dense_gemm.py#L381
http://github.com/NVIDIA/cutlass/blob/bd96096d58e4886e204cd1d71a385ca73e7719b8/python/CuTeDSL/cutlass/base_dsl/dsl.py#L555