这是indexloc提供的服务,不要输入任何密码
Skip to content

[QST] Does sm120 gemm kernels support fp16/tf32 inputs? #2766

@JohnnyGo-x

Description

@JohnnyGo-x

I notice the kernel generator python/cutlass_library/generator.py can't emit kernels with fp16/fp32 input on sm120 architecture, only f8 is supported.

While sm100 kernels support fp16/fp32 inputs, they can't run on sm120 gpus directly. Is there any restrictions?

using cutlass3x_sm100_tensorop_gemm_f16_f16_f16_f16_f16_128x256x64_0x0x1_0_tnt_align8_1sm_epilogue =
        typename cutlass::epilogue::collective::CollectiveBuilder<
          cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
          cute::Shape<cute::_128, cute::_256, cute::_64>,
          cute::Shape<int, int, cute::_1>,
          cutlass::epilogue::collective::EpilogueTileAuto,
          cutlass::half_t, cutlass::half_t,
          cutlass::half_t, cutlass::layout::RowMajor, 8,
          cutlass::half_t, cutlass::layout::RowMajor, 8,
          cutlass::epilogue::TmaWarpSpecialized1Sm,
          cutlass::epilogue::fusion::LinearCombination<
      cutlass::half_t,
      cutlass::half_t,
      cutlass::half_t,
      cutlass::half_t
    >>::CollectiveOp;

Above shows part of a cutlass kernel designed for sm100, if I replace cutlass::arch::Sm100 with cutlass::arch::Sm120, error SM120 TmaWarpSpecialized builder currently only supports F8F6F4 MMA. will occur.

So how to generate kernels with fp16/fp32 inputs/accumulator/D on sm120 arch?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions