What's Changed
- [fix] fix BatchAttention CTA_TILE_KV mask issue by @happierpig in #1206
- feat: enable and update all-reduce fused quantization by @yyihuang in #1164
- Fix the issue with auxillary kernel launch and grid dim calculation by @Anerudhan in #1208
- Fix test_groupwise_scaled_gemm_fp8.py by @jinyangyuan-nvidia in #1211
- [TVM] Remove
enable_pdl
from TVM binding interface by @MasterJH5574 in #1217 - misc: minor adds in readme by @yyihuang in #1218
- bugfix: fix blackwell fmha hanging issue for empty kv_len by @yzh119 in #1198
- update trtllm-gen decode attention kernel launcher by @wenscarl in #1189
- Handle allocation cutlass fused MoE output to caller by @wenscarl in #1225
- Fix missing hash in the cudnn cubin path by @Anerudhan in #1227
- bugfix: add logits processor to pyproject.toml by @yzh119 in #1224
- fix: add trtllm-allreduce-fusion api notes and fix memory error by @yyihuang in #1229
- feat: Add non-causal cudnn prefill kernels by @Anerudhan in #1230
- minor: update oneshot handling, add params notes by @yyihuang in #1232
- Enable cudnn decode and add tests for the cudnn decode kernel by @Anerudhan in #1221
- docker: add cuda-python to CI docker image by @yzh119 in #1233
- bugfix: Fix building without
get_requires*()
invocation by @mgorny in #1226 - bugfix: support uint8_t for vec_t class template by @chenyang78 in #1234
- feat: trtllm-gen fp8 moe kernels by @aleozlx in #1212
- Patch fp8 cubin availability by @aleozlx in #1240
- [comm] TRT-LLM's Multi-Node NVLink All-Reduce Kernel by @nvmbreughe in #1213
- feat: Support MXFP8 x MXFP4 CUTLASS grouped GEMM by @jinyangyuan-nvidia in #1241
- feat: add trtllm-gen mla cubin by @yyihuang in #1222
- Add DeepGEMM kernels by @cyx-6 in #1209
- Remove sm100+ requirment for trtllm allreduce kernels by @yzh119 in #1249
- Defer mpi import for comm module by @yzh119 in #1250
- feat: support environment variable overrides for NVSHMEM paths and linker flags by @EmilienM in #1253
- release: bump version to v0.2.8 by @yzh119 in #1257
- TRT-LLM's Multi-Node NVLink AR + fused RMSNorm kernel by @nvmbreughe in #1255
New Contributors
- @jinyangyuan-nvidia made their first contribution in #1211
- @mgorny made their first contribution in #1226
- @chenyang78 made their first contribution in #1234
- @aleozlx made their first contribution in #1212
- @nvmbreughe made their first contribution in #1213
- @EmilienM made their first contribution in #1253
Full Changelog: v0.2.7.post1...v0.2.8