Code study #4

EikanWang · 2023-08-22T05:29:27Z

No description provided.

The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>

@EikanWang

cc @EikanWang . I'm disabling this for now since it broke with the H100 merge, but please feel free to fix the compilation errors and submit a PR.

Also fixes a bug exposed in convertLayout lowering for float16. We shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this needs to take a 32bits register. Also this prevented optimization at llvm ir level.

Issue triton-lang#1973 Co-authored-by: Philippe Tillet <phil@openai.com>

…2051)

…on-lang#2040) Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <phil@openai.com>

Use camel case accessors ("getStaticOffsets" etc.) for `ExtractSliceOp`. This change works with and without the changes from D156857. After D156857 has landed, only camel case accessors work for ops that implement the `OffsetSizeAndStrideOpInterface`. https://reviews.llvm.org/D156857 Co-authored-by: Philippe Tillet <phil@openai.com>

@ptillet

We are interested in having python wheels for triton built for Linux arm64 platforms, such as NVIDIA's Grace CPU. This change is fairly simple, however: - It requires a linux arm64 build of LLVM to be available (see MR here: ptillet/triton-llvm-releases#15) - For now my changes use the LLVM build hosted here: https://github.com/acollins3/triton-llvm-releases/releases/tag/llvm-17.0.0-c5dede880d17 - The Triton release process will need to be updated to include arm64 wheels. Is this something you have time to work on @ptillet? It would be difficult for me to update this part without more access permissions. With these changes, I managed to build a set of python wheels and have hosted them here for us to use in the meantime: https://github.com/acollins3/triton/releases/tag/triton-2.1.0-arm64

Co-authored-by: Philippe Tillet <phil@openai.com>

…r than Q's (triton-lang#2033) Implemented this situation with and without causal mask. My implementation with causal mask looks like: 111000 111100 111110 Where only the right upper triangle part will be masked. I added `P_SEQ` for the notation of extra sequence length for KV. Co-authored-by: Philippe Tillet <phil@openai.com>

This allows the AOT client to tune the number of stages for the generated kernel. set the default number to 3 to match the triton compiler.

…in hopper tests (triton-lang#2041) Co-authored-by: goostavz <gzhu@nvidia.com> Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>

Co-authored-by: Allen Zhao <allzhao@nvidia.com>

Improve error messaging for block shape and value shape mismatch.

…#2050) Co-authored-by: Philippe Tillet <phil@openai.com>

Rename "rocm" -> "hip", to comply with other uses in compiler.py.

…2063)

…riton-lang#2057) Co-authored-by: Biao Wang <biaow@nvidia.com>

…lang#2067)

…m. (triton-lang#2068) No functional changes intended, and it might slightly speed up the build. This allows a downstream Bazel build of Triton to avoid building a number of dialects and passes that Triton doesn't need.

`getScratchSizeInBytes` was assuming that the size of all types in bits is a multiple of 8. If it is not, it would return 0. This caused a bug for boolean (i1) type, where the reduction lowering would attempt to use shared memory, which was not assigned to the op. Fix this issue by setting the number of bytes per element to `ceil(bits / 8)`.

libtriton.so is pretty large these days and hashing it is slow. Switching the hash from md5 to sha1 shaves close to 300ms off the time for me (as well as being a better hash, for whatever that's worth). As far as I could tell, sha1 is the fastest stable hash in the Python standard library, including things like zlib.crc32

Realised I could do this right after my first PR got merged. This saves another 100ms

…ng#2075) remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish

…anches (triton-lang#2089) - These minor fixes are not specific to interface changes from LLVM main or official llvm-17 branch and can be applied on triton main branch. - https://github.com/darkbuck/triton/tree/darkbuck/main/llvm-main-branch has extra changes to build again LLVM main branch build to enable me to work on other backends on the main branch only. That's the hobby effort and just FYR.

…g#2117)

…ernel names in cache to compare artifacts (triton-lang#2111)

`offset + ptr` and `ptr + offset` both work now

…n-lang#2020)

…g#2125) Also stop promoting integer types as it doesn't give better perf this will allow more vectorization oportuinity in the future.

…#2019) When I'm using Kaggle GPU (https://www.kaggle.com/), I find that `ldconfig -p` does not show libcuda.so, but requires `ldconfig` (run with sudo) to refresh the cache to find the libcuda.so. Therefore, I added this informative message to help users find libcuda.so.

Forgot to remove this line

Before this PR, the determination of `TritonGPUToLLVMIRPass` to generate NVVM-compatible LLVM or ROCDL-compatible LLVM is controlled by a boolean `isROCM`. This method is hard to scale. This PR changes it to use an enum instead, where new target can be added easily when needed. --------- Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com> Co-authored-by: Philippe Tillet <phil@openai.com>

…ymore (triton-lang#2116)

…on-lang#2136) Add a new operation to be able to implement packed inline assembly for elementwise operations. This way inline assembly can be used to control elementwise operations. It also allows to pack elements to be able to manually vectorize operations.

…rn (triton-lang#2137) Use getEffect instead to tell passes whether the op has side effects or not. This doesn't change functionality otherwise.

Co-authored-by: Philippe Tillet <phil@openai.com>

Replace the Turing version for the dot operation from following Volta version to following Ampere version. Update code generator to produce two m16.n8.k8 MMAs for Turing instead of one m16.n8.k16 MMA we have for Ampere.

A little code cleanup

Disable tf32 if run on sm75 and below Fix the pattern match to compare the generated ptx against if run on sm75

`if _unwrap_if_constexpr(cond)` then enters `node.body` is wrong when cond is a tensor since we cannot statically evaluate a dynamic tensor's value. The right way to solve the problem is probably: 1. visit the ast of IfExp (do not build IRs) 2. get the type of the last statement 3. initialize the return value and assign it to livein 4. call visit_If

…n-lang#2145)

…-lang#2143) Simplify the code by using inline asm to implement globaltimer and smid instead of relying on bc file.

) For warp specialized persistent kernel, the instruction sequence for Warp Groups are ``` // warp group 0 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait EB[idx]; W0; mbarrier.arrive FB[idx]; idx++; ``` ``` // warp group 1 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait FB[idx]; R0; mbarrier.arrive EB[idx]; idx++; ``` then this would form a sequence of morally-strong relations W0 -> R0 -> W1 -> R1 in causality order. But if GEMM K is small than K-TileShape, then the num_inner_loop_steps of persistent kernel is 0. The buffer id and mbarrier id will always be 0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is contradicts with the atomicity -- "If a read R precedes an overlapping write W in causality order, then R cannot read from W."

…n-lang#2135) 1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.

goostavz and others added 30 commits August 7, 2023 09:53

[CI] disable XPU tests (not compiling) (triton-lang#2044)

223c2d3

cc @EikanWang . I'm disabling this for now since it broke with the H100 merge, but please feel free to fix the compilation errors and submit a PR.

[CI] disable AMD CI (triton-lang#2045)

54f1ac9

[CI] disabled float32 perf regression tests

521cfae

[FRONTEND] Support jit functions without arguments (triton-lang#2043)

30a331e

Issue triton-lang#1973 Co-authored-by: Philippe Tillet <phil@openai.com>

[CI] H100 tests always use ENABLE_TMA=1 ENABLE_MMA_V3=1 (triton-lang#…

3ec05fb

…2051)

[FRONTEND] improve error message for type mismatch (triton-lang#2038)

6a1ac65

[BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (trit…

341f5b6

…on-lang#2040) Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <phil@openai.com>

[TESTS] remove get_proper_err, get_variant_golden (triton-lang#2039)

31e79aa

Co-authored-by: Philippe Tillet <phil@openai.com>

add num_stages parameter to aot compile.py (triton-lang#2000)

a76ecd7

This allows the AOT client to tune the number of stages for the generated kernel. set the default number to 3 to match the triton compiler.

[Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts …

b525880

…in hopper tests (triton-lang#2041) Co-authored-by: goostavz <gzhu@nvidia.com> Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>

[hopper][ws] use per-agent thread idx by default (triton-lang#2054)

11cf334

Co-authored-by: Allen Zhao <allzhao@nvidia.com>

[FRONTEND] remove ptxas from git (triton-lang#2055)

658747f

[FRONTEND] improve error message for shape mismatch (triton-lang#2031)

bb47f89

Improve error messaging for block shape and value shape mismatch.

[Clean]: remove skip for num_ctas > 1 and num_warps == 8 (triton-lang…

2a95d9b

…#2050) Co-authored-by: Philippe Tillet <phil@openai.com>

[HOPPER][WS] fix TMA store hang in ws mode (triton-lang#2056)

6dee55c

[ROCM] fix device_type name (triton-lang#2061)

1c45836

Rename "rocm" -> "hip", to comply with other uses in compiler.py.

[HOPPER][WS] fix missing WS attrs when lowering to llvm (triton-lang#…

6d98a08

…2063)

[OPTIMIZER] Fix the load and store fallback issue of test_persisten… (t…

de47bba

…riton-lang#2057) Co-authored-by: Biao Wang <biaow@nvidia.com>

[HOPPER][WS] remove numCTAs = 1 check in guard pass (triton-lang#2066)

8a610f7

[HOPPER][WS] support tt.reduce as dependent op in guard pass (triton-…

a58e6ef

…lang#2067)

[FRONTEND] further improve version_key speed (triton-lang#2073)

776b378

Realised I could do this right after my first PR got merged. This saves another 100ms

[TESTS] refactor test-persistent-warp-specialized-gemm UTs (triton-la…

d1ce4c4

…ng#2075) remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish

darkbuck and others added 29 commits August 16, 2023 01:18

[BACKEND] Fix several shape errors when numCTAs > 1 (triton-lang#2107)

ecc5e8c

[BACKEND] improve inline PTX for various type conversions (triton-lan…

49941b9

…g#2117)

[CI] upload only test/unit/operators cache to artifacts and rely on k…

557b2d4

…ernel names in cache to compare artifacts (triton-lang#2111)

[OPTIMIZER][BACKEND] Rename MMAv2kWidth to kWidth (triton-lang#2119)

eb94082

[FRONTEND] Fix addptr code generation (triton-lang#2122)

2d513db

`offset + ptr` and `ptr + offset` both work now

[TUTORIALS] alow BLOCK(bwd) != BLOCK_M(fwd) in flash attention (trito…

0970a29

…n-lang#2020)

[CI] Adding new github workflow for testing (triton-lang#2121)

3fa6d51

[FRONTEND][BACKEND] Add a performance test for reductions (triton-lan…

387fc89

…g#2125) Also stop promoting integer types as it doesn't give better perf this will allow more vectorization oportuinity in the future.

[CI] Testing PR comment from another workflow (triton-lang#2127)

6f654cf

[CI] Fix bug in Compare Artifacts workflow (triton-lang#2128)

b33f97a

Forgot to remove this line

[CI] Fix PR comment (triton-lang#2131)

1faf93e

[BACKEND] Minor clean up and remove loop fixup as it is not needed an…

c736ea8

…ymore (triton-lang#2116)

[BACKEND] Merge TT_ElementwisePureExtern and TT_ElementwiseImpureExte…

23ef261

…rn (triton-lang#2137) Use getEffect instead to tell passes whether the op has side effects or not. This doesn't change functionality otherwise.

[BACKEND] Solidify f8e4m3 (triton-lang#2105)

23dd11d

Co-authored-by: Philippe Tillet <phil@openai.com>

[BACKEND] enable transpose for float16 on sm75 (triton-lang#2139)

d5188fa

Replace the Turing version for the dot operation from following Volta version to following Ampere version. Update code generator to produce two m16.n8.k8 MMAs for Turing instead of one m16.n8.k16 MMA we have for Ampere.

[CI] Adding workflow_run (triton-lang#2120)

3c8f959

[OPTIMIZER] Remove dead code (triton-lang#2141)

e072da5

A little code cleanup

[TESTS] Fix tl.dot test on sm75 (triton-lang#2140)

a7b40a1

Disable tf32 if run on sm75 and below Fix the pattern match to compare the generated ptx against if run on sm75

[BACKEND] Remove dead code related to old libhopper_helpers.bc (trito…

ad3e363

…n-lang#2145)

[FRONTEND] Use inline asm for global timer and smid functions (triton…

54ca7fc

…-lang#2143) Simplify the code by using inline asm to implement globaltimer and smid instead of relying on bc file.

[FRONTEND] name mangling fixup (triton-lang#2148)

ea84161

[BACKEND] Optimize performance for f16 epilogue with TMA store (trito…

ec801ce

…n-lang#2135) 1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.

[DOCS] update meetup/08-22-2023.md (triton-lang#2149)

0410652

EikanWang marked this pull request as draft August 22, 2023 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code study #4

Code study #4

Uh oh!

EikanWang commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

31 participants

Code study #4

Are you sure you want to change the base?

Code study #4

Uh oh!

Conversation

EikanWang commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

31 participants