-
Notifications
You must be signed in to change notification settings - Fork 730
[wip][llvmgpu][codegen] Remove warp reduction pipeline #21433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[wip][llvmgpu][codegen] Remove warp reduction pipeline #21433
Conversation
Signed-off-by: James Newling <james.newling@gmail.com>
Signed-off-by: James Newling <james.newling@gmail.com>
Signed-off-by: James Newling <james.newling@gmail.com>
Signed-off-by: James Newling <james.newling@gmail.com>
aa6508f
to
e908914
Compare
@@ -207,7 +207,8 @@ func.func @main2(%arg0: tensor<2x130x130x4xf16>, %arg1: tensor<3x3x4x320xf16>, % | |||
|
|||
// ----- | |||
|
|||
// We want to skip padding skinny matmul cases, since warpReduction is more performant for it. | |||
// We want to skip padding skinny matmul cases, since WarpReduction is more performant for it. | |||
// TODO(newling) reconsider this, as WarpReduction is removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need to reconsider this, warp reduction is still a "strategy", just the llvmgpuvectordistribute pipeline can handle it.
#hal.pipeline.binding<storage_buffer>, | ||
#hal.pipeline.binding<storage_buffer> | ||
]> | ||
func.func @dynamic_softmax() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need this test to go through vector distribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a (not pushed) branch which can do this now. Tested numerics, and they match the numerics on ToM with WarpReduction... which are incorrect. Created an issue to track this #21468
|
||
// ----- | ||
|
||
func.func @dynamic_parallel_dims(%dynsize : index, %input : tensor<4x?x4096xf16>) -> tensor<4x?xf32> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these tests should be deleted, they need to go through vector distribute now
%4 = hal.interface.constant.load layout(#pipeline_layout) ordinal(4) : i32 | ||
%5 = arith.index_castui %0 : i32 to index | ||
%6 = arith.index_castui %1 : i32 to index | ||
%7 = arith.index_castui %2 : i32 to index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to send this through vector distribute, this cannot be deleted.
%4 = iree_tensor_ext.dispatch.tensor.load %1, offsets = [0, 0], sizes = [32000, 4096], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<32000x4096xf16>> -> tensor<32000x4096xf16> | ||
%5 = tensor.empty() : tensor<2x32000xf16> | ||
%6 = linalg.fill ins(%cst : f16) outs(%5 : tensor<2x32000xf16>) -> tensor<2x32000xf16> | ||
%7 = linalg.matmul_transpose_b ins(%3, %4 : tensor<2x4096xf16>, tensor<32000x4096xf16>) outs(%6 : tensor<2x32000xf16>) -> tensor<2x32000xf16> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need to go through vector distribute. None of the tests is config_matvec can be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will push on PRs like #21430 to get this (and other) lit tests working with vector distribute
%5 = linalg.generic { | ||
indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>], | ||
iterator_types = ["parallel", "reduction"] | ||
} ins(%2 : tensor<2x512xf32>) outs(%4 : tensor<2xf32>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of deleting these tests, these tests need to be moved to pipeline_vector_distribute file and checked if we can do them with vector distribute.
This PR is to see what the fallout is in the e2e CI when the warp reduction pipeline is removed.