Tags: cifkao/pytorch
Tags
Update on "Fuse row-wise sharded linear matmul to increase perf." Instead of looping through and performing a matmul separately, we can just perform a single matmul to ensure we launch a single cuda kernel for this operation. Differential Revision: [D36743354](https://our.internmc.facebook.com/intern/diff/D36743354/) Differential Revision: [D36743354](https://our.internmc.facebook.com/intern/diff/D36743354) [ghstack-poisoned]
Merge branch 'master' of https://github.com/pytorch/pytorch into lagu… …erre-polynomial-l
Move THPStorage definitions out of `torch/csrc/generic`
Move THPStorage definitions out of `torch/csrc/generic`
[ROCm] TestGradients: Enable grad and gradgrad Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
PreviousNext