+
Skip to content

Conversation

coreylowman
Copy link
Owner

@coreylowman coreylowman commented Oct 12, 2022

This addresses some slowness in conv2d implementation by leveraging matrix multiplication algorithms.

Here are some numbers from my local benchmarks with Conv2D<128, 256, 4> on Tensor4D<64, 128, 28, 28>:

pytorch pytorch single thread* matrixmultiply mkl-dynamic-seq mkl-dynamic-iomp old naive approach
forward 163ms 400ms 1597ms 524ms 382ms 2000ms
backward 293ms 795ms 4100ms 1245ms 876ms 7714ms

There is probably still more work that can be done as far as reducing allocations/copying when creating the patches buffers used below.

@seddonm1
Copy link

Did you run the benchmarks for the previous 'naive' implementation?

@coreylowman
Copy link
Owner Author

Did you run the benchmarks for the previous 'naive' implementation?

Just added - they are really slow lol. Didn't scale up the number of channels enough in my original benchmarks of convs, but it was a really simple algorithm to debug and test. All the unit tests made it really easy to verify that this new implementation worked! 😁

Comment on lines +85 to +96
#[cfg(not(feature = "cblas"))]
unsafe {
matrixmultiply::sgemm(
m, k, n, 1.0, a, k as isize, 1, b, n as isize, 1, 1.0, c, n as isize, 1,
)
}

#[cfg(feature = "cblas")]
unsafe {
let (m, n, k) = (m as libc::c_int, n as libc::c_int, k as libc::c_int);
sgemm(RowMajor, NoTr, NoTr, m, n, k, 1.0, a, k, b, n, 1.0, c, n)
}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this is copied from the matmul implementation - using Cpu::mm requires casting the inputs to 2d arrays, which in turn requires generic_const_expr bounds added to the trait, which I wanted to avoid.

@seddonm1
Copy link

Damn, I'd hoped for some LLVM intrinsics magic that kept the naive way somewhat competitive! Looks like Fortran will be with us for a lot longer 😝

@coreylowman
Copy link
Owner Author

Damn, I'd hoped for some LLVM intrinsics magic that kept the naive way somewhat competitive! Looks like Fortran will be with us for a lot longer 😝

Hah yeah I was hoping that too. I'm pretty sure it was auto vectorizing the forward at least, but I think all the matmul stuff is suuuuper cache optimized.

@coreylowman coreylowman merged commit da32b77 into main Oct 15, 2022
@coreylowman coreylowman deleted the conv-matmuls branch October 15, 2022 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载