+
Skip to content

Optimizing conv kernels a bit #605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 24, 2023
Merged

Optimizing conv kernels a bit #605

merged 9 commits into from
Mar 24, 2023

Conversation

coreylowman
Copy link
Owner

Resolves #547
Related to #578

This does a couple things:

  1. Adds a workspace to Cuda that allows conv kernel to not have to re-allocate memory for patches
  2. Removes a memset(0) from the output of Conv operation
  3. Removes a memset(0) on patches allocation
  4. No longer broadcasts filters in transpose_and_broadcast_filters
  5. Sets patches[i] = 0.0 for both unfold_input & unfold_output
  6. Uses the parallel stream to parallelize conv operations a bit

Timings of conv2d bench on A10

branch forward backward
main 9ms 52ms
updates 5ms 21.5ms

@coreylowman
Copy link
Owner Author

FYI @opfromthestart. I think there's still a lot more work to do with unfolding, which is why i'm going to leave the issue open.

@coreylowman coreylowman merged commit b7a6b5f into main Mar 24, 2023
@coreylowman coreylowman deleted the conv-optims-v2 branch March 24, 2023 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add workspace for Cuda & Cpu devices
1 participant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载