+
Skip to content

Conversation

coreylowman
Copy link
Owner

Resolves #547
Related to #578

This does a couple things:

  1. Adds a workspace to Cuda that allows conv kernel to not have to re-allocate memory for patches
  2. Removes a memset(0) from the output of Conv operation
  3. Removes a memset(0) on patches allocation
  4. No longer broadcasts filters in transpose_and_broadcast_filters
  5. Sets patches[i] = 0.0 for both unfold_input & unfold_output
  6. Uses the parallel stream to parallelize conv operations a bit

Timings of conv2d bench on A10

branch forward backward
main 9ms 52ms
updates 5ms 21.5ms

@coreylowman
Copy link
Owner Author

FYI @opfromthestart. I think there's still a lot more work to do with unfolding, which is why i'm going to leave the issue open.

@coreylowman coreylowman merged commit b7a6b5f into main Mar 24, 2023
@coreylowman coreylowman deleted the conv-optims-v2 branch March 24, 2023 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add workspace for Cuda & Cpu devices

1 participant

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载