-
-
Notifications
You must be signed in to change notification settings - Fork 308
Improve pipeling for AVX512 chunking #1821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe pull request introduces enhancements to memory copying functions in two files: Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Warning Rate limit exceeded@KungFuJesus has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 20 minutes and 30 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
273c1a6 to
f0aed1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
arch/x86/chunkset_avx512.c (1)
163-172: Consider applying the same branching strategy as CHUNKCOPYWhile the current implementation is correct, it could potentially benefit from the same optimization strategy used in CHUNKCOPY by:
- Moving masked operations behind a branch for lengths < 16 bytes
- Using regular loads/stores for full 16-byte chunks
This would align with the PR's goal of improving instruction-level parallelism by reducing the frequency of masked operations in the common case.
Example approach:
static inline uint8_t* HALFCHUNKCOPY(uint8_t *out, uint8_t const *from, unsigned len) { Assert(len > 0, "chunkcopy should never have a length 0"); halfchunk_t chunk; uint32_t rem = len % sizeof(halfchunk_t); - if (rem == 0) { - rem = sizeof(halfchunk_t); + + if (len < sizeof(halfchunk_t)) { + halfmask_t rem_mask = gen_half_mask(rem); + chunk = _mm_maskz_loadu_epi8(rem_mask, from); + _mm_mask_storeu_epi8(out, rem_mask, chunk); + return out + rem; } - halfmask_t rem_mask = gen_half_mask(rem); - chunk = _mm_maskz_loadu_epi8(rem_mask, from); - _mm_mask_storeu_epi8(out, rem_mask, chunk); + loadhalfchunk(from, &chunk); + rem = (rem == 0) ? sizeof(halfchunk_t) : rem; + storehalfchunk(out, &chunk); return out + rem; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
arch/x86/chunkset_avx512.c(2 hunks)chunkset_tpl.h(0 hunks)
💤 Files with no reviewable changes (1)
- chunkset_tpl.h
🔇 Additional comments (1)
arch/x86/chunkset_avx512.c (1)
74-85: Excellent optimization for improving instruction-level parallelism!
The changes improve pipelining by:
- Moving masked operations behind a rarely taken branch (len < 32 bytes)
- Properly handling the remainder calculation with
rem = (rem == 0) ? sizeof(chunk_t) : rem - Using masked loads to prevent potential overreads
This aligns well with the PR objectives and should reduce pipeline stalls in the common case where len >= 32 bytes.
For reasons that aren't quite so clear, using the masked writes here did not pipeline very well. Either setting up the mask stalled things or masked moves have issues overlapping regular moves. Simply putting the masked moves behind a branch that is rarely taken seemed to do the trick in improving the ILP. While here, put masked loads behind the same branch in case there were ever a hazard for overreading.
f0aed1e to
0c9c7be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
arch/x86/chunkset_avx512.c (1)
155-157: Consider adding documentation about alignment requirementsIt would be helpful to add a comment explaining that this function operates on 16-byte chunks (half of AVX512's 32-byte registers) and any alignment requirements or assumptions.
static inline uint8_t* HALFCHUNKCOPY(uint8_t *out, uint8_t const *from, unsigned len) { Assert(len > 0, "chunkcopy should never have a length 0"); halfchunk_t chunk; + /* Handle copying using 16-byte (128-bit) chunks, which are half the size of + * full AVX512 32-byte registers. This is useful for smaller copies or when + * dealing with data that isn't 32-byte aligned. */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
arch/x86/chunkset_avx512.c(2 hunks)chunkset_tpl.h(0 hunks)
💤 Files with no reviewable changes (1)
- chunkset_tpl.h
🔇 Additional comments (2)
arch/x86/chunkset_avx512.c (2)
70-81: Excellent optimization for AVX512 pipelining!
The changes improve instruction-level parallelism by:
- Moving masked operations behind a rarely taken branch (len < 32 bytes)
- Using
_mm256_maskz_loadu_epi8for safe boundary handling - Properly handling remainder with
rem = (rem == 0) ? sizeof(chunk_t) : rem
This implementation aligns perfectly with the PR's objective of improving pipelining for AVX512 chunking.
155-164: Clean implementation of half-chunk copying!
The changes follow the same optimization pattern as CHUNKCOPY, using masked operations efficiently for remainder handling. The simplified logic improves code maintainability while maintaining performance.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1821 +/- ##
===========================================
+ Coverage 32.18% 32.20% +0.02%
===========================================
Files 67 67
Lines 5752 5745 -7
Branches 1237 1239 +2
===========================================
- Hits 1851 1850 -1
- Misses 3644 3645 +1
+ Partials 257 250 -7 ☔ View full report in Codecov by Sentry. |
BeforeAfter |
|
And for added measure, here's with #1816, with the AVX512 inflate_fast family of functions commented out from the function table: |
Dead2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
For reasons that aren't quite so clear, using the masked writes here did not pipeline very well. Either setting up the mask stalled things or masked moves have issues overlapping regular moves. Simply putting the masked moves behind a branch that is rarely taken seemed to do the trick in improving the ILP. While here, put masked loads behind the same branch in case there were ever a hazard for overreading.
Summary by CodeRabbit
New Features
CHUNKCOPY_SAFEandGET_CHUNK_MAG.Bug Fixes
CHUNKCOPYandHALFCHUNKCOPYfor correctly managing data copying when lengths are not multiples of chunk sizes.