-
-
Notifications
You must be signed in to change notification settings - Fork 308
Improve chunkset_avx2 performance #1778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a1e8227 to
242be08
Compare
6b5bf9a to
ebc0ef6
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1778 +/- ##
===========================================
- Coverage 33.26% 33.03% -0.23%
===========================================
Files 66 66
Lines 5481 5555 +74
Branches 1222 1227 +5
===========================================
+ Hits 1823 1835 +12
- Misses 3399 3460 +61
- Partials 259 260 +1 ☔ View full report in Codecov by Sentry. |
ebc0ef6 to
7a1871d
Compare
|
I tested current develop (with split chunkcopy_safe) with and without this PR on an i9-9900K. Develop: This PR: Pretty nice improvement to instructions/cycle. Benchmark of this PR without #1776: (Correct build used) Develop: This PR: For some reason the improvements were much smaller here, only 0.28% faster inflate on average. The difference is speeds seems to be data-dependent, not cacheline-aliasing or anything like that. Benchmark of this PR without #1776: (Correct build used) |
That's more or less what I'd expect. Certain data streams might hit the chunk_memcpy path harder than others, it just depends on what the decompression stream presents. If you had a sequence of something like 17 or so bytes behind the destination pointer that repeated, I think you end up dispatching to plain old chunkcopy more in chunk_memcpy now that there are smaller unit sizes. The inlining alone in my testing bought us like 3%, even without smaller chunks to work with. Being able to share the last len byte copy loop shrank some of the code as well. I have plans to make the GET_CHUNK_MAG code better for VBMI platforms as well but I'm still waiting on hardware for that. |
|
Can this strategy also be used for AVX512? |
It certainly could (with a quarter size chunk being represented with With Icelake and VBMI2, that chunk magazine logic gets quite a bit simpler since they finally allowed cross lane single byte permutes. I plan to go back and add that in for 256 bit vector lengths once I can actually test it. That might also be a good time to revisit 64 byte chunks to see if that juice is worth the squeeze. |
|
|
||
| /* Only AVX2 */ | ||
| #ifdef HAVE_HALF_CHUNK | ||
| if (len <= sizeof(halfchunk_t)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we just call chunkmemset_ssse3, is it because it is inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. And we're able to shrink the code size by merging common elements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inlining bought us a lot here by itself and by allowing for a half chunk length we get the best of both worlds.
7a1871d to
c8febc6
Compare
|
I did another benchmark with this PR applied to develop before #1776, added it to the post above, so it is easier to compare them. |
Which version of gcc are you using? I feel like chunk_copy_safe has been inlining for me since at least gcc 11, but I'll try a myriad of compilation options and compilers to be sure. Earliest I can go back is probably 11 or 12. |
I don't think it has ever been inlining for me with the default CMake config. I remember disabling -Winlining a couple years ago because it was so noisy, then re-discovered it this summer when we were having a hard look at optional warnings. I get exactly the same behavior with GCC 12.2 from Debian on Aarch64 as I do with GCC 13.2.1 from Fedora on x86-64. You don't compile using custom flags or something? -O3 for example? That always seems to slow things down for me though. PS: Updated the benchmarks above, so now they are the correct ones. Hardly any difference from the incorrect build though. |
|
I do build with O3 and that might be the difference. Let me mess around with more stock flags and see if I see the same inlining behavior. |
As far as I know and remember, |
I don't think that's an issue as the minimum subset of extensions on all x86-64 CPUs contains sse3. |
Yep, so I confirmed it's -O3 that allows this to be inlined (with that commit reverted) and in my testing, it's a significant improvement. |
Problem is, -O3 is not the default for most distros, so we cannot rely on that being the case. |
Right but apart from manually forcing the same decisions the compiler is doing with O3 I don't know a better way to get that performance back. Without reverting 6b8efe7, it's a net loss for me. |
I am open to reverting that change again, but only if we can find a good way that also improves performance on an -O2 compile. |
|
Here's the comparison of O2 and O3, with 6b8efe7 reverted: Here's O2 with and without the revert compared: |
|
Couldn't we just alter our cmake configuration to not override O3 with O2 for release? I don't know how many distros override these things to begin with but I imagine most don't try to override the default build arguments. For me, the difference between O2 and O3 is ~7% and the work to split chunk copy so that it partially inlines hurts the O3 performance by another 2ish%. |
|
I would be in favor of forcing CMake/configure to use -O3 or /O3 for this project. |
4ddaef4 to
d2a9451
Compare
Archlinux is actually tending to use None as buildtype, so maybe the maintainer should be then contacted there. |
WalkthroughThe pull request includes changes across multiple files focused on memory chunk handling in ARM and x86 architectures. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ChunkHandler as Chunk Handler
participant Memory as Memory Operations
User->>ChunkHandler: Request chunk operation
ChunkHandler->>Memory: Perform chunk memory set
Memory-->>ChunkHandler: Return success
ChunkHandler-->>User: Operation completed
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (1)
chunkset_tpl.h (1)
197-210: Consider loop unrolling for performance optimizationIn the loop starting at line 197, unrolling the loop could enhance performance by reducing the loop overhead for large values of
len. This can be particularly beneficial in performance-critical code paths.Consider unrolling the loop:
while (len >= (2 * sizeof(chunk_t))) { storechunk(out, &chunk_load); storechunk(out + adv_amount, &chunk_load); storechunk(out + 2 * adv_amount, &chunk_load); storechunk(out + 3 * adv_amount, &chunk_load); out += 4 * adv_amount; len -= 4 * adv_amount; }Adjust the unrolling factor based on benchmarking results to find the optimal balance between code size and performance gain.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
- arch/arm/chunkset_neon.c (0 hunks)
- arch/x86/chunkset_avx2.c (5 hunks)
- arch/x86/chunkset_sse2.c (0 hunks)
- arch/x86/chunkset_ssse3.c (0 hunks)
- chunkset_tpl.h (3 hunks)
- inffast_tpl.h (1 hunks)
💤 Files with no reviewable changes (3)
- arch/arm/chunkset_neon.c
- arch/x86/chunkset_sse2.c
- arch/x86/chunkset_ssse3.c
🔇 Additional comments (10)
inffast_tpl.h (1)
263-263: Potential performance optimization - please clarify safety implicationsThe modification of the
CHUNKMEMSET_SAFEfunction call by removing the+ 1from(unsigned)((safe - out) + 1)could potentially lead to performance improvements, especially for smaller chunk sizes. This aligns with the PR objectives.However, I have some concerns:
Safety: Can you please clarify the safety implications of this change? We need to ensure that removing the
+ 1doesn't introduce any risk of buffer overflows or other memory-related issues.Testing: Have you conducted thorough testing with various data sets and chunk sizes to verify that this change consistently improves performance without introducing any regressions or edge cases?
Documentation: If this change is intentional and proven safe, please update the function's documentation to reflect this modification and explain the reasoning behind it.
To help verify the impact of this change, could you run the following benchmarks and provide the results?
This script will help us understand the performance impact across different input sizes, which is crucial given the PR's focus on improving performance for smaller chunk sizes.
arch/x86/chunkset_avx2.c (8)
12-12: Consistent Typedef forhalfchunk_tDefining
halfchunk_tas__m128ialigns with the existingchunk_ttypedef and maintains consistency in the codebase.
55-58: Initialization ofhalf_rem_valsArrayThe
half_rem_valsarray is correctly initialized to support half-chunk operations, ensuring proper indexing and functionality for distances less than the chunk size.
77-80: Correct Implementation ofchunkmemset_16FunctionThe
chunkmemset_16function efficiently broadcasts a 128-bit value into a 256-bit AVX2 register using_mm256_broadcastsi128_si256, enabling effective handling of 16-byte chunks.
Line range hint
112-127: Optimized Logic inGET_CHUNK_MAGFunctionThe modifications in the
GET_CHUNK_MAGfunction streamline the handling of distances less than 16. The use of intrinsics and permutation vectors is appropriate and enhances performance for smaller distances.
128-145: Implementation ofhalfchunkmemset_*FunctionsThe
halfchunkmemset_2,halfchunkmemset_4, andhalfchunkmemset_8functions correctly usememcpyand appropriate intrinsics (_mm_set1_epi16,_mm_set1_epi32,_mm_set1_epi64x) to safely broadcast values into 128-bit vectors. This ensures efficient handling of small data sizes with potential unaligned input.
146-152: Accurate Loading and Storing withloadhalfchunkandstorehalfchunkThe
loadhalfchunkandstorehalfchunkfunctions correctly utilize unaligned load and store intrinsics (_mm_loadu_si128and_mm_storeu_si128), ensuring safe operations regardless of data alignment.
160-171: Proper Implementation ofGET_HALFCHUNK_MAGFunctionThe
GET_HALFCHUNK_MAGfunction is correctly implemented to handle half-chunk magnitudes for distances less than the chunk size. The use of_mm_shuffle_epi8with appropriate permutation vectors efficiently rearranges bytes for the required output.
154-158:⚠️ Potential issueUnused Static Function
halfchunk2wholeThe function
halfchunk2wholeappears to be defined but not used within this file or elsewhere in the codebase. This might indicate that it's unnecessary or was intended for future use.Run the following script to check for usages of
halfchunk2wholein the codebase:chunkset_tpl.h (1)
90-107: HALFCHUNKCOPY function implementation looks correctThe
HALFCHUNKCOPYfunction efficiently handles copying data in half-chunk sizes. The logic is sound, and the implementation aligns with the intended functionality for systems whereHAVE_HALF_CHUNKis defined.
nmoinvaz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks clean to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
arch/x86/chunkset_avx2.c (3)
55-58: LGTM: New static array for half chunk remainder valuesThe addition of
half_rem_valsis a good optimization for the new half chunk operations. It should help reduce computation time in theGET_HALFCHUNK_MAGfunction.Consider adding a brief comment explaining the purpose and derivation of these values for better maintainability.
Line range hint
112-127: LGTM: Simplified logic in GET_CHUNK_MAG functionThe modifications to
GET_CHUNK_MAGalign well with the PR objective of simplifying the "half chunk mag" process. The removal of the specificdist == 16case suggests a more streamlined approach, which should contribute to better performance.Consider adding a brief comment explaining why the
dist == 16case is no longer needed, to help future maintainers understand the rationale behind this change.
146-158: LGTM: Essential half chunk operations addedThe new functions
loadhalfchunk,storehalfchunk, andhalfchunk2wholeprovide crucial operations for working with half chunks. These additions are consistent with the PR's goal of improving performance through half chunk awareness.In the comment for
halfchunk2whole, consider clarifying the specific memory sanitizer issues this addresses, if known. This could help future maintainers understand the rationale behind the zero extension.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- arch/arm/chunkset_neon.c (0 hunks)
- arch/x86/chunkset_avx2.c (5 hunks)
- arch/x86/chunkset_sse2.c (0 hunks)
- arch/x86/chunkset_ssse3.c (0 hunks)
- chunkset_tpl.h (3 hunks)
💤 Files with no reviewable changes (3)
- arch/arm/chunkset_neon.c
- arch/x86/chunkset_sse2.c
- arch/x86/chunkset_ssse3.c
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (1)
Learnt from: KungFuJesus PR: zlib-ng/zlib-ng#1778 File: chunkset_tpl.h:164-165 Timestamp: 2024-10-04T03:17:24.773Z Learning: In `chunkset_tpl.h`, using `goto` in the `CHUNKMEMSET` function aids the compiler in inlining the function, so it should be retained.
🔇 Additional comments (11)
arch/x86/chunkset_avx2.c (5)
9-19: LGTM: New typedefs and macros for half chunk operationsThe addition of
halfchunk_ttypedef and related macros aligns well with the PR objective of making the function half chunk aware. These changes provide a solid foundation for the performance improvements described in the PR summary.
77-80: LGTM: Efficient 16-byte chunk memset functionThe new
chunkmemset_16function is a valuable addition that aligns with the PR's performance improvement goals. It efficiently sets 16-byte chunks using AVX2 instructions, which should contribute to better performance for smaller chunk sizes as mentioned in the PR summary.
128-145: LGTM: New halfchunkmemset functions for various sizesThe addition of
halfchunkmemset_2,halfchunkmemset_4, andhalfchunkmemset_8functions provides efficient ways to set half chunks of different sizes. These functions align well with the PR's objective of improving performance for smaller chunk sizes and making the implementation half chunk aware.
160-171: LGTM: Efficient GET_HALFCHUNK_MAG function addedThe new
GET_HALFCHUNK_MAGfunction is a crucial addition for efficient half chunk handling. It leverages thehalf_rem_valsarray and simplifies the process of obtaining the magnitude for half chunks. This aligns perfectly with the PR's goal of improving performance for smaller chunk sizes.To ensure the performance improvements, consider running the following benchmark:
This script compiles two versions of the code (with and without the new half chunk optimizations) and runs benchmarks for various small chunk sizes. The results should show improved performance for the new version, especially for smaller chunk sizes.
Line range hint
1-172: Comprehensive benchmarking recommendedThe changes implemented in this PR appear to be well-designed and should improve performance for smaller chunk sizes, as intended. However, to ensure that these optimizations don't negatively impact performance for larger chunk sizes, I recommend running a comprehensive benchmark across a wide range of chunk sizes.
Consider running the following benchmark script:
This script will provide a comprehensive view of the performance impact across various chunk sizes, helping to ensure that the optimizations provide benefits without unexpected regressions.
chunkset_tpl.h (6)
90-108: LGTM: New HALFCHUNKCOPY function improves performance for smaller chunk sizesThe introduction of the
HALFCHUNKCOPYfunction aligns well with the PR objective of making the code half chunk aware. This function efficiently handles copying of smaller data chunks, which should lead to performance improvements, especially for smaller chunk sizes.
Line range hint
112-129: LGTM: Improved function signature and initial checksThe changes to the
CHUNKMEMSETfunction enhance its performance and flexibility:
- Changing to
static inlineallows for better compiler optimization.- New variables
chunk_load,chunk_mod, andadv_amountprepare for the half chunk aware implementation.- Using
CHUNKCOPYfordist > sizeof(chunk_t)simplifies the code and potentially improves performance.These modifications align well with the PR objectives of improving performance and making the function half chunk aware.
🧰 Tools
🪛 ast-grep
[warning] 122-122: Avoid 'memset()' function, it does not consider buffer boundaries and can lead to buffer overflows. Use 'memset_s()' instead.
Context: memset(out, *from, len);
Note: [CWE-14]: Compiler Removal of Code to Clear Buffers [OWASP A04:2021] Insecure Design [REFERENCES]
- https://cwe.mitre.org/data/definitions/14.html
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures/
136-165: LGTM: Efficient half chunk handling improves performance for smaller sizesThe new half chunk handling code significantly enhances the function's performance for smaller chunk sizes:
- Specific handling for common
distvalues (2, 4, 8, 16) optimizes frequent cases.- The use of
GET_HALFCHUNK_MAGfor otherdistvalues ensures flexibility.- The loop for copying half chunks efficiently handles the main copy operation.
Note: The use of
gotoat line 164, while generally discouraged, is retained here to aid compiler optimization for inlining, as previously discussed.These changes align perfectly with the PR objective of making the function half chunk aware and improving performance for smaller chunk sizes.
183-200: LGTM: Optimized main chunk handling for improved performanceThe changes to the main chunk handling code introduce several performance optimizations:
- Addition of a specific case for
dist == 16aligns with the PR objective of optimizing common cases.- Use of
adv_amountallows for more efficient copying whendistdoesn't divide evenly intochunk_t.- The main loop now handles two chunk copies simultaneously, leveraging CPU store forwarding as mentioned in the PR objectives.
These optimizations should lead to significant performance improvements, especially for smaller chunk sizes and common distance values.
213-215: Acknowledged: Remaining bytes handling for half chunk supportThe addition of the
rem_bytes:label, conditionally compiled for half chunk support, allows the half chunk code to efficiently handle remaining bytes. This aligns with the overall optimization strategy of the PR.Note: As discussed earlier, the use of
gotohere aids compiler optimization for inlining.
Line range hint
90-215: LGTM: Comprehensive performance improvements for chunkset operationsThe changes in
chunkset_tpl.hrepresent a significant optimization effort that aligns well with the PR objectives:
- Introduction of half chunk awareness improves performance for smaller chunk sizes.
- Optimizations for common distance values (2, 4, 8, 16) enhance efficiency in frequent scenarios.
- The main loop now leverages CPU store forwarding by handling two chunks simultaneously.
- The use of
adv_amountallows for more efficient copying when the distance doesn't divide evenly into chunk size.These optimizations are expected to yield substantial performance improvements, particularly for smaller chunk sizes and common distance values. The implementation is thorough and considers various edge cases, ensuring robust performance across different scenarios.
Great job on these optimizations! The changes should significantly enhance the performance of the
zlib-nglibrary.
BenchmarkDataset: silesia.tar Develop 8d10c30PR efb588d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
arch/x86/chunkset_avx2.c (1)
55-57: Add documentation for the half_rem_vals arrayThe
half_rem_valsarray has been added, which is used in theGET_HALFCHUNK_MAGfunction. While its purpose can be inferred from its usage, it would be beneficial to add a brief comment explaining the significance of these values and how they were derived.Could you please add a comment explaining the purpose and derivation of the values in the
half_rem_valsarray?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- arch/arm/chunkset_neon.c (0 hunks)
- arch/x86/chunkset_avx2.c (5 hunks)
- arch/x86/chunkset_sse2.c (0 hunks)
- arch/x86/chunkset_ssse3.c (0 hunks)
- chunkset_tpl.h (3 hunks)
💤 Files with no reviewable changes (3)
- arch/arm/chunkset_neon.c
- arch/x86/chunkset_sse2.c
- arch/x86/chunkset_ssse3.c
🧰 Additional context used
📓 Learnings (2)
arch/x86/chunkset_avx2.c (1)
Learnt from: KungFuJesus PR: zlib-ng/zlib-ng#1802 File: arch/x86/chunkset_avx2.c:56-59 Timestamp: 2024-10-08T03:09:54.717Z Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.chunkset_tpl.h (1)
Learnt from: KungFuJesus PR: zlib-ng/zlib-ng#1778 File: chunkset_tpl.h:164-165 Timestamp: 2024-10-04T03:17:24.773Z Learning: In `chunkset_tpl.h`, using `goto` in the `CHUNKMEMSET` function aids the compiler in inlining the function, so it should be retained.
🔇 Additional comments (11)
arch/x86/chunkset_avx2.c (5)
12-12: LGTM: Appropriate type definition for half-chunksThe addition of
halfchunk_tas__m128iis a good choice for handling 128-bit (half-chunk) operations. This aligns well with the PR's objective of improving performance for smaller chunk sizes.
17-19: LGTM: New macros for extended functionalityThe addition of
HAVE_CHUNKMEMSET_16andHAVE_HALF_CHUNKmacros is consistent with the existing code structure. These will allow for conditional compilation of the new 16-byte chunk operations and half-chunk functionality, which is in line with the PR's performance improvement goals.
77-79: LGTM: Efficient implementation of chunkmemset_16The
chunkmemset_16function is a well-implemented addition for handling 16-byte chunks. The use of_mm256_broadcastsi128_si256is an efficient way to set a 256-bit chunk from a 16-byte input, which aligns with the PR's performance improvement goals.
128-158: LGTM: Well-implemented half-chunk operationsThe addition of half-chunk related functions (halfchunkmemset_2, halfchunkmemset_4, halfchunkmemset_8, loadhalfchunk, storehalfchunk, halfchunk2whole) is well-implemented. These functions provide efficient operations for 128-bit data, which is consistent with the PR's goal of improving performance for smaller chunk sizes. The implementations are analogous to their full-chunk counterparts, ensuring consistency in the codebase.
Line range hint
91-124: Verify correctness of simplified GET_CHUNK_MAG functionThe simplification of the
GET_CHUNK_MAGfunction, particularly fordist < 16, should lead to better performance. However, the removal of the specific branch fordist == 16warrants careful verification to ensure correctness is maintained for all input cases.Please run the following script to verify the function's behavior:
chunkset_tpl.h (6)
90-108: LGTM: NewHALFCHUNKCOPYfunction for optimized half-chunk operationsThe addition of the
HALFCHUNKCOPYfunction is a good optimization for handling smaller data sizes. It follows the same pattern asCHUNKCOPY, which maintains consistency in the codebase. This function will likely improve performance for operations on data sizes that are half of the standard chunk size.
112-120: LGTM: Improved function signature and new variable declarationsThe change of
CHUNKMEMSETtostatic inlineis a good optimization that allows the compiler to inline the function, potentially improving performance. The new variableschunk_loadandchunk_modare necessary for the updated logic in the function.
136-165: LGTM: Efficient half-chunk handling logicThe new half-chunk handling logic is a valuable addition that allows for more efficient processing of smaller data sizes. The specialized handling for common
distvalues (2, 4, 8, 16) is likely to improve performance in frequent use cases.Note: The use of
goto rem_bytesat line 164 is intentional and aids in function inlining, as previously discussed.
194-200: LGTM: Optimized main loop with double chunk copyThe updated main loop in
CHUNKMEMSETnow handles two chunk copies at once, which can significantly improve performance by reducing the number of loop iterations. The use ofadv_amountensures correct handling for cases where the chunk size doesn't evenly divide the vector register size, maintaining accuracy while optimizing for speed.
Line range hint
213-220: LGTM: Consistent remainder handling with half-chunk supportThe addition of the
rem_bytes:label, conditionally compiled for half-chunk support, maintains consistency with the earliergotostatement. The unchanged remainder handling logic continues to appropriately deal with any leftover bytes, ensuring complete and accurate data copying.
Line range hint
1-255: Overall assessment: Significant performance optimizationsThe changes in this file represent a substantial effort to optimize chunk memory operations, particularly for smaller data sizes. The introduction of half-chunk handling, the optimized main loop in
CHUNKMEMSET, and the newHALFCHUNKCOPYfunction are all likely to contribute to improved performance.The code maintains good readability and follows consistent patterns, which is crucial for maintainability. The careful handling of different chunk sizes and the use of conditional compilation (
#ifdef HAVE_HALF_CHUNK) allow for flexibility across different architectures.These optimizations align well with the PR objectives of improving the performance of chunk operations, especially for smaller chunk sizes. The changes appear to be well-thought-out and implemented with attention to both performance and code quality.
| static inline halfchunk_t GET_HALFCHUNK_MAG(uint8_t *buf, uint32_t *chunk_rem, uint32_t dist) { | ||
| lut_rem_pair lut_rem = perm_idx_lut[dist - 3]; | ||
| __m128i perm_vec, ret_vec; | ||
| __msan_unpoison(buf + dist, 16 - dist); | ||
| ret_vec = _mm_loadu_si128((__m128i*)buf); | ||
| *chunk_rem = half_rem_vals[dist - 3]; | ||
|
|
||
| perm_vec = _mm_load_si128((__m128i*)(permute_table + lut_rem.idx)); | ||
| ret_vec = _mm_shuffle_epi8(ret_vec, perm_vec); | ||
|
|
||
| return ret_vec; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
LGTM: Efficient implementation of GET_HALFCHUNK_MAG with minor optimization potential
The GET_HALFCHUNK_MAG function is a well-implemented half-chunk version of GET_CHUNK_MAG. The use of half_rem_vals for setting chunk_rem simplifies the remainder calculation, and the SIMD instructions are used appropriately for efficient operation.
A minor optimization could be considered:
Consider using _mm_loadl_epi64 instead of _mm_loadu_si128 when dist <= 8. This might be slightly more efficient for smaller distances. Here's a potential implementation:
- ret_vec = _mm_loadu_si128((__m128i*)buf);
+ ret_vec = (dist <= 8) ? _mm_loadl_epi64((__m128i*)buf) : _mm_loadu_si128((__m128i*)buf);This change could potentially improve performance for very small distances without affecting correctness.
Committable suggestion was skipped due to low confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branching cost far outweighs the savings from the load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KungFuJesus, understood. Thank you for the clarification.
(。•́︿•̀。)
✏️ Learnings added
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.150Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:56-59
Timestamp: 2024-10-08T03:09:54.717Z
Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:56-59
Timestamp: 2024-10-08T19:37:14.998Z
Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:168-168
Timestamp: 2024-10-07T21:21:10.570Z
Learning: In the function `GET_HALFCHUNK_MAG` in `arch/x86/chunkset_avx2.c`, `dist` is always less than 16, so `16 - dist` in `__msan_unpoison` is always positive.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:168-168
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the function `GET_HALFCHUNK_MAG` in `arch/x86/chunkset_avx2.c`, `dist` is always less than 16, so `16 - dist` in `__msan_unpoison` is always positive.
Put length 16 in the length checking ladder and take care of it there since it's also a simple case to handle. We kind of went out of our way to pretend 128 bit vectors didn't exist when using avx2 but this can be handled in a single instruction. Strangely the intrinsic uses vector register operands but the instruction itself assumes a memory operand for the source. This also means we don't have to handle this case in our "GET_CHUNK_MAG" function.
This gives us appreciable gains on a number of fronts. The first being we're inlining a pretty hot function that was getting dispatched to regularly. Another is that we're able to do a safe lagged copy of a distance that is smaller, so CHUNKCOPY gets its teeth back here for smaller sizes, without having to do another dispatch to a function. We're also now doing two overlapping writes at once and letting the CPU do its store forwarding. This was an enhancement @dougallj had suggested a while back. Additionally, the "half chunk mag" here is fundamentally less complicated because it doesn't require sythensizing cross lane permutes with a blend operation, so we can optimistically do that first if the len is small enough that a full 32 byte chunk doesn't make any sense.
Dead2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Make chunkset_avx2 half chunk aware
This gives us appreciable gains on a number of fronts. The first being
we're inlining a pretty hot function that was getting dispatched to
regularly. Another is that we're able to do a safe lagged copy of a
distance that is smaller, so CHUNKCOPY gets its teeth back here for
smaller sizes, without having to do another dispatch to a function.
We're also now doing two overlapping writes at once and letting the CPU
do its store forwarding. This was an enhancement @dougallj had suggested
a while back.
Additionally, the "half chunk mag" here is fundamentally less
complicated because it doesn't require sythensizing cross lane permutes
with a blend operation, so we can optimistically do that first if the
len is small enough that a full 32 byte chunk doesn't make any sense.
Summary by CodeRabbit
New Features
halfchunk_ttype to optimize operations on 128-bit chunks.HALFCHUNKCOPYfor efficient data copying of half-sized chunks.Bug Fixes
inflate_fastfunction for improved safety.Refactor
CHUNK_SIZEmacro across multiple files to streamline chunk size references.CHUNKMEMSETfunction to a static inline function for better linkage control.