-
-
Notifications
You must be signed in to change notification settings - Fork 308
ARM Neon: Fold a copy into the adler32 function for UPDATEWINDOW #1870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes update the NEON-optimized Adler32 implementation to optionally perform data copying during the accumulation process. A new inline function Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Wrapper
participant Implementation
%% For non-copy flow (original behavior)
Caller->>Wrapper: Call adler32_neon (adler, buf, len)
Wrapper->>Implementation: Call adler32_fold_copy_impl with COPY=0
Implementation-->>Wrapper: Return checksum
Wrapper-->>Caller: Return checksum
%% For copy flow (new behavior)
Caller->>Wrapper: Call adler32_fold_copy_neon (adler, dst, src, len)
Wrapper->>Implementation: Call adler32_fold_copy_impl with COPY=1
Implementation-->>Wrapper: Return checksum
Wrapper-->>Caller: Return checksum
Suggested labels
Suggested reviewers
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
c18d13c to
50c4542
Compare
50c4542 to
847e959
Compare
|
Nope |
847e959 to
3bdbca9
Compare
|
Hmm, the slide hash function should be pointed at the buffer window, which should be properly aligned. Do you have more error context for that one? As for the adler, the only thing we're using the multi-register variant with ld1 with is something that we control the alignment of on the stack. Everything else is using plain old ld1 (really 2x ldp's, as that's usually what both clang and GCC compile to): Of course this is aarch64, not necessarily armv7. |
3bdbca9 to
ebdb4b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
arch/arm/adler32_neon.c (4)
13-13: Clarify theCOPYparameter usage in the function signature.
Declaringstatic inline void NEON_accum32(uint32_t *s, uint8_t *dst, const uint8_t *buf, size_t len, const int COPY)introduces a conditional copy mechanism. It would be helpful to add a brief comment explaining thatlenhere refers to the number of 16-byte blocks (not total bytes), and thatCOPYtoggles storing todst.
56-56: Remove or justify the commented-outvst1q_u8_x4call.
Line 56 has//vst1q_u8_x4(dst, d0_d3);commented out. If this code path is obsolete, removing it would clean up the function. Otherwise, add a comment explaining why it remains commented for potential future use.
168-168: Remove trailing whitespace.
The pipeline warns about trailing whitespace on line 168. Removing it helps maintain a clean codebase.Apply the following fix:
- *dst = *src; + *dst = *src;🧰 Tools
🪛 GitHub Actions: Lint
[warning] 168-168: Trailing whitespace found.
230-230: Clarify block-based length usage.
NEON_accum32(pair, dst + done, src + done, n >> 4, COPY);is called withn >> 4, wherenis in bytes. The function then treats this as the number of 16-byte blocks. This is correct but can be confusing. Consider documenting thatleninsideNEON_accum32is interpreted as “count of 16-byte blocks.”
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
arch/arm/adler32_neon.c(9 hunks)arch/arm/arm_functions.h(2 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Lint
arch/arm/adler32_neon.c
[warning] 168-168: Trailing whitespace found.
⏰ Context from checks skipped due to timeout of 90000ms (60)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
🔇 Additional comments (6)
arch/arm/adler32_neon.c (3)
51-58: Confirm sufficient buffer space before writing todst.
WhenCOPYis enabled, the code writes 64 bytes (d0,d1,d2,d3) todstwithout explicit bounds checks. Although the caller logic appears to ensure there's enough space, consider documenting or asserting thatdstcan indeed accommodate these writes, to prevent out-of-bounds writes in case of misuse.
79-86: Ensure no overflow risks in wide accumulations.
Although the internal comment states that the maximum values won’t exceeduint16limits, ensure that any changes to the iteration bounds (e.g., ifNMAXor the data size changes in the future) do not inadvertently cause overflow in these partial sums. Maintaining a test that specifically checks large inputs might help.
256-262: Wrapper functions for copy vs non-copy look appropriate.
Theadler32_neonandadler32_fold_copy_neonwrappers callingadler32_fold_copy_implwithCOPY=0andCOPY=1, respectively, cleanly separate the two pathways. This is an elegant solution that avoids code duplication. No issues found.arch/arm/arm_functions.h (1)
10-10: Validate new function declaration and macro definition.
Introducinguint32_t adler32_fold_copy_neon(...)and mappingnative_adler32_fold_copyto it (line 42) is consistent with how other NEON-based definitions are set. Ensure all references expect the same parameter types and order.Also applies to: 42-42
test/benchmarks/benchmark_adler32_copy.cc (1)
95-96: Benchmarking the new folded-copy function.
Switching fromadler32_neontoadler32_fold_copy_neonfor benchmarking is consistent with the new copy-aware Adler32 logic. This should accurately reflect performance changes introduced by optional copying.functable.c (1)
175-175: Verify the NEON implementation's interface compatibility.The integration of
adler32_fold_copy_neoninto the function table looks correct. However, given the reported compilation issues with ARM architecture files, let's verify the function declaration and interface compatibility.Run the following script to check the function declaration:
If the function declaration matches the interface of other implementations and follows the expected signature
uint32_t (*adler32_fold_copy)(uint32_t adler, uint8_t *dst, const uint8_t *src, size_t len), then the integration is correct.✅ Verification successful
🏁 Script executed:
#!/bin/bash # Description: Verify the function declaration matches the expected interface # Check the function declaration in the header file rg -A 1 "adler32_fold_copy_neon" arch/arm/ # Check if the function signature matches other implementations rg -A 1 "adler32_fold_copy_\w+" | grep -v "neon"Length of output: 13737
NEON Implementation Interface Verified
The function declaration and definition of
adler32_fold_copy_neoninarch/arm/arm_functions.handarch/arm/adler32_neon.cmatch the expected signature, consistent with the other implementations in the codebase. The assignment infunctable.cis correct and well-integrated.
- Confirmed function signature:
uint32_t adler32_fold_copy_neon(uint32_t adler, uint8_t *dst, const uint8_t *src, size_t len)- Signature matches the interface used by generic and alternative implementations
|
@KungFuJesus nope. This is Debian 11 with clang-11. Interestingly enough, if I use an Alpine Linux container with distrobox, it compiles and it works fine after this PR. Before this PR, example.c would crash when ran. |
|
Ok so the failing to compile is probably a separate issue. There's probably something weird going on with your build toolchain. |
|
huh? it's stock Debian. clang was installed with sudo apt install clang. |
ebdb4b9 to
d5fe593
Compare
|
Let me try to reproduce with my potato pi on a similar vintage of debian: The armv6 abi but I'll give it a go. |
|
Hmm, this might be too potato, it's just compiling functions with the fallback v6 asm: I don't really have any aarch32 installs to test this on, everything I have that's aarch64 capable is running on aarch64. I might have a second gen pi laying around somewhere but I'd have to track it down. Suffice to say though if the compiler is crashing and spitting a stack trace out it's probably a compiler bug. Either that or your clang binaries are somehow corrupt. |
|
Correct. That stack trace is indeed an internal LLVM bug. |
d5fe593 to
4f9c6cc
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1870 +/- ##
===========================================
+ Coverage 82.37% 82.57% +0.19%
===========================================
Files 141 141
Lines 12558 12672 +114
Branches 2894 2909 +15
===========================================
+ Hits 10345 10464 +119
- Misses 1239 1240 +1
+ Partials 974 968 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)
56-56: Consider using vst1q_u8_x4 for optimized copying.The commented out
vst1q_u8_x4instruction could potentially offer better performance than individualvst1q_u8calls.Let's verify if the compiler supports this instruction:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Description: Check if vst1q_u8_x4 is supported by the compiler echo '#include <arm_neon.h>' > test.c echo 'void test() { uint8x16x4_t data; uint8_t* dst; vst1q_u8_x4(dst, data); }' >> test.c cc -c test.c -march=armv8-a+simd 2>&1 || trueLength of output: 247
I'll run a new check using
clanginstead ofccto verify whether thevst1q_u8_x4intrinsic is supported by the available toolchain:Once this script executes, we can determine if the NEON intrinsic
vst1q_u8_x4is supported. If the compilation succeeds (i.e. no intrinsic-related errors are emitted), then usingvst1q_u8_x4for optimized copying should be considered. Otherwise, if the compiler errors out due to lack of support, then leaving the code as-is is appropriate.Please review the result of this script and verify that your build environment includes a toolchain (with
clangor an equivalent compiler) that supports ARM NEON intrinsics.
🏁 Script executed:
[run_scripts] #!/bin/bash # Check if clang is available and compile test using clang for vst1q_u8_x4 support which clang || echo "clang not installed" echo '#include <arm_neon.h>' > test_clang.c echo 'void test() { uint8x16x4_t data; uint8_t* dst; vst1q_u8_x4(dst, data); }' >> test_clang.c clang -c test_clang.c -march=armv8-a+simd 2>&1 || trueLength of output: 540
Action Needed: Confirm Compiler Support for vst1q_u8_x4
The script couldn’t verify support due to the absence of a suitable compiler (clang) in the testing environment. Please ensure that your build toolchain supports the
vst1q_u8_x4intrinsic—either by testing compilation locally or reviewing your toolchain documentation. If supported, consider enabling it for optimized copying; otherwise, the current implementation remains acceptable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
arch/arm/adler32_neon.c(9 hunks)arch/arm/arm_functions.h(2 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- test/benchmarks/benchmark_adler32_copy.cc
- functable.c
- arch/arm/arm_functions.h
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Ubuntu 20.04 Clang 6
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
🔇 Additional comments (4)
arch/arm/adler32_neon.c (4)
13-150: LGTM! The NEON accumulation function is well-implemented.The function correctly handles both checksum computation and optional data copying using NEON intrinsics. The implementation maintains good performance characteristics by:
- Using aligned NEON loads and stores
- Efficiently handling data in 64-byte chunks
- Properly managing overflow conditions
205-206: Good use of restrict pointers for optimizing the copy loop.The
__restrictqualifier helps the compiler generate more efficient code by guaranteeing that the source and destination pointers don't alias.
160-254: LGTM! The implementation is robust and handles edge cases well.The function:
- Correctly handles special cases (len == 1, len < 16)
- Maintains alignment requirements for optimal NEON operations
- Properly manages the checksum computation with optional copying
256-262: LGTM! Clean API design with good separation of concerns.The public API functions provide a clear interface while delegating to the shared implementation, making the code maintainable and efficient.
4f9c6cc to
04988a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)
56-56: Remove commented out code.The commented out
vst1q_u8_x4instruction should be removed as it's not being used.- //vst1q_u8_x4(dst, d0_d3);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
arch/arm/adler32_neon.c(9 hunks)arch/arm/arm_functions.h(2 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- arch/arm/arm_functions.h
- test/benchmarks/benchmark_adler32_copy.cc
- functable.c
⏰ Context from checks skipped due to timeout of 90000ms (40)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
🔇 Additional comments (3)
arch/arm/adler32_neon.c (3)
13-150: LGTM! Well-optimized NEON implementation.The function efficiently combines checksum calculation with conditional copying, maintaining optimal performance through:
- Proper use of NEON intrinsics for vectorized operations
- Minimal branching in the main loop
- Efficient memory access patterns
160-254: LGTM! Well-structured implementation with proper optimizations.The function effectively:
- Handles all edge cases
- Uses restrict pointers for better compiler optimization
- Maintains proper memory alignment for SIMD operations
256-262: LGTM! Clean wrapper implementations.The wrapper functions provide a clear interface to the core implementation with appropriate parameter passing.
3f4e99f to
da961da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)
13-14: Consider usingvst1q_u8_x4for more efficient copying.The commented out
vst1q_u8_x4instruction could potentially be more efficient than four separatevst1q_u8calls. Consider benchmarking both approaches.Also applies to: 52-59
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
arch/arm/adler32_neon.c(9 hunks)arch/arm/arm_functions.h(2 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)zbuild.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- test/benchmarks/benchmark_adler32_copy.cc
- functable.c
- arch/arm/arm_functions.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (44)
- GitHub Check: Windows ClangCl Win32
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows ClangCl Win32
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows ClangCl Win32
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows ClangCl Win32
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows ClangCl Win32
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows ClangCl Win32
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: Windows ClangCl Win32
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows ClangCl Win32
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows ClangCl Win32
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: macOS GCC
- GitHub Check: macOS GCC Symbol Prefix & Compat
🔇 Additional comments (5)
arch/arm/adler32_neon.c (4)
161-184: Well-structured implementation with optimized edge cases.The implementation efficiently handles different length cases while maintaining the original adler32 logic. The use of
static inlinewithZ_INLINEand special handling for small lengths helps optimize performance.
206-214: Effective use of restrict pointers for copy optimization.The use of
__restrictqualifiers helps the compiler optimize the copy operation by guaranteeing non-overlapping memory regions.
258-260: Clean wrapper implementation maintaining backward compatibility.The function correctly delegates to
adler32_fold_copy_implwith appropriate parameters.
262-264: Efficient implementation with implicit validation.Using
dst != NULLas the COPY flag is an elegant way to handle validation implicitly.zbuild.h (1)
73-78: Well-implemented cross-compiler forced inline support.The
Z_INLINEmacro correctly handles both MSVC (__forceinline) and GCC (__attribute__((always_inline))) compilers, which is essential for the performance optimizations in the adler32 implementation.
| #else | ||
| #define Z_INLINE | ||
| #endif | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking):
Wrapping this in a test for WITHOUT_FORCEINLINE or NO_FORCEINLINE could be helpful for testing and debugging, and maybe also for the few ones that want to minimize library size for use in embedded and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how that might be useful if we come to use it in many places. In this circumstance, there's a lot of stack churn going into and out of NEON_accum32, it's quite a bit slower without inlining. Maybe we can do that in a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of allowing to disable forced inlining, we should consider using x4 versions only for non-copy version... This would mean we would need to duplicate the function instead of using a parameter to decide if we need to copy... On a lot of known platforms where x4 versions don't work, we already convert them to equivalent 4 x1 calls... For the remaining, we need to fix the issue anyways instead of masking it.
Easiest way to avoid inlining is to build debug version as it will disable all inlining.
Having two versions of the function also makes it easier to test for regressions as there is always "spare" version to compare against especially when the plain C version is not included for reducing binary size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are avoiding that due to the 64 byte alignment requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 64-byte alignment requirement seems to be dictated by limitation of specific ABI, not generic restriction... On default ABI there is no such restriction and using x1 version should be somewhat slower and rely too much on compiler's ability to reorder code to avoid unnecessary pipeline stalls and stack spills.
As mentioned elsewhere, the other non-ABI related issues with x4 intrinsics were caused by using incompatible combination of compiler command-line options. As such, fix for them warrants own PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, the x4 load improvement is quite minor to begin with but if we're going to have 2 separate function paths anyway, I don't see much real harm. I do find it funny that coderabbit tried to suggest adding it back, claiming regurgitated benefits from whatever lore it may have read from stack overflow, but then failed to do it because its copy of clang to verify that claims didn't support it. I've added it back and injected what I hope is a good workaround for detecting iOS. That should be all the more that's needed.
We could instead enforce 32 byte alignment in all scenarios but then we're going to have 1-31 byte checksums done a byte at a time which is less than ideal. The alignment benefits everywhere are touted to be transparent and automatically realized by many ARM SIPs, yet the x2,3,&4 variants of this instruction still support encoding an alignment hint into the instruction so...clearly something wants that. In any case clang and GCC will not ever put anything there when targeting Linux, even with an alignment suggestion (__builtin_assume_aligned).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My knowledge of ARM architecture is pretty much limited to compilers for embedded programming and that implies utilizing limited set of instructions... As I've mentioned before, sometimes we need to have to find middle ground between using as many registers (in a single instruction or code block) as possible and avoiding stalls. This pretty much means we should in such cases benchmark 1x, 2x and 4x variations, due to us deducing that 3x variants cause more trouble than have benefits. Compilers will almost always try to reorder the code if there is chance of performance penalty and the optimisation level prefers speed over size.
What works for 32-bit ARM might not be the best option for 64-bit ARM and that was one of the reasons I didn't back then want to unify the code even though a lot of code was shared between 32-bit and 64-bit versions.
da961da to
23ee5b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)
56-56: Consider using vst1q_u8_x4 for better performance.The commented out
vst1q_u8_x4instruction suggests a potential optimization. This instruction could potentially improve performance by reducing the number of store operations.✅ Verification successful
🌐 Web query:
What are the performance benefits of using vst1q_u8_x4 over individual vst1q_u8 calls in ARM NEON?💡 Result:
Using
vst1q_u8_x4(which stores four 128-bit NEON registers consecutively) instead of four separatevst1q_u8calls provides several performance advantages on ARM NEON:
Reduced Instruction Overhead
A singlevst1q_u8_x4instruction replaces fourvst1q_u8calls, minimizing loop iterations and branching. This reduces pipeline stalls and improves instruction throughput[1][5].Memory Access Efficiency
Combining stores into a single operation improves cache utilization and memory bandwidth efficiency by consolidating scattered writes into contiguous blocks[3][6]. This is critical for data-heavy workloads like image/video processing.Latency Hiding
Bulk stores allow better instruction scheduling to hide memory latency. For example:// Individual stores (inefficient) vst1q_u8(ptr, data0); vst1q_u8(ptr+16, data1); vst1q_u8(ptr+32, data2); vst1q_u8(ptr+48, data3); // Combined store (optimized) vst1q_u8_x4(ptr, {data0, data1, data2, data3});The combined version reduces dependencies between store operations[1][5].
Alignment Benefits
Bulk stores are more likely to align with cache line boundaries (typically 64 bytes), avoiding partial-line writes that incur extra cycles[3][6].
Approach Instructions Cache Utilization Latency Impact 4x vst1q_u8calls4 Moderate High vst1q_u8_x41 Optimal Low While benchmarks vary by hardware[10], tests on Cortex-A9 show similar optimizations improving memory-bound workloads by ~30%[1]. For ARMv8+ CPUs with wider pipelines (e.g., Grace CPU), bulk stores better utilize memory subsystems designed for large contiguous accesses[2][7].
Citations:
- 1: https://community.arm.com/arm-community-blogs/b/operating-systems-blog/posts/arm-neon-optimization
- 2: https://github.com/NVIDIA/grace-cpu-benchmarking-guide/blob/main/src/developer/vectorization.md
- 3: https://docs.unity3d.com/Packages/com.unity.burst@1.7/api/Unity.Burst.Intrinsics.Arm.Neon.vst1q_u8.html
- 4: https://e2e.ti.com/support/tools/code-composer-studio-group/ccs/f/code-composer-studio-forum/117858/arm-cortex-a8-using-neon-and-vfp
- 5: https://www.dailyhostnews.com/practical-approach-to-arm-neon-optimization
- 6: https://blog.yiningkarlli.com/2021/09/neon-vs-sse.html
- 7: https://www.arm.com/technologies/neon
- 8: https://answers.opencv.org/question/42612/does-opencv-support-the-use-of-vector-reciprocal-on-arm-neon/
- 9: https://developer.arm.com/documentation/dui0472/latest/Using-NEON-Support/NEON-intrinsics-for-storing-a-single-vector-or-lane
- 10: https://forums.developer.apple.com/forums/thread/14440
Optimize NEON Data Stores with
vst1q_u8_x4
The web query confirms that usingvst1q_u8_x4instead of multiplevst1q_u8calls can reduce instruction overhead, improve memory access efficiency (thanks to better cache and alignment behavior), and lower latency through bulk store operations. The commented out code on line 56 indicates that this optimization was considered. If performance profiling in your target workload confirms these benefits without adverse effects, you might consider un-commenting (or integrating) the use ofvst1q_u8_x4.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
arch/arm/adler32_neon.c(9 hunks)arch/arm/arm_functions.h(2 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)zbuild.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- functable.c
- arch/arm/arm_functions.h
- test/benchmarks/benchmark_adler32_copy.cc
- zbuild.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (96)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC (ARM64)
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu Emscripten WASM32
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
🔇 Additional comments (3)
arch/arm/adler32_neon.c (3)
13-150: LGTM! Efficient NEON implementation with optional copy functionality.The NEON implementation correctly handles both checksum calculation and data copying, with proper alignment and efficient use of SIMD instructions.
204-212: Excellent use of restrict pointers for copy optimization.The use of
__restrictqualifiers helps the compiler generate more efficient code by guaranteeing that the source and destination pointers don't alias.
260-262: LGTM! Smart implementation of the copy flag.Using
dst != NULLas the COPY flag is an elegant way to handle the copy functionality while maintaining backward compatibility.
23ee5b6 to
a97591c
Compare
a97591c to
d706600
Compare
|
@neheb could you check to see if my iOS check fixes your issue? I don't have a cross compilation setup to test this sort of thing, it'd be good to know. I had half a mind to force 32 byte alignment unconditionally for the non-copying variant of the checksum. Then we actually could actually not use the fallback 4x loads. But this would probably make a lot of small checksums really slow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
arch/arm/neon_intrins.h (1)
171-171: Consider removing the alignment assumption.The
__builtin_assume_alignedhint might be unnecessary since the alignment is already handled by the caller.- uint8x16x4_t d0_d3 = vld1q_u8_x4(__builtin_assume_aligned(buf, 32)); + uint8x16x4_t d0_d3 = vld1q_u8_x4(buf);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
arch/arm/adler32_neon.c(7 hunks)arch/arm/arm_functions.h(2 hunks)arch/arm/neon_intrins.h(1 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)zbuild.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- test/benchmarks/benchmark_adler32_copy.cc
- arch/arm/arm_functions.h
- functable.c
- zbuild.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (166)
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: macOS Clang ASAN (ARM64)
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu MinGW i686
- GitHub Check: EL9 GCC S390X DFLTCC ASAN
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
- GitHub Check: Ubuntu GCC AARCH64 ASAN
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS64
🔇 Additional comments (11)
arch/arm/neon_intrins.h (6)
28-30: LGTM! iOS platform detection is correctly implemented.The macro definition for iOS platform detection is well-implemented using standard platform-specific macros.
32-34: LGTM! Alignment requirement check is correctly expanded.The condition now properly handles both Android and iOS platforms, with an accurate comment explaining the strict alignment requirements for x4 NEON intrinsics.
13-21: LGTM! Taps array is correctly aligned for NEON operations.The array is properly aligned to 64 bytes using
ALIGNED_macro, which is essential for optimal NEON vector operations.
23-146: LGTM! NEON_accum32_copy implementation is well-optimized.The function efficiently combines checksum computation with data copying:
- Uses NEON intrinsics for vectorized operations
- Properly handles alignment and buffer copying
- Maintains the same performance characteristics as the non-copying version
269-291: LGTM! Short length handling is properly implemented.The function correctly handles edge cases:
- Single byte case with conditional copying
- Short lengths (< 16 bytes) with appropriate helper functions
379-385: LGTM! Public API functions are well-designed.The functions provide a clean interface:
adler32_neon: Maintains backward compatibilityadler32_fold_copy_neon: New function with copy support- Both use the shared implementation with appropriate flags
arch/arm/adler32_neon.c (5)
13-21: LGTM! Well-structured coefficient array.The taps array is properly aligned for SIMD operations and contains the correct coefficient values.
23-146: Excellent optimization for combined copy and checksum calculation!The implementation efficiently interleaves load, store, and accumulation operations for better instruction-level parallelism. The decision to avoid
vst1q_u8_x4in favor of individual stores is well-justified for platforms where stores can block instruction-level parallelism.
269-377: Well-structured implementation with thorough edge case handling!The implementation:
- Efficiently handles alignment requirements
- Uses restrict pointers for better optimization
- Includes detailed comments about platform-specific considerations
- Properly handles all edge cases (single byte, short lengths, tails)
379-385: Clean and efficient API design!The API functions provide a simple interface while maintaining efficiency:
adler32_neon: Maintains backward compatibilityadler32_fold_copy_neon: Elegantly handles the copy case by checking if dst is NULL
171-171:❓ Verification inconclusive
Consider platform-specific alignment requirements.
The use of
__builtin_assume_aligned(buf, 32)assumes 32-byte alignment. While this works well for most platforms, some ARM platforms might benefit from 64-byte alignment for cache line optimization.
🏁 Script executed:
#!/bin/bash # Check if there are any alignment-related performance notes in the codebase rg -i "align|cacheline" --type cLength of output: 17472
Action: Revisit Buffer Alignment for Optimal Cache Performance
The current use of
__builtin_assume_aligned(buf, 32)hardcodes a 32-byte alignment assumption. While 32-byte alignment is generally sufficient for NEON loads, note that several parts of the codebase (e.g., the taps array defined withALIGNED_(64)) demonstrate a preference for 64-byte alignment on some ARM targets. Please verify that the buffer’s allocation meets the intended alignment requirements. If the target platform benefits from a 64-byte cache line (or stricter alignment, as seen in related modules), consider either changing the alignment parameter (or using an appropriate macro likeHINT_ALIGNED_64) or ensuring that the buffer allocation consistently provides the necessary alignment to avoid potential performance hits or misaligned memory accesses.
d706600 to
cc91055
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
arch/arm/adler32_neon.c (2)
51-55: Consider documenting the ILP optimization choice.The code uses individual
vst1q_u8operations instead ofvst1q_u8_x4to avoid instruction-level parallelism (ILP) blocking. Consider adding a comment explaining this optimization decision, similar to the alignment explanation in the code.
324-331: Consider using memcpy for small copies.For small copies in the alignment adjustment phase, consider using
memcpyinstead of a manual loop. Modern compilers can optimizememcpycalls effectively.- while (cpy_len--) { - *dst_noalias++ = *src_noalias++; - } + memcpy(dst_noalias, src_noalias, cpy_len);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
arch/arm/adler32_neon.c(7 hunks)arch/arm/arm_functions.h(2 hunks)arch/arm/neon_intrins.h(1 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)zbuild.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- arch/arm/neon_intrins.h
- functable.c
- arch/arm/arm_functions.h
- zbuild.h
- test/benchmarks/benchmark_adler32_copy.cc
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (57)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: macOS GCC
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: macOS GCC Symbol Prefix (ARM64)
🔇 Additional comments (3)
arch/arm/adler32_neon.c (3)
13-21: LGTM! Well-structured coefficient array.The taps array is properly aligned for optimal NEON vector operations and contains the correct coefficients for Adler32 calculation.
379-385: LGTM! Clean and efficient public interface.The implementation provides a clean interface with efficient NULL handling for the copy operation. The reuse of the implementation function with a boolean flag is elegant.
306-316: Excellent documentation of platform-specific optimizations.The detailed comments about platform-specific alignment requirements and optimization choices are valuable for maintainability. This level of documentation helps future developers understand the performance implications of the implementation.
as is, this PR fixes runtime on Alpine Linux + clang. |
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
cc91055 to
b50216e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)
268-290: Consider adding early-exit optimization for zero-length input.While the implementation is solid, consider adding an early return for
len == 0before thelen == 1check to avoid unnecessary processing.static Z_FORCEINLINE uint32_t adler32_fold_copy_impl(uint32_t adler, uint8_t *dst, const uint8_t *src, size_t len, const int COPY) { + if (len == 0) + return adler; + /* split Adler-32 into component sums */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
arch/arm/adler32_neon.c(6 hunks)arch/arm/arm_functions.h(2 hunks)arch/arm/neon_intrins.h(1 hunks)functable.c(1 hunks)test/benchmarks/benchmark_adler32_copy.cc(1 hunks)zbuild.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- test/benchmarks/benchmark_adler32_copy.cc
- arch/arm/arm_functions.h
- zbuild.h
- functable.c
- arch/arm/neon_intrins.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
- GitHub Check: EL9 Clang S390X DFLTCC MSAN
🔇 Additional comments (4)
arch/arm/adler32_neon.c (4)
13-21: LGTM! Well-structured coefficient array.The taps array is properly aligned and organized for optimal SIMD access.
23-145: Excellent optimization for combined checksum and copy operations!The implementation intelligently avoids x4 load instructions when copying is involved, which helps prevent pipeline stalls. The code maintains numerical accuracy while adding copy functionality.
260-266: LGTM! Clean implementation of tail handling.The helper function correctly processes remaining bytes that can't be handled by SIMD operations.
378-384: LGTM! Clean public interface implementation.The wrapper functions provide a clear interface while delegating the actual work to the optimized implementation. The
dst != NULLcheck inadler32_fold_copy_neonis an elegant way to control the copy behavior.
Dead2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Hey, this seems to not be needed with Clang 20. Can anyone else confirm? |
|
correct. I just tested versions 2.2.4 and 2.2.3. Both succeed. |
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail.
Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix #1824.
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Tests