-
-
Notifications
You must be signed in to change notification settings - Fork 308
Allow more cases to fall through to CHUNKMEMSET/CHUNKCOPY #1819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For AVX2 specifically, our chunkmemset call chain allows for safely dealing with half sized chunks. Because this doesn't alter the code size in any meaningful way, we may as well use macros to define the constant we use for comparison in the "safe" variety of functions.
WalkthroughThe changes in this pull request include the introduction of a new macro definition, Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (6)
arch/generic/chunkset_c.c (1)
13-13: Consider adding a brief comment explaining the macro's purpose.While the macro's name is self-explanatory, a brief comment would help document its role in the safety checks for chunk operations.
+/* Minimum size for safe chunk operations */ #define MIN_CHUNKSIZE sizeof(chunk_t)arch/x86/chunkset_sse2.c (1)
15-15: Consider documenting the performance implications.The introduction of
MIN_CHUNKSIZEallows for more consistent handling of chunk sizes across different architectures while potentially improving performance by allowing more cases to fall through to the optimized SIMD implementations. Consider:
- Adding a comment explaining the rationale behind the minimum chunk size choice
- Documenting any performance implications in the relevant documentation
- Adding benchmarks to measure the impact of this change
Would you like me to help draft the documentation or create benchmark templates?
Also applies to: 44-44
arch/power/chunkset_power8.c (1)
16-16: Architectural improvement for vectorized operationsThis change enables more efficient vectorized operations while maintaining safety:
- Using
sizeof(chunk_t)automatically adapts to the architecture's vector size- Allows the compiler to optimize size checks better
- Maintains consistency with the VSX implementation's capabilities
Consider documenting these architectural assumptions and requirements in a comment above the macro, especially noting:
- The relationship with VSX vector size
- The connection to CHUNK_SIZE
- The usage in safety checks
+/* MIN_CHUNKSIZE defines the minimum safe chunk size for memory operations. + * For POWER8 VSX, this matches the vector size (16 bytes) and CHUNK_SIZE. + * This is used in safety checks for chunkmemset operations. + */ #define MIN_CHUNKSIZE sizeof(chunk_t)arch/riscv/chunkset_rvv.c (2)
Line range hint
89-89: Enhance the assertion message for better debugging.While the assertion correctly checks for non-zero length, the error message could be more descriptive to aid in debugging.
- Assert(len > 0, "chunkcopy should never have a length 0"); + Assert(len > 0, "CHUNKCOPY: Invalid zero length parameter provided");
Line range hint
71-86: Consider benchmarking the memory copying strategies.The implementation intelligently uses different strategies for overlapping and non-overlapping cases, leveraging glibc's RVV-optimized memcpy through IFUNC. While this approach is theoretically sound, it would be valuable to verify the performance characteristics:
- Non-overlapping case: Single large memcpy
- Overlapping case: N-chunk memcpy followed by chunk-by-chunk copying
Consider adding performance benchmarks to validate these strategies across different memory sizes and alignment scenarios.
chunkset_tpl.h (1)
Line range hint
253-285: Performance consideration: Branch prediction hints are properly usedThe use of
UNLIKELYmacro for the size checks helps the compiler optimize the branch prediction, minimizing the performance impact on the common path. This is especially important given the performance-critical nature of these functions, as noted in the previous learnings.The additional safety checks align well with the PR's objective of controlling chunk operation dispatch while maintaining good performance characteristics.
Consider adding comments explaining the performance implications of these checks for future maintainers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (8)
arch/arm/chunkset_neon.c(1 hunks)arch/generic/chunkset_c.c(1 hunks)arch/power/chunkset_power8.c(1 hunks)arch/riscv/chunkset_rvv.c(1 hunks)arch/x86/chunkset_avx2.c(1 hunks)arch/x86/chunkset_sse2.c(1 hunks)arch/x86/chunkset_ssse3.c(1 hunks)chunkset_tpl.h(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- arch/x86/chunkset_avx2.c
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
🔇 Additional comments (8)
arch/generic/chunkset_c.c (1)
13-13: LGTM! Verify consistent usage across architectures.
The MIN_CHUNKSIZE macro is correctly defined based on sizeof(chunk_t) (8 bytes), which aligns with the generic implementation's use of uint64_t. This provides a safe minimum boundary for chunk operations.
Let's verify the consistency of MIN_CHUNKSIZE definitions across different architectures:
✅ Verification successful
MIN_CHUNKSIZE definitions are architecture-aware and correct
The verification shows that MIN_CHUNKSIZE is consistently defined according to each architecture's requirements:
- Generic C implementation:
sizeof(chunk_t)wherechunk_tisuint64_t(8 bytes) - x86 SSE2/SSSE3:
sizeof(chunk_t)wherechunk_tis__m128i(16 bytes) - x86 AVX2:
sizeof(halfchunk_t)wherehalfchunk_tis__m128i(16 bytes) - ARM NEON:
sizeof(chunk_t)wherechunk_tisuint8x16_t(16 bytes) - POWER8:
sizeof(chunk_t)wherechunk_tisvector unsigned char(16 bytes) - RISC-V:
CHUNK_SIZEwhich is defined as 32 bytes
Each architecture correctly defines MIN_CHUNKSIZE based on its vector width and implementation requirements, maintaining proper alignment and size constraints.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check MIN_CHUNKSIZE definitions across architecture-specific implementations
# Expected: All implementations should define MIN_CHUNKSIZE based on their chunk_t size
# Search for MIN_CHUNKSIZE definitions in architecture-specific files
echo "Checking MIN_CHUNKSIZE definitions:"
rg "define\s+MIN_CHUNKSIZE" arch/
Length of output: 563
Script:
#!/bin/bash
# Let's check the chunk_t and halfchunk_t definitions in these files
echo "=== Checking chunk_t and related type definitions ==="
rg "typedef.*chunk_t" arch/ -B 2 -A 2
echo -e "\n=== Checking CHUNK_SIZE definitions ==="
rg "define.*CHUNK_SIZE" arch/ -B 2 -A 2
echo -e "\n=== Checking halfchunk_t definitions ==="
rg "typedef.*halfchunk_t" arch/ -B 2 -A 2
Length of output: 3547
arch/x86/chunkset_sse2.c (1)
Line range hint 44-44: Verify the usage of MIN_CHUNKSIZE in the template.
The inclusion of chunkset_tpl.h at this point is crucial as it will use the newly defined MIN_CHUNKSIZE macro.
Let's verify how MIN_CHUNKSIZE is used in the template:
✅ Verification successful
Let me gather more context about the usage of MIN_CHUNKSIZE in relation to CHUNKCOPY since that's what's being defined in the reviewed code.
MIN_CHUNKSIZE is correctly defined and used in the template
The verification shows that:
- MIN_CHUNKSIZE is properly defined in chunkset_sse2.c as
sizeof(chunk_t) - The template uses MIN_CHUNKSIZE appropriately for safety checks:
- Before small copies to prevent underflow
- In safe variants to ensure sufficient buffer size
- CHUNKCOPY (which is being defined to use SSE2) is properly integrated with these safety checks
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the usage of MIN_CHUNKSIZE in the template file
# Expected: The template should use MIN_CHUNKSIZE for size comparisons
# Check how MIN_CHUNKSIZE is used in the template
echo "Checking MIN_CHUNKSIZE usage in template:"
rg -A 3 "MIN_CHUNKSIZE" chunkset_tpl.h
# Look for any size comparisons in the safe variants
echo -e "\nChecking size comparisons in safe variants:"
ast-grep --pattern 'CHUNKMEMSET_SAFE' chunkset_tpl.h
Length of output: 661
Script:
#!/bin/bash
# Check for CHUNKCOPY usage in the template
echo "Checking CHUNKCOPY usage in template:"
rg -A 5 "CHUNKCOPY" chunkset_tpl.h
# Check MIN_CHUNKSIZE definition in the SSE2 file
echo -e "\nChecking MIN_CHUNKSIZE definition:"
rg "MIN_CHUNKSIZE" arch/x86/chunkset_sse2.c
Length of output: 1899
arch/power/chunkset_power8.c (1)
16-16: LGTM! Verify consistency with template usage.
The MIN_CHUNKSIZE definition correctly uses sizeof(chunk_t) which aligns with the VSX vector size on POWER8 and matches the CHUNK_SIZE definition. This enables efficient vectorized operations.
Let's verify the consistency across architectures and template usage:
✅ Verification successful
Let me gather more information about the template usage since the previous command failed to locate the template file.
MIN_CHUNKSIZE definition is correct and consistent with template usage
The verification shows that:
- The
MIN_CHUNKSIZEdefinition inarch/power/chunkset_power8.ccorrectly matches the template usage inchunkset_tpl.hwhere it's used for safety checks before memory operations - It's consistent with other architecture implementations:
- Most architectures define it as
sizeof(chunk_t)matching their vector types - AVX2 uses
sizeof(halfchunk_t)due to its wider vectors - RISC-V uses explicit
CHUNK_SIZE(32)
- Most architectures define it as
- The Power8 implementation's
CHUNK_SIZEof 16 bytes aligns with itschunk_t(vector unsigned char) size
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify MIN_CHUNKSIZE definitions across architectures and its usage in templates
# Check MIN_CHUNKSIZE definitions across all architecture files
echo "Checking MIN_CHUNKSIZE definitions:"
rg "MIN_CHUNKSIZE" arch/
# Verify usage in template implementation
echo -e "\nChecking template usage:"
rg "MIN_CHUNKSIZE" "*tpl.h"
# Verify chunk_t definitions to ensure size consistency
echo -e "\nChecking chunk_t definitions:"
rg "typedef.*chunk_t" arch/
Length of output: 1282
Script:
#!/bin/bash
# Find the template file first
echo "Locating template files:"
fd ".*tpl\.h"
# Check MIN_CHUNKSIZE usage in chunkset implementation
echo -e "\nChecking MIN_CHUNKSIZE usage in implementation:"
rg -A 3 "MIN_CHUNKSIZE" --type c
# Check if there are any other size-related checks in chunkset files
echo -e "\nChecking other size-related patterns:"
rg "CHUNK_SIZE|chunk_size" --type c arch/
Length of output: 3004
arch/x86/chunkset_ssse3.c (1)
17-17: LGTM! Consider adding a brief comment explaining the macro's purpose.
The MIN_CHUNKSIZE macro is correctly defined based on the SSSE3 vector size. This aligns with the PR's objective of standardizing chunk size handling across architectures.
Consider adding a brief comment explaining that this defines the minimum safe chunk size for SSSE3 operations:
+/* Minimum chunk size for safe SSSE3 vector operations */
#define MIN_CHUNKSIZE sizeof(chunk_t)Let's verify how this constant is used in the template file:
✅ Verification successful
Let me gather more context about how these MIN_CHUNKSIZE checks are used in the template file.
The MIN_CHUNKSIZE macro is correctly used as a safety threshold
The verification confirms that MIN_CHUNKSIZE is properly used in the template file for safety checks:
- Used as a threshold to fall back to byte-by-byte operations when remaining length is too small
- Ensures safe handling of small chunks in both chunkcopy and chunkmemset operations
- Correctly guards against potential buffer overruns when vector operations cannot be safely performed
The original review comment's suggestion to add documentation remains valid, but the technical implementation and usage are sound.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the usage of MIN_CHUNKSIZE in the template implementation
# Expected: Should find usage in safety checks within chunkmemset/chunkcopy functions
# Search for MIN_CHUNKSIZE usage in template file
rg "MIN_CHUNKSIZE" "chunkset_tpl.h"
Length of output: 149
Script:
#!/bin/bash
# Get surrounding context of MIN_CHUNKSIZE usage
rg -B 5 -A 5 "MIN_CHUNKSIZE" "chunkset_tpl.h"
Length of output: 545
arch/arm/chunkset_neon.c (2)
16-16: LGTM! Macro definition aligns with NEON vector size.
The MIN_CHUNKSIZE definition as sizeof(chunk_t) is correct for NEON operations, as it maps to a 16-byte vector size which is optimal for NEON SIMD instructions.
16-16: Verify consistency with other architectures.
Let's ensure this definition maintains consistency with other architecture-specific implementations.
✅ Verification successful
MIN_CHUNKSIZE definition is consistent across architectures
The verification shows that MIN_CHUNKSIZE is consistently defined relative to the architecture-specific chunk sizes:
- Most architectures (ARM NEON, x86 SSE2/SSSE3, POWER8, Generic C):
MIN_CHUNKSIZE = sizeof(chunk_t) - x86 AVX2:
MIN_CHUNKSIZE = sizeof(halfchunk_t)(justified as AVX2 uses 256-bit vectors with halfchunk support) - RISC-V:
MIN_CHUNKSIZE = CHUNK_SIZE(uses a different but equivalent approach)
Each architecture appropriately defines its chunk_t type based on its vector capabilities:
- ARM NEON: uint8x16_t (128-bit)
- x86 SSE2/SSSE3: __m128i (128-bit)
- x86 AVX2/AVX512: __m256i (256-bit)
- POWER8: vector unsigned char
- Generic C: uint64_t
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check MIN_CHUNKSIZE definitions across different architectures
# Expected: All architecture-specific implementations should define MIN_CHUNKSIZE
# consistently relative to their vector sizes
echo "Checking MIN_CHUNKSIZE definitions across architectures..."
rg "MIN_CHUNKSIZE" arch/*/chunk*.{c,h} -B 2 -A 2
echo -e "\nChecking chunk_t definitions..."
rg "typedef.*chunk_t" arch/*/chunk*.{c,h} -B 1 -A 1
Length of output: 3647
chunkset_tpl.h (2)
Line range hint 253-259: LGTM: Safe fallback for small buffer sizes
The check for MIN_CHUNKSIZE provides a safe fallback to byte-by-byte copy when the remaining buffer space is too small for chunk operations. This aligns with the PR objective of controlling chunk operation dispatch while maintaining safety.
Let's verify the definition of MIN_CHUNKSIZE across architectures:
✅ Verification successful
Let me verify the definition of chunk_t and halfchunk_t to ensure the MIN_CHUNKSIZE values are consistent with the architecture-specific requirements.
MIN_CHUNKSIZE check is correctly defined and used across architectures
The MIN_CHUNKSIZE check in the fallback logic is correctly implemented, with architecture-specific definitions that match their respective chunk_t sizes:
- x86 SSE2/SSSE3: 16 bytes (size of __m128i)
- x86 AVX2: 16 bytes (size of halfchunk_t/__m128i)
- ARM NEON: 16 bytes (size of uint8x16_t)
- Generic: 8 bytes (size of uint64_t)
- POWER8: 16 bytes (size of vector)
- RISC-V: 32 bytes (explicitly defined)
The fallback to byte-by-byte copy when buffer size is less than MIN_CHUNKSIZE is a safe and correct approach that maintains compatibility across all supported architectures.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check MIN_CHUNKSIZE definitions across architecture-specific files
# Expect: MIN_CHUNKSIZE should be defined consistently based on chunk_t or halfchunk_t
rg -A 2 "define.*MIN_CHUNKSIZE" arch/
Length of output: 1287
Script:
#!/bin/bash
# Check chunk_t and halfchunk_t definitions
rg -A 2 "typedef.*chunk_t|typedef.*halfchunk_t" --type c --type h
# Check CHUNK_SIZE definition for RISC-V
rg "define.*CHUNK_SIZE" arch/riscv/
Length of output: 1294
Line range hint 279-285: LGTM: Comprehensive safety checks for copy operations
The dual check against MIN_CHUNKSIZE for both from_dist and safelen ensures safe operation by falling back to byte-by-byte copy when either:
- The source and destination are too close (from_dist < MIN_CHUNKSIZE)
- The safe buffer space is too small (safelen < MIN_CHUNKSIZE)
This maintains safety while allowing more cases to fall through to CHUNKCOPY when conditions permit.
Let's verify the relationship between these checks and the chunk operations:
✅ Verification successful
Based on the search results, I can now verify the relationship between the safety checks and chunk operations. The codebase shows:
-
The
MIN_CHUNKSIZEcheck is consistently defined and used across different architectures (SSE2, AVX2, AVX512, NEON, Power8, RVV) to ensure safe chunk operations. -
The dual check against
MIN_CHUNKSIZEfor bothfrom_distandsafelenis used in the safe copy operations (CHUNKCOPY_SAFE) to prevent buffer overruns by falling back to byte-by-byte copy when:- The source and destination are too close (
from_dist < MIN_CHUNKSIZE) - The safe buffer space is too small (
safelen < MIN_CHUNKSIZE)
- The source and destination are too close (
-
This safety mechanism is consistently implemented across all architecture-specific implementations, with each having its own optimized chunk size but maintaining the same safety checks.
LGTM: The safety checks are correctly implemented and essential for safe chunk operations
The dual check against MIN_CHUNKSIZE is a critical safety mechanism that ensures:
- Safe fallback to byte-by-byte copy when buffers are too close
- Prevention of buffer overruns by checking available space
- Consistent safety across all architecture-specific implementations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential chunk size assumptions in the codebase
# Expect: No unsafe assumptions about minimum distances or buffer sizes
# Look for related size checks and chunk operations
rg -A 5 "chunk_t|halfchunk_t|CHUNKCOPY|CHUNKMEMSET"
# Look for potential safety-related comments or assertions
rg -A 2 "safe|assert.*chunk"
Length of output: 70984
| #define HAVE_CHUNKMEMSET_2 | ||
| #define HAVE_CHUNKMEMSET_4 | ||
| #define HAVE_CHUNKMEMSET_8 | ||
| #define MIN_CHUNKSIZE sizeof(chunk_t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
The MIN_CHUNKSIZE definition is inconsistent with AVX2/AVX512 implementations
The verification revealed that while MIN_CHUNKSIZE is consistently defined as sizeof(chunk_t) across architectures, the underlying chunk_t type varies:
- SSE2/SSSE3:
__m128i(16 bytes) - AVX2/AVX512:
__m256i(32 bytes) - NEON:
uint8x16_t(16 bytes) - POWER8:
vector unsigned char(16 bytes) - Generic:
uint64_t(8 bytes)
This inconsistency in chunk_t sizes means the SSE2 implementation's MIN_CHUNKSIZE (16 bytes) differs from AVX2/AVX512 (32 bytes), which could lead to suboptimal performance or incorrect behavior when switching between implementations.
- Consider aligning
MIN_CHUNKSIZEwith the largest chunk size (32 bytes) across all x86 implementations - Review if the template code in
chunkset_tpl.hcorrectly handles these size differences
🔗 Analysis chain
LGTM! The MIN_CHUNKSIZE definition is correct and well-placed.
The macro definition correctly uses sizeof(chunk_t) which evaluates to 16 bytes (size of __m128i). This aligns with SSE2's natural vector size and matches the PR's objective of standardizing chunk size handling across architectures.
Let's verify the consistency of MIN_CHUNKSIZE across other architecture files:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify MIN_CHUNKSIZE definition consistency across architecture files
# Expected: All architecture files should define MIN_CHUNKSIZE similarly
# Search for MIN_CHUNKSIZE definitions in architecture files
echo "Checking MIN_CHUNKSIZE definitions:"
rg "MIN_CHUNKSIZE.*sizeof\(chunk_t\)" arch/
# Verify chunk_t definitions to ensure correct sizes
echo -e "\nChecking chunk_t definitions:"
rg "typedef.*chunk_t" arch/
Length of output: 1037
| #define HAVE_CHUNKMEMSET_2 | ||
| #define HAVE_CHUNKMEMSET_4 | ||
| #define HAVE_CHUNKMEMSET_8 | ||
| #define MIN_CHUNKSIZE CHUNK_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Inconsistent chunk size definition needs attention
The RISC-V implementation defines CHUNK_SIZE as 32 and MIN_CHUNKSIZE as CHUNK_SIZE, but this differs from other architectures where:
- Generic C: 8 bytes (uint64_t)
- x86 (SSE2/SSSE3): 16 bytes (__m128i)
- x86 (AVX2): 32 bytes (__m256i)
- ARM NEON: 16 bytes (uint8x16_t)
- POWER8: 16 bytes (vector unsigned char)
The RISC-V implementation uses a struct with 4 uint64_t (32 bytes) because "We don't have a 32-byte datatype for RISC-V arch". While the size is correctly matched to the struct, this inconsistency with other architectures' MIN_CHUNKSIZE definitions (which use sizeof(chunk_t)) should be reviewed.
🔗 Analysis chain
LGTM! Verify chunk size consistency.
The MIN_CHUNKSIZE macro definition correctly matches the CHUNK_SIZE of 32 bytes, which aligns with the chunk_t structure size (4 × uint64_t).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify consistent chunk size definitions across architecture-specific files
rg -g 'arch/*/*.c' -g 'arch/*/*.h' "MIN_CHUNKSIZE|CHUNK_SIZE"
Length of output: 901
Script:
#!/bin/bash
# Let's verify the chunk_t definitions to understand the size variations
rg -g 'arch/*/*.h' -g 'arch/*/*.c' "typedef.*chunk_t" -A 3
Length of output: 1436
Script:
#!/bin/bash
# Let's check the RISC-V specific chunk_t definition and structure
rg -g 'arch/riscv/*' "chunk_t|struct chunk" -B 2 -A 4
Length of output: 3430
|
Hmm, actually I see one case where this may not be true...I could remedy it but the cure might be worse than the disease. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1819 +/- ##
========================================
Coverage 32.18% 32.18%
========================================
Files 67 67
Lines 5752 5752
Branches 1237 1237
========================================
Hits 1851 1851
Misses 3644 3644
Partials 257 257 ☔ View full report in Codecov by Sentry. |
Develop#1819 Allow more cases to fall through to CHUNKMEMSET/CHUNKCOPYIt is not much, but it seems to be slightly faster in my AVX2 tests at least. |
|
For this to handle all the possible corner cases it'd be a bit messy (way messier than it is with AVX512 where a simple mask can be used unconditionally). It's probably not worth the bug surface area to do this. |
For AVX2 specifically, our chunkmemset call chain allows for safely dealing with half sized chunks. Because this doesn't alter the code size in any meaningful way, we may as well use macros to define the constant we use for comparison in the "safe" variety of functions.
Summary by CodeRabbit
New Features
MIN_CHUNKSIZEto standardize minimum chunk sizes across various architectures.MIN_CHUNKSIZEfor improved safety and consistency.Bug Fixes
Documentation