-
-
Notifications
You must be signed in to change notification settings - Fork 308
Make an AVX512 inflate fast with low cost masked writes #1805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make an AVX512 inflate fast with low cost masked writes #1805
Conversation
WalkthroughThe pull request introduces modifications to the zlib library's build configuration and implementation for enhanced support of AVX512 and AVX2 architectures. Key changes include the addition of a new source file for AVX512 optimizations, updates to CMake and Makefile configurations to include new compilation flags and object targets, and enhancements to CPU feature detection logic. New inline functions for memory operations using AVX512 are defined, while existing functions are adjusted for improved performance and clarity. The overall structure of the build process remains intact, focusing on expanding architecture-specific capabilities. Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (10)
arch/x86/x86_features.h (2)
17-17: LGTM! Consider adding a brief comment.The addition of
has_bmi2is well-placed and aligns with the PR's optimization goals. Consider adding a brief comment explaining BMI2's relevance to the AVX512 optimizations.+ /* BMI2 support required for efficient bit manipulation operations */ int has_bmi2;
Line range hint
9-25: Consider documenting CPU feature flags.To improve maintainability, consider adding documentation for all CPU feature flags in the structure. This would help developers understand which features are required for specific optimizations and their interdependencies.
Example documentation style:
struct x86_cpu_features { + /* Advanced Vector Extensions 2 support */ int has_avx2; + /* AVX-512 Foundation instructions */ int has_avx512f; + /* AVX-512 Doubleword and Quadword instructions */ int has_avx512dq; // ... (document remaining fields) };arch/x86/avx2_tables.h (3)
6-8: Documentation needs enhancementThe current comment is insufficient to understand the table's purpose and structure. Consider adding:
- Purpose of the lut_rem_pair structure
- Explanation of the offset calculation (32 and 16 multipliers)
- Meaning of the right-side comments
40-42: Add documentation for half_rem_vals tableThe table lacks essential documentation explaining:
- Its purpose and relationship to the permutation table
- The significance of these specific values
- How these values are used in the AVX2 operations
1-44: Consider reorganizing shared lookup tablesSince these tables might be used by both AVX2 and AVX512 implementations (based on the PR description), consider:
- Moving shared tables to a common location
- Adding documentation about which SIMD implementations use these tables
- Explaining any differences in how AVX2 and AVX512 utilize these values
arch/x86/chunkset_avx2.c (2)
Line range hint
52-89: Consider documenting performance implications of table reorganizationThe
GET_CHUNK_MAGandGET_HALFCHUNK_MAGfunctions rely heavily on lookup tables for efficient permutation and remainder calculations. While moving tables to a separate header is good for organization, it's important to ensure:
- The compiler can still efficiently access the tables (ideally keeping them in L1 cache)
- The performance characteristics remain unchanged
Consider adding comments documenting:
- Expected performance characteristics
- Cache alignment requirements for the tables
- Any specific compiler optimizations needed
Also applies to: 108-120
Line range hint
1-150: Clarify relationship with AVX512 optimizationsWhile this file contains AVX2 implementations, the table reorganization appears to be preparatory work for the AVX512 optimizations mentioned in the PR objectives. To ensure smooth integration:
- Document how the shared tables support both AVX2 and AVX512 implementations
- Consider adding comments explaining the relationship between AVX2 and AVX512 code paths
This will help maintainers understand the design decisions and make future optimizations easier.
arch/x86/x86_functions.h (1)
152-157: LGTM! Comprehensive AVX512 feature detection with proper native function mappings.The implementation correctly checks for all required AVX512 extensions (F,DQ,BW,VL) before enabling the optimizations. This ensures the code only runs on fully compatible hardware.
Consider documenting the minimum CPU generation required for these extensions (e.g., Intel Skylake-X or newer) in a comment to help with deployment planning.
functable.c (1)
130-134: Consider documenting AVX512 masked move optimization strategy.Given that this implementation leverages AVX512 masked moves to optimize the inflate operation and reduce chunkcopy_safe fallbacks, consider:
- Adding comments explaining the optimization strategy
- Documenting performance characteristics and trade-offs
- Adding benchmarks to verify the improvements
Would you like me to help draft the documentation or benchmark suite?
arch/x86/chunkset_avx512.c (1)
70-93: OptimizeCHUNKCOPYby avoiding unnecessary masked store whenremis zeroIn the
CHUNKCOPYfunction, iflenis a multiple ofsizeof(chunk_t),remwill be zero, and the masked store on line 79 is unnecessary. Consider adding a condition to skip the masked store whenremis zero to improve performance.Apply this diff to optimize the function:
unsigned rem = len % sizeof(chunk_t); mask_t rem_mask = gen_mask(rem); chunk_t chunk; -loadchunk(from, &chunk); -_mm256_mask_storeu_epi8(out, rem_mask, chunk); -out += rem; -from += rem; -len -= rem; +if (rem > 0) { + loadchunk(from, &chunk); + _mm256_mask_storeu_epi8(out, rem_mask, chunk); + out += rem; + from += rem; + len -= rem; +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (13)
- CMakeLists.txt (1 hunks)
- arch/x86/Makefile.in (3 hunks)
- arch/x86/avx2_tables.h (1 hunks)
- arch/x86/chunkset_avx2.c (1 hunks)
- arch/x86/chunkset_avx512.c (1 hunks)
- arch/x86/x86_features.c (2 hunks)
- arch/x86/x86_features.h (1 hunks)
- arch/x86/x86_functions.h (2 hunks)
- chunkset_tpl.h (8 hunks)
- cmake/detect-intrinsics.cmake (2 hunks)
- configure (2 hunks)
- functable.c (1 hunks)
- inffast_tpl.h (1 hunks)
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (5)
Learnt from: KungFuJesus PR: zlib-ng/zlib-ng#1802 File: chunkset_tpl.h:135-135 Timestamp: 2024-10-07T21:23:13.401Z Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.Learnt from: KungFuJesus PR: zlib-ng/zlib-ng#1802 File: chunkset_tpl.h:92-108 Timestamp: 2024-10-08T19:37:14.998Z Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.Learnt from: KungFuJesus PR: zlib-ng/zlib-ng#1802 File: chunkset_tpl.h:92-108 Timestamp: 2024-10-07T22:00:02.180Z Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.Learnt from: KungFuJesus PR: zlib-ng/zlib-ng#1802 File: chunkset_tpl.h:263-271 Timestamp: 2024-10-07T21:56:31.870Z Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.Learnt from: KungFuJesus PR: zlib-ng/zlib-ng#1802 File: chunkset_tpl.h:263-271 Timestamp: 2024-10-08T19:37:14.998Z Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.
🔇 Additional comments (19)
arch/x86/avx2_tables.h (2)
1-5: LGTM: Header structure is well-organizedThe header guards and include directive are properly structured.
8-38: Verify table pattern correctnessThe table shows systematic patterns in offset calculations and remainder values. Let's verify the implementation:
✅ Verification successful
Let me analyze the table pattern and verification results:
The table
perm_idx_lutshows a systematic pattern:
- First section (indices 3-15): Uses multiples of 32 (0-10 * 32)
- Second section (indices 17-31): Uses base offset of 11*32 plus multiples of 16
- The remainder values follow a logical decreasing pattern
The search results show:
- This is the only instance of
perm_idx_lutin the codebase- The constants 32 and 16 are used extensively in AVX2/AVX512 SIMD operations
- This table appears to be part of AVX2-specific optimizations
Let me verify the usage context:
Let me check the broader context and usage of this table:
Let me check the content of the chunk permute table and related files:
Table pattern is correct and consistent with the implementation
The table pattern in
perm_idx_lutis verified to be correct based on the following evidence:
- The table is used in conjunction with
permute_tablefromchunk_permute_table.hfor AVX2-specific optimizations- The pattern follows the documented structure in
chunk_permute_table.h:
- First section (indices 3-15) uses 32-byte aligned offsets for AVX2 vector operations
- Second section (indices 17-31) uses 16-byte offsets after the base offset of 11*32 to handle cross-lane permutations
- The remainder values are correctly decreasing to handle the edge cases as documented
- The implementation is consistent with the SSSE3 version in
chunkset_ssse3.cwhich uses a similar but smaller table structureThe table's pattern is mathematically sound and properly aligned for AVX2 SIMD operations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if this table is referenced correctly and if similar patterns exist elsewhere # Look for other permutation tables and their usage rg -A 2 "static const lut_rem_pair" --type cpp # Find references to this table rg "perm_idx_lut" --type cpp # Look for related constants (32 and 16 multipliers) rg "32[^0-9]|16[^0-9]" --type cpp arch/x86/Length of output: 22946
Script:
#!/bin/bash # Find where this table is used rg -B 5 -A 5 "perm_idx_lut" --type cpp # Look for related AVX2 permutation functions rg "perm.*avx2" --type cppLength of output: 802
Script:
#!/bin/bash # Look for the chunk permute table header that's included rg -l "chunk_permute_table.h" # Check the content of the chunk permute table header cat ../generic/chunk_permute_table.h # Look for any AVX2-specific permutation functions rg -B 3 -A 3 "permute.*avx2|avx2.*permute" --type cppLength of output: 317
Script:
#!/bin/bash # Check the content of chunk_permute_table.h from the repository root cat arch/generic/chunk_permute_table.h # Look for files that use this table rg -B 5 -A 5 "lut_rem_pair" --type cpp # Check the chunkset implementation that might use this table cat arch/x86/chunkset_ssse3.cLength of output: 7574
arch/x86/x86_features.c (2)
100-101: LGTM! BMI2 detection implementation is correct.The BMI2 feature bit (bit 3 in ebx) is correctly checked from CPUID leaf 7, which aligns with Intel's documentation and will properly detect the presence of BMI2 instructions needed for optimized masked operations.
Let's verify the BMI2 detection works as expected:
✅ Verification successful
Based on the results, I can now generate the final response:
BMI2 detection implementation needs attention - potential inconsistency with AVX512 requirements
The verification revealed that BMI2 is not just used for optional optimizations, but is actually a required feature for AVX512_common support:
- In
arch/x86/x86_features.c, BMI2 is part of the mandatory requirements:features->has_avx512_common = ... && features->has_bmi2- However, there's no documentation or clear rationale explaining why BMI2 is a mandatory requirement for AVX512_common
This requirement seems unusual since BMI2 is typically not a prerequisite for AVX512 operations. Let's verify this further:
BMI2 requirement for AVX512_common is intentional and correct
The verification reveals that BMI2 is correctly required as part of AVX512_common for this codebase:
The configure script and CMake files consistently include BMI2 with AVX512 flags:
avx512flag="-mavx512f -mavx512dq -mavx512bw -mavx512vl -mbmi2"- This is applied across all build systems (Makefile.in, CMakeLists.txt)
The requirement is used in specific optimized implementations:
adler32_avx512and related functionschunkset_avx512for memory operationsvpclmulqdqoperations that depend on AVX512_commonThe BMI2 requirement is part of the architectural design, as evidenced by the compiler flags and feature detection logic working together to ensure optimal performance for the AVX512 implementations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify BMI2 detection is properly used in the codebase # Test 1: Check if BMI2 detection is consistently implemented across the codebase rg -A 2 -B 2 "has_bmi2" # Test 2: Look for BMI2-specific instructions that might be used ast-grep --pattern 'asm { $$$ pext $$$ }'Length of output: 975
Script:
#!/bin/bash # Check for any documentation or comments explaining the BMI2 requirement rg -B 5 "avx512.*common" # Look for actual AVX512 instruction usage that might depend on BMI2 rg "avx512" -A 2 -B 2Length of output: 33239
113-113: Verify the requirement of BMI2 for AVX512 common features.Adding BMI2 as a requirement for AVX512 common features is a significant change that could potentially disable AVX512 optimizations on some processors. While this aligns with the PR's goal of optimizing masked operations, please confirm:
- Are there target processors that support AVX512 but not BMI2?
- Is disabling AVX512 features on such processors (if they exist) an acceptable trade-off?
Let's check the processor support matrix:
✅ Verification successful
Based on the code analysis, I can now provide a definitive response about the BMI2 requirement for AVX512.
BMI2 requirement for AVX512 is correct and intentional
The requirement of BMI2 for AVX512 features is correct because:
BMI2 instructions are essential for optimal performance of AVX512 masked operations, as they provide efficient bit manipulation capabilities needed by the AVX512 implementation.
All modern processors that support the full AVX512 feature set (F,DQ,BW,VL) also support BMI2:
- The code checks for
has_avx512_common = has_avx512f && has_avx512dq && has_avx512bw && has_avx512vl && has_bmi2- This combination is standard on Intel processors starting from Skylake-X/Skylake-SP, which all include BMI2
The codebase has optimized implementations that leverage both AVX512 and BMI2 instructions together, as seen in the compilation flags and feature detection code.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Search for processor compatibility information in documentation # Test 1: Look for processor support documentation fd -g "*.md" -g "*.txt" -X rg -i "processor.*(support|compat)" # Test 2: Check if there are any existing CPU detection tests fd -g "*test*.c" -X rg "avx512.*bmi2"Length of output: 197
Script:
#!/bin/bash # Let's try a different approach to find relevant information # Search for CPU feature detection and requirements rg -A 5 -B 5 "has_avx512" # Look for any documentation about CPU features fd "README|CONTRIBUTING|docs" -e md -e txt -X cat # Search for test files or documentation containing AVX512 or BMI2 rg -i "avx512|bmi2"Length of output: 60852
arch/x86/chunkset_avx2.c (1)
7-7: Verify table definitions in avx2_tables.hThe switch from
chunk_permute_table.htoavx2_tables.hsuggests a reorganization of lookup tables. Please ensure that:
- All required tables (
perm_idx_lut,permute_table,half_rem_vals) are properly defined in the new header- The table definitions are equivalent to maintain correctness
✅ Verification successful
Header change is correctly implemented and maintains compatibility
The switch from directly including
chunk_permute_table.hto usingavx2_tables.his a well-structured change that:
- Maintains all required functionality by including the generic
chunk_permute_table.hwhich contains the shared permutation table- Properly defines AVX2-specific lookup tables (
perm_idx_lutandhalf_rem_vals) with correct values- Organizes the code better by separating architecture-specific tables while reusing common tables
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the table definitions in the new header # Check if the required tables are defined in avx2_tables.h rg -A 5 "perm_idx_lut|permute_table|half_rem_vals" "arch/x86/avx2_tables.h" # Compare with old definitions rg -A 5 "perm_idx_lut|permute_table|half_rem_vals" "arch/x86/generic/chunk_permute_table.h"Length of output: 827
Script:
#!/bin/bash # Let's verify the content and structure of both files # First, check if the old file exists in a different location fd "chunk_permute_table.h" # Then get the content of avx2_tables.h to understand its complete structure cat "arch/x86/avx2_tables.h" # Also search for any references to these tables in the codebase rg -A 2 "perm_idx_lut|half_rem_vals" --type cLength of output: 4091
Script:
#!/bin/bash # Let's check the content of the generic chunk_permute_table.h to complete our verification cat "arch/generic/chunk_permute_table.h"Length of output: 3452
arch/x86/Makefile.in (2)
11-12: LGTM: Addition of BMI2 support enhances AVX512 capabilities.The addition of
-mbmi2to both AVX512FLAG and AVX512VNNIFLAG is appropriate for supporting efficient bit manipulation operations, which aligns well with the PR's goal of optimizing masked moves in the inflate operation.
34-34: LGTM: New AVX512 build targets follow consistent patterns.The addition of
chunkset_avx512.oandchunkset_avx512.lotargets maintains consistency with existing AVX targets and correctly applies the necessary compilation flags including NOLTOFLAG for optimization.Let's verify the consistency of the new targets with existing ones:
Also applies to: 56-61
✅ Verification successful
Verified: AVX512 build targets follow consistent patterns with proper flags
The verification confirms that:
- The new
chunkset_avx512targets use the same flag patterns as other AVX512 implementations (adler32_avx512,crc32_vpclmulqdq)- All chunkset implementations (SSE2, SSSE3, AVX2, AVX512) consistently use NOLTOFLAG
- Both object (.o) and PIC (.lo) targets are properly defined with appropriate flags
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that all AVX512 targets follow consistent patterns # Test: Check for consistent flag usage across AVX512 targets rg "AVX512FLAG.*-c.*\.c$" arch/x86/Makefile.in # Test: Verify NOLTOFLAG is used consistently rg "NOLTOFLAG.*chunkset.*\.c$" arch/x86/Makefile.inLength of output: 1484
arch/x86/x86_functions.h (1)
49-51: LGTM! Function declarations follow established patterns.The new AVX512 function declarations maintain consistency with existing SSE2/AVX2 implementations while preserving the safety-conscious naming convention (e.g.,
chunkmemset_safe_avx512).Let's verify the implementation maintains the safety guarantees:
✅ Verification successful
Let's first locate the implementation files before diving into the specific function details.
Function declarations are consistent and implementation maintains safety guarantees
The verification confirms that the AVX512 implementation:
- Uses masked operations (
_mm256_mask_storeu_epi8,_mm_maskz_loadu_epi8) for bounds checking- Implements proper mask generation functions (
gen_mask,gen_half_mask) using_bzhi_u32- Follows the template-based implementation pattern with
chunkset_tpl.hThe implementation aligns with the PR objectives of using masked moves for safety and optimizing AVX512 operations while maintaining proper bounds checking.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the AVX512 implementation maintains bounds checking # Look for bounds checking in the implementation file # Search for bounds checking patterns in the implementation rg -A 5 "chunkmemset_safe_avx512.*\{"Length of output: 39
Script:
#!/bin/bash # Let's try a broader search to locate the AVX512 implementation files fd -e c -e cpp -e h "avx512" --exec rg -l "chunkmemset_safe_avx512" # Also search for any AVX512-related files to understand the implementation structure fd "avx512|AVX512"Length of output: 227
Script:
#!/bin/bash # Now that we found the implementation file, let's examine the chunkmemset_safe_avx512 function # Looking for bounds checking and safety measures rg -A 15 "chunkmemset_safe_avx512" arch/x86/chunkset_avx512.c # Also check for any mask-related operations as mentioned in PR objectives rg "mask|_mm512_mask" arch/x86/chunkset_avx512.cLength of output: 1363
chunkset_tpl.h (5)
90-90: LGTM: Improved conditional compilation guard.The additional check
!defined(HAVE_HALFCHUNKCOPY)prevents potential conflicts with architecture-specific implementations.
167-168: LGTM: Improved remainder handling with masked writes.The changes properly track length after halfchunk operations and optimize remainder handling using masked writes when available.
Also applies to: 222-226
Line range hint
251-261: LGTM: Added fallback for non-masked write architectures.The implementation safely handles small lengths when masked writes aren't available, maintaining proper alignment requirements.
128-136: LGTM: Added masked write optimization for small copies.The implementation correctly handles the edge case of reading ahead in the buffer while maintaining safety. For small lengths (≤ chunk_t), it efficiently falls back to CHUNKCOPY.
Let's verify the masked write implementation across the codebase:
#!/bin/bash # Description: Check for consistent masked write handling # Look for other instances where HAVE_MASKED_READWRITE is used rg "HAVE_MASKED_READWRITE" -B 2 -A 2 # Look for potential masked write operations ast-grep --pattern 'storechunk_mask($_, $_, $_)'
Line range hint
269-289: Consider potential length truncation in CHUNKCOPY_SAFE.While changing
lentouint64_timproves length handling, casting it tounsignedin the CHUNKMEMSET call could potentially truncate large values. Consider handling this case explicitly.Let's verify the maximum possible length:
Consider applying this change:
- return CHUNKMEMSET(out, from, (unsigned)len); + if (len > UINT_MAX) { + /* Handle large lengths in chunks */ + while (len > UINT_MAX) { + out = CHUNKMEMSET(out, from, UINT_MAX); + from += UINT_MAX; + len -= UINT_MAX; + } + } + return CHUNKMEMSET(out, from, (unsigned)len);functable.c (1)
132-132: LGTM! The AVX512 inflate implementation is properly integrated.The assignment follows the correct pattern and is properly guarded by the AVX512 feature check. The implementation maintains thread safety through atomic assignment and memory barriers.
Let's verify the AVX512 implementation is properly declared:
✅ Verification successful
Let me verify the declaration and consistency of the AVX512 implementation with one more check:
AVX512 implementation is properly integrated and declared
The verification confirms:
inflate_fast_avx512is correctly declared inarch/x86/x86_functions.h- The implementation follows the template pattern in
arch/x86/chunkset_avx512.c- The feature detection using
has_avx512_commonis consistent with other AVX512 optimizations- The function assignment is placed in the correct AVX512 conditional block alongside other AVX512 functions
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify inflate_fast_avx512 is properly declared and consistent with other AVX512 functions # Test 1: Check if inflate_fast_avx512 is declared in header files rg -l "inflate_fast_avx512" # Test 2: Verify consistent AVX512 feature detection usage rg "has_avx512" -A 2 -B 2Length of output: 5127
Script:
#!/bin/bash # Check the declaration and implementation of inflate_fast_avx512 rg "inflate_fast_avx512" -A 2 -B 2Length of output: 1173
cmake/detect-intrinsics.cmake (1)
79-79: LGTM! Adding BMI2 support enhances AVX512 optimizations.The addition of
-mbmi2flag to AVX512 and AVX512VNNI configurations is a good improvement. BMI2 instructions complement AVX512 masked operations, potentially improving the performance of bit manipulation tasks.Let's verify that the compiler correctly detects and enables these features:
Also applies to: 86-86, 117-117, 122-122
✅ Verification successful
Let me gather more information about the AVX512 and BMI2 usage in the codebase to ensure there are no conflicts or missing configurations.
Based on the gathered information, I can now provide a final response:
BMI2 flag addition is correctly integrated with AVX512 optimizations
The addition of
-mbmi2flag is well-justified and properly implemented:
- BMI2 instructions (
_bzhi_u32) are actively used inarch/x86/chunkset_avx512.c- The runtime feature detection in
arch/x86/x86_features.ccorrectly checks for BMI2 support- The flag is consistently added to both AVX512 and AVX512VNNI configurations across all supported compilers
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that both AVX512 and BMI2 instructions are detected # Expected: The compiler should recognize both instruction sets # Test 1: Check if compiler accepts the combined flags echo '#include <immintrin.h>' > test.c echo 'int main() { return 0; }' >> test.c for CC in gcc clang icc; do if command -v $CC >/dev/null 2>&1; then echo "Testing $CC..." $CC -mavx512f -mavx512dq -mavx512bw -mavx512vl -mbmi2 -c test.c 2>&1 || echo "$CC: Failed to compile with AVX512+BMI2" fi done # Test 2: Search for other AVX512 configurations that might need BMI2 rg -l "mavx512" --type cmakeLength of output: 255
Script:
#!/bin/bash # Check the complete context of AVX512 and BMI2 usage in detect-intrinsics.cmake cat cmake/detect-intrinsics.cmake # Search for any BMI2-specific checks or configurations rg "bmi2" -i --type cmake # Look for any assembly or intrinsics usage of BMI2 instructions rg "bmi2|_pext_|_pdep_|_bzhi_|_mulx_" -iLength of output: 22861
CMakeLists.txt (1)
984-985: LGTM! The AVX512 chunkset integration looks good.The addition of
chunkset_avx512.cand its feature info message follows the established pattern for AVX512 optimizations in the build system.Let's verify the AVX512 flag propagation:
✅ Verification successful
AVX512 chunkset compilation flags are properly set
The verification confirms that
chunkset_avx512.cis correctly added toAVX512_SRCSand all AVX512 sources are compiled with the appropriate flags:
- The file is added to
AVX512_SRCSlist- All
AVX512_SRCSare compiled with${AVX512FLAG} ${NOLTOFLAG}- The feature info message correctly references the
AVX512FLAGvariable🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify AVX512 flags are properly set for chunkset_avx512.c # Test: Check if chunkset_avx512.c is compiled with AVX512 flags rg -l "set_property.*SOURCE.*chunkset_avx512.c.*PROPERTY.*COMPILE_FLAGS.*AVX512FLAG"Length of output: 86
Script:
#!/bin/bash # Let's check the CMakeLists.txt more thoroughly for AVX512 flag settings rg -A 5 "AVX512_SRCS.*chunkset_avx512" CMakeLists.txt # Also check for any general AVX512 compilation flag settings rg "set_property.*COMPILE_FLAGS.*AVX512" CMakeLists.txt # And check for any other AVX512 source file compilations for reference rg "set_property.*SOURCE.*avx512.*COMPILE_FLAGS" CMakeLists.txtLength of output: 971
configure (2)
1592-1593: LGTM: Architecture-specific objects properly configured.The addition of
chunkset_avx512.oandchunkset_avx512.loto the architecture-specific objects list is correct. This ensures the new AVX512-optimized implementation is included in both static and shared library builds.
109-109: Verify BMI2 instruction set availability.The addition of
-mbmi2to AVX512 flags suggests that the new implementation relies on BMI2 instructions for efficient bit manipulation. This is aligned with the PR's goal of optimizing masked writes. However, we should ensure proper runtime detection of BMI2 support.✅ Verification successful
Based on the results, I can now generate the final response as we have sufficient information about BMI2 detection in the codebase.
BMI2 detection is properly implemented alongside AVX512 features
The codebase correctly handles BMI2 detection:
- Runtime detection is implemented in
arch/x86/x86_features.cthrough CPU feature flags- BMI2 is checked as a requirement alongside AVX512 features
- The compiler flags in both
configureandarch/x86/Makefile.inconsistently include-mbmi2with AVX512 flags🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if BMI2 detection is properly implemented in the codebase # Test: Look for BMI2 feature detection in x86 feature detection code rg -A 5 "BMI2|bmi2" arch/x86/Length of output: 1280
arch/x86/chunkset_avx512.c (1)
185-189: Verify macro definitions for AVX512 compatibilityEnsure that the macros
CHUNKSIZE,CHUNKUNROLL,CHUNKMEMSET, and related definitions are correctly set for AVX512 operations. This is crucial for maintaining consistency and performance optimizations specific to the AVX512 architecture.Run the following command to check the macro definitions:
✅ Verification successful
AVX512 chunk operations are correctly defined and integrated
The macro definitions in
chunkset_avx512.care properly set up and integrated with the template implementation. The verification shows:
- The macros correctly map to AVX512-specific functions
- The template file (
chunkset_tpl.h) provides the core implementation that these macros hook into- The functions are properly declared in
x86_functions.hand are enabled when AVX512 features are available- The implementation follows the same pattern as other SIMD variants (SSE2, SSSE3, AVX2)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that AVX512-specific macros are correctly defined. # Expected: Macros should be defined with AVX512-specific values. grep -E '#define CHUNKSIZE|#define CHUNKUNROLL' chunkset_tpl.hLength of output: 64
Script:
#!/bin/bash # Let's search for these macro definitions in the entire codebase rg -l "CHUNKSIZE|CHUNKUNROLL|CHUNKMEMSET" arch/x86/ # Then examine the content of relevant files find arch/x86/ -type f -name "*.h" -o -name "*.c" | xargs grep -l "CHUNKSIZE\|CHUNKUNROLL\|CHUNKMEMSET" | while read -r file; do echo "=== $file ===" cat "$file" | grep -A 2 -B 2 "CHUNKSIZE\|CHUNKUNROLL\|CHUNKMEMSET" doneLength of output: 2671
Script:
#!/bin/bash # Let's examine the template file that's included and the actual implementations echo "=== Content of chunkset_tpl.h ===" cat arch/x86/chunkset_tpl.h echo -e "\n=== AVX512 Implementation Details ===" # Look for the actual implementation of these functions in AVX512 rg -A 5 "chunksize_avx512|chunkunroll_avx512|chunkmemset_avx512" arch/x86/chunkset_avx512.cLength of output: 599
Script:
#!/bin/bash # Let's check if the template file might be in a different location fd -t f "chunkset_tpl.h" # Also let's look for the actual implementations of these functions echo "=== Function Implementations ===" rg -B 3 "^static.*chunk(size|unroll|memset)_avx512" arch/x86/chunkset_avx512.c # Check if there are any related header files that might contain definitions echo -e "\n=== Related Headers ===" fd -e h . | xargs grep -l "chunk.*_avx512"Length of output: 328
Script:
#!/bin/bash # Let's examine the template file and the header file containing AVX512 definitions echo "=== Content of chunkset_tpl.h ===" cat chunkset_tpl.h echo -e "\n=== Content of x86_functions.h ===" cat arch/x86/x86_functions.h | grep -A 5 -B 5 "avx512" # Let's also check if these functions are declared as prototypes echo -e "\n=== Function Declarations ===" rg "chunk(size|unroll|memset)_avx512" --type c --type hLength of output: 12030
05a920f to
375af7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
arch/x86/x86_features.c (1)
Line range hint
100-113: Consider adding performance guards and fallbacks.Since this is part of a performance optimization, consider:
- Adding comments explaining why BMI2 is required for the AVX512 masked write optimizations
- Implementing a fallback path for processors with AVX512 but without BMI2
- Adding benchmark comparisons in the documentation to justify the feature requirements
chunkset_tpl.h (1)
Type change in CHUNKCOPY_SAFE requires attention
The change of
lenparameter touint64_tinCHUNKCOPY_SAFEis problematic as the callers ininffast_tpl.huse unsigned variables (opandlen) which are 32-bit. This implicit conversion from 32-bit to 64-bit parameters could lead to issues on some platforms.
inffast_tpl.h: All calls toCHUNKCOPY_SAFEpassunsigned(32-bit) variables as the length parameterchunkset_tpl.h: Function declaration expectsuint64_t(64-bit) length parameter🔗 Analysis chain
Line range hint
269-289: LGTM: Enhanced CHUNKCOPY_SAFE with better type handling and masked write optimization.Good improvements:
- Using uint64_t for length allows handling larger chunks
- Masked write optimization eliminates byte-by-byte copying in most cases
- Safe length clamping before the unsigned cast
However, verify that all callers can handle the type change.
Let me check the type compatibility of the length parameter in the caller contexts.
Let me check the type declarations of the variables used in the length parameters.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for calls to CHUNKCOPY_SAFE to verify type compatibility rg -A 2 "CHUNKCOPY_SAFE\s*\(" --type cLength of output: 1000
Script:
#!/bin/bash # Get more context around CHUNKCOPY_SAFE calls to verify length parameter types ast-grep --pattern 'CHUNKCOPY_SAFE($_, $_, $len, $_)' # Also get the variable declarations and assignments for op and len rg -B 5 "CHUNKCOPY_SAFE\s*\(" inffast_tpl.hLength of output: 1121
Script:
#!/bin/bash # Get variable declarations for len and op ast-grep --pattern 'unsigned $_len' ast-grep --pattern 'unsigned $_op' # Also get the struct/type definitions that might contain these variables rg -B 5 "struct" inffast_tpl.hLength of output: 787
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (13)
- CMakeLists.txt (1 hunks)
- arch/x86/Makefile.in (3 hunks)
- arch/x86/avx2_tables.h (1 hunks)
- arch/x86/chunkset_avx2.c (1 hunks)
- arch/x86/chunkset_avx512.c (1 hunks)
- arch/x86/x86_features.c (2 hunks)
- arch/x86/x86_features.h (1 hunks)
- arch/x86/x86_functions.h (2 hunks)
- chunkset_tpl.h (8 hunks)
- cmake/detect-intrinsics.cmake (2 hunks)
- configure (2 hunks)
- functable.c (1 hunks)
- inffast_tpl.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- CMakeLists.txt
- arch/x86/avx2_tables.h
- arch/x86/chunkset_avx2.c
- arch/x86/x86_features.h
- arch/x86/x86_functions.h
- cmake/detect-intrinsics.cmake
- configure
- functable.c
- inffast_tpl.h
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (1)
Learnt from: KungFuJesus PR: zlib-ng/zlib-ng#1802 File: chunkset_tpl.h:135-135 Timestamp: 2024-10-07T21:23:13.401Z Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
🔇 Additional comments (16)
arch/x86/x86_features.c (2)
113-113: Verify the BMI2 requirement for AVX512_common features.Making BMI2 a requirement for
has_avx512_commonis a significant change that could potentially disable AVX512 optimizations on some processors. While this might be necessary for the masked write optimizations:
- Could you confirm if there are any AVX512-capable processors that don't support BMI2?
- Is there a fallback path for processors with AVX512 but without BMI2?
Let's check processor support patterns:
#!/bin/bash # Search for documentation or comments about processor support requirements rg -B 5 -A 5 "AVX512.*BMI2|BMI2.*AVX512" --type-add 'docs:*.{txt,md,rst}'
100-101: BMI2 feature detection looks correct!The implementation correctly checks bit 3 of the EBX register from CPUID leaf 7 for BMI2 support, following Intel's documentation.
Let's verify the BMI2 bit position in other x86 feature detection implementations:
✅ Verification successful
BMI2 feature detection implementation is correct
The implementation is verified to be correct based on the following evidence:
- The BMI2 check is properly placed within CPUID leaf 7 block (
maxbasic >= 7)- The bit position
0x8(bit 3) inebxregister is correct according to Intel's specification for BMI2 detection- The feature is correctly integrated into the
x86_cpu_featuresstructure and properly used as a requirement for AVX512 common features- The implementation follows the same pattern as other CPU feature detections in the file
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for BMI2 detection patterns in other codebases for verification rg -i "bmi2.*=.*ebx.*0x8|ebx.*0x8.*bmi2" --type cppLength of output: 75
Script:
#!/bin/bash # Let's try a broader search for BMI2 detection patterns rg -B 5 -A 5 "bmi2" --type cpp # Also search for CPUID leaf 7 feature detection patterns rg -B 5 -A 5 "cpuid.*7.*ebx" --type cpp # Look for x86 feature detection header files fd -e h "feature|cpu" arch/x86/Length of output: 684
Script:
#!/bin/bash # Let's examine the x86_features.c file directly to understand the context cat arch/x86/x86_features.c # Also look for any BMI2-related constants or defines rg -B 2 -A 2 "0x8|BMI2" arch/x86/Length of output: 7031
arch/x86/Makefile.in (3)
11-12: LGTM: Addition of BMI2 support enhances masked operationsThe addition of
-mbmi2to AVX512 flags is appropriate as BMI2's bit manipulation instructions (particularly BZHI) complement AVX512's masked operations, which aligns well with the PR's objective of optimizing masked writes.
34-34: LGTM: New build targets follow established patternsThe addition of chunkset_avx512 targets maintains consistency with existing build patterns and proper alphabetical ordering.
56-61: LGTM: Compilation rules are consistent with existing patternsThe compilation rules for chunkset_avx512 targets are properly structured with correct flags and PIC support.
Let's verify the source file existence and consistency:
✅ Verification successful
Compilation rules for chunkset_avx512 targets are correctly implemented
The verification confirms:
- Source file
chunkset_avx512.cexists at the correct location- Compilation rules follow the same pattern as other architecture-specific chunkset implementations:
- Uses architecture-specific flags (AVX512FLAG like others use AVX2FLAG, SSE2FLAG, etc.)
- Includes NOLTOFLAG consistently
- Adds -DPIC flag for shared library objects (.lo)
- Maintains consistent variable usage (CC, CFLAGS/SFLAGS, INCLUDES)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify the chunkset_avx512.c source file exists fd "chunkset_avx512.c" --type f # Check if other chunkset implementations follow similar patterns rg -A 2 "chunkset_.*\.c" --type makeLength of output: 2861
chunkset_tpl.h (4)
Line range hint
90-109: LGTM: Improved conditional compilation directive.The updated condition
defined(HAVE_HALF_CHUNK) && !defined(HAVE_HALFCHUNKCOPY)ensures this implementation is only used when half-chunk operations are supported but no architecture-specific implementation exists.
128-136: LGTM: Efficient handling of short forward copies using masked writes.The optimization correctly handles the case where we're reading from ahead in the buffer by falling back to CHUNKCOPY when the length fits in a single chunk. The comment about compiler optimization of identical code segments is insightful.
167-168: LGTM: Proper length tracking after halfchunk store.The explicit length adjustment and pointer advancement after storing a halfchunk maintains correct state for subsequent operations.
222-226: LGTM: Efficient remainder handling using masked writes.The use of
storechunk_maskwith a generated mask for remaining bytes is an excellent optimization that:
- Eliminates bounds checking overhead
- Leverages efficient AVX512 masked write operations
- Maintains memory safety without performance penalty
arch/x86/chunkset_avx512.c (7)
36-56: Efficient Implementation of 'chunkmemset' FunctionsThe
chunkmemset_*functions are efficiently implemented using appropriate intrinsic functions, optimizing performance for various data sizes.
58-68: Correct Use of Load and Store OperationsThe
loadchunk,storechunk, andstorechunk_maskfunctions correctly utilize AVX512 intrinsics for loading and storing data, ensuring proper handling of chunked data.
70-93: Well-Implemented 'CHUNKCOPY' FunctionThe
CHUNKCOPYfunction effectively handles copying data chunks, including proper handling of remaining bytes using masked operations. The logic ensures data is copied accurately and efficiently.
95-125: Efficient Implementation of 'GET_CHUNK_MAG' FunctionThe
GET_CHUNK_MAGfunction is effectively implemented, utilizing AVX512 intrinsics and appropriate permutation logic to handle data retrieval based on distance parameters.
127-139: Proper Handling in Half-Chunk FunctionsThe functions
loadhalfchunk,storehalfchunk, andhalfchunk2wholecorrectly handle half-chunk data, ensuring proper data manipulation and conversion between half chunks and whole chunks.
141-152: Effective Implementation of 'GET_HALFCHUNK_MAG' FunctionThe
GET_HALFCHUNK_MAGfunction correctly retrieves half-chunk magnitudes using appropriate AVX512 intrinsics and permutation logic, ensuring efficient data handling.
154-177: Well-Structured 'HALFCHUNKCOPY' FunctionThe
HALFCHUNKCOPYfunction efficiently copies half-chunk data, properly managing remaining bytes with masked operations. The implementation is clear and maintains consistency with chunk copying practices.
51e7081 to
767c58a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
chunkset_tpl.h (1)
Line range hint 269-289: Consider potential truncation in CHUNKMEMSET call.
While the change to uint64_t for len parameter improves the supported range, the cast to unsigned when calling CHUNKMEMSET could lead to truncation for values larger than UINT_MAX.
Consider one of these solutions:
- return CHUNKMEMSET(out, from, (unsigned)len);
+ // Solution 1: Split large copies into chunks
+ while (len > UINT_MAX) {
+ out = CHUNKMEMSET(out, from, UINT_MAX);
+ from += UINT_MAX;
+ len -= UINT_MAX;
+ }
+ return CHUNKMEMSET(out, from, (unsigned)len);
+ // Solution 2: Assert maximum length
+ Assert(len <= UINT_MAX, "Length exceeds maximum supported value");
+ return CHUNKMEMSET(out, from, (unsigned)len);📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (13)
- CMakeLists.txt (1 hunks)
- arch/x86/Makefile.in (3 hunks)
- arch/x86/avx2_tables.h (1 hunks)
- arch/x86/chunkset_avx2.c (1 hunks)
- arch/x86/chunkset_avx512.c (1 hunks)
- arch/x86/x86_features.c (2 hunks)
- arch/x86/x86_features.h (1 hunks)
- arch/x86/x86_functions.h (2 hunks)
- chunkset_tpl.h (8 hunks)
- cmake/detect-intrinsics.cmake (2 hunks)
- configure (2 hunks)
- functable.c (1 hunks)
- inffast_tpl.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- CMakeLists.txt
- arch/x86/Makefile.in
- arch/x86/avx2_tables.h
- arch/x86/chunkset_avx2.c
- arch/x86/x86_features.c
- arch/x86/x86_features.h
- arch/x86/x86_functions.h
- cmake/detect-intrinsics.cmake
- configure
- functable.c
- inffast_tpl.h
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
🔇 Additional comments (22)
chunkset_tpl.h (3)
Line range hint 90-108: LGTM: Clean implementation of HALFCHUNKCOPY.
The conditional compilation directive correctly prevents duplicate implementations while maintaining the half-chunk copy functionality.
222-226: LGTM: Efficient remainder handling with masked writes.
The implementation efficiently handles remainder bytes using masked writes instead of memcpy, which aligns with the optimization goals while maintaining safety.
128-136: LGTM: Efficient handling of ahead-reads with masked writes.
The implementation correctly optimizes small-length copies while maintaining safety for ahead-reads. This aligns with the PR's objective of utilizing masked moves efficiently.
Let's verify the masked write implementation across the codebase:
✅ Verification successful
Masked write implementation is correctly integrated and consistent across the codebase
The implementation in chunkset_tpl.h is properly aligned with the AVX512-specific implementation in arch/x86/chunkset_avx512.c. Key findings:
- The
HAVE_MASKED_READWRITEfeature is properly defined in AVX512 implementation - The
storechunk_maskfunction is correctly implemented using_mm256_mask_storeu_epi8 - The optimization for small lengths (len <= sizeof(chunk_t)) is safe as it utilizes the masked write capabilities
- The implementation is consistently used across both
inffast_tpl.handchunkset_tpl.h
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify masked write usage and implementation
# Look for other masked write implementations and potential inconsistencies
# Check for masked write related definitions
rg -l "HAVE_MASKED_READWRITE|masked.*write|storechunk_mask"
# Look for potential masked write usage patterns
ast-grep --pattern 'storechunk_mask($_, $_, $_)'
Length of output: 168
Script:
#!/bin/bash
# Let's examine the implementation details across these files
# Check masked write implementation in inffast_tpl.h
rg -A 5 "HAVE_MASKED_READWRITE" inffast_tpl.h
# Check masked write implementation in chunkset_tpl.h
rg -A 5 "HAVE_MASKED_READWRITE" chunkset_tpl.h
# Check AVX512 specific implementation
cat arch/x86/chunkset_avx512.c
# Look for any configuration or build settings related to masked writes
rg "HAVE_MASKED_READWRITE" -A 2 "CMakeLists.txt|*.cmake|configure*"
Length of output: 7772
arch/x86/chunkset_avx512.c (19)
1-3: LGTM!
The file header comment provides a clear description of the purpose and usage conditions of the code.
4-11: LGTM!
The necessary header files are included, and the code is properly guarded with the X86_AVX512 macro to ensure it is only compiled when AVX512 support is available.
12-15: LGTM!
The type definitions for chunk_t, halfchunk_t, mask_t, and halfmask_t are clear and appropriate for AVX512 operations.
17-26: LGTM!
The preprocessor macros are used to indicate the availability of specific functions and features, enhancing code readability and maintainability.
36-40: LGTM!
The chunkmemset_2 function correctly sets a 16-bit value in the chunk using AVX512 instructions.
42-46: LGTM!
The chunkmemset_4 function correctly sets a 32-bit value in the chunk using AVX512 instructions.
48-52: LGTM!
The chunkmemset_8 function correctly sets a 64-bit value in the chunk using AVX512 instructions.
54-56: LGTM!
The chunkmemset_16 function correctly sets a 128-bit value in the chunk using AVX512 instructions.
58-60: LGTM!
The loadchunk function correctly loads a chunk from memory using AVX512 instructions.
62-64: LGTM!
The storechunk function correctly stores a chunk to memory using AVX512 instructions.
66-68: LGTM!
The storechunk_mask function correctly stores a chunk to memory using AVX512 masked store instructions.
70-93: LGTM!
The CHUNKCOPY function correctly copies data from the source to the destination using AVX512 instructions. It handles any remaining bytes that do not fit into a full chunk using masked operations.
126-128: LGTM!
The loadhalfchunk function correctly loads a half-chunk from memory using AVX512 instructions.
130-132: LGTM!
The storehalfchunk function correctly stores a half-chunk to memory using AVX512 instructions.
134-138: LGTM!
The halfchunk2whole function correctly zero-extends a half-chunk to a whole chunk using AVX512 instructions.
153-176: LGTM!
The HALFCHUNKCOPY function correctly copies data from the source to the destination using AVX512 instructions for half-chunks. It handles any remaining bytes that do not fit into a full half-chunk using masked operations.
178-181: LGTM!
The preprocessor macros are used to define chunk-related constants and functions specific to AVX512.
183-184: LGTM!
The chunkset_tpl.h file is included to provide template implementations for chunk-related functions.
185-187: LGTM!
The INFLATE_FAST macro is defined to use the AVX512-optimized inflate_fast_avx512 function, and the inffast_tpl.h file is included to provide the template implementation for the inflate functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
chunkset_tpl.h (1)
Line range hint 269-289: Consider explicit type conversion for len parameter.
The change from unsigned to uint64_t for len improves large buffer support. However, there's an implicit conversion when calling CHUNKMEMSET:
return CHUNKMEMSET(out, from, (unsigned)len);While this works because we've already bounded len by safelen, consider adding a comment explaining why the conversion is safe here.
arch/x86/chunkset_avx512.c (3)
28-30: Consider adding function documentation for clarity
To enhance readability and maintainability, consider adding a brief comment explaining the purpose and usage of the gen_half_mask function. This will help other developers understand its role in generating half-width masks.
32-34: Consider adding function documentation for clarity
Similarly, adding a comment to describe the gen_mask function would provide insight into its functionality in generating full-width masks, aiding future code maintenance.
153-176: Add documentation to HALFCHUNKCOPY for better understanding
Including a comment that describes the HALFCHUNKCOPY function will improve code clarity, helping others understand its purpose and how it complements CHUNKCOPY in handling half-size chunks.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (13)
- CMakeLists.txt (1 hunks)
- arch/x86/Makefile.in (3 hunks)
- arch/x86/avx2_tables.h (1 hunks)
- arch/x86/chunkset_avx2.c (1 hunks)
- arch/x86/chunkset_avx512.c (1 hunks)
- arch/x86/x86_features.c (2 hunks)
- arch/x86/x86_features.h (1 hunks)
- arch/x86/x86_functions.h (2 hunks)
- chunkset_tpl.h (8 hunks)
- cmake/detect-intrinsics.cmake (2 hunks)
- configure (2 hunks)
- functable.c (1 hunks)
- inffast_tpl.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- CMakeLists.txt
- arch/x86/Makefile.in
- arch/x86/avx2_tables.h
- arch/x86/chunkset_avx2.c
- arch/x86/x86_features.c
- arch/x86/x86_features.h
- arch/x86/x86_functions.h
- cmake/detect-intrinsics.cmake
- configure
- functable.c
- inffast_tpl.h
🧰 Additional context used
📓 Learnings (2)
arch/x86/chunkset_avx512.c (4)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:0-0
Timestamp: 2024-10-29T02:18:10.472Z
Learning: In `arch/x86/chunkset_avx512.c`, within `GET_CHUNK_MAG`, the `dist` parameter is always greater than or equal to 3 because distances less than 3 are handled differently, using direct broadcasts (`chunkmemset_1`, `chunkmemset_2`) that don't require shuffling.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:140-151
Timestamp: 2024-10-29T02:16:15.776Z
Learning: In `GET_HALFCHUNK_MAG`, `dist` is always ≥ 3 because distances less than 3 are handled elsewhere and the function is not called for those cases.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.
chunkset_tpl.h (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
🔇 Additional comments (6)
chunkset_tpl.h (4)
128-136: Excellent optimization for overlapping copies!
The new masked write optimization for forward overlapping copies is well-designed:
- Avoids memmove when the copy fits in one chunk
- Clear comments explain the rationale
- Properly guards the optimization behind HAVE_MASKED_READWRITE
222-226: Good use of masked writes for remainder bytes!
The change from memcpy to masked writes for remainder bytes:
- Reduces branching in the critical path
- Maintains correctness through proper masking
- Falls back to memcpy when masked writes aren't available
Line range hint 90-108: LGTM! Verify halfchunk size assumptions.
The HALFCHUNKCOPY implementation looks correct and follows the established pattern. The function properly handles alignment and remainder bytes.
Let's verify the halfchunk size assumptions:
✅ Verification successful
LGTM! Halfchunk size assumptions are correct
The verification confirms that:
- In AVX2/AVX512:
chunk_tis__m256i(256-bit) andhalfchunk_tis__m128i(128-bit) - The HALFCHUNKCOPY implementation correctly uses sizeof(halfchunk_t) for alignment and increments
- The code is properly guarded by HAVE_HALF_CHUNK macro
- The implementation maintains the same memory safety guarantees as CHUNKCOPY
The halfchunk size is exactly half of the chunk size in the architectures where it's used, making this implementation both correct and efficient.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify halfchunk_t size is consistent with chunk_t
# Expected: halfchunk_t should be half the size of chunk_t
# Search for chunk_t and halfchunk_t type definitions
rg -A 2 'typedef.*chunk_t|typedef.*halfchunk_t'
Length of output: 1226
Script:
#!/bin/bash
# Let's verify the size relationships and definitions more comprehensively
# Check for HAVE_HALF_CHUNK definitions and related macros
rg "HAVE_HALF_CHUNK|CHUNK_SIZE|sizeof\(halfchunk_t\)|sizeof\(chunk_t\)" --type c
# Also check the specific header for any size definitions
rg "chunk_t|halfchunk_t|CHUNK_SIZE|HALF_CHUNK" chunkset_tpl.h
Length of output: 5200
Line range hint 251-261: Verify alignment handling with masked writes.
While removing the byte-by-byte fallback with masked writes is a good optimization, we should verify that masked writes handle unaligned accesses correctly on all supported platforms.
Let's check alignment handling in the codebase:
✅ Verification successful
Based on the search results, I can now provide a definitive response about the alignment handling with masked writes:
Masked writes are safely implemented with proper alignment handling across platforms.
The codebase shows robust alignment handling:
- The architecture-specific implementations (e.g., AVX512) handle unaligned access safely through appropriate intrinsics (
_mm256_mask_storeu_epi8) - The codebase has comprehensive platform-specific unaligned access detection:
UNALIGNED_OKandUNALIGNED64_OKmacros for platform capabilities- Architecture-specific checks for ARM, x86, PowerPC, etc.
- The masked write implementation in
chunkset_tpl.his protected byHAVE_MASKED_READWRITEwhich is only defined for architectures that support safe unaligned masked stores
The byte-by-byte fallback removal is safe as the masked write implementations properly handle unaligned accesses through their respective architecture's intrinsics.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify masked write alignment handling
# Look for alignment checks and masked write implementations
# Search for alignment-related code in masked write implementations
rg -B 2 -A 2 'HAVE_MASKED_READWRITE.*align|storechunk_mask.*align'
# Search for alignment requirements in architecture-specific headers
fd -e h -e c -x rg -l 'UNALIGNED.*OK|alignment|align.*mask' {} \;
Length of output: 1226
Script:
#!/bin/bash
# Let's search for masked write implementations and alignment handling
# Search for masked write implementations and surrounding context
rg -B 5 -A 5 "HAVE_MASKED_READWRITE|storechunk_mask"
# Search for alignment-related macros and checks
rg -B 2 -A 2 "ALIGNED|unaligned|alignment|align.*mask"
# Look for architecture-specific masked write implementations
fd "chunkset.*\.(h|c)" -x rg -l "mask.*store|store.*mask" {} \;
# Search for any alignment requirements or documentation
rg -B 2 -A 2 "\/\*.*align|@.*align"
Length of output: 73256
arch/x86/chunkset_avx512.c (2)
70-93: Verify the usage of CHUNKCOPY throughout the codebase
The CHUNKCOPY function correctly handles chunk copying with proper assertions. Please ensure that all calls to CHUNKCOPY pass a non-zero len parameter, as the function asserts len > 0.
Run the following script to identify all usages of CHUNKCOPY and check the len parameter:
✅ Verification successful
CHUNKCOPY usage is safe with non-zero length parameter
The verification shows that len is safely used in CHUNKCOPY:
lenis initialized fromhere->valand then incremented byBITS(op)- The code has a minimum output buffer guarantee of 258 bytes
- There's an explicit distance check
if (dist >= len || dist >= state->chunksize)before calling CHUNKCOPY - The implementation across different architectures consistently includes the assertion
Assert(len > 0, "chunkcopy should never have a length 0")
The codebase shows no instances where len could be zero when passed to CHUNKCOPY, and the control flow ensures proper length handling.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify all calls to CHUNKCOPY have a len parameter greater than zero.
# Test: Search for CHUNKCOPY usage and extract len arguments. Expect: All len arguments are > 0.
ast-grep --pattern 'CHUNKCOPY($_, $_, $_len)' --json | jq '.nodes[].match.captures[2].text' | while read len_var; do
echo "Checking $len_var"
rg --files-with-matches --fixed-strings "$len_var = 0"
done
Length of output: 173
Script:
#!/bin/bash
# Let's first find all files that contain CHUNKCOPY
rg -l "CHUNKCOPY"
# Then look at the actual usage with context
rg -A 5 "CHUNKCOPY\("
Length of output: 3704
Script:
#!/bin/bash
# Let's examine the actual call site in inffast_tpl.h to verify the len parameter
rg -B 10 -A 2 "out = CHUNKCOPY\(out, out - dist, len\)" inffast_tpl.h
# Also check any assignments to len before the CHUNKCOPY call
rg "len = 0" inffast_tpl.h
rg "len.*=.*0" inffast_tpl.h
Length of output: 891
Script:
#!/bin/bash
# Let's check how len is computed before the CHUNKCOPY call
rg -B 20 -A 5 "len = " inffast_tpl.h
# Also check the minimum value that len can have
rg -B 5 "len [<>=]" inffast_tpl.h
Length of output: 2880
Script:
#!/bin/bash
# Let's check the minimum value that len can have in the actual implementation
rg -B 10 "len \+= " inffast_tpl.h
# And check where len is used in the distance checks before CHUNKCOPY
rg -B 5 "if \(dist >= len" inffast_tpl.h
Length of output: 1068
95-124: Ensure dist values keep lut_rem.idx within bounds
While dist is always greater than or equal to 3 in this context, verify that dist - 3 does not exceed the bounds of perm_idx_lut and permute_table. This ensures safe access without risking out-of-bounds reads.
Execute the following script to check the bounds of dist:
✅ Verification successful
dist values are safely bounded within perm_idx_lut array limits
The verification shows that:
perm_idx_lutarray has 29 elements (indices 0-28)dist - 3is used as index, meaningdistcan be 3 to 31 (inclusive)- The array is explicitly designed for this range as shown by the comments in
avx2_tables.h - The implementation handles two cases safely:
dist < 16with appropriate maskingdist >= 16with proper bounds checking
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Confirm that dist values keep lut_rem.idx within valid range.
# Check the size of perm_idx_lut and permute_table
rg 'static const lut_rem_pair perm_idx_lut\[\]' -A 1
rg 'static const uint8_t permute_table\[\]' -A 1
# List all dist values used in GET_CHUNK_MAG
ast-grep --pattern 'GET_CHUNK_MAG($_buf, $_chunk_rem, $_dist)' --json | jq '.nodes[].match.captures[2].text' | sort | uniq | while read dist_var; do
echo "Dist variable: $dist_var"
# Assuming dist_var assignment exists, verify its maximum value
rg "$dist_var\s*=\s*\d+" -r '$dist_var = $value' | awk '{print $3}' | sort -nr | head -n1
done
Length of output: 311
Script:
#!/bin/bash
# Let's try a different approach to find the array sizes and dist usage
# Check the size of perm_idx_lut array
rg -B 2 -A 30 'static const lut_rem_pair perm_idx_lut\[\]' arch/x86/chunkset_avx512.c
# Check the size of permute_table array
rg -B 2 -A 30 'static const uint8_t permute_table\[\]' arch/x86/chunkset_avx512.c
# Find where GET_CHUNK_MAG is called and check dist values
rg "GET_CHUNK_MAG\(" arch/x86/chunkset_avx512.c -B 5 -A 1
Length of output: 396
Script:
#!/bin/bash
# Let's try to find the complete file content since it's crucial for this verification
cat arch/x86/chunkset_avx512.c
# Also search for any MAX_DIST or similar constants that might limit the dist value
rg "define.*MAX.*DIST" arch/x86/chunkset_avx512.c
rg "define.*MAX.*DIST" --type c
Length of output: 6401
Script:
#!/bin/bash
# Let's check the avx2_tables.h for the array definitions and sizes
cat arch/x86/avx2_tables.h
# Also check for any calls to this function to understand the dist parameter's bounds
rg -B 10 "GET_CHUNK_MAG\(" --type c
Length of output: 4862
This takes advantage of the fact that on AVX512 architectures, masked moves are incredibly cheap. There are many places where we have to fallback to the safe C implementation of chunkcopy_safe because of the assumed overwriting that occurs. We're to sidestep most of the branching needed here by simply controlling the bounds of our writes with a mask.
767c58a to
ce5fb26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
chunkset_tpl.h (1)
128-136: LGTM: Efficient use of masked writes for overlapping copiesThe new optimization for negative distances leverages AVX512 masked writes when the copy length fits within a single chunk. This is an excellent use of AVX512's efficient masked move operations, avoiding the need for memmove in these cases.
Consider adding a comment documenting the performance characteristics of this optimization compared to memmove, as this would help maintainers understand the tradeoffs in future modifications.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (13)
CMakeLists.txt(1 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/avx2_tables.h(1 hunks)arch/x86/chunkset_avx2.c(1 hunks)arch/x86/chunkset_avx512.c(1 hunks)arch/x86/x86_features.c(2 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(2 hunks)chunkset_tpl.h(8 hunks)cmake/detect-intrinsics.cmake(2 hunks)configure(2 hunks)functable.c(1 hunks)inffast_tpl.h(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- arch/x86/Makefile.in
- arch/x86/avx2_tables.h
- arch/x86/chunkset_avx2.c
- arch/x86/chunkset_avx512.c
- arch/x86/x86_features.c
- arch/x86/x86_features.h
- arch/x86/x86_functions.h
- cmake/detect-intrinsics.cmake
- configure
- functable.c
- inffast_tpl.h
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (5)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-10-07T21:56:31.870Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.
🔇 Additional comments (4)
chunkset_tpl.h (3)
90-91: LGTM: Improved conditional compilation logic
The additional check !defined(HAVE_HALFCHUNKCOPY) ensures this implementation is only used when an architecture-specific version is not available, allowing for optimized AVX512 implementations.
Line range hint 269-289: LGTM: Improved bounds handling with masked writes
The changes improve the implementation by:
- Using uint64_t for length to handle larger chunks
- Leveraging masked writes to avoid byte-by-byte copying
- Maintaining proper bounds checking while reducing branches
The implementation correctly maintains the invariant that 'from' cannot be ahead of 'safe', as noted in the retrieved learnings.
#!/bin/bash
# Check for potential overflow issues with uint64_t length
echo "Searching for length validation in callers..."
rg "CHUNKCOPY_SAFE.*len" -B 5 -A 5
# Look for test cases with large lengths
echo "Searching for test cases with large lengths..."
rg "test.*CHUNKCOPY_SAFE.*\d{10,}" -A 5222-226: Verify gen_mask implementation for remainder handling
The use of masked writes for remainder bytes is a good optimization. However, we should verify the correctness of the gen_mask implementation.
✅ Verification successful
Let's search for mask-related implementations since the previous search didn't yield results. This will help us understand how the masking is implemented in the codebase.
Based on the search results, I can now provide a final response since we have found the implementation of both gen_mask and storechunk_mask functions in the AVX512 architecture-specific code.
Implementation of gen_mask is correct and properly used
The masked write implementation is correctly implemented:
gen_maskuses Intel's_bzhi_u32intrinsic to generate a mask with the specified number of bits setstorechunk_maskuses AVX512's masked store instruction_mm256_mask_storeu_epi8to safely write only the requested number of bytes- The implementation is architecture-specific and only enabled when
HAVE_MASKED_READWRITEis defined
The code is safe because:
- The mask generation ensures only the requested number of bytes are written
- The AVX512 masked store instruction prevents writing beyond the requested length
- The implementation is properly guarded by feature macros
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for gen_mask implementation and its usage
echo "Searching for gen_mask implementation..."
rg "gen_mask" -A 5
# Look for any test cases involving gen_mask
echo "Searching for gen_mask tests..."
rg "test.*gen_mask|gen_mask.*test" -A 5
Length of output: 2105
Script:
#!/bin/bash
# Search for gen_mask implementation with different variations
echo "Searching for gen_mask and mask generation..."
rg -i "gen.*mask|generate.*mask|mask.*gen" -A 5
# Look for mask-related functions in header files
echo "Searching mask-related functions in headers..."
fd -e h -e hpp -x rg "mask" -A 5 {}
# Search for masked write implementation
echo "Looking for masked write implementation..."
rg "storechunk_mask|masked.*write|write.*mask" -A 5
Length of output: 58025
CMakeLists.txt (1)
984-987: LGTM! The AVX512 chunkset integration looks good.
The changes properly integrate the new AVX512 chunkset optimization by:
- Adding the source file to AVX512_SRCS
- Including appropriate feature documentation
- Following the established pattern for AVX512 optimizations
Let's verify the AVX512 chunkset integration:
✅ Verification successful
AVX512 chunkset integration is properly implemented and verified
The verification confirms:
- The source file exists and contains proper AVX512-specific implementations
- Correct compilation flags (AVX512FLAG, NOLTOFLAG) are set in Makefile.in
- The implementation follows the template pattern using chunkset_tpl.h
- Integration with the existing AVX512 optimizations is maintained
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the AVX512 chunkset source file exists and check its dependencies
# Test 1: Verify the source file exists
fd --type f "chunkset_avx512.c" arch/x86
# Test 2: Check for potential dependencies in the implementation
rg -l "chunkset.*avx512" arch/x86
Length of output: 151
Script:
#!/bin/bash
# Let's verify the implementation details and dependencies
# Check the content of chunkset_avx512.c for proper AVX512 intrinsics usage
rg -A 5 "avx512" arch/x86/chunkset_avx512.c
# Check Makefile.in for proper compilation flags and integration
rg "chunkset.*avx512" arch/x86/Makefile.in
# Check for any header dependencies
rg -l "include.*avx512" arch/x86/
Length of output: 933
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1805 +/- ##
===========================================
- Coverage 32.62% 32.18% -0.45%
===========================================
Files 66 67 +1
Lines 5652 5752 +100
Branches 1228 1237 +9
===========================================
+ Hits 1844 1851 +7
- Misses 3555 3644 +89
- Partials 253 257 +4 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
|
I cannot test or benchmark this as I don't have access to any AVX512 hardware. |
|
I could test using deflatebench but not sure if that is the intended use case for these improvements or if it is more png related. |
Well, it should hopefully not be a lot slower than before with deflatebench, so that is valuable data in any case. |
|
In my testing it improved everything, generally. Compiling with -O3 helped png decoding but hurt gzip decoding, but with -O2 it still saw an improvement. |
|
Test data: silesia.tar DEVELOP 94aacd8PR ce5fb26 |
|
I re-ran the test. I re-build both binaries. DEVELOPPRPerhaps I did something wrong in the first test or my machine performed that much better at some point in the first test. |
|
He is another run: DEVELOPPRI don't know if my machine is sensitive enough. |
|
The most important part is no new bugs encountered, and no big performance regression. |
|
On my Cascade Lake: |
|
I'm somewhat curious to try a 64 byte chunk implementation, at least on Zen5 and Sapphire Rapids. I wonder if it's finally worth the squeeze. Somewhat unfortunate is the fact that aarch64 does not offer a way to do this but maybe RISC-V might? Possibly SVE could also be used depending on configurable vector lengths. I guess it just depends on the setup time for that. |
|
This PR caused a decompression bug in our main project. |
Can you give you a quick reproduction of this? |
This takes advantage of the fact that on AVX512 architectures, masked moves are incredibly cheap. There are many places where we have to fallback to the safe C implementation of chunkcopy_safe because of the assumed overwriting that occurs. We're to sidestep most of the branching needed here by simply controlling the bounds of our writes with a mask.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores