Make an AVX512 inflate fast with low cost masked writes #1805

KungFuJesus · 2024-10-28T22:34:53Z

This takes advantage of the fact that on AVX512 architectures, masked moves are incredibly cheap. There are many places where we have to fallback to the safe C implementation of chunkcopy_safe because of the assumed overwriting that occurs. We're to sidestep most of the branching needed here by simply controlling the bounds of our writes with a mask.

Summary by CodeRabbit

New Features
- Enhanced support for AVX512 architecture with new optimized functions for memory operations.
- Added BMI2 feature detection to improve performance on compatible CPUs.
Bug Fixes
- Improved handling of distance calculations in decompression routines to prevent errors.
Documentation
- Updated comments for clarity in memory handling functions.
Chores
- Updated configuration scripts to include new object files and flags for AVX512 support.
- Modified build scripts to enhance compiler flags for AVX512 optimizations.

coderabbitai · 2024-10-28T22:35:01Z

Walkthrough

The pull request introduces modifications to the zlib library's build configuration and implementation for enhanced support of AVX512 and AVX2 architectures. Key changes include the addition of a new source file for AVX512 optimizations, updates to CMake and Makefile configurations to include new compilation flags and object targets, and enhancements to CPU feature detection logic. New inline functions for memory operations using AVX512 are defined, while existing functions are adjusted for improved performance and clarity. The overall structure of the build process remains intact, focusing on expanding architecture-specific capabilities.

Changes

File	Change Summary
`CMakeLists.txt`	Added `chunkset_avx512.c` to AVX512 source files; defined feature info for AVX512 chunkset support.
`arch/x86/Makefile.in`	Updated `AVX512FLAG` and `AVX512VNNIFLAG` to include `-mbmi2`; added targets for `chunkset_avx512.o` and `chunkset_avx512.lo`.
`arch/x86/avx2_tables.h`	Introduced `perm_idx_lut` and `half_rem_vals` static arrays for AVX2 operations.
`arch/x86/chunkset_avx2.c`	Included `avx2_tables.h`; removed previous static arrays; no changes to inline functions.
`arch/x86/chunkset_avx512.c`	Added inline functions for AVX512 chunk memory operations, including `chunkmemset` and `CHUNKCOPY`.
`arch/x86/x86_features.c`	Added feature check for BMI2 support; updated AVX512 feature checks to include BMI2.
`arch/x86/x86_features.h`	Added `int has_bmi2;` to `struct x86_cpu_features`.
`arch/x86/x86_functions.h`	Added declarations for AVX512-specific functions: `chunksize_avx512`, `chunkmemset_safe_avx512`, `inflate_fast_avx512`.
`chunkset_tpl.h`	Updated `CHUNKCOPY_SAFE` to accept `uint64_t` for `len`; refined conditions in `CHUNKMEMSET`.
`cmake/detect-intrinsics.cmake`	Enhanced flags for detecting compiler intrinsics, including `-mbmi2`.
`configure`	Updated `avx512flag` to include `-mbmi2`; added `chunkset_avx512.o` and `chunkset_avx512.lo` to object lists.
`functable.c`	Assigned `inflate_fast_avx512` to `ft.inflate_fast` for AVX512 architecture.
`inffast_tpl.h`	Refined logic in `INFLATE_FAST` for output copying and distance handling.

Possibly related PRs

Improve chunkset_avx2 performance #1778: Enhances performance of chunk memory operations using AVX2, related to AVX512 optimizations in this PR.
Simplify inflate fast by dispatching to chunkmemset for all chunkcopy cases #1802: Simplifies handling of chunk copying and memory setting operations, aligning with improvements in AVX512 chunkset operations.
Reorder variables in inflate functions to reduce padding holes #1803: Focuses on reordering variables for padding reduction, relevant to overall memory management improvements in this PR.

Suggested labels

optimization, enhancement

Suggested reviewers

nmoinvaz

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (10)

arch/x86/x86_features.h (2)
17-17: LGTM! Consider adding a brief comment.

The addition of has_bmi2 is well-placed and aligns with the PR's optimization goals. Consider adding a brief comment explaining BMI2's relevance to the AVX512 optimizations.
+    /* BMI2 support required for efficient bit manipulation operations */
     int has_bmi2;
Line range hint 9-25: Consider documenting CPU feature flags.

To improve maintainability, consider adding documentation for all CPU feature flags in the structure. This would help developers understand which features are required for specific optimizations and their interdependencies.

Example documentation style:
 struct x86_cpu_features {
+    /* Advanced Vector Extensions 2 support */
     int has_avx2;
+    /* AVX-512 Foundation instructions */
     int has_avx512f;
+    /* AVX-512 Doubleword and Quadword instructions */
     int has_avx512dq;
     // ... (document remaining fields)
 };
arch/x86/avx2_tables.h (3)

6-8: Documentation needs enhancement

The current comment is insufficient to understand the table's purpose and structure. Consider adding:

Purpose of the lut_rem_pair structure

Explanation of the offset calculation (32 and 16 multipliers)

Meaning of the right-side comments

40-42: Add documentation for half_rem_vals table

The table lacks essential documentation explaining:

Its purpose and relationship to the permutation table

The significance of these specific values

How these values are used in the AVX2 operations

1-44: Consider reorganizing shared lookup tables

Since these tables might be used by both AVX2 and AVX512 implementations (based on the PR description), consider:

Moving shared tables to a common location

Adding documentation about which SIMD implementations use these tables

Explaining any differences in how AVX2 and AVX512 utilize these values

arch/x86/chunkset_avx2.c (2)

Line range hint 52-89: Consider documenting performance implications of table reorganization

The GET_CHUNK_MAG and GET_HALFCHUNK_MAG functions rely heavily on lookup tables for efficient permutation and remainder calculations. While moving tables to a separate header is good for organization, it's important to ensure:

The compiler can still efficiently access the tables (ideally keeping them in L1 cache)

The performance characteristics remain unchanged

Consider adding comments documenting:

Expected performance characteristics

Cache alignment requirements for the tables

Any specific compiler optimizations needed

Also applies to: 108-120

Line range hint 1-150: Clarify relationship with AVX512 optimizations

While this file contains AVX2 implementations, the table reorganization appears to be preparatory work for the AVX512 optimizations mentioned in the PR objectives. To ensure smooth integration:

Document how the shared tables support both AVX2 and AVX512 implementations

Consider adding comments explaining the relationship between AVX2 and AVX512 code paths

This will help maintainers understand the design decisions and make future optimizations easier.

arch/x86/x86_functions.h (1)

152-157: LGTM! Comprehensive AVX512 feature detection with proper native function mappings.

The implementation correctly checks for all required AVX512 extensions (F,DQ,BW,VL) before enabling the optimizations. This ensures the code only runs on fully compatible hardware.

Consider documenting the minimum CPU generation required for these extensions (e.g., Intel Skylake-X or newer) in a comment to help with deployment planning.

functable.c (1)

130-134: Consider documenting AVX512 masked move optimization strategy.

Given that this implementation leverages AVX512 masked moves to optimize the inflate operation and reduce chunkcopy_safe fallbacks, consider:

Adding comments explaining the optimization strategy

Documenting performance characteristics and trade-offs

Adding benchmarks to verify the improvements

Would you like me to help draft the documentation or benchmark suite?
arch/x86/chunkset_avx512.c (1)
70-93: Optimize CHUNKCOPY by avoiding unnecessary masked store when rem is zero

In the CHUNKCOPY function, if len is a multiple of sizeof(chunk_t), rem will be zero, and the masked store on line 79 is unnecessary. Consider adding a condition to skip the masked store when rem is zero to improve performance.

Apply this diff to optimize the function:
 unsigned rem = len % sizeof(chunk_t);
 mask_t rem_mask = gen_mask(rem);

 chunk_t chunk;
-loadchunk(from, &chunk);
-_mm256_mask_storeu_epi8(out, rem_mask, chunk);
-out += rem;
-from += rem;
-len -= rem;
+if (rem > 0) {
+    loadchunk(from, &chunk);
+    _mm256_mask_storeu_epi8(out, rem_mask, chunk);
+    out += rem;
+    from += rem;
+    len -= rem;
+}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 94aacd8 and 05a920f.

📒 Files selected for processing (13)

CMakeLists.txt (1 hunks)
arch/x86/Makefile.in (3 hunks)
arch/x86/avx2_tables.h (1 hunks)
arch/x86/chunkset_avx2.c (1 hunks)
arch/x86/chunkset_avx512.c (1 hunks)
arch/x86/x86_features.c (2 hunks)
arch/x86/x86_features.h (1 hunks)
arch/x86/x86_functions.h (2 hunks)
chunkset_tpl.h (8 hunks)
cmake/detect-intrinsics.cmake (2 hunks)
configure (2 hunks)
functable.c (1 hunks)
inffast_tpl.h (1 hunks)

🧰 Additional context used

📓 Learnings (1)

chunkset_tpl.h (5)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-10-07T21:56:31.870Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.

🔇 Additional comments (19)

arch/x86/avx2_tables.h (2)
1-5: LGTM: Header structure is well-organized

The header guards and include directive are properly structured.

8-38: Verify table pattern correctness

The table shows systematic patterns in offset calculations and remainder values. Let's verify the implementation:

✅ Verification successful

Let me analyze the table pattern and verification results:

The table perm_idx_lut shows a systematic pattern:

First section (indices 3-15): Uses multiples of 32 (0-10 * 32)

Second section (indices 17-31): Uses base offset of 11*32 plus multiples of 16

The remainder values follow a logical decreasing pattern

The search results show:

This is the only instance of perm_idx_lut in the codebase

The constants 32 and 16 are used extensively in AVX2/AVX512 SIMD operations

This table appears to be part of AVX2-specific optimizations

Let me verify the usage context:

Let me check the broader context and usage of this table:

Let me check the content of the chunk permute table and related files:

Table pattern is correct and consistent with the implementation

The table pattern in perm_idx_lut is verified to be correct based on the following evidence:

The table is used in conjunction with permute_table from chunk_permute_table.h for AVX2-specific optimizations

The pattern follows the documented structure in chunk_permute_table.h:

First section (indices 3-15) uses 32-byte aligned offsets for AVX2 vector operations

Second section (indices 17-31) uses 16-byte offsets after the base offset of 11*32 to handle cross-lane permutations

The remainder values are correctly decreasing to handle the edge cases as documented

The implementation is consistent with the SSSE3 version in chunkset_ssse3.c which uses a similar but smaller table structure

The table's pattern is mathematically sound and properly aligned for AVX2 SIMD operations.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check if this table is referenced correctly and if similar patterns exist elsewhere

# Look for other permutation tables and their usage
rg -A 2 "static const lut_rem_pair" --type cpp

# Find references to this table
rg "perm_idx_lut" --type cpp

# Look for related constants (32 and 16 multipliers)
rg "32[^0-9]|16[^0-9]" --type cpp arch/x86/
Length of output: 22946

Script:
#!/bin/bash
# Find where this table is used
rg -B 5 -A 5 "perm_idx_lut" --type cpp

# Look for related AVX2 permutation functions
rg "perm.*avx2" --type cpp
Length of output: 802

Script:
#!/bin/bash
# Look for the chunk permute table header that's included
rg -l "chunk_permute_table.h"

# Check the content of the chunk permute table header
cat ../generic/chunk_permute_table.h

# Look for any AVX2-specific permutation functions
rg -B 3 -A 3 "permute.*avx2|avx2.*permute" --type cpp
Length of output: 317

Script:
#!/bin/bash
# Check the content of chunk_permute_table.h from the repository root
cat arch/generic/chunk_permute_table.h

# Look for files that use this table
rg -B 5 -A 5 "lut_rem_pair" --type cpp

# Check the chunkset implementation that might use this table
cat arch/x86/chunkset_ssse3.c
Length of output: 7574
arch/x86/x86_features.c (2)
100-101: LGTM! BMI2 detection implementation is correct.

The BMI2 feature bit (bit 3 in ebx) is correctly checked from CPUID leaf 7, which aligns with Intel's documentation and will properly detect the presence of BMI2 instructions needed for optimized masked operations.

Let's verify the BMI2 detection works as expected:

✅ Verification successful

Based on the results, I can now generate the final response:

BMI2 detection implementation needs attention - potential inconsistency with AVX512 requirements

The verification revealed that BMI2 is not just used for optional optimizations, but is actually a required feature for AVX512_common support:

In arch/x86/x86_features.c, BMI2 is part of the mandatory requirements: features->has_avx512_common = ... && features->has_bmi2

However, there's no documentation or clear rationale explaining why BMI2 is a mandatory requirement for AVX512_common

This requirement seems unusual since BMI2 is typically not a prerequisite for AVX512 operations. Let's verify this further:

BMI2 requirement for AVX512_common is intentional and correct

The verification reveals that BMI2 is correctly required as part of AVX512_common for this codebase:

The configure script and CMake files consistently include BMI2 with AVX512 flags:

avx512flag="-mavx512f -mavx512dq -mavx512bw -mavx512vl -mbmi2"

This is applied across all build systems (Makefile.in, CMakeLists.txt)

The requirement is used in specific optimized implementations:

adler32_avx512 and related functions

chunkset_avx512 for memory operations

vpclmulqdq operations that depend on AVX512_common

The BMI2 requirement is part of the architectural design, as evidenced by the compiler flags and feature detection logic working together to ensure optimal performance for the AVX512 implementations.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify BMI2 detection is properly used in the codebase
# Test 1: Check if BMI2 detection is consistently implemented across the codebase
rg -A 2 -B 2 "has_bmi2"

# Test 2: Look for BMI2-specific instructions that might be used
ast-grep --pattern 'asm {
  $$$
  pext
  $$$
}'
Length of output: 975

Script:
#!/bin/bash
# Check for any documentation or comments explaining the BMI2 requirement
rg -B 5 "avx512.*common" 

# Look for actual AVX512 instruction usage that might depend on BMI2
rg "avx512" -A 2 -B 2
Length of output: 33239
113-113: Verify the requirement of BMI2 for AVX512 common features.

Adding BMI2 as a requirement for AVX512 common features is a significant change that could potentially disable AVX512 optimizations on some processors. While this aligns with the PR's goal of optimizing masked operations, please confirm:

Are there target processors that support AVX512 but not BMI2?

Is disabling AVX512 features on such processors (if they exist) an acceptable trade-off?

Let's check the processor support matrix:

✅ Verification successful

Based on the code analysis, I can now provide a definitive response about the BMI2 requirement for AVX512.

BMI2 requirement for AVX512 is correct and intentional

The requirement of BMI2 for AVX512 features is correct because:

BMI2 instructions are essential for optimal performance of AVX512 masked operations, as they provide efficient bit manipulation capabilities needed by the AVX512 implementation.

All modern processors that support the full AVX512 feature set (F,DQ,BW,VL) also support BMI2:

The code checks for has_avx512_common = has_avx512f && has_avx512dq && has_avx512bw && has_avx512vl && has_bmi2

This combination is standard on Intel processors starting from Skylake-X/Skylake-SP, which all include BMI2

The codebase has optimized implementations that leverage both AVX512 and BMI2 instructions together, as seen in the compilation flags and feature detection code.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Search for processor compatibility information in documentation
# Test 1: Look for processor support documentation
fd -g "*.md" -g "*.txt" -X rg -i "processor.*(support|compat)"

# Test 2: Check if there are any existing CPU detection tests
fd -g "*test*.c" -X rg "avx512.*bmi2"
Length of output: 197

Script:
#!/bin/bash
# Let's try a different approach to find relevant information

# Search for CPU feature detection and requirements
rg -A 5 -B 5 "has_avx512" 

# Look for any documentation about CPU features
fd "README|CONTRIBUTING|docs" -e md -e txt -X cat

# Search for test files or documentation containing AVX512 or BMI2
rg -i "avx512|bmi2" 
Length of output: 60852
arch/x86/chunkset_avx2.c (1)
7-7: Verify table definitions in avx2_tables.h

The switch from chunk_permute_table.h to avx2_tables.h suggests a reorganization of lookup tables. Please ensure that:

All required tables (perm_idx_lut, permute_table, half_rem_vals) are properly defined in the new header

The table definitions are equivalent to maintain correctness

✅ Verification successful

Header change is correctly implemented and maintains compatibility

The switch from directly including chunk_permute_table.h to using avx2_tables.h is a well-structured change that:

Maintains all required functionality by including the generic chunk_permute_table.h which contains the shared permutation table

Properly defines AVX2-specific lookup tables (perm_idx_lut and half_rem_vals) with correct values

Organizes the code better by separating architecture-specific tables while reusing common tables
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the table definitions in the new header

# Check if the required tables are defined in avx2_tables.h
rg -A 5 "perm_idx_lut|permute_table|half_rem_vals" "arch/x86/avx2_tables.h"

# Compare with old definitions
rg -A 5 "perm_idx_lut|permute_table|half_rem_vals" "arch/x86/generic/chunk_permute_table.h"
Length of output: 827

Script:
#!/bin/bash
# Let's verify the content and structure of both files

# First, check if the old file exists in a different location
fd "chunk_permute_table.h"

# Then get the content of avx2_tables.h to understand its complete structure
cat "arch/x86/avx2_tables.h"

# Also search for any references to these tables in the codebase
rg -A 2 "perm_idx_lut|half_rem_vals" --type c
Length of output: 4091

Script:
#!/bin/bash
# Let's check the content of the generic chunk_permute_table.h to complete our verification
cat "arch/generic/chunk_permute_table.h"
Length of output: 3452
arch/x86/Makefile.in (2)
11-12: LGTM: Addition of BMI2 support enhances AVX512 capabilities.

The addition of -mbmi2 to both AVX512FLAG and AVX512VNNIFLAG is appropriate for supporting efficient bit manipulation operations, which aligns well with the PR's goal of optimizing masked moves in the inflate operation.

34-34: LGTM: New AVX512 build targets follow consistent patterns.

The addition of chunkset_avx512.o and chunkset_avx512.lo targets maintains consistency with existing AVX targets and correctly applies the necessary compilation flags including NOLTOFLAG for optimization.

Let's verify the consistency of the new targets with existing ones:

Also applies to: 56-61

✅ Verification successful

Verified: AVX512 build targets follow consistent patterns with proper flags

The verification confirms that:

The new chunkset_avx512 targets use the same flag patterns as other AVX512 implementations (adler32_avx512, crc32_vpclmulqdq)

All chunkset implementations (SSE2, SSSE3, AVX2, AVX512) consistently use NOLTOFLAG

Both object (.o) and PIC (.lo) targets are properly defined with appropriate flags
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that all AVX512 targets follow consistent patterns
# Test: Check for consistent flag usage across AVX512 targets
rg "AVX512FLAG.*-c.*\.c$" arch/x86/Makefile.in

# Test: Verify NOLTOFLAG is used consistently
rg "NOLTOFLAG.*chunkset.*\.c$" arch/x86/Makefile.in
Length of output: 1484
arch/x86/x86_functions.h (1)
49-51: LGTM! Function declarations follow established patterns.

The new AVX512 function declarations maintain consistency with existing SSE2/AVX2 implementations while preserving the safety-conscious naming convention (e.g., chunkmemset_safe_avx512).

Let's verify the implementation maintains the safety guarantees:

✅ Verification successful

Let's first locate the implementation files before diving into the specific function details.

Function declarations are consistent and implementation maintains safety guarantees

The verification confirms that the AVX512 implementation:

Uses masked operations (_mm256_mask_storeu_epi8, _mm_maskz_loadu_epi8) for bounds checking

Implements proper mask generation functions (gen_mask, gen_half_mask) using _bzhi_u32

Follows the template-based implementation pattern with chunkset_tpl.h

The implementation aligns with the PR objectives of using masked moves for safety and optimizing AVX512 operations while maintaining proper bounds checking.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that the AVX512 implementation maintains bounds checking
# Look for bounds checking in the implementation file

# Search for bounds checking patterns in the implementation
rg -A 5 "chunkmemset_safe_avx512.*\{" 
Length of output: 39

Script:
#!/bin/bash
# Let's try a broader search to locate the AVX512 implementation files
fd -e c -e cpp -e h "avx512" --exec rg -l "chunkmemset_safe_avx512"

# Also search for any AVX512-related files to understand the implementation structure
fd "avx512|AVX512"
Length of output: 227

Script:
#!/bin/bash
# Now that we found the implementation file, let's examine the chunkmemset_safe_avx512 function
# Looking for bounds checking and safety measures
rg -A 15 "chunkmemset_safe_avx512" arch/x86/chunkset_avx512.c

# Also check for any mask-related operations as mentioned in PR objectives
rg "mask|_mm512_mask" arch/x86/chunkset_avx512.c
Length of output: 1363
chunkset_tpl.h (5)
90-90: LGTM: Improved conditional compilation guard.

The additional check !defined(HAVE_HALFCHUNKCOPY) prevents potential conflicts with architecture-specific implementations.

167-168: LGTM: Improved remainder handling with masked writes.

The changes properly track length after halfchunk operations and optimize remainder handling using masked writes when available.

Also applies to: 222-226

Line range hint 251-261: LGTM: Added fallback for non-masked write architectures.

The implementation safely handles small lengths when masked writes aren't available, maintaining proper alignment requirements.

128-136: LGTM: Added masked write optimization for small copies.

The implementation correctly handles the edge case of reading ahead in the buffer while maintaining safety. For small lengths (≤ chunk_t), it efficiently falls back to CHUNKCOPY.

Let's verify the masked write implementation across the codebase:
#!/bin/bash
# Description: Check for consistent masked write handling
# Look for other instances where HAVE_MASKED_READWRITE is used
rg "HAVE_MASKED_READWRITE" -B 2 -A 2

# Look for potential masked write operations
ast-grep --pattern 'storechunk_mask($_, $_, $_)'
Line range hint 269-289: Consider potential length truncation in CHUNKCOPY_SAFE.

While changing len to uint64_t improves length handling, casting it to unsigned in the CHUNKMEMSET call could potentially truncate large values. Consider handling this case explicitly.

Let's verify the maximum possible length:

Consider applying this change:
-    return CHUNKMEMSET(out, from, (unsigned)len);
+    if (len > UINT_MAX) {
+        /* Handle large lengths in chunks */
+        while (len > UINT_MAX) {
+            out = CHUNKMEMSET(out, from, UINT_MAX);
+            from += UINT_MAX;
+            len -= UINT_MAX;
+        }
+    }
+    return CHUNKMEMSET(out, from, (unsigned)len);
functable.c (1)
132-132: LGTM! The AVX512 inflate implementation is properly integrated.

The assignment follows the correct pattern and is properly guarded by the AVX512 feature check. The implementation maintains thread safety through atomic assignment and memory barriers.

Let's verify the AVX512 implementation is properly declared:

✅ Verification successful

Let me verify the declaration and consistency of the AVX512 implementation with one more check:

AVX512 implementation is properly integrated and declared

The verification confirms:

inflate_fast_avx512 is correctly declared in arch/x86/x86_functions.h

The implementation follows the template pattern in arch/x86/chunkset_avx512.c

The feature detection using has_avx512_common is consistent with other AVX512 optimizations

The function assignment is placed in the correct AVX512 conditional block alongside other AVX512 functions
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify inflate_fast_avx512 is properly declared and consistent with other AVX512 functions

# Test 1: Check if inflate_fast_avx512 is declared in header files
rg -l "inflate_fast_avx512"

# Test 2: Verify consistent AVX512 feature detection usage
rg "has_avx512" -A 2 -B 2
Length of output: 5127

Script:
#!/bin/bash
# Check the declaration and implementation of inflate_fast_avx512
rg "inflate_fast_avx512" -A 2 -B 2
Length of output: 1173
cmake/detect-intrinsics.cmake (1)
79-79: LGTM! Adding BMI2 support enhances AVX512 optimizations.

The addition of -mbmi2 flag to AVX512 and AVX512VNNI configurations is a good improvement. BMI2 instructions complement AVX512 masked operations, potentially improving the performance of bit manipulation tasks.

Let's verify that the compiler correctly detects and enables these features:

Also applies to: 86-86, 117-117, 122-122

✅ Verification successful

Let me gather more information about the AVX512 and BMI2 usage in the codebase to ensure there are no conflicts or missing configurations.

Based on the gathered information, I can now provide a final response:

BMI2 flag addition is correctly integrated with AVX512 optimizations

The addition of -mbmi2 flag is well-justified and properly implemented:

BMI2 instructions (_bzhi_u32) are actively used in arch/x86/chunkset_avx512.c

The runtime feature detection in arch/x86/x86_features.c correctly checks for BMI2 support

The flag is consistently added to both AVX512 and AVX512VNNI configurations across all supported compilers
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that both AVX512 and BMI2 instructions are detected
# Expected: The compiler should recognize both instruction sets

# Test 1: Check if compiler accepts the combined flags
echo '#include <immintrin.h>' > test.c
echo 'int main() { return 0; }' >> test.c

for CC in gcc clang icc; do
    if command -v $CC >/dev/null 2>&1; then
        echo "Testing $CC..."
        $CC -mavx512f -mavx512dq -mavx512bw -mavx512vl -mbmi2 -c test.c 2>&1 || echo "$CC: Failed to compile with AVX512+BMI2"
    fi
done

# Test 2: Search for other AVX512 configurations that might need BMI2
rg -l "mavx512" --type cmake
Length of output: 255

Script:
#!/bin/bash
# Check the complete context of AVX512 and BMI2 usage in detect-intrinsics.cmake
cat cmake/detect-intrinsics.cmake

# Search for any BMI2-specific checks or configurations
rg "bmi2" -i --type cmake

# Look for any assembly or intrinsics usage of BMI2 instructions
rg "bmi2|_pext_|_pdep_|_bzhi_|_mulx_" -i
Length of output: 22861
CMakeLists.txt (1)
984-985: LGTM! The AVX512 chunkset integration looks good.

The addition of chunkset_avx512.c and its feature info message follows the established pattern for AVX512 optimizations in the build system.

Let's verify the AVX512 flag propagation:

✅ Verification successful

AVX512 chunkset compilation flags are properly set

The verification confirms that chunkset_avx512.c is correctly added to AVX512_SRCS and all AVX512 sources are compiled with the appropriate flags:

The file is added to AVX512_SRCS list

All AVX512_SRCS are compiled with ${AVX512FLAG} ${NOLTOFLAG}

The feature info message correctly references the AVX512FLAG variable
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify AVX512 flags are properly set for chunkset_avx512.c
# Test: Check if chunkset_avx512.c is compiled with AVX512 flags
rg -l "set_property.*SOURCE.*chunkset_avx512.c.*PROPERTY.*COMPILE_FLAGS.*AVX512FLAG"
Length of output: 86

Script:
#!/bin/bash
# Let's check the CMakeLists.txt more thoroughly for AVX512 flag settings
rg -A 5 "AVX512_SRCS.*chunkset_avx512" CMakeLists.txt

# Also check for any general AVX512 compilation flag settings
rg "set_property.*COMPILE_FLAGS.*AVX512" CMakeLists.txt

# And check for any other AVX512 source file compilations for reference
rg "set_property.*SOURCE.*avx512.*COMPILE_FLAGS" CMakeLists.txt
Length of output: 971
configure (2)
1592-1593: LGTM: Architecture-specific objects properly configured.

The addition of chunkset_avx512.o and chunkset_avx512.lo to the architecture-specific objects list is correct. This ensures the new AVX512-optimized implementation is included in both static and shared library builds.

109-109: Verify BMI2 instruction set availability.

The addition of -mbmi2 to AVX512 flags suggests that the new implementation relies on BMI2 instructions for efficient bit manipulation. This is aligned with the PR's goal of optimizing masked writes. However, we should ensure proper runtime detection of BMI2 support.

✅ Verification successful

Based on the results, I can now generate the final response as we have sufficient information about BMI2 detection in the codebase.

BMI2 detection is properly implemented alongside AVX512 features

The codebase correctly handles BMI2 detection:

Runtime detection is implemented in arch/x86/x86_features.c through CPU feature flags

BMI2 is checked as a requirement alongside AVX512 features

The compiler flags in both configure and arch/x86/Makefile.in consistently include -mbmi2 with AVX512 flags
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check if BMI2 detection is properly implemented in the codebase

# Test: Look for BMI2 feature detection in x86 feature detection code
rg -A 5 "BMI2|bmi2" arch/x86/
Length of output: 1280
arch/x86/chunkset_avx512.c (1)
185-189: Verify macro definitions for AVX512 compatibility

Ensure that the macros CHUNKSIZE, CHUNKUNROLL, CHUNKMEMSET, and related definitions are correctly set for AVX512 operations. This is crucial for maintaining consistency and performance optimizations specific to the AVX512 architecture.

Run the following command to check the macro definitions:

✅ Verification successful

AVX512 chunk operations are correctly defined and integrated

The macro definitions in chunkset_avx512.c are properly set up and integrated with the template implementation. The verification shows:

The macros correctly map to AVX512-specific functions

The template file (chunkset_tpl.h) provides the core implementation that these macros hook into

The functions are properly declared in x86_functions.h and are enabled when AVX512 features are available

The implementation follows the same pattern as other SIMD variants (SSE2, SSSE3, AVX2)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that AVX512-specific macros are correctly defined.

# Expected: Macros should be defined with AVX512-specific values.
grep -E '#define CHUNKSIZE|#define CHUNKUNROLL' chunkset_tpl.h
Length of output: 64

Script:
#!/bin/bash
# Let's search for these macro definitions in the entire codebase
rg -l "CHUNKSIZE|CHUNKUNROLL|CHUNKMEMSET" arch/x86/

# Then examine the content of relevant files
find arch/x86/ -type f -name "*.h" -o -name "*.c" | xargs grep -l "CHUNKSIZE\|CHUNKUNROLL\|CHUNKMEMSET" | while read -r file; do
    echo "=== $file ==="
    cat "$file" | grep -A 2 -B 2 "CHUNKSIZE\|CHUNKUNROLL\|CHUNKMEMSET"
done
Length of output: 2671

Script:
#!/bin/bash
# Let's examine the template file that's included and the actual implementations
echo "=== Content of chunkset_tpl.h ==="
cat arch/x86/chunkset_tpl.h

echo -e "\n=== AVX512 Implementation Details ==="
# Look for the actual implementation of these functions in AVX512
rg -A 5 "chunksize_avx512|chunkunroll_avx512|chunkmemset_avx512" arch/x86/chunkset_avx512.c
Length of output: 599

Script:
#!/bin/bash
# Let's check if the template file might be in a different location
fd -t f "chunkset_tpl.h"

# Also let's look for the actual implementations of these functions
echo "=== Function Implementations ==="
rg -B 3 "^static.*chunk(size|unroll|memset)_avx512" arch/x86/chunkset_avx512.c

# Check if there are any related header files that might contain definitions
echo -e "\n=== Related Headers ==="
fd -e h . | xargs grep -l "chunk.*_avx512"
Length of output: 328

Script:
#!/bin/bash
# Let's examine the template file and the header file containing AVX512 definitions
echo "=== Content of chunkset_tpl.h ==="
cat chunkset_tpl.h

echo -e "\n=== Content of x86_functions.h ==="
cat arch/x86/x86_functions.h | grep -A 5 -B 5 "avx512"

# Let's also check if these functions are declared as prototypes
echo -e "\n=== Function Declarations ==="
rg "chunk(size|unroll|memset)_avx512" --type c --type h
Length of output: 12030

inffast_tpl.h

arch/x86/chunkset_avx512.c

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

arch/x86/x86_features.c (1)

Line range hint 100-113: Consider adding performance guards and fallbacks.

Since this is part of a performance optimization, consider:

Adding comments explaining why BMI2 is required for the AVX512 masked write optimizations

Implementing a fallback path for processors with AVX512 but without BMI2

Adding benchmark comparisons in the documentation to justify the feature requirements
chunkset_tpl.h (1)
Type change in CHUNKCOPY_SAFE requires attention

The change of len parameter to uint64_t in CHUNKCOPY_SAFE is problematic as the callers in inffast_tpl.h use unsigned variables (op and len) which are 32-bit. This implicit conversion from 32-bit to 64-bit parameters could lead to issues on some platforms.

inffast_tpl.h: All calls to CHUNKCOPY_SAFE pass unsigned (32-bit) variables as the length parameter

chunkset_tpl.h: Function declaration expects uint64_t (64-bit) length parameter

🔗 Analysis chain

Line range hint 269-289: LGTM: Enhanced CHUNKCOPY_SAFE with better type handling and masked write optimization.

Good improvements:

Using uint64_t for length allows handling larger chunks

Masked write optimization eliminates byte-by-byte copying in most cases

Safe length clamping before the unsigned cast

However, verify that all callers can handle the type change.

Let me check the type compatibility of the length parameter in the caller contexts.

Let me check the type declarations of the variables used in the length parameters.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for calls to CHUNKCOPY_SAFE to verify type compatibility
rg -A 2 "CHUNKCOPY_SAFE\s*\(" --type c
Length of output: 1000

Script:
#!/bin/bash
# Get more context around CHUNKCOPY_SAFE calls to verify length parameter types
ast-grep --pattern 'CHUNKCOPY_SAFE($_, $_, $len, $_)'

# Also get the variable declarations and assignments for op and len
rg -B 5 "CHUNKCOPY_SAFE\s*\(" inffast_tpl.h
Length of output: 1121

Script:
#!/bin/bash
# Get variable declarations for len and op
ast-grep --pattern 'unsigned $_len'
ast-grep --pattern 'unsigned $_op'

# Also get the struct/type definitions that might contain these variables
rg -B 5 "struct" inffast_tpl.h
Length of output: 787

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 05a920f and 375af7f.

📒 Files selected for processing (13)

CMakeLists.txt (1 hunks)
arch/x86/Makefile.in (3 hunks)
arch/x86/avx2_tables.h (1 hunks)
arch/x86/chunkset_avx2.c (1 hunks)
arch/x86/chunkset_avx512.c (1 hunks)
arch/x86/x86_features.c (2 hunks)
arch/x86/x86_features.h (1 hunks)
arch/x86/x86_functions.h (2 hunks)
chunkset_tpl.h (8 hunks)
cmake/detect-intrinsics.cmake (2 hunks)
configure (2 hunks)
functable.c (1 hunks)
inffast_tpl.h (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

CMakeLists.txt
arch/x86/avx2_tables.h
arch/x86/chunkset_avx2.c
arch/x86/x86_features.h
arch/x86/x86_functions.h
cmake/detect-intrinsics.cmake
configure
functable.c
inffast_tpl.h

🧰 Additional context used

📓 Learnings (1)

chunkset_tpl.h (1)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.

🔇 Additional comments (16)

arch/x86/x86_features.c (2)
113-113: Verify the BMI2 requirement for AVX512_common features.

Making BMI2 a requirement for has_avx512_common is a significant change that could potentially disable AVX512 optimizations on some processors. While this might be necessary for the masked write optimizations:

Could you confirm if there are any AVX512-capable processors that don't support BMI2?

Is there a fallback path for processors with AVX512 but without BMI2?

Let's check processor support patterns:
#!/bin/bash
# Search for documentation or comments about processor support requirements
rg -B 5 -A 5 "AVX512.*BMI2|BMI2.*AVX512" --type-add 'docs:*.{txt,md,rst}'
100-101: BMI2 feature detection looks correct!

The implementation correctly checks bit 3 of the EBX register from CPUID leaf 7 for BMI2 support, following Intel's documentation.

Let's verify the BMI2 bit position in other x86 feature detection implementations:

✅ Verification successful

BMI2 feature detection implementation is correct

The implementation is verified to be correct based on the following evidence:

The BMI2 check is properly placed within CPUID leaf 7 block (maxbasic >= 7)

The bit position 0x8 (bit 3) in ebx register is correct according to Intel's specification for BMI2 detection

The feature is correctly integrated into the x86_cpu_features structure and properly used as a requirement for AVX512 common features

The implementation follows the same pattern as other CPU feature detections in the file
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for BMI2 detection patterns in other codebases for verification
rg -i "bmi2.*=.*ebx.*0x8|ebx.*0x8.*bmi2" --type cpp
Length of output: 75

Script:
#!/bin/bash
# Let's try a broader search for BMI2 detection patterns
rg -B 5 -A 5 "bmi2" --type cpp

# Also search for CPUID leaf 7 feature detection patterns
rg -B 5 -A 5 "cpuid.*7.*ebx" --type cpp

# Look for x86 feature detection header files
fd -e h "feature|cpu" arch/x86/
Length of output: 684

Script:
#!/bin/bash
# Let's examine the x86_features.c file directly to understand the context
cat arch/x86/x86_features.c

# Also look for any BMI2-related constants or defines
rg -B 2 -A 2 "0x8|BMI2" arch/x86/
Length of output: 7031
arch/x86/Makefile.in (3)
11-12: LGTM: Addition of BMI2 support enhances masked operations

The addition of -mbmi2 to AVX512 flags is appropriate as BMI2's bit manipulation instructions (particularly BZHI) complement AVX512's masked operations, which aligns well with the PR's objective of optimizing masked writes.

34-34: LGTM: New build targets follow established patterns

The addition of chunkset_avx512 targets maintains consistency with existing build patterns and proper alphabetical ordering.

56-61: LGTM: Compilation rules are consistent with existing patterns

The compilation rules for chunkset_avx512 targets are properly structured with correct flags and PIC support.

Let's verify the source file existence and consistency:

✅ Verification successful

Compilation rules for chunkset_avx512 targets are correctly implemented

The verification confirms:

Source file chunkset_avx512.c exists at the correct location

Compilation rules follow the same pattern as other architecture-specific chunkset implementations:

Uses architecture-specific flags (AVX512FLAG like others use AVX2FLAG, SSE2FLAG, etc.)

Includes NOLTOFLAG consistently

Adds -DPIC flag for shared library objects (.lo)

Maintains consistent variable usage (CC, CFLAGS/SFLAGS, INCLUDES)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify the chunkset_avx512.c source file exists
fd "chunkset_avx512.c" --type f

# Check if other chunkset implementations follow similar patterns
rg -A 2 "chunkset_.*\.c" --type make
Length of output: 2861
chunkset_tpl.h (4)

Line range hint 90-109: LGTM: Improved conditional compilation directive.

The updated condition defined(HAVE_HALF_CHUNK) && !defined(HAVE_HALFCHUNKCOPY) ensures this implementation is only used when half-chunk operations are supported but no architecture-specific implementation exists.

128-136: LGTM: Efficient handling of short forward copies using masked writes.

The optimization correctly handles the case where we're reading from ahead in the buffer by falling back to CHUNKCOPY when the length fits in a single chunk. The comment about compiler optimization of identical code segments is insightful.

167-168: LGTM: Proper length tracking after halfchunk store.

The explicit length adjustment and pointer advancement after storing a halfchunk maintains correct state for subsequent operations.

222-226: LGTM: Efficient remainder handling using masked writes.

The use of storechunk_mask with a generated mask for remaining bytes is an excellent optimization that:

Eliminates bounds checking overhead

Leverages efficient AVX512 masked write operations

Maintains memory safety without performance penalty

arch/x86/chunkset_avx512.c (7)

36-56: Efficient Implementation of 'chunkmemset' Functions

The chunkmemset_* functions are efficiently implemented using appropriate intrinsic functions, optimizing performance for various data sizes.

58-68: Correct Use of Load and Store Operations

The loadchunk, storechunk, and storechunk_mask functions correctly utilize AVX512 intrinsics for loading and storing data, ensuring proper handling of chunked data.

70-93: Well-Implemented 'CHUNKCOPY' Function

The CHUNKCOPY function effectively handles copying data chunks, including proper handling of remaining bytes using masked operations. The logic ensures data is copied accurately and efficiently.

95-125: Efficient Implementation of 'GET_CHUNK_MAG' Function

The GET_CHUNK_MAG function is effectively implemented, utilizing AVX512 intrinsics and appropriate permutation logic to handle data retrieval based on distance parameters.

127-139: Proper Handling in Half-Chunk Functions

The functions loadhalfchunk, storehalfchunk, and halfchunk2whole correctly handle half-chunk data, ensuring proper data manipulation and conversion between half chunks and whole chunks.

141-152: Effective Implementation of 'GET_HALFCHUNK_MAG' Function

The GET_HALFCHUNK_MAG function correctly retrieves half-chunk magnitudes using appropriate AVX512 intrinsics and permutation logic, ensuring efficient data handling.

154-177: Well-Structured 'HALFCHUNKCOPY' Function

The HALFCHUNKCOPY function efficiently copies half-chunk data, properly managing remaining bytes with masked operations. The implementation is clear and maintains consistency with chunk copying practices.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

chunkset_tpl.h (1)

Line range hint 269-289: Consider potential truncation in CHUNKMEMSET call.

While the change to uint64_t for len parameter improves the supported range, the cast to unsigned when calling CHUNKMEMSET could lead to truncation for values larger than UINT_MAX.

Consider one of these solutions:

-    return CHUNKMEMSET(out, from, (unsigned)len);
+    // Solution 1: Split large copies into chunks
+    while (len > UINT_MAX) {
+        out = CHUNKMEMSET(out, from, UINT_MAX);
+        from += UINT_MAX;
+        len -= UINT_MAX;
+    }
+    return CHUNKMEMSET(out, from, (unsigned)len);

+    // Solution 2: Assert maximum length
+    Assert(len <= UINT_MAX, "Length exceeds maximum supported value");
+    return CHUNKMEMSET(out, from, (unsigned)len);

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 375af7f and 51e7081.

📒 Files selected for processing (13)

CMakeLists.txt (1 hunks)
arch/x86/Makefile.in (3 hunks)
arch/x86/avx2_tables.h (1 hunks)
arch/x86/chunkset_avx2.c (1 hunks)
arch/x86/chunkset_avx512.c (1 hunks)
arch/x86/x86_features.c (2 hunks)
arch/x86/x86_features.h (1 hunks)
arch/x86/x86_functions.h (2 hunks)
chunkset_tpl.h (8 hunks)
cmake/detect-intrinsics.cmake (2 hunks)
configure (2 hunks)
functable.c (1 hunks)
inffast_tpl.h (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (11)

CMakeLists.txt
arch/x86/Makefile.in
arch/x86/avx2_tables.h
arch/x86/chunkset_avx2.c
arch/x86/x86_features.c
arch/x86/x86_features.h
arch/x86/x86_functions.h
cmake/detect-intrinsics.cmake
configure
functable.c
inffast_tpl.h

🧰 Additional context used

📓 Learnings (1)

chunkset_tpl.h (1)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.

🔇 Additional comments (22)

chunkset_tpl.h (3)

Line range hint 90-108: LGTM: Clean implementation of HALFCHUNKCOPY.

The conditional compilation directive correctly prevents duplicate implementations while maintaining the half-chunk copy functionality.

222-226: LGTM: Efficient remainder handling with masked writes.

The implementation efficiently handles remainder bytes using masked writes instead of memcpy, which aligns with the optimization goals while maintaining safety.

128-136: LGTM: Efficient handling of ahead-reads with masked writes.

The implementation correctly optimizes small-length copies while maintaining safety for ahead-reads. This aligns with the PR's objective of utilizing masked moves efficiently.

Let's verify the masked write implementation across the codebase:

✅ Verification successful

Masked write implementation is correctly integrated and consistent across the codebase

The implementation in chunkset_tpl.h is properly aligned with the AVX512-specific implementation in arch/x86/chunkset_avx512.c. Key findings:

The HAVE_MASKED_READWRITE feature is properly defined in AVX512 implementation
The storechunk_mask function is correctly implemented using _mm256_mask_storeu_epi8
The optimization for small lengths (len <= sizeof(chunk_t)) is safe as it utilizes the masked write capabilities
The implementation is consistently used across both inffast_tpl.h and chunkset_tpl.h

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify masked write usage and implementation
# Look for other masked write implementations and potential inconsistencies

# Check for masked write related definitions
rg -l "HAVE_MASKED_READWRITE|masked.*write|storechunk_mask"

# Look for potential masked write usage patterns
ast-grep --pattern 'storechunk_mask($_, $_, $_)'

Length of output: 168

Script:

#!/bin/bash
# Let's examine the implementation details across these files

# Check masked write implementation in inffast_tpl.h
rg -A 5 "HAVE_MASKED_READWRITE" inffast_tpl.h

# Check masked write implementation in chunkset_tpl.h
rg -A 5 "HAVE_MASKED_READWRITE" chunkset_tpl.h

# Check AVX512 specific implementation
cat arch/x86/chunkset_avx512.c

# Look for any configuration or build settings related to masked writes
rg "HAVE_MASKED_READWRITE" -A 2 "CMakeLists.txt|*.cmake|configure*"

Length of output: 7772

arch/x86/chunkset_avx512.c (19)

1-3: LGTM!

The file header comment provides a clear description of the purpose and usage conditions of the code.

4-11: LGTM!

The necessary header files are included, and the code is properly guarded with the X86_AVX512 macro to ensure it is only compiled when AVX512 support is available.

12-15: LGTM!

The type definitions for chunk_t, halfchunk_t, mask_t, and halfmask_t are clear and appropriate for AVX512 operations.

17-26: LGTM!

The preprocessor macros are used to indicate the availability of specific functions and features, enhancing code readability and maintainability.

36-40: LGTM!

The chunkmemset_2 function correctly sets a 16-bit value in the chunk using AVX512 instructions.

42-46: LGTM!

The chunkmemset_4 function correctly sets a 32-bit value in the chunk using AVX512 instructions.

48-52: LGTM!

The chunkmemset_8 function correctly sets a 64-bit value in the chunk using AVX512 instructions.

54-56: LGTM!

The chunkmemset_16 function correctly sets a 128-bit value in the chunk using AVX512 instructions.

58-60: LGTM!

The loadchunk function correctly loads a chunk from memory using AVX512 instructions.

62-64: LGTM!

The storechunk function correctly stores a chunk to memory using AVX512 instructions.

66-68: LGTM!

The storechunk_mask function correctly stores a chunk to memory using AVX512 masked store instructions.

70-93: LGTM!

The CHUNKCOPY function correctly copies data from the source to the destination using AVX512 instructions. It handles any remaining bytes that do not fit into a full chunk using masked operations.

126-128: LGTM!

The loadhalfchunk function correctly loads a half-chunk from memory using AVX512 instructions.

130-132: LGTM!

The storehalfchunk function correctly stores a half-chunk to memory using AVX512 instructions.

134-138: LGTM!

The halfchunk2whole function correctly zero-extends a half-chunk to a whole chunk using AVX512 instructions.

153-176: LGTM!

The HALFCHUNKCOPY function correctly copies data from the source to the destination using AVX512 instructions for half-chunks. It handles any remaining bytes that do not fit into a full half-chunk using masked operations.

178-181: LGTM!

The preprocessor macros are used to define chunk-related constants and functions specific to AVX512.

183-184: LGTM!

The chunkset_tpl.h file is included to provide template implementations for chunk-related functions.

185-187: LGTM!

The INFLATE_FAST macro is defined to use the AVX512-optimized inflate_fast_avx512 function, and the inffast_tpl.h file is included to provide the template implementation for the inflate functionality.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)

chunkset_tpl.h (1)

Line range hint 269-289: Consider explicit type conversion for len parameter.

The change from unsigned to uint64_t for len improves large buffer support. However, there's an implicit conversion when calling CHUNKMEMSET:

return CHUNKMEMSET(out, from, (unsigned)len);

While this works because we've already bounded len by safelen, consider adding a comment explaining why the conversion is safe here.

arch/x86/chunkset_avx512.c (3)

28-30: Consider adding function documentation for clarity

To enhance readability and maintainability, consider adding a brief comment explaining the purpose and usage of the gen_half_mask function. This will help other developers understand its role in generating half-width masks.

32-34: Consider adding function documentation for clarity

Similarly, adding a comment to describe the gen_mask function would provide insight into its functionality in generating full-width masks, aiding future code maintenance.

153-176: Add documentation to HALFCHUNKCOPY for better understanding

Including a comment that describes the HALFCHUNKCOPY function will improve code clarity, helping others understand its purpose and how it complements CHUNKCOPY in handling half-size chunks.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 375af7f and 767c58a.

📒 Files selected for processing (13)

CMakeLists.txt (1 hunks)
arch/x86/Makefile.in (3 hunks)
arch/x86/avx2_tables.h (1 hunks)
arch/x86/chunkset_avx2.c (1 hunks)
arch/x86/chunkset_avx512.c (1 hunks)
arch/x86/x86_features.c (2 hunks)
arch/x86/x86_features.h (1 hunks)
arch/x86/x86_functions.h (2 hunks)
chunkset_tpl.h (8 hunks)
cmake/detect-intrinsics.cmake (2 hunks)
configure (2 hunks)
functable.c (1 hunks)
inffast_tpl.h (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (11)

CMakeLists.txt
arch/x86/Makefile.in
arch/x86/avx2_tables.h
arch/x86/chunkset_avx2.c
arch/x86/x86_features.c
arch/x86/x86_features.h
arch/x86/x86_functions.h
cmake/detect-intrinsics.cmake
configure
functable.c
inffast_tpl.h

🧰 Additional context used

📓 Learnings (2)

arch/x86/chunkset_avx512.c (4)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:0-0
Timestamp: 2024-10-29T02:18:10.472Z
Learning: In `arch/x86/chunkset_avx512.c`, within `GET_CHUNK_MAG`, the `dist` parameter is always greater than or equal to 3 because distances less than 3 are handled differently, using direct broadcasts (`chunkmemset_1`, `chunkmemset_2`) that don't require shuffling.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:140-151
Timestamp: 2024-10-29T02:16:15.776Z
Learning: In `GET_HALFCHUNK_MAG`, `dist` is always ≥ 3 because distances less than 3 are handled elsewhere and the function is not called for those cases.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:28-30
Timestamp: 2024-10-29T02:18:25.966Z
Learning: In `chunkset_avx512.c`, the `gen_half_mask` function does not require validation for `len` since it will never exceed 16 due to computing the remainder for a 16-byte load.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1805
File: arch/x86/chunkset_avx512.c:32-34
Timestamp: 2024-10-29T02:22:55.489Z
Learning: In `arch/x86/chunkset_avx512.c`, the `gen_mask` function's `len` parameter cannot exceed 32 because it is only called on the remaining bytes from a 32-byte vector.

chunkset_tpl.h (1)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.

🔇 Additional comments (6)

chunkset_tpl.h (4)

128-136: Excellent optimization for overlapping copies!

The new masked write optimization for forward overlapping copies is well-designed:

Avoids memmove when the copy fits in one chunk
Clear comments explain the rationale
Properly guards the optimization behind HAVE_MASKED_READWRITE

222-226: Good use of masked writes for remainder bytes!

The change from memcpy to masked writes for remainder bytes:

Reduces branching in the critical path
Maintains correctness through proper masking
Falls back to memcpy when masked writes aren't available

Line range hint 90-108: LGTM! Verify halfchunk size assumptions.

The HALFCHUNKCOPY implementation looks correct and follows the established pattern. The function properly handles alignment and remainder bytes.

Let's verify the halfchunk size assumptions:

✅ Verification successful

LGTM! Halfchunk size assumptions are correct

The verification confirms that:

In AVX2/AVX512: chunk_t is __m256i (256-bit) and halfchunk_t is __m128i (128-bit)
The HALFCHUNKCOPY implementation correctly uses sizeof(halfchunk_t) for alignment and increments
The code is properly guarded by HAVE_HALF_CHUNK macro
The implementation maintains the same memory safety guarantees as CHUNKCOPY

The halfchunk size is exactly half of the chunk size in the architectures where it's used, making this implementation both correct and efficient.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify halfchunk_t size is consistent with chunk_t
# Expected: halfchunk_t should be half the size of chunk_t

# Search for chunk_t and halfchunk_t type definitions
rg -A 2 'typedef.*chunk_t|typedef.*halfchunk_t'

Length of output: 1226

Script:

#!/bin/bash
# Let's verify the size relationships and definitions more comprehensively

# Check for HAVE_HALF_CHUNK definitions and related macros
rg "HAVE_HALF_CHUNK|CHUNK_SIZE|sizeof\(halfchunk_t\)|sizeof\(chunk_t\)" --type c

# Also check the specific header for any size definitions
rg "chunk_t|halfchunk_t|CHUNK_SIZE|HALF_CHUNK" chunkset_tpl.h

Length of output: 5200

Line range hint 251-261: Verify alignment handling with masked writes.

While removing the byte-by-byte fallback with masked writes is a good optimization, we should verify that masked writes handle unaligned accesses correctly on all supported platforms.

Let's check alignment handling in the codebase:

✅ Verification successful

Based on the search results, I can now provide a definitive response about the alignment handling with masked writes:

Masked writes are safely implemented with proper alignment handling across platforms.

The codebase shows robust alignment handling:

The architecture-specific implementations (e.g., AVX512) handle unaligned access safely through appropriate intrinsics (_mm256_mask_storeu_epi8)
The codebase has comprehensive platform-specific unaligned access detection:
- UNALIGNED_OK and UNALIGNED64_OK macros for platform capabilities
- Architecture-specific checks for ARM, x86, PowerPC, etc.
The masked write implementation in chunkset_tpl.h is protected by HAVE_MASKED_READWRITE which is only defined for architectures that support safe unaligned masked stores

The byte-by-byte fallback removal is safe as the masked write implementations properly handle unaligned accesses through their respective architecture's intrinsics.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify masked write alignment handling
# Look for alignment checks and masked write implementations

# Search for alignment-related code in masked write implementations
rg -B 2 -A 2 'HAVE_MASKED_READWRITE.*align|storechunk_mask.*align'

# Search for alignment requirements in architecture-specific headers
fd -e h -e c -x rg -l 'UNALIGNED.*OK|alignment|align.*mask' {} \;

Length of output: 1226

Script:

#!/bin/bash
# Let's search for masked write implementations and alignment handling

# Search for masked write implementations and surrounding context
rg -B 5 -A 5 "HAVE_MASKED_READWRITE|storechunk_mask"

# Search for alignment-related macros and checks
rg -B 2 -A 2 "ALIGNED|unaligned|alignment|align.*mask"

# Look for architecture-specific masked write implementations
fd "chunkset.*\.(h|c)" -x rg -l "mask.*store|store.*mask" {} \;

# Search for any alignment requirements or documentation
rg -B 2 -A 2 "\/\*.*align|@.*align"

Length of output: 73256

arch/x86/chunkset_avx512.c (2)

70-93: Verify the usage of CHUNKCOPY throughout the codebase

The CHUNKCOPY function correctly handles chunk copying with proper assertions. Please ensure that all calls to CHUNKCOPY pass a non-zero len parameter, as the function asserts len > 0.

Run the following script to identify all usages of CHUNKCOPY and check the len parameter:

✅ Verification successful

CHUNKCOPY usage is safe with non-zero length parameter

The verification shows that len is safely used in CHUNKCOPY:

len is initialized from here->val and then incremented by BITS(op)
The code has a minimum output buffer guarantee of 258 bytes
There's an explicit distance check if (dist >= len || dist >= state->chunksize) before calling CHUNKCOPY
The implementation across different architectures consistently includes the assertion Assert(len > 0, "chunkcopy should never have a length 0")

The codebase shows no instances where len could be zero when passed to CHUNKCOPY, and the control flow ensures proper length handling.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all calls to CHUNKCOPY have a len parameter greater than zero.

# Test: Search for CHUNKCOPY usage and extract len arguments. Expect: All len arguments are > 0.
ast-grep --pattern 'CHUNKCOPY($_, $_, $_len)' --json | jq '.nodes[].match.captures[2].text' | while read len_var; do
    echo "Checking $len_var"
    rg --files-with-matches --fixed-strings "$len_var = 0"
done

Length of output: 173

Script:

#!/bin/bash
# Let's first find all files that contain CHUNKCOPY
rg -l "CHUNKCOPY"

# Then look at the actual usage with context
rg -A 5 "CHUNKCOPY\("

Length of output: 3704

Script:

#!/bin/bash
# Let's examine the actual call site in inffast_tpl.h to verify the len parameter
rg -B 10 -A 2 "out = CHUNKCOPY\(out, out - dist, len\)" inffast_tpl.h

# Also check any assignments to len before the CHUNKCOPY call
rg "len = 0" inffast_tpl.h
rg "len.*=.*0" inffast_tpl.h

Length of output: 891

Script:

#!/bin/bash
# Let's check how len is computed before the CHUNKCOPY call
rg -B 20 -A 5 "len = " inffast_tpl.h

# Also check the minimum value that len can have
rg -B 5 "len [<>=]" inffast_tpl.h

Length of output: 2880

Script:

#!/bin/bash
# Let's check the minimum value that len can have in the actual implementation
rg -B 10 "len \+= " inffast_tpl.h

# And check where len is used in the distance checks before CHUNKCOPY
rg -B 5 "if \(dist >= len" inffast_tpl.h

Length of output: 1068

95-124: Ensure dist values keep lut_rem.idx within bounds

While dist is always greater than or equal to 3 in this context, verify that dist - 3 does not exceed the bounds of perm_idx_lut and permute_table. This ensures safe access without risking out-of-bounds reads.

Execute the following script to check the bounds of dist:

✅ Verification successful

dist values are safely bounded within perm_idx_lut array limits

The verification shows that:

perm_idx_lut array has 29 elements (indices 0-28)
dist - 3 is used as index, meaning dist can be 3 to 31 (inclusive)
The array is explicitly designed for this range as shown by the comments in avx2_tables.h
The implementation handles two cases safely:
- dist < 16 with appropriate masking
- dist >= 16 with proper bounds checking

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Confirm that dist values keep lut_rem.idx within valid range.

# Check the size of perm_idx_lut and permute_table
rg 'static const lut_rem_pair perm_idx_lut\[\]' -A 1
rg 'static const uint8_t permute_table\[\]' -A 1

# List all dist values used in GET_CHUNK_MAG
ast-grep --pattern 'GET_CHUNK_MAG($_buf, $_chunk_rem, $_dist)' --json | jq '.nodes[].match.captures[2].text' | sort | uniq | while read dist_var; do
    echo "Dist variable: $dist_var"
    # Assuming dist_var assignment exists, verify its maximum value
    rg "$dist_var\s*=\s*\d+" -r '$dist_var = $value' | awk '{print $3}' | sort -nr | head -n1
done

Length of output: 311

Script:

#!/bin/bash
# Let's try a different approach to find the array sizes and dist usage

# Check the size of perm_idx_lut array
rg -B 2 -A 30 'static const lut_rem_pair perm_idx_lut\[\]' arch/x86/chunkset_avx512.c

# Check the size of permute_table array
rg -B 2 -A 30 'static const uint8_t permute_table\[\]' arch/x86/chunkset_avx512.c

# Find where GET_CHUNK_MAG is called and check dist values
rg "GET_CHUNK_MAG\(" arch/x86/chunkset_avx512.c -B 5 -A 1

Length of output: 396

Script:

#!/bin/bash
# Let's try to find the complete file content since it's crucial for this verification
cat arch/x86/chunkset_avx512.c

# Also search for any MAX_DIST or similar constants that might limit the dist value
rg "define.*MAX.*DIST" arch/x86/chunkset_avx512.c
rg "define.*MAX.*DIST" --type c

Length of output: 6401

Script:

#!/bin/bash
# Let's check the avx2_tables.h for the array definitions and sizes
cat arch/x86/avx2_tables.h

# Also check for any calls to this function to understand the dist parameter's bounds
rg -B 10 "GET_CHUNK_MAG\(" --type c

Length of output: 4862

functable.c

This takes advantage of the fact that on AVX512 architectures, masked moves are incredibly cheap. There are many places where we have to fallback to the safe C implementation of chunkcopy_safe because of the assumed overwriting that occurs. We're to sidestep most of the branching needed here by simply controlling the bounds of our writes with a mask.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

chunkset_tpl.h (1)

128-136: LGTM: Efficient use of masked writes for overlapping copies

The new optimization for negative distances leverages AVX512 masked writes when the copy length fits within a single chunk. This is an excellent use of AVX512's efficient masked move operations, avoiding the need for memmove in these cases.

Consider adding a comment documenting the performance characteristics of this optimization compared to memmove, as this would help maintainers understand the tradeoffs in future modifications.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 767c58a and ce5fb26.

📒 Files selected for processing (13)

CMakeLists.txt (1 hunks)
arch/x86/Makefile.in (3 hunks)
arch/x86/avx2_tables.h (1 hunks)
arch/x86/chunkset_avx2.c (1 hunks)
arch/x86/chunkset_avx512.c (1 hunks)
arch/x86/x86_features.c (2 hunks)
arch/x86/x86_features.h (1 hunks)
arch/x86/x86_functions.h (2 hunks)
chunkset_tpl.h (8 hunks)
cmake/detect-intrinsics.cmake (2 hunks)
configure (2 hunks)
functable.c (1 hunks)
inffast_tpl.h (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (11)

arch/x86/Makefile.in
arch/x86/avx2_tables.h
arch/x86/chunkset_avx2.c
arch/x86/chunkset_avx512.c
arch/x86/x86_features.c
arch/x86/x86_features.h
arch/x86/x86_functions.h
cmake/detect-intrinsics.cmake
configure
functable.c
inffast_tpl.h

🧰 Additional context used

📓 Learnings (1)

chunkset_tpl.h (5)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-10-07T21:23:13.401Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-10-07T21:56:31.870Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.

🔇 Additional comments (4)

chunkset_tpl.h (3)

90-91: LGTM: Improved conditional compilation logic

The additional check !defined(HAVE_HALFCHUNKCOPY) ensures this implementation is only used when an architecture-specific version is not available, allowing for optimized AVX512 implementations.

Line range hint 269-289: LGTM: Improved bounds handling with masked writes

The changes improve the implementation by:

Using uint64_t for length to handle larger chunks
Leveraging masked writes to avoid byte-by-byte copying
Maintaining proper bounds checking while reducing branches

The implementation correctly maintains the invariant that 'from' cannot be ahead of 'safe', as noted in the retrieved learnings.

#!/bin/bash
# Check for potential overflow issues with uint64_t length
echo "Searching for length validation in callers..."
rg "CHUNKCOPY_SAFE.*len" -B 5 -A 5

# Look for test cases with large lengths
echo "Searching for test cases with large lengths..."
rg "test.*CHUNKCOPY_SAFE.*\d{10,}" -A 5

222-226: Verify gen_mask implementation for remainder handling

The use of masked writes for remainder bytes is a good optimization. However, we should verify the correctness of the gen_mask implementation.

✅ Verification successful

Let's search for mask-related implementations since the previous search didn't yield results. This will help us understand how the masking is implemented in the codebase.

Based on the search results, I can now provide a final response since we have found the implementation of both gen_mask and storechunk_mask functions in the AVX512 architecture-specific code.

Implementation of gen_mask is correct and properly used

The masked write implementation is correctly implemented:

gen_mask uses Intel's _bzhi_u32 intrinsic to generate a mask with the specified number of bits set
storechunk_mask uses AVX512's masked store instruction _mm256_mask_storeu_epi8 to safely write only the requested number of bytes
The implementation is architecture-specific and only enabled when HAVE_MASKED_READWRITE is defined

The code is safe because:

The mask generation ensures only the requested number of bytes are written
The AVX512 masked store instruction prevents writing beyond the requested length
The implementation is properly guarded by feature macros

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for gen_mask implementation and its usage
echo "Searching for gen_mask implementation..."
rg "gen_mask" -A 5

# Look for any test cases involving gen_mask
echo "Searching for gen_mask tests..."
rg "test.*gen_mask|gen_mask.*test" -A 5

Length of output: 2105

Script:

#!/bin/bash
# Search for gen_mask implementation with different variations
echo "Searching for gen_mask and mask generation..."
rg -i "gen.*mask|generate.*mask|mask.*gen" -A 5

# Look for mask-related functions in header files
echo "Searching mask-related functions in headers..."
fd -e h -e hpp -x rg "mask" -A 5 {}

# Search for masked write implementation
echo "Looking for masked write implementation..."
rg "storechunk_mask|masked.*write|write.*mask" -A 5

Length of output: 58025

CMakeLists.txt (1)

984-987: LGTM! The AVX512 chunkset integration looks good.

The changes properly integrate the new AVX512 chunkset optimization by:

Adding the source file to AVX512_SRCS
Including appropriate feature documentation
Following the established pattern for AVX512 optimizations

Let's verify the AVX512 chunkset integration:

✅ Verification successful

AVX512 chunkset integration is properly implemented and verified

The verification confirms:

The source file exists and contains proper AVX512-specific implementations
Correct compilation flags (AVX512FLAG, NOLTOFLAG) are set in Makefile.in
The implementation follows the template pattern using chunkset_tpl.h
Integration with the existing AVX512 optimizations is maintained

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the AVX512 chunkset source file exists and check its dependencies

# Test 1: Verify the source file exists
fd --type f "chunkset_avx512.c" arch/x86

# Test 2: Check for potential dependencies in the implementation
rg -l "chunkset.*avx512" arch/x86

Length of output: 151

Script:

#!/bin/bash
# Let's verify the implementation details and dependencies

# Check the content of chunkset_avx512.c for proper AVX512 intrinsics usage
rg -A 5 "avx512" arch/x86/chunkset_avx512.c

# Check Makefile.in for proper compilation flags and integration
rg "chunkset.*avx512" arch/x86/Makefile.in

# Check for any header dependencies
rg -l "include.*avx512" arch/x86/

Length of output: 933

codecov · 2024-10-31T01:19:32Z

Codecov Report

Attention: Patch coverage is 0.92593% with 107 lines in your changes missing coverage. Please review.

Project coverage is 32.18%. Comparing base (94aacd8) to head (ce5fb26).
Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
arch/x86/chunkset_avx512.c	0.00%	93 Missing ⚠️
chunkset_tpl.h	0.00%	8 Missing ⚠️
functable.c	0.00%	3 Missing ⚠️
inffast_tpl.h	0.00%	1 Missing and 1 partial ⚠️
arch/x86/x86_features.c	50.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1805      +/-   ##
===========================================
- Coverage    32.62%   32.18%   -0.45%     
===========================================
  Files           66       67       +1     
  Lines         5652     5752     +100     
  Branches      1228     1237       +9     
===========================================
+ Hits          1844     1851       +7     
- Misses        3555     3644      +89     
- Partials       253      257       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

Dead2 · 2024-11-09T00:18:21Z

I cannot test or benchmark this as I don't have access to any AVX512 hardware.
Anyone else up for the task? @nmoinvaz?

nmoinvaz · 2024-11-09T00:20:32Z

I could test using deflatebench but not sure if that is the intended use case for these improvements or if it is more png related.

Dead2 · 2024-11-09T00:22:26Z

I could test using deflatebench but not sure if that is the intended use case for these improvements or if it is more png related.

Well, it should hopefully not be a lot slower than before with deflatebench, so that is valuable data in any case.

KungFuJesus · 2024-11-09T01:59:57Z

In my testing it improved everything, generally. Compiling with -O3 helped png decoding but hurt gzip decoding, but with -O2 it still saw an improvement.

nmoinvaz · 2024-11-14T02:14:12Z

Test data: silesia.tar

DEVELOP `94aacd8`

OS: Windows 10 10.0.22631 AMD64
CPU: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
Tool: ../zlib-ng/build-develop/Release/minigzip.exe Size: 107,008 B
Levels: 1-9
Runs: 70         Trim worst: 30

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      0.956/1.095/1.297/0.101        0.433/0.505/0.579/0.043       94,127,497
 2     35.519%      1.485/1.698/1.988/0.167        0.424/0.494/0.565/0.047       75,286,322
 3     33.844%      1.993/2.280/2.721/0.215        0.419/0.474/0.550/0.043       71,735,218
 4     33.146%      2.270/2.616/3.120/0.244        0.405/0.468/0.550/0.043       70,255,223
 5     32.642%      2.526/2.849/3.286/0.227        0.406/0.467/0.539/0.043       69,187,419
 6     32.483%      2.978/3.341/3.903/0.287        0.386/0.456/0.524/0.044       68,850,776
 7     32.255%      3.922/4.407/5.147/0.366        0.398/0.456/0.523/0.044       68,366,759
 8     32.167%      6.196/6.883/7.991/0.521        0.396/0.454/0.511/0.040       68,180,762
 9     31.887%      6.822/7.609/8.836/0.632        0.394/0.450/0.519/0.042       67,586,442

 avg1  34.261%                        3.642                          0.469
 tot                               1311.124                        168.990      653,576,418

PR `ce5fb26`

OS: Windows 10 10.0.22631 AMD64
CPU: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
Tool: ../zlib-ng/build-develop-jesus/Release/minigzip.exe Size: 110,080 B
Levels: 1-9
Runs: 70         Trim worst: 30

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      0.940/0.965/0.987/0.012        0.432/0.442/0.455/0.006       94,127,497
 2     35.519%      1.457/1.497/1.532/0.023        0.417/0.433/0.448/0.009       75,286,322
 3     33.844%      1.960/2.012/2.070/0.032        0.400/0.412/0.420/0.006       71,735,218
 4     33.146%      2.256/2.306/2.367/0.037        0.393/0.409/0.420/0.007       70,255,223
 5     32.642%      2.476/2.537/2.592/0.040        0.384/0.401/0.412/0.006       69,187,419
 6     32.483%      2.914/2.993/3.064/0.048        0.384/0.401/0.411/0.006       68,850,776
 7     32.255%      3.903/3.980/4.077/0.058        0.384/0.402/0.414/0.008       68,366,759
 8     32.167%      6.162/6.259/6.384/0.069        0.394/0.403/0.411/0.005       68,180,762
 9     31.887%      6.783/6.898/7.058/0.092        0.378/0.393/0.402/0.006       67,586,442

 avg1  34.261%                        3.272                          0.411
 tot                               1177.830                        147.866      653,576,418

nmoinvaz · 2024-11-19T03:48:54Z

I re-ran the test. I re-build both binaries.

DEVELOP

OS: Windows 10 10.0.22631 AMD64
CPU: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
Tool: ../zlib-ng/build-develop/Release/minigzip.exe Size: 107,008 B
Levels: 1-9
Runs: 70         Trim worst: 30

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      0.849/0.864/0.871/0.005        0.374/0.382/0.387/0.004       94,127,497
 2     35.519%      1.304/1.329/1.338/0.007        0.369/0.377/0.381/0.003       75,286,322
 3     33.844%      1.768/1.794/1.804/0.009        0.353/0.360/0.365/0.004       71,735,218
 4     33.146%      2.024/2.040/2.049/0.007        0.345/0.351/0.357/0.004       70,255,223
 5     32.642%      2.214/2.245/2.258/0.010        0.342/0.348/0.353/0.004       69,187,419
 6     32.483%      2.602/2.620/2.629/0.007        0.338/0.345/0.350/0.003       68,850,776
 7     32.255%      3.435/3.447/3.455/0.005        0.341/0.345/0.350/0.002       68,366,759
 8     32.167%      5.386/5.399/5.410/0.006        0.339/0.343/0.345/0.001       68,180,762
 9     31.887%      5.932/5.947/5.960/0.008        0.332/0.337/0.340/0.002       67,586,442

 avg1  34.261%                        2.854                          0.354
 tot                               1027.412                        127.566      653,576,418

PR

OS: Windows 10 10.0.22631 AMD64
CPU: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
Tool: ../zlib-ng/build-develop-jesus/Release/minigzip.exe Size: 110,080 B
Levels: 1-9
Runs: 70         Trim worst: 30

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      0.842/0.858/0.863/0.005        0.379/0.387/0.390/0.002       94,127,497
 2     35.519%      1.311/1.333/1.342/0.007        0.372/0.382/0.385/0.003       75,286,322
 3     33.844%      1.764/1.785/1.793/0.007        0.358/0.366/0.369/0.003       71,735,218
 4     33.146%      1.998/2.030/2.039/0.010        0.349/0.357/0.361/0.003       70,255,223
 5     32.642%      2.188/2.231/2.243/0.012        0.345/0.353/0.355/0.002       69,187,419
 6     32.483%      2.553/2.602/2.612/0.013        0.344/0.351/0.354/0.002       68,850,776
 7     32.255%      3.393/3.446/3.456/0.016        0.343/0.351/0.353/0.002       68,366,759
 8     32.167%      5.309/5.390/5.404/0.026        0.343/0.351/0.353/0.002       68,180,762
 9     31.887%      5.839/5.933/5.953/0.027        0.339/0.344/0.346/0.002       67,586,442

 avg1  34.261%                        2.845                          0.360
 tot                               1024.356                        129.698      653,576,418

Perhaps I did something wrong in the first test or my machine performed that much better at some point in the first test.

nmoinvaz · 2024-11-19T06:38:49Z

He is another run:

DEVELOP

OS: Windows 10 10.0.22631 AMD64
CPU: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
Tool: ../zlib-ng/build-develop/Release/minigzip.exe Size: 107,008 B
Levels: 1-9
Runs: 70         Trim worst: 30

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      0.837/0.859/0.864/0.006        0.382/0.386/0.388/0.002       94,127,497
 2     35.519%      1.281/1.321/1.330/0.011        0.374/0.379/0.383/0.002       75,286,322
 3     33.844%      1.735/1.783/1.793/0.012        0.360/0.365/0.367/0.002       71,735,218
 4     33.146%      1.986/2.025/2.035/0.011        0.352/0.355/0.358/0.001       70,255,223
 5     32.642%      2.189/2.230/2.240/0.012        0.349/0.352/0.355/0.002       69,187,419
 6     32.483%      2.577/2.602/2.609/0.007        0.346/0.350/0.352/0.001       68,850,776
 7     32.255%      3.412/3.430/3.436/0.004        0.347/0.350/0.351/0.001       68,366,759
 8     32.167%      5.368/5.376/5.381/0.003        0.347/0.349/0.351/0.001       68,180,762
 9     31.887%      5.898/5.919/5.926/0.006        0.341/0.344/0.345/0.001       67,586,442

 avg1  34.261%                        2.839                          0.359
 tot                               1021.871                        129.187      653,576,418

PR

OS: Windows 10 10.0.22631 AMD64
CPU: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
Tool: ../zlib-ng/build-develop-jesus/Release/minigzip.exe Size: 110,080 B
Levels: 1-9
Runs: 70         Trim worst: 30

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      0.847/0.853/0.857/0.002        0.382/0.386/0.388/0.002       94,127,497
 2     35.519%      1.317/1.325/1.330/0.003        0.375/0.380/0.383/0.002       75,286,322
 3     33.844%      1.767/1.776/1.781/0.003        0.360/0.364/0.367/0.002       71,735,218
 4     33.146%      2.007/2.018/2.021/0.003        0.351/0.355/0.358/0.002       70,255,223
 5     32.642%      2.209/2.217/2.222/0.004        0.347/0.351/0.354/0.002       69,187,419
 6     32.483%      2.578/2.590/2.596/0.004        0.347/0.349/0.351/0.001       68,850,776
 7     32.255%      3.428/3.439/3.444/0.003        0.347/0.349/0.350/0.001       68,366,759
 8     32.167%      5.378/5.385/5.390/0.003        0.347/0.348/0.350/0.001       68,180,762
 9     31.887%      5.914/5.922/5.926/0.003        0.340/0.342/0.343/0.001       67,586,442

 avg1  34.261%                        2.836                          0.358
 tot                               1020.973                        128.934      653,576,418

I don't know if my machine is sensitive enough.

Dead2 · 2024-11-19T16:12:28Z

The most important part is no new bugs encountered, and no big performance regression.
This looks good to me now.

KungFuJesus · 2024-11-23T17:10:11Z

On my Cascade Lake:

OS: Linux 6.11.6-gentoo-x86_64 #3 SMP PREEMPT_DYNAMIC Mon Nov  4 10:41:27 EST 2024 x86_64
CPU: Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
Tool: /home/adam/scratch/zlib-ng/build_preavx512/minigzip Size: 150,528 B
Levels: 0-9       
Runs: 70         Trim worst: 30        

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.010/0.017/0.020/0.002        0.013/0.019/0.021/0.002      211,973,953
 1     44.409%      0.841/0.852/0.858/0.004        0.358/0.366/0.372/0.004       94,127,497
 2     35.519%      1.422/1.436/1.441/0.004        0.343/0.358/0.363/0.005       75,286,322
 3     33.844%      1.771/1.782/1.788/0.005        0.329/0.339/0.344/0.004       71,735,218
 4     33.146%      2.086/2.094/2.100/0.004        0.323/0.330/0.336/0.003       70,255,223
 5     32.642%      2.399/2.416/2.422/0.005        0.318/0.327/0.334/0.004       69,187,419
 6     32.483%      2.919/2.930/2.936/0.005        0.312/0.324/0.330/0.004       68,850,776
 7     32.255%      4.100/4.115/4.124/0.007        0.314/0.325/0.331/0.005       68,366,759
 8     32.167%      6.544/6.571/6.596/0.018        0.305/0.327/0.333/0.006       68,180,762
 9     31.887%      7.178/7.189/7.200/0.007        0.305/0.317/0.323/0.005       67,586,442

 avg1  40.836%                        2.940                          0.303
 avg2  45.373%                        3.267                          0.337
 tot                               1176.082                        121.288      865,550,371


OS: Linux 6.11.6-gentoo-x86_64 #3 SMP PREEMPT_DYNAMIC Mon Nov  4 10:41:27 EST 2024 x86_64
CPU: Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
Tool: /home/adam/scratch/zlib-ng/build_postavx512/minigzip Size: 154,848 B
Levels: 0-9       
Runs: 70         Trim worst: 30        

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.011/0.016/0.018/0.002        0.011/0.018/0.021/0.002      211,973,953
 1     44.409%      0.842/0.853/0.858/0.004        0.313/0.330/0.336/0.004       94,127,497
 2     35.519%      1.423/1.436/1.442/0.004        0.316/0.326/0.331/0.004       75,286,322
 3     33.844%      1.771/1.782/1.787/0.004        0.298/0.308/0.312/0.003       71,735,218
 4     33.146%      2.086/2.095/2.100/0.004        0.282/0.301/0.307/0.006       70,255,223
 5     32.642%      2.399/2.408/2.413/0.004        0.289/0.298/0.305/0.004       69,187,419
 6     32.483%      2.903/2.920/2.926/0.005        0.286/0.297/0.303/0.005       68,850,776
 7     32.255%      4.096/4.108/4.116/0.005        0.287/0.299/0.307/0.005       68,366,759
 8     32.167%      6.531/6.553/6.584/0.015        0.289/0.301/0.307/0.005       68,180,762
 9     31.887%      7.152/7.176/7.187/0.007        0.276/0.288/0.294/0.004       67,586,442

 avg1  40.836%                        2.935                          0.277
 avg2  45.373%                        3.261                          0.307
 tot                               1173.919                        110.632      865,550,371

KungFuJesus · 2024-11-23T17:33:34Z

I'm somewhat curious to try a 64 byte chunk implementation, at least on Zen5 and Sapphire Rapids. I wonder if it's finally worth the squeeze.

Somewhat unfortunate is the fact that aarch64 does not offer a way to do this but maybe RISC-V might? Possibly SVE could also be used depending on configurable vector lengths. I guess it just depends on the setup time for that.

nmoinvaz · 2025-03-06T00:28:09Z

This PR caused a decompression bug in our main project.

KungFuJesus · 2025-03-06T00:32:48Z

This PR caused a decompression bug in our main project.

Can you give you a quick reproduction of this?

coderabbitai bot reviewed Oct 28, 2024

View reviewed changes

inffast_tpl.h Show resolved Hide resolved

arch/x86/chunkset_avx512.c Show resolved Hide resolved

arch/x86/chunkset_avx512.c Show resolved Hide resolved

arch/x86/chunkset_avx512.c Show resolved Hide resolved

arch/x86/chunkset_avx512.c Show resolved Hide resolved

KungFuJesus force-pushed the avx512_chunking_experiments branch from 05a920f to 375af7f Compare October 28, 2024 22:47

coderabbitai bot reviewed Oct 28, 2024

View reviewed changes

KungFuJesus force-pushed the avx512_chunking_experiments branch 2 times, most recently from 51e7081 to 767c58a Compare October 29, 2024 02:19

coderabbitai bot reviewed Oct 29, 2024

View reviewed changes

phprus reviewed Oct 30, 2024

View reviewed changes

functable.c Show resolved Hide resolved

KungFuJesus force-pushed the avx512_chunking_experiments branch from 767c58a to ce5fb26 Compare October 31, 2024 00:43

coderabbitai bot reviewed Oct 31, 2024

View reviewed changes

Dead2 added the Needs testing Please help test this label Nov 9, 2024

Dead2 removed the Needs testing Please help test this label Nov 19, 2024

Dead2 approved these changes Nov 19, 2024

View reviewed changes

Dead2 merged commit 0ed5ac8 into zlib-ng:develop Nov 20, 2024
142 of 150 checks passed

This was referenced Nov 27, 2024

add env variable to control dfltcc usage #1813

Draft

Enable AVX2 functions to be built with BMI2 instructions #1816

Merged

Dead2 mentioned this pull request Dec 31, 2024

2.2.3 Release #1838

Merged

coderabbitai bot mentioned this pull request Feb 5, 2025

Fix shift overflow in inflate and send_code. #1859

Merged

nmoinvaz mentioned this pull request Mar 6, 2025

AVX512 decompression bug in 2.2.3 or later #1883

Closed

coderabbitai bot mentioned this pull request Jul 22, 2025

Disable NEON workaround on Clang 20 and above, and enable it for non-mobile platforms #1942

Merged

This was referenced Jul 29, 2025

X86_AVX512VNNI: test _mm256_dpbusd_epi32 too #1944

Merged

cmake: use bracket strings #1945

Draft

coderabbitai bot mentioned this pull request Aug 22, 2025

Inline read_buf and flush_pending #1952

Merged

This was referenced Sep 27, 2025

[DRAFT] Add support for trimming down the amount of unused arch functions #1972

Draft

Inline the CHUNKSIZE function #1974

Merged

Uh oh!

Make an AVX512 inflate fast with low cost masked writes #1805

Make an AVX512 inflate fast with low cost masked writes #1805

Uh oh!

Conversation

KungFuJesus commented Oct 28, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Dead2 commented Nov 9, 2024

Uh oh!

nmoinvaz commented Nov 9, 2024

Uh oh!

Dead2 commented Nov 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KungFuJesus commented Nov 9, 2024

Uh oh!

nmoinvaz commented Nov 14, 2024

DEVELOP 94aacd8

PR ce5fb26

Uh oh!

nmoinvaz commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DEVELOP

PR

Uh oh!

nmoinvaz commented Nov 19, 2024

DEVELOP

PR

Uh oh!

Dead2 commented Nov 19, 2024

Uh oh!

Uh oh!

KungFuJesus commented Nov 23, 2024

Uh oh!

KungFuJesus commented Nov 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmoinvaz commented Mar 6, 2025

Uh oh!

KungFuJesus commented Mar 6, 2025

KungFuJesus commented Oct 28, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 28, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Oct 31, 2024 •

edited

Loading

Dead2 commented Nov 9, 2024 •

edited

Loading

DEVELOP `94aacd8`

PR `ce5fb26`

nmoinvaz commented Nov 19, 2024 •

edited

Loading

KungFuJesus commented Nov 23, 2024 •

edited

Loading