-
-
Notifications
You must be signed in to change notification settings - Fork 308
Remove UNALIGNED_OK checks #1828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request introduces significant changes to the zlib library's memory handling and optimization strategies. The primary focus is on refining unaligned memory access, removing the Changes
Sequence DiagramsequenceDiagram
participant Build System
participant Compiler
participant Memory Comparison Functions
participant Sanitizer Runtime
Build System->>Compiler: Set OPTIMAL_CMP based on architecture
Compiler->>Memory Comparison Functions: Conditionally compile unaligned functions
Memory Comparison Functions->>Sanitizer Runtime: Instrument memory access
Sanitizer Runtime->>Memory Comparison Functions: Validate memory operations
The sequence diagram illustrates how the build system configures the compiler to select appropriate memory comparison functions based on the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
chunkset_tpl.h (2)
230-234: Consider handling OPTIMAL_CMP >= 64 case for alignment maskThe alignment mask is defined for OPTIMAL_CMP < 32 and == 32, but there's no explicit handling for OPTIMAL_CMP >= 64. Consider whether a different alignment mask would be beneficial for 64-bit optimized architectures.
#if OPTIMAL_CMP < 32 static const uint32_t align_mask = 7; #elif OPTIMAL_CMP == 32 static const uint32_t align_mask = 3; +#elif OPTIMAL_CMP >= 64 + static const uint32_t align_mask = 1; #endif
Line range hint
238-242: Consider vectorized alignment for OPTIMAL_CMP >= 64Based on the learnings about performance in critical code sections, consider using vectorized operations for alignment when OPTIMAL_CMP >= 64 instead of byte-by-byte copying.
match_tpl.h (1)
Line range hint
141-166: Consider refactoring nested optimization conditionsWhile the logic is correct, the nested conditions for different optimization levels make the code harder to maintain. Consider extracting the comparison logic into separate functions for each optimization level.
Example structure:
static inline bool compare_match_16(/* params */) { return mbase_end[cur_match] == scan_end[0] && mbase_end[cur_match+1] == scan_end[1] && mbase_start[cur_match] == scan[0] && mbase_start[cur_match+1] == scan[1]; } static inline bool compare_match_32(/* params */) { return zng_memcmp_4(mbase_end+cur_match, scan_end) == 0 && zng_memcmp_4(mbase_start+cur_match, scan_start) == 0; } // ... similar for 64-bitzbuild.h (1)
270-293: LGTM! Consider adding documentation for size choices.The architecture-specific OPTIMAL_CMP definitions look good and align well with the PR's objective of removing UNALIGNED_OK checks. The implementation properly handles different architectures and maintains backward compatibility through NO_UNALIGNED.
Consider adding comments explaining the rationale for choosing specific sizes (64/32/8) for each architecture to help future maintainers understand the performance implications.
arch/generic/generic_functions.h (1)
Line range hint
31-76: Excellent architectural improvements for handling unaligned accessThe transition from UNALIGNED_OK to OPTIMAL_CMP provides several benefits:
- Better granularity in selecting optimal implementations based on architecture capabilities
- Safer handling of unaligned memory access through explicit size thresholds
- Clear separation between feature detection (CTZ/CTZLL) and performance optimization (OPTIMAL_CMP)
This change resolves the issue where distributions might disable safe unaligned reads due to confusion with zlib's potentially unsafe UNALIGNED_OK.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
.github/workflows/cmake.yml(1 hunks)CMakeLists.txt(0 hunks)README.md(1 hunks)arch/generic/compare256_c.c(2 hunks)arch/generic/generic_functions.h(2 hunks)chunkset_tpl.h(1 hunks)cmake/detect-sanitizer.cmake(1 hunks)compare256_rle.h(2 hunks)configure(0 hunks)deflate_rle.c(1 hunks)insert_string_tpl.h(1 hunks)match_tpl.h(4 hunks)test/benchmarks/benchmark_compare256.cc(1 hunks)test/benchmarks/benchmark_compare256_rle.cc(1 hunks)test/test_compare256.cc(1 hunks)test/test_compare256_rle.cc(1 hunks)zbuild.h(1 hunks)
💤 Files with no reviewable changes (2)
- configure
- CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
- README.md
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.
🔇 Additional comments (20)
insert_string_tpl.h (1)
32-38: LGTM! The simplified HASH_CALC_READ macro improves clarity and safety.
The changes align well with the PR objectives by:
- Using memcpy consistently for safe unaligned reads
- Properly handling endianness without unnecessary complexity
- Removing the redundant UNALIGNED_OK checks
Let's verify the impact across different architectures:
✅ Verification successful
The simplified HASH_CALC_READ macro is safe and consistent with the codebase architecture
The verification confirms that:
- The change aligns with the codebase's handling of unaligned access across architectures
- memcpy is consistently used for safe unaligned access throughout the codebase, including in architecture-specific implementations
- The UNALIGNED_OK definitions in zbuild.h are only used for feature detection and don't affect the actual implementation anymore
- The changes in insert_string_tpl.h are used by both generic (insert_string.c) and optimized (insert_string_roll.c) implementations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for architecture-specific implementations and potential performance impact
# Look for any remaining UNALIGNED_OK references or architecture-specific optimizations
# Search for any remaining UNALIGNED_OK checks
echo "Checking for remaining UNALIGNED_OK references:"
rg "UNALIGNED_OK"
# Look for architecture-specific implementations that might be affected
echo "Checking for architecture-specific implementations:"
fd -e h -e c | xargs rg -l "__(i386|x86_64|arm|aarch64|ppc|sparc)__"
Length of output: 661
Script:
#!/bin/bash
# Let's check the content of zbuild.h to understand the UNALIGNED_OK definitions
echo "=== Checking zbuild.h for UNALIGNED_OK context ==="
rg -A 5 -B 5 "UNALIGNED_OK" zbuild.h
# Let's also check if there are any other uses of memcpy for unaligned access
echo -e "\n=== Checking for similar memcpy patterns in architecture-specific files ==="
rg "memcpy.*sizeof" arch/
# Check the context of insert_string_tpl.h usage
echo -e "\n=== Checking where insert_string_tpl.h is included ==="
rg -l "insert_string_tpl\.h"
Length of output: 3454
cmake/detect-sanitizer.cmake (1)
114-114: LGTM: Alignment sanitizer check addition is appropriate
Adding the alignment check to UBSAN's known_checks is the correct approach, as it replaces the old UNALIGNED_OK checks with proper sanitizer-based validation. This change ensures that any unsafe unaligned memory access will be caught by the sanitizer.
.github/workflows/cmake.yml (2)
84-88: LGTM: CI configuration properly updated
The job rename and removal of -DWITH_UNALIGNED=OFF correctly reflects the architectural change of making unaligned access checks unconditional through sanitizers rather than compile-time flags.
84-88: Verify removal of WITH_UNALIGNED flag across all jobs
Let's ensure consistency across all CI jobs.
✅ Verification successful
After reviewing the entire workflow file, I can see that there is no usage of WITH_UNALIGNED flag in any of the jobs. Since the review comment was concerned about checking for any remaining WITH_UNALIGNED flags across all jobs in the workflow, and we've confirmed there are none, I can generate the final response:
No remaining WITH_UNALIGNED flags found in workflow
The workflow file has been checked and there are no instances of the WITH_UNALIGNED flag being used in any of the CI jobs.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any remaining WITH_UNALIGNED flags in the workflow
rg "WITH_UNALIGNED" .github/workflows/
Length of output: 38
Script:
#!/bin/bash
# Let's check the entire workflow file content to understand the context better
cat .github/workflows/cmake.yml
Length of output: 32965
arch/generic/compare256_c.c (2)
60-60: Conditional compilation updated to use OPTIMAL_CMP
The preprocessor check now uses OPTIMAL_CMP >= 32 instead of UNALIGNED_OK, which aligns with the objective to remove UNALIGNED_OK checks and simplifies optimization control.
141-142: Appropriate update for 64-bit comparison conditional
The condition has been correctly modified to #if defined(HAVE_BUILTIN_CTZLL) && OPTIMAL_CMP >= 64, removing the redundant UNALIGNED64_OK check and ensuring consistency with the new optimization strategy.
test/test_compare256_rle.cc (1)
55-62: Test conditions updated to reflect codebase changes
The conditional directives for the RLE compare tests have been appropriately updated to use BYTE_ORDER == LITTLE_ENDIAN and OPTIMAL_CMP thresholds, ensuring that the tests align with the modified implementation.
test/benchmarks/benchmark_compare256_rle.cc (1)
64-71: Benchmark conditionals adjusted to new optimization parameters
The updates to the preprocessor conditions now correctly depend on OPTIMAL_CMP and built-in function availability, which ensures that benchmarks are compiled when appropriate and aligns with the removal of UNALIGNED_OK checks.
test/test_compare256.cc (1)
68-77: Conditional directives in tests updated to match new criteria
The conditional compilation for the compare256 tests now relies on OPTIMAL_CMP and checks for built-in functions, which is consistent with the updated codebase and the removal of UNALIGNED_OK checks.
deflate_rle.c (1)
Line range hint 13-22: LGTM! Well-structured function selection hierarchy.
The migration from UNALIGNED_OK to OPTIMAL_CMP provides a clear progression of optimizations:
- 64-bit comparisons when OPTIMAL_CMP >= 64 and CTZLL is available
- 32-bit comparisons when OPTIMAL_CMP >= 32 and CTZ is available
- 16-bit comparisons as a fallback when OPTIMAL_CMP >= 32
- Byte-by-byte comparison for OPTIMAL_CMP < 32
match_tpl.h (2)
62-70: Well-structured offset calculations for different optimization levels
The offset calculations are properly adjusted based on both OPTIMAL_CMP values and data sizes, ensuring optimal matching behavior for different architectures.
189-198: Consistent handling of best length updates
The offset adjustments and optimization level checks maintain consistency with the initial calculations, ensuring symmetric behavior throughout the matching process.
zbuild.h (2)
Line range hint 295-371: LGTM! Well-structured sanitizer support.
The sanitizer implementation is robust with proper compiler detection, clean fallbacks, and well-designed instrumentation macros. The code handles both C and C++ compilation contexts appropriately.
Let's verify the sanitizer integration:
✅ Verification successful
Let me check the CMakeLists.txt files to understand how the sanitizers are integrated into the build system.
Let me check the sanitizer detection and configuration in the cmake directory to complete the verification.
Sanitizer support is properly integrated and used
The verification confirms that the sanitizer implementation is well-integrated:
- Build system properly supports Address, Memory, Thread, and Undefined Behavior sanitizers through CMake options
- Sanitizer detection is robust with comprehensive compiler flag checks
- Instrumentation macros are actively used in the codebase, particularly in the s390 architecture implementation
- CMake configuration includes proper compiler flags and build options for each sanitizer type
- Memory Sanitizer integration includes special handling for Google Test with LLVM instrumented libraries
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if sanitizers are properly integrated in the build system
# and verify their usage across the codebase
# Check build system integration
echo "Checking build system integration..."
rg -l "ASAN|MSAN|sanitizer" CMakeLists.txt *.cmake
# Check usage of instrumentation macros
echo "Checking instrumentation macro usage..."
rg "instrument_(read|write)" --type c --type cpp --type h
# Look for existing sanitizer-related tests
echo "Checking for sanitizer tests..."
rg -l "ASAN|MSAN|sanitizer" test/
Length of output: 1424
Script:
#!/bin/bash
# Check CMakeLists.txt content for sanitizer configuration
echo "Main CMakeLists.txt sanitizer configuration:"
rg "sanitizer|ASAN|MSAN" -A 3 -B 3 CMakeLists.txt
echo -e "\nTest CMakeLists.txt sanitizer configuration:"
rg "sanitizer|ASAN|MSAN" -A 3 -B 3 test/CMakeLists.txt
# Check if there are any other sanitizer-related compiler flags
echo -e "\nCompiler flags related to sanitizers:"
rg "CMAKE_C_FLAGS|CMAKE_CXX_FLAGS" -A 1 -B 1 CMakeLists.txt test/CMakeLists.txt
Length of output: 3070
Script:
#!/bin/bash
# Check sanitizer detection and configuration
echo "Sanitizer detection configuration:"
cat cmake/detect-sanitizer.cmake
# Check if there are any other sanitizer-related files
echo -e "\nOther sanitizer-related files in cmake directory:"
fd sanitizer cmake/
Length of output: 5616
270-293: Verify performance impact across architectures.
Let's verify that the chosen OPTIMAL_CMP sizes provide optimal performance for each architecture.
✅ Verification successful
OPTIMAL_CMP values are properly optimized for architecture-specific performance
Based on the comprehensive analysis of the codebase:
-
The OPTIMAL_CMP values are carefully chosen based on:
- Architecture word sizes (64-bit for x86_64, aarch64, ppc64le)
- Hardware unaligned access capabilities (ARM with __ARM_FEATURE_UNALIGNED)
- Fallback to smaller sizes when unaligned access isn't supported
- Safe default of 8 bytes when architecture-specific optimization isn't available
-
The implementation validates these choices through:
- Comprehensive benchmarking suite (test/benchmarks/benchmark_compare256.cc)
- Architecture-specific optimized implementations (compare256_unaligned_16/32/64)
- Performance-focused code comments indicating careful optimization
- Conditional compilation ensuring the most efficient method is selected
-
The code shows evidence of performance tuning:
- Comments indicate optimization for STD_MAX_MATCH-2 multiple of 16
- Early exit optimizations for better performance
- Careful handling of unaligned access for each architecture
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any performance regression tests or benchmarks in the codebase
# that might validate the OPTIMAL_CMP size choices.
# Look for performance tests
echo "Searching for performance tests..."
rg -l "benchmark|perf_test|performance" --type c --type cpp --type h
# Look for related discussions or documentation
echo "Searching for related documentation..."
rg -l "OPTIMAL_CMP|comparison size|unaligned" README.md docs/
Length of output: 874
Script:
#!/bin/bash
# Let's examine the benchmark files and implementation details
# Check the benchmark implementation for compare256
echo "=== Checking compare256 benchmark implementation ==="
cat test/benchmarks/benchmark_compare256.cc
# Check if there are any comments or documentation about OPTIMAL_CMP in README
echo -e "\n=== Checking README for OPTIMAL_CMP documentation ==="
cat README.md | grep -A 5 -B 5 "OPTIMAL_CMP\|unaligned"
# Look for any configuration or build system tests related to OPTIMAL_CMP
echo -e "\n=== Checking for build system tests ==="
rg "OPTIMAL_CMP" configure
Length of output: 3893
Script:
#!/bin/bash
# Let's examine the actual implementation and usage of compare256 functions
# Check the compare256 implementation file
echo "=== Checking compare256 implementations ==="
rg -p "compare256_unaligned|compare256_c" --type c --type h -A 10
# Check where OPTIMAL_CMP is actually used
echo -e "\n=== Checking OPTIMAL_CMP usage ==="
rg "OPTIMAL_CMP" --type c --type h -A 5
# Look for any performance-related comments near the implementations
echo -e "\n=== Checking for performance-related comments ==="
rg -p "Performance|performance|optimized for|OPTIMAL" --type c --type h -B 3 -A 3
Length of output: 24919
test/benchmarks/benchmark_compare256.cc (3)
69-71: LGTM: Appropriate condition for unaligned_16 benchmark
The change from UNALIGNED_OK to OPTIMAL_CMP >= 32 aligns with the PR objectives. The condition correctly ensures that the 16-bit unaligned comparison benchmark is only included when the architecture supports efficient unaligned access.
71-73: LGTM: Proper dependency check for unaligned_32 benchmark
The condition appropriately checks for the built-in CTZ (count trailing zeros) function, which is required for the 32-bit implementation.
74-76: LGTM: Correct conditions for unaligned_64 benchmark
The conditions properly combine:
- Check for 64-bit CTZ support (
HAVE_BUILTIN_CTZLL) - Verification that optimal comparison size is at least 64 bits (
OPTIMAL_CMP >= 64)
arch/generic/generic_functions.h (3)
31-32: LGTM: Proper conditional compilation for compare256 unaligned variants
The conditions for unaligned comparison functions are well-structured:
- Base requirement: Little-endian and OPTIMAL_CMP >= 32
- Additional requirements for 32-bit (HAVE_BUILTIN_CTZ) and 64-bit (HAVE_BUILTIN_CTZLL) variants
Also applies to: 34-34, 36-37
46-47: LGTM: Well-structured conditions for longest_match variants
The conditions mirror those of compare256 functions, maintaining consistency across the codebase. Both regular and slow variants are properly declared for each bit width.
Also applies to: 49-50, 52-52, 54-55
62-63: LGTM: Appropriate implementation selection logic
The generic implementation selection uses a cascading approach to choose the most optimal variant based on:
- Architecture endianness
- OPTIMAL_CMP threshold
- Available CPU features (CTZ/CTZLL)
Let's verify the implementation selection logic is comprehensive:
✅ Verification successful
Implementation selection logic is comprehensive and well-structured
The verification confirms that the implementation selection logic is complete and handles all possible combinations:
- For Little Endian with OPTIMAL_CMP >= 32:
- 64-bit optimized variants when HAVE_BUILTIN_CTZLL and OPTIMAL_CMP >= 64
- 32-bit optimized variants when HAVE_BUILTIN_CTZ
- 16-bit variants as fallback
- For all other cases (Big Endian or OPTIMAL_CMP < 32):
- Falls back to the standard C implementation
The code also properly handles runtime CPU detection through the DISABLE_RUNTIME_CPU_DETECTION flag, mapping the generic implementations to native functions.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that all possible combinations of conditions are handled
# Check for any potential gaps in implementation selection
rg -l "longest_match.*generic|compare256.*generic" | xargs rg "define.*longest_match.*generic|define.*compare256.*generic" -A 5 -B 5
Length of output: 2819
…nsics), it is time to replace the UNALIGNED_OK checks that have since really only been used to select the optimal comparison sizes for the arch instead.
6a4e0ad to
8ff2fec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
arch/generic/compare256_c.c (1)
141-142: Implementation looks good with efficient bit operationsThe implementation is safe and efficient:
- Uses memcpy for safe unaligned access
- Leverages CPU's count trailing zeros instruction through __builtin_ctzll
- Properly handles 64-bit to 32-bit conversion in return value
Consider adding bounds checking for better safety:
diff = sv ^ mv; if (diff) { uint64_t match_byte = __builtin_ctzll(diff) / 8; + /* Ensure we don't return more than the current length */ + if (match_byte > UINT32_MAX - len) + match_byte = UINT32_MAX - len; return len + (uint32_t)match_byte; }Also applies to: 143-166
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
.github/workflows/cmake.yml(1 hunks)CMakeLists.txt(0 hunks)README.md(1 hunks)arch/generic/compare256_c.c(2 hunks)arch/generic/generic_functions.h(2 hunks)chunkset_tpl.h(1 hunks)cmake/detect-sanitizer.cmake(1 hunks)compare256_rle.h(2 hunks)configure(0 hunks)deflate_rle.c(1 hunks)insert_string_tpl.h(1 hunks)match_tpl.h(4 hunks)test/benchmarks/benchmark_compare256.cc(1 hunks)test/benchmarks/benchmark_compare256_rle.cc(1 hunks)test/test_compare256.cc(1 hunks)test/test_compare256_rle.cc(1 hunks)zbuild.h(1 hunks)
💤 Files with no reviewable changes (2)
- configure
- CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (11)
- README.md
- test/test_compare256_rle.cc
- test/benchmarks/benchmark_compare256_rle.cc
- insert_string_tpl.h
- compare256_rle.h
- match_tpl.h
- test/test_compare256.cc
- test/benchmarks/benchmark_compare256.cc
- deflate_rle.c
- .github/workflows/cmake.yml
- arch/generic/generic_functions.h
🔇 Additional comments (5)
cmake/detect-sanitizer.cmake (1)
114-114: LGTM! The addition of unconditional alignment checks aligns with PR objectives.
The unconditional inclusion of the alignment sanitizer check supports the PR's goal of ensuring consistently safe unaligned reads across all configurations.
Let's verify the impact on CI builds:
✅ Verification successful
LGTM! The alignment sanitizer check is properly integrated with the existing sanitizer framework.
The code inspection confirms that:
- The
alignmentcheck is correctly added to theknown_checkslist in theadd_undefined_sanitizermacro - The check is properly integrated with the existing sanitizer detection and configuration framework
- The CI workflow includes multiple jobs with undefined behavior sanitizer enabled, which will now include alignment checks
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if any CI workflows explicitly disable alignment sanitizer checks
# or rely on the old WITH_UNALIGNED behavior
# Search for alignment sanitizer configurations in workflow files
fd -e yml -e yaml . .github/workflows/ --exec rg -l 'sanitizer|UBSAN|WITH_UNALIGNED'
# Search for any remaining WITH_UNALIGNED references that might need updating
rg 'WITH_UNALIGNED'
Length of output: 134
Script:
#!/bin/bash
# Let's check the content of cmake.yml to understand the sanitizer configuration
cat .github/workflows/cmake.yml
# Also check for any remaining references to alignment checks or sanitizers in CMake files
rg -g '*.cmake' 'alignment|sanitizer|UBSAN' -A 3 -B 3
Length of output: 40058
zbuild.h (1)
246-269: LGTM! Architecture-specific optimal comparison sizes are well-defined.
The OPTIMAL_CMP macro definitions provide a robust framework for determining optimal comparison sizes based on architecture capabilities. The progression from 64-bit to 32-bit to 8-bit is logical and includes proper safeguards.
Let's verify the architecture detection logic:
✅ Verification successful
Let me analyze the usage of OPTIMAL_CMP in more detail to ensure it's being used correctly with the architecture-specific values.
Architecture-specific OPTIMAL_CMP values are correctly implemented and used
The verification confirms that:
- The OPTIMAL_CMP macro is consistently used across the codebase with proper size-based conditionals
- The implementation in match_tpl.h and chunkset_tpl.h correctly handles different comparison sizes (8, 32, and 64 bits)
- No conflicting architecture-specific defines exist in zbuild.h
- The old UNALIGNED_OK checks have been completely removed
- The code gracefully degrades from 64-bit to 32-bit to 8-bit operations based on architecture capabilities
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that OPTIMAL_CMP is used consistently across the codebase
# and that there are no remaining UNALIGNED_OK checks
# Check for any remaining UNALIGNED_OK references
rg "UNALIGNED_OK" --type c --type h
# Verify OPTIMAL_CMP usage
rg "OPTIMAL_CMP" --type c --type h
Length of output: 1733
Script:
#!/bin/bash
# Check the context of OPTIMAL_CMP usage in match_tpl.h and chunkset_tpl.h
# to verify proper handling of different sizes
# Check match_tpl.h implementation
rg "OPTIMAL_CMP" match_tpl.h -B 2 -A 2
# Check chunkset_tpl.h implementation
rg "OPTIMAL_CMP" chunkset_tpl.h -B 2 -A 2
# Check if there are any other architecture-specific defines that might conflict
rg "#if.*defined\(__" zbuild.h -A 2
Length of output: 2550
chunkset_tpl.h (1)
230-232: Alignment handling properly adapted for optimal comparison sizes.
The alignment mask and handling logic correctly adapt to different OPTIMAL_CMP values, ensuring proper memory access patterns while maintaining performance.
Also applies to: 238-242
arch/generic/compare256_c.c (2)
Line range hint 61-89: Implementation looks good with efficient partial match handling
The implementation is safe and efficient:
- Uses
zng_memcmp_2for safe unaligned access - Optimizes for partial matches with
(*src0 == *src1) - Properly processes 2 bytes at a time
60-60: Verify OPTIMAL_CMP threshold for 16-bit operations
The condition OPTIMAL_CMP >= 32 seems high for enabling 16-bit operations. This might prevent the 16-bit implementation from being used on architectures where it could be beneficial.
✅ Verification successful
The OPTIMAL_CMP >= 32 condition is correct for 16-bit operations
The condition is appropriate because:
- OPTIMAL_CMP defines the optimal word size for unaligned memory access on the architecture
- When OPTIMAL_CMP >= 32, it indicates the architecture efficiently supports unaligned access of at least 32 bits, which means 16-bit operations are also safe and efficient
- The implementation is part of a progressive optimization strategy where architectures with better unaligned access capabilities get more optimized implementations (8-bit → 16-bit → 32-bit → 64-bit)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for OPTIMAL_CMP definitions and usage
echo "Searching for OPTIMAL_CMP definitions:"
rg -A 2 "define.*OPTIMAL_CMP"
echo -e "\nSearching for other 16-bit operations with similar conditions:"
rg "OPTIMAL_CMP.*>=.*32.*16"
Length of output: 3580
Script:
#!/bin/bash
# Let's check the implementation context and other similar conditions
echo "Checking the implementation context:"
rg -B 5 -A 10 "BYTE_ORDER == LITTLE_ENDIAN && OPTIMAL_CMP >= 32" arch/generic/compare256_c.c
echo -e "\nChecking for other OPTIMAL_CMP >= 32 conditions:"
rg "OPTIMAL_CMP >= 32"
echo -e "\nChecking for other 16-bit related implementations:"
rg -A 3 "memcmp_2|16.*bit"
Length of output: 105796
Script:
#!/bin/bash
# Let's check zbuild.h for OPTIMAL_CMP definitions and their context
echo "Checking OPTIMAL_CMP definitions in zbuild.h:"
rg -B 5 -A 5 "define.*OPTIMAL_CMP" zbuild.h
echo -e "\nChecking compare256 implementations:"
rg -B 2 -A 5 "compare256.*16" arch/generic/compare256_c.c
Length of output: 2130
KungFuJesus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. The alignment sanitizer coming back clean is a good sign. We should definitely test this on something a little more exotic like a G5.
| # Group sanitizer flag -fsanitize=undefined will automatically add alignment, even if | ||
| # it is not in our sanitize flag list, so we need to explicitly disable alignment sanitizing. | ||
| if(WITH_UNALIGNED) | ||
| add_compile_options(-fno-sanitize=alignment) | ||
| endif() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, hopefully by removing this bit we'll have some coverage for anything we've missed with CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, looks like we already found an invalid alignment access at crc32_acle.c:52
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah yep. Easy fix, hopefully it generates the same code like I'd expect it to.
| val = ((uint32_t)(strstart[0])); \ | ||
| val |= ((uint32_t)(strstart[1]) << 8); \ | ||
| val |= ((uint32_t)(strstart[2]) << 16); \ | ||
| val |= ((uint32_t)(strstart[3]) << 24); | ||
| memcpy(&val, strstart, sizeof(val)); \ | ||
| val = ZSWAP32(val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I wonder of GCC was missing this obvious byteswap idiom. If so, this should actually end up being faster, as I think GCC will probably translate this to movbe on x86.
|
|
||
| #if !defined(UNALIGNED64_OK) | ||
| #if OPTIMAL_CMP < 64 | ||
| while (((uintptr_t)out & align_mask) && (len > 0)) { | ||
| *out++ = *from++; | ||
| --len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of speculate this loop here doesn't actually need to exist. There may be some performance benefit in a niche case for keeping it if say, the ISA has an unaligned write penalty but otherwise permits arbitrary alignment. Then the compiler would likely generate simple writes that happen to be at word aligned offsets. Of course, there's likely going to be unaligned access in one way or another, either the read, the write, or both.
Maybe I should see if things are impacted on the G5, which will use the C implementations of these things, supports unaligned writes from GPRs, and I think has a performance penalty for doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could leave out this loop on some of the architectures, that'd certainly be nice for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added due to a hang on macOS: #1031
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we don't directly access anything unaligned anymore I suspect it's not needed but I don't have a Mac os machine to test it with. Was this with aarch64 or x86?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain. It looks like it was in a code coverage build.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1828 +/- ##
===========================================
+ Coverage 32.20% 32.30% +0.10%
===========================================
Files 67 67
Lines 5745 5745
Branches 1239 1239
===========================================
+ Hits 1850 1856 +6
Misses 3637 3637
+ Partials 258 252 -6 ☔ View full report in Codecov by Sentry. |
nmoinvaz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should get rid of the *_16 versions of compare since we don't ever really target a 16-bit platform.
|
So this is the performance on develop: And this is this PR, as is: So seemingly no performance penalty. The dip at level 9 for inflate in both cases is curious but I guess not too unexpected. |
|
Here's after removing the aforementioned alignment loop: So, basically the same. |
|
And removing the arbitrary requirement of little endian for 8 byte comparisons on deflate maybe marginally helped level 9 performance: Seems to have surprisingly hurt level 8 for compression run time. |
With #1830 the _16 variants are the fallbacks if we don't have the CTZ variants. |
Since we long ago (since 2.1.0-beta1) made unaligned reads safe by using memcpy or intrinsics,
it is time to replace the UNALIGNED_OK checks that have since really only been used to select the
optimal comparison sizes for the arch instead.
Distros that are used to zlib's potentially unsafe UNALIGNED_OK have been disabling it for zlib-ng too, unfortunately this has effectively resulted in disabling the safe unaligned reads and in parts of the code falling back to the slowest byte-by-byte comparison methods.
This PR should clear up that easily-misunderstood naming.
In my tests on x86-64 w/AVX2, comparing develop with UNALIGNED_OK disabled vs this PR:
Compression: 5.82% faster
Decompression: 0.72% faster