Remove UNALIGNED_OK checks #1828

Dead2 · 2024-12-17T22:20:31Z

Since we long ago (since 2.1.0-beta1) made unaligned reads safe by using memcpy or intrinsics,
it is time to replace the UNALIGNED_OK checks that have since really only been used to select the
optimal comparison sizes for the arch instead.

Distros that are used to zlib's potentially unsafe UNALIGNED_OK have been disabling it for zlib-ng too, unfortunately this has effectively resulted in disabling the safe unaligned reads and in parts of the code falling back to the slowest byte-by-byte comparison methods.
This PR should clear up that easily-misunderstood naming.

In my tests on x86-64 w/AVX2, comparing develop with UNALIGNED_OK disabled vs this PR:
Compression: 5.82% faster
Decompression: 0.72% faster

coderabbitai · 2024-12-17T22:20:42Z

Walkthrough

This pull request introduces significant changes to the zlib library's memory handling and optimization strategies. The primary focus is on refining unaligned memory access, removing the WITH_UNALIGNED configuration option, and introducing more granular control over memory comparisons through the new OPTIMAL_CMP macro. The changes span multiple files, including workflow configurations, CMake build scripts, and source code, with an emphasis on improving memory safety, compiler compatibility, and architecture-specific optimizations.

Changes

File	Change Summary
`.github/workflows/cmake.yml`	Updated job name and removed `-DWITH_UNALIGNED=OFF` CMake argument
`CMakeLists.txt`	Removed `WITH_UNALIGNED` option, added multiple architecture-specific options
`README.md`	Updated description of memory handling improvements
`arch/generic/compare256_c.c`	Added unaligned comparison functions with updated compilation conditions
`arch/generic/generic_functions.h`	Modified conditional compilation for comparison functions
`chunkset_tpl.h`	Updated memory manipulation functions, added `CHUNKCOPY_SAFE`
`cmake/detect-sanitizer.cmake`	Added `alignment` to known sanitizer checks
`configure`	Removed `unalignedok` variable and related logic
`test/benchmarks/benchmark_compare256.cc`	Updated conditional compilation for benchmarks
`test/benchmarks/benchmark_compare256_rle.cc`	Updated conditional compilation for benchmarks
`test/test_compare256.cc`	Updated conditional compilation for tests
`test/test_compare256_rle.cc`	Updated conditional compilation for tests
`zbuild.h`	Added `OPTIMAL_CMP` macro and sanitizer-related macros

Sequence Diagram

sequenceDiagram
    participant Build System
    participant Compiler
    participant Memory Comparison Functions
    participant Sanitizer Runtime

    Build System->>Compiler: Set OPTIMAL_CMP based on architecture
    Compiler->>Memory Comparison Functions: Conditionally compile unaligned functions
    Memory Comparison Functions->>Sanitizer Runtime: Instrument memory access
    Sanitizer Runtime->>Memory Comparison Functions: Validate memory operations

The sequence diagram illustrates how the build system configures the compiler to select appropriate memory comparison functions based on the OPTIMAL_CMP value, with integration of sanitizer runtime for enhanced memory safety.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

chunkset_tpl.h (2)
230-234: Consider handling OPTIMAL_CMP >= 64 case for alignment mask

The alignment mask is defined for OPTIMAL_CMP < 32 and == 32, but there's no explicit handling for OPTIMAL_CMP >= 64. Consider whether a different alignment mask would be beneficial for 64-bit optimized architectures.
#if OPTIMAL_CMP < 32
    static const uint32_t align_mask = 7;
#elif OPTIMAL_CMP == 32
    static const uint32_t align_mask = 3;
+#elif OPTIMAL_CMP >= 64
+    static const uint32_t align_mask = 1;
#endif
Line range hint 238-242: Consider vectorized alignment for OPTIMAL_CMP >= 64

Based on the learnings about performance in critical code sections, consider using vectorized operations for alignment when OPTIMAL_CMP >= 64 instead of byte-by-byte copying.
match_tpl.h (1)
Line range hint 141-166: Consider refactoring nested optimization conditions

While the logic is correct, the nested conditions for different optimization levels make the code harder to maintain. Consider extracting the comparison logic into separate functions for each optimization level.

Example structure:
static inline bool compare_match_16(/* params */) {
    return mbase_end[cur_match] == scan_end[0] && 
           mbase_end[cur_match+1] == scan_end[1] &&
           mbase_start[cur_match] == scan[0] && 
           mbase_start[cur_match+1] == scan[1];
}

static inline bool compare_match_32(/* params */) {
    return zng_memcmp_4(mbase_end+cur_match, scan_end) == 0 &&
           zng_memcmp_4(mbase_start+cur_match, scan_start) == 0;
}

// ... similar for 64-bit
zbuild.h (1)

270-293: LGTM! Consider adding documentation for size choices.

The architecture-specific OPTIMAL_CMP definitions look good and align well with the PR's objective of removing UNALIGNED_OK checks. The implementation properly handles different architectures and maintains backward compatibility through NO_UNALIGNED.

Consider adding comments explaining the rationale for choosing specific sizes (64/32/8) for each architecture to help future maintainers understand the performance implications.

arch/generic/generic_functions.h (1)

Line range hint 31-76: Excellent architectural improvements for handling unaligned access

The transition from UNALIGNED_OK to OPTIMAL_CMP provides several benefits:

Better granularity in selecting optimal implementations based on architecture capabilities

Safer handling of unaligned memory access through explicit size thresholds

Clear separation between feature detection (CTZ/CTZLL) and performance optimization (OPTIMAL_CMP)

This change resolves the issue where distributions might disable safe unaligned reads due to confusion with zlib's potentially unsafe UNALIGNED_OK.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 037ab0f and 6a4e0ad.

📒 Files selected for processing (17)

.github/workflows/cmake.yml (1 hunks)
CMakeLists.txt (0 hunks)
README.md (1 hunks)
arch/generic/compare256_c.c (2 hunks)
arch/generic/generic_functions.h (2 hunks)
chunkset_tpl.h (1 hunks)
cmake/detect-sanitizer.cmake (1 hunks)
compare256_rle.h (2 hunks)
configure (0 hunks)
deflate_rle.c (1 hunks)
insert_string_tpl.h (1 hunks)
match_tpl.h (4 hunks)
test/benchmarks/benchmark_compare256.cc (1 hunks)
test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
test/test_compare256.cc (1 hunks)
test/test_compare256_rle.cc (1 hunks)
zbuild.h (1 hunks)

💤 Files with no reviewable changes (2)

configure
CMakeLists.txt

✅ Files skipped from review due to trivial changes (1)

README.md

🧰 Additional context used

📓 Learnings (1)

chunkset_tpl.h (2)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.

🔇 Additional comments (20)

insert_string_tpl.h (1)

32-38: LGTM! The simplified HASH_CALC_READ macro improves clarity and safety.

The changes align well with the PR objectives by:

Using memcpy consistently for safe unaligned reads
Properly handling endianness without unnecessary complexity
Removing the redundant UNALIGNED_OK checks

Let's verify the impact across different architectures:

✅ Verification successful

The simplified HASH_CALC_READ macro is safe and consistent with the codebase architecture

The verification confirms that:

The change aligns with the codebase's handling of unaligned access across architectures
memcpy is consistently used for safe unaligned access throughout the codebase, including in architecture-specific implementations
The UNALIGNED_OK definitions in zbuild.h are only used for feature detection and don't affect the actual implementation anymore
The changes in insert_string_tpl.h are used by both generic (insert_string.c) and optimized (insert_string_roll.c) implementations

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for architecture-specific implementations and potential performance impact
# Look for any remaining UNALIGNED_OK references or architecture-specific optimizations

# Search for any remaining UNALIGNED_OK checks
echo "Checking for remaining UNALIGNED_OK references:"
rg "UNALIGNED_OK"

# Look for architecture-specific implementations that might be affected
echo "Checking for architecture-specific implementations:"
fd -e h -e c | xargs rg -l "__(i386|x86_64|arm|aarch64|ppc|sparc)__"

Length of output: 661

Script:

#!/bin/bash
# Let's check the content of zbuild.h to understand the UNALIGNED_OK definitions
echo "=== Checking zbuild.h for UNALIGNED_OK context ==="
rg -A 5 -B 5 "UNALIGNED_OK" zbuild.h

# Let's also check if there are any other uses of memcpy for unaligned access
echo -e "\n=== Checking for similar memcpy patterns in architecture-specific files ==="
rg "memcpy.*sizeof" arch/

# Check the context of insert_string_tpl.h usage
echo -e "\n=== Checking where insert_string_tpl.h is included ==="
rg -l "insert_string_tpl\.h"

Length of output: 3454

cmake/detect-sanitizer.cmake (1)

114-114: LGTM: Alignment sanitizer check addition is appropriate

Adding the alignment check to UBSAN's known_checks is the correct approach, as it replaces the old UNALIGNED_OK checks with proper sanitizer-based validation. This change ensures that any unsafe unaligned memory access will be caught by the sanitizer.

.github/workflows/cmake.yml (2)

84-88: LGTM: CI configuration properly updated

The job rename and removal of -DWITH_UNALIGNED=OFF correctly reflects the architectural change of making unaligned access checks unconditional through sanitizers rather than compile-time flags.

84-88: Verify removal of WITH_UNALIGNED flag across all jobs

Let's ensure consistency across all CI jobs.

✅ Verification successful

After reviewing the entire workflow file, I can see that there is no usage of WITH_UNALIGNED flag in any of the jobs. Since the review comment was concerned about checking for any remaining WITH_UNALIGNED flags across all jobs in the workflow, and we've confirmed there are none, I can generate the final response:

No remaining WITH_UNALIGNED flags found in workflow

The workflow file has been checked and there are no instances of the WITH_UNALIGNED flag being used in any of the CI jobs.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining WITH_UNALIGNED flags in the workflow
rg "WITH_UNALIGNED" .github/workflows/

Length of output: 38

Script:

#!/bin/bash
# Let's check the entire workflow file content to understand the context better
cat .github/workflows/cmake.yml

Length of output: 32965

arch/generic/compare256_c.c (2)

60-60: Conditional compilation updated to use OPTIMAL_CMP

The preprocessor check now uses OPTIMAL_CMP >= 32 instead of UNALIGNED_OK, which aligns with the objective to remove UNALIGNED_OK checks and simplifies optimization control.

141-142: Appropriate update for 64-bit comparison conditional

The condition has been correctly modified to #if defined(HAVE_BUILTIN_CTZLL) && OPTIMAL_CMP >= 64, removing the redundant UNALIGNED64_OK check and ensuring consistency with the new optimization strategy.

test/test_compare256_rle.cc (1)

55-62: Test conditions updated to reflect codebase changes

The conditional directives for the RLE compare tests have been appropriately updated to use BYTE_ORDER == LITTLE_ENDIAN and OPTIMAL_CMP thresholds, ensuring that the tests align with the modified implementation.

test/benchmarks/benchmark_compare256_rle.cc (1)

64-71: Benchmark conditionals adjusted to new optimization parameters

The updates to the preprocessor conditions now correctly depend on OPTIMAL_CMP and built-in function availability, which ensures that benchmarks are compiled when appropriate and aligns with the removal of UNALIGNED_OK checks.

test/test_compare256.cc (1)

68-77: Conditional directives in tests updated to match new criteria

The conditional compilation for the compare256 tests now relies on OPTIMAL_CMP and checks for built-in functions, which is consistent with the updated codebase and the removal of UNALIGNED_OK checks.

deflate_rle.c (1)

Line range hint 13-22: LGTM! Well-structured function selection hierarchy.

The migration from UNALIGNED_OK to OPTIMAL_CMP provides a clear progression of optimizations:

64-bit comparisons when OPTIMAL_CMP >= 64 and CTZLL is available
32-bit comparisons when OPTIMAL_CMP >= 32 and CTZ is available
16-bit comparisons as a fallback when OPTIMAL_CMP >= 32
Byte-by-byte comparison for OPTIMAL_CMP < 32

match_tpl.h (2)

62-70: Well-structured offset calculations for different optimization levels

The offset calculations are properly adjusted based on both OPTIMAL_CMP values and data sizes, ensuring optimal matching behavior for different architectures.

189-198: Consistent handling of best length updates

The offset adjustments and optimization level checks maintain consistency with the initial calculations, ensuring symmetric behavior throughout the matching process.

zbuild.h (2)

Line range hint 295-371: LGTM! Well-structured sanitizer support.

The sanitizer implementation is robust with proper compiler detection, clean fallbacks, and well-designed instrumentation macros. The code handles both C and C++ compilation contexts appropriately.

Let's verify the sanitizer integration:

✅ Verification successful

Let me check the CMakeLists.txt files to understand how the sanitizers are integrated into the build system.

Let me check the sanitizer detection and configuration in the cmake directory to complete the verification.

Sanitizer support is properly integrated and used

The verification confirms that the sanitizer implementation is well-integrated:

Build system properly supports Address, Memory, Thread, and Undefined Behavior sanitizers through CMake options
Sanitizer detection is robust with comprehensive compiler flag checks
Instrumentation macros are actively used in the codebase, particularly in the s390 architecture implementation
CMake configuration includes proper compiler flags and build options for each sanitizer type
Memory Sanitizer integration includes special handling for Google Test with LLVM instrumented libraries

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if sanitizers are properly integrated in the build system
# and verify their usage across the codebase

# Check build system integration
echo "Checking build system integration..."
rg -l "ASAN|MSAN|sanitizer" CMakeLists.txt *.cmake

# Check usage of instrumentation macros
echo "Checking instrumentation macro usage..."
rg "instrument_(read|write)" --type c --type cpp --type h

# Look for existing sanitizer-related tests
echo "Checking for sanitizer tests..."
rg -l "ASAN|MSAN|sanitizer" test/

Length of output: 1424

Script:

#!/bin/bash
# Check CMakeLists.txt content for sanitizer configuration
echo "Main CMakeLists.txt sanitizer configuration:"
rg "sanitizer|ASAN|MSAN" -A 3 -B 3 CMakeLists.txt

echo -e "\nTest CMakeLists.txt sanitizer configuration:"
rg "sanitizer|ASAN|MSAN" -A 3 -B 3 test/CMakeLists.txt

# Check if there are any other sanitizer-related compiler flags
echo -e "\nCompiler flags related to sanitizers:"
rg "CMAKE_C_FLAGS|CMAKE_CXX_FLAGS" -A 1 -B 1 CMakeLists.txt test/CMakeLists.txt

Length of output: 3070

Script:

#!/bin/bash
# Check sanitizer detection and configuration
echo "Sanitizer detection configuration:"
cat cmake/detect-sanitizer.cmake

# Check if there are any other sanitizer-related files
echo -e "\nOther sanitizer-related files in cmake directory:"
fd sanitizer cmake/

Length of output: 5616

270-293: Verify performance impact across architectures.

Let's verify that the chosen OPTIMAL_CMP sizes provide optimal performance for each architecture.

✅ Verification successful

OPTIMAL_CMP values are properly optimized for architecture-specific performance

Based on the comprehensive analysis of the codebase:

The OPTIMAL_CMP values are carefully chosen based on:
- Architecture word sizes (64-bit for x86_64, aarch64, ppc64le)
- Hardware unaligned access capabilities (ARM with __ARM_FEATURE_UNALIGNED)
- Fallback to smaller sizes when unaligned access isn't supported
- Safe default of 8 bytes when architecture-specific optimization isn't available
The implementation validates these choices through:
- Comprehensive benchmarking suite (test/benchmarks/benchmark_compare256.cc)
- Architecture-specific optimized implementations (compare256_unaligned_16/32/64)
- Performance-focused code comments indicating careful optimization
- Conditional compilation ensuring the most efficient method is selected
The code shows evidence of performance tuning:
- Comments indicate optimization for STD_MAX_MATCH-2 multiple of 16
- Early exit optimizations for better performance
- Careful handling of unaligned access for each architecture

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any performance regression tests or benchmarks in the codebase
# that might validate the OPTIMAL_CMP size choices.

# Look for performance tests
echo "Searching for performance tests..."
rg -l "benchmark|perf_test|performance" --type c --type cpp --type h

# Look for related discussions or documentation
echo "Searching for related documentation..."
rg -l "OPTIMAL_CMP|comparison size|unaligned" README.md docs/

Length of output: 874

Script:

#!/bin/bash
# Let's examine the benchmark files and implementation details

# Check the benchmark implementation for compare256
echo "=== Checking compare256 benchmark implementation ==="
cat test/benchmarks/benchmark_compare256.cc

# Check if there are any comments or documentation about OPTIMAL_CMP in README
echo -e "\n=== Checking README for OPTIMAL_CMP documentation ==="
cat README.md | grep -A 5 -B 5 "OPTIMAL_CMP\|unaligned"

# Look for any configuration or build system tests related to OPTIMAL_CMP
echo -e "\n=== Checking for build system tests ==="
rg "OPTIMAL_CMP" configure

Length of output: 3893

Script:

#!/bin/bash
# Let's examine the actual implementation and usage of compare256 functions

# Check the compare256 implementation file
echo "=== Checking compare256 implementations ==="
rg -p "compare256_unaligned|compare256_c" --type c --type h -A 10

# Check where OPTIMAL_CMP is actually used
echo -e "\n=== Checking OPTIMAL_CMP usage ==="
rg "OPTIMAL_CMP" --type c --type h -A 5

# Look for any performance-related comments near the implementations
echo -e "\n=== Checking for performance-related comments ==="
rg -p "Performance|performance|optimized for|OPTIMAL" --type c --type h -B 3 -A 3

Length of output: 24919

test/benchmarks/benchmark_compare256.cc (3)

69-71: LGTM: Appropriate condition for unaligned_16 benchmark

The change from UNALIGNED_OK to OPTIMAL_CMP >= 32 aligns with the PR objectives. The condition correctly ensures that the 16-bit unaligned comparison benchmark is only included when the architecture supports efficient unaligned access.

71-73: LGTM: Proper dependency check for unaligned_32 benchmark

The condition appropriately checks for the built-in CTZ (count trailing zeros) function, which is required for the 32-bit implementation.

74-76: LGTM: Correct conditions for unaligned_64 benchmark

The conditions properly combine:

Check for 64-bit CTZ support (HAVE_BUILTIN_CTZLL)
Verification that optimal comparison size is at least 64 bits (OPTIMAL_CMP >= 64)

arch/generic/generic_functions.h (3)

31-32: LGTM: Proper conditional compilation for compare256 unaligned variants

The conditions for unaligned comparison functions are well-structured:

Base requirement: Little-endian and OPTIMAL_CMP >= 32
Additional requirements for 32-bit (HAVE_BUILTIN_CTZ) and 64-bit (HAVE_BUILTIN_CTZLL) variants

Also applies to: 34-34, 36-37

46-47: LGTM: Well-structured conditions for longest_match variants

The conditions mirror those of compare256 functions, maintaining consistency across the codebase. Both regular and slow variants are properly declared for each bit width.

Also applies to: 49-50, 52-52, 54-55

62-63: LGTM: Appropriate implementation selection logic

The generic implementation selection uses a cascading approach to choose the most optimal variant based on:

Architecture endianness
OPTIMAL_CMP threshold
Available CPU features (CTZ/CTZLL)

Let's verify the implementation selection logic is comprehensive:

✅ Verification successful

Implementation selection logic is comprehensive and well-structured

The verification confirms that the implementation selection logic is complete and handles all possible combinations:

For Little Endian with OPTIMAL_CMP >= 32:
- 64-bit optimized variants when HAVE_BUILTIN_CTZLL and OPTIMAL_CMP >= 64
- 32-bit optimized variants when HAVE_BUILTIN_CTZ
- 16-bit variants as fallback
For all other cases (Big Endian or OPTIMAL_CMP < 32):
- Falls back to the standard C implementation

The code also properly handles runtime CPU detection through the DISABLE_RUNTIME_CPU_DETECTION flag, mapping the generic implementations to native functions.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that all possible combinations of conditions are handled

# Check for any potential gaps in implementation selection
rg -l "longest_match.*generic|compare256.*generic" | xargs rg "define.*longest_match.*generic|define.*compare256.*generic" -A 5 -B 5

Length of output: 2819

compare256_rle.h

…nsics), it is time to replace the UNALIGNED_OK checks that have since really only been used to select the optimal comparison sizes for the arch instead.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

arch/generic/compare256_c.c (1)
141-142: Implementation looks good with efficient bit operations

The implementation is safe and efficient:

Uses memcpy for safe unaligned access

Leverages CPU's count trailing zeros instruction through __builtin_ctzll

Properly handles 64-bit to 32-bit conversion in return value

Consider adding bounds checking for better safety:
 diff = sv ^ mv;
 if (diff) {
     uint64_t match_byte = __builtin_ctzll(diff) / 8;
+    /* Ensure we don't return more than the current length */
+    if (match_byte > UINT32_MAX - len)
+        match_byte = UINT32_MAX - len;
     return len + (uint32_t)match_byte;
 }
Also applies to: 143-166

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a4e0ad and 8ff2fec.

📒 Files selected for processing (17)

.github/workflows/cmake.yml (1 hunks)
CMakeLists.txt (0 hunks)
README.md (1 hunks)
arch/generic/compare256_c.c (2 hunks)
arch/generic/generic_functions.h (2 hunks)
chunkset_tpl.h (1 hunks)
cmake/detect-sanitizer.cmake (1 hunks)
compare256_rle.h (2 hunks)
configure (0 hunks)
deflate_rle.c (1 hunks)
insert_string_tpl.h (1 hunks)
match_tpl.h (4 hunks)
test/benchmarks/benchmark_compare256.cc (1 hunks)
test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
test/test_compare256.cc (1 hunks)
test/test_compare256_rle.cc (1 hunks)
zbuild.h (1 hunks)

💤 Files with no reviewable changes (2)

configure
CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (11)

README.md
test/test_compare256_rle.cc
test/benchmarks/benchmark_compare256_rle.cc
insert_string_tpl.h
compare256_rle.h
match_tpl.h
test/test_compare256.cc
test/benchmarks/benchmark_compare256.cc
deflate_rle.c
.github/workflows/cmake.yml
arch/generic/generic_functions.h

🔇 Additional comments (5)

cmake/detect-sanitizer.cmake (1)

114-114: LGTM! The addition of unconditional alignment checks aligns with PR objectives.

The unconditional inclusion of the alignment sanitizer check supports the PR's goal of ensuring consistently safe unaligned reads across all configurations.

Let's verify the impact on CI builds:

✅ Verification successful

LGTM! The alignment sanitizer check is properly integrated with the existing sanitizer framework.

The code inspection confirms that:

The alignment check is correctly added to the known_checks list in the add_undefined_sanitizer macro
The check is properly integrated with the existing sanitizer detection and configuration framework
The CI workflow includes multiple jobs with undefined behavior sanitizer enabled, which will now include alignment checks

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if any CI workflows explicitly disable alignment sanitizer checks
# or rely on the old WITH_UNALIGNED behavior

# Search for alignment sanitizer configurations in workflow files
fd -e yml -e yaml . .github/workflows/ --exec rg -l 'sanitizer|UBSAN|WITH_UNALIGNED'

# Search for any remaining WITH_UNALIGNED references that might need updating
rg 'WITH_UNALIGNED'

Length of output: 134

Script:

#!/bin/bash
# Let's check the content of cmake.yml to understand the sanitizer configuration
cat .github/workflows/cmake.yml

# Also check for any remaining references to alignment checks or sanitizers in CMake files
rg -g '*.cmake' 'alignment|sanitizer|UBSAN' -A 3 -B 3

Length of output: 40058

zbuild.h (1)

246-269: LGTM! Architecture-specific optimal comparison sizes are well-defined.

The OPTIMAL_CMP macro definitions provide a robust framework for determining optimal comparison sizes based on architecture capabilities. The progression from 64-bit to 32-bit to 8-bit is logical and includes proper safeguards.

Let's verify the architecture detection logic:

✅ Verification successful

Let me analyze the usage of OPTIMAL_CMP in more detail to ensure it's being used correctly with the architecture-specific values.

Architecture-specific OPTIMAL_CMP values are correctly implemented and used

The verification confirms that:

The OPTIMAL_CMP macro is consistently used across the codebase with proper size-based conditionals
The implementation in match_tpl.h and chunkset_tpl.h correctly handles different comparison sizes (8, 32, and 64 bits)
No conflicting architecture-specific defines exist in zbuild.h
The old UNALIGNED_OK checks have been completely removed
The code gracefully degrades from 64-bit to 32-bit to 8-bit operations based on architecture capabilities

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that OPTIMAL_CMP is used consistently across the codebase
# and that there are no remaining UNALIGNED_OK checks

# Check for any remaining UNALIGNED_OK references
rg "UNALIGNED_OK" --type c --type h

# Verify OPTIMAL_CMP usage
rg "OPTIMAL_CMP" --type c --type h

Length of output: 1733

Script:

#!/bin/bash
# Check the context of OPTIMAL_CMP usage in match_tpl.h and chunkset_tpl.h
# to verify proper handling of different sizes

# Check match_tpl.h implementation
rg "OPTIMAL_CMP" match_tpl.h -B 2 -A 2

# Check chunkset_tpl.h implementation
rg "OPTIMAL_CMP" chunkset_tpl.h -B 2 -A 2

# Check if there are any other architecture-specific defines that might conflict
rg "#if.*defined\(__" zbuild.h -A 2

Length of output: 2550

chunkset_tpl.h (1)

230-232: Alignment handling properly adapted for optimal comparison sizes.

The alignment mask and handling logic correctly adapt to different OPTIMAL_CMP values, ensuring proper memory access patterns while maintaining performance.

Also applies to: 238-242

arch/generic/compare256_c.c (2)

Line range hint 61-89: Implementation looks good with efficient partial match handling

The implementation is safe and efficient:

Uses zng_memcmp_2 for safe unaligned access
Optimizes for partial matches with (*src0 == *src1)
Properly processes 2 bytes at a time

60-60: Verify OPTIMAL_CMP threshold for 16-bit operations

The condition OPTIMAL_CMP >= 32 seems high for enabling 16-bit operations. This might prevent the 16-bit implementation from being used on architectures where it could be beneficial.

✅ Verification successful

The OPTIMAL_CMP >= 32 condition is correct for 16-bit operations

The condition is appropriate because:

OPTIMAL_CMP defines the optimal word size for unaligned memory access on the architecture
When OPTIMAL_CMP >= 32, it indicates the architecture efficiently supports unaligned access of at least 32 bits, which means 16-bit operations are also safe and efficient
The implementation is part of a progressive optimization strategy where architectures with better unaligned access capabilities get more optimized implementations (8-bit → 16-bit → 32-bit → 64-bit)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for OPTIMAL_CMP definitions and usage
echo "Searching for OPTIMAL_CMP definitions:"
rg -A 2 "define.*OPTIMAL_CMP" 

echo -e "\nSearching for other 16-bit operations with similar conditions:"
rg "OPTIMAL_CMP.*>=.*32.*16"

Length of output: 3580

Script:

#!/bin/bash
# Let's check the implementation context and other similar conditions
echo "Checking the implementation context:"
rg -B 5 -A 10 "BYTE_ORDER == LITTLE_ENDIAN && OPTIMAL_CMP >= 32" arch/generic/compare256_c.c

echo -e "\nChecking for other OPTIMAL_CMP >= 32 conditions:"
rg "OPTIMAL_CMP >= 32"

echo -e "\nChecking for other 16-bit related implementations:"
rg -A 3 "memcmp_2|16.*bit"

Length of output: 105796

Script:

#!/bin/bash
# Let's check zbuild.h for OPTIMAL_CMP definitions and their context
echo "Checking OPTIMAL_CMP definitions in zbuild.h:"
rg -B 5 -A 5 "define.*OPTIMAL_CMP" zbuild.h

echo -e "\nChecking compare256 implementations:"
rg -B 2 -A 5 "compare256.*16" arch/generic/compare256_c.c

Length of output: 2130

KungFuJesus

Looks good. The alignment sanitizer coming back clean is a good sign. We should definitely test this on something a little more exotic like a G5.

KungFuJesus · 2024-12-17T22:26:39Z

cmake/detect-sanitizer.cmake

-        # Group sanitizer flag -fsanitize=undefined will automatically add alignment, even if
-        # it is not in our sanitize flag list, so we need to explicitly disable alignment sanitizing.
-        if(WITH_UNALIGNED)
-            add_compile_options(-fno-sanitize=alignment)
-        endif()
-


Awesome, hopefully by removing this bit we'll have some coverage for anything we've missed with CI.

Yes, looks like we already found an invalid alignment access at crc32_acle.c:52

Hah yep. Easy fix, hopefully it generates the same code like I'd expect it to.

KungFuJesus · 2024-12-17T22:28:26Z

insert_string_tpl.h

-        val  = ((uint32_t)(strstart[0])); \
-        val |= ((uint32_t)(strstart[1]) << 8); \
-        val |= ((uint32_t)(strstart[2]) << 16); \
-        val |= ((uint32_t)(strstart[3]) << 24);
+        memcpy(&val, strstart, sizeof(val)); \
+        val = ZSWAP32(val);


Looks good, I wonder of GCC was missing this obvious byteswap idiom. If so, this should actually end up being faster, as I think GCC will probably translate this to movbe on x86.

KungFuJesus · 2024-12-17T22:34:52Z

chunkset_tpl.h


-#if !defined(UNALIGNED64_OK)
+#if OPTIMAL_CMP < 64
    while (((uintptr_t)out & align_mask) && (len > 0)) {
        *out++ = *from++;
        --len;


I kind of speculate this loop here doesn't actually need to exist. There may be some performance benefit in a niche case for keeping it if say, the ISA has an unaligned write penalty but otherwise permits arbitrary alignment. Then the compiler would likely generate simple writes that happen to be at word aligned offsets. Of course, there's likely going to be unaligned access in one way or another, either the read, the write, or both.

Maybe I should see if things are impacted on the G5, which will use the C implementations of these things, supports unaligned writes from GPRs, and I think has a performance penalty for doing so.

If we could leave out this loop on some of the architectures, that'd certainly be nice for performance.

This was added due to a hang on macOS: #1031

Now that we don't directly access anything unaligned anymore I suspect it's not needed but I don't have a Mac os machine to test it with. Was this with aarch64 or x86?

I'm not certain. It looks like it was in a code coverage build.

codecov · 2024-12-17T23:38:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 32.30%. Comparing base (037ab0f) to head (8ff2fec).
Report is 3 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1828      +/-   ##
===========================================
+ Coverage    32.20%   32.30%   +0.10%     
===========================================
  Files           67       67              
  Lines         5745     5745              
  Branches      1239     1239              
===========================================
+ Hits          1850     1856       +6     
  Misses        3637     3637              
+ Partials       258      252       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nmoinvaz

I wonder if we should get rid of the *_16 versions of compare since we don't ever really target a 16-bit platform.

test/test_compare256_rle.cc

test/test_compare256.cc

test/benchmarks/benchmark_compare256_rle.cc

test/benchmarks/benchmark_compare256.cc

KungFuJesus · 2024-12-18T23:11:45Z

So this is the performance on develop:

CPU: PPC970MP, altivec supported
Tool: /home/adam/zlib-ng/build/minigzip Size: 146,440 B
Levels: 0-9       
Runs: 4          Trim worst: 0         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.390/0.401/0.416/0.011        0.412/0.433/0.451/0.017      211,973,953
 1     44.409%      2.891/2.911/2.939/0.022        1.359/1.361/1.364/0.002       94,127,497
 2     35.521%      4.645/4.670/4.695/0.024        1.368/1.376/1.386/0.008       75,290,525
 3     33.851%      6.067/6.105/6.132/0.027        1.337/1.338/1.339/0.001       71,750,763
 4     33.189%      6.811/6.824/6.844/0.014        1.305/1.307/1.310/0.002       70,346,478
 5     32.642%      7.341/7.358/7.376/0.015        1.283/1.289/1.294/0.005       69,187,419
 6     32.483%      8.701/8.738/8.773/0.029        1.282/1.283/1.285/0.001       68,850,776
 7     32.255%   12.013/12.085/12.203/0.090        1.268/1.272/1.280/0.005       68,366,759
 8     32.167%   17.598/17.617/17.649/0.023        1.262/1.269/1.276/0.006       68,180,762
 9     31.887%   19.923/19.961/20.001/0.033        1.283/1.289/1.293/0.004       67,586,442

 avg1  40.841%                        8.667                          1.222
 avg2  45.379%                        9.630                          1.357
 tot                                346.672                         48.870      865,661,374

And this is this PR, as is:

OS: Linux 6.1.1-gentoo #1 SMP PREEMPT Wed Dec 28 01:35:23 EST 2022 ppc64
CPU: PPC970MP, altivec supported
Tool: /home/adam/zlib-ng/build/minigzip Size: 146,440 B
Levels: 0-9       
Runs: 4          Trim worst: 0         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.396/0.399/0.405/0.004        0.404/0.416/0.432/0.012      211,973,953
 1     44.409%      2.894/2.928/2.964/0.037        1.351/1.362/1.375/0.011       94,127,497
 2     35.521%      4.646/4.694/4.743/0.050        1.373/1.378/1.388/0.007       75,290,525
 3     33.851%      6.080/6.102/6.118/0.017        1.336/1.340/1.346/0.005       71,750,763
 4     33.189%      6.792/6.809/6.819/0.012        1.304/1.310/1.314/0.004       70,346,478
 5     32.642%      7.333/7.361/7.385/0.023        1.276/1.283/1.290/0.007       69,187,419
 6     32.483%      8.675/8.699/8.734/0.026        1.274/1.281/1.289/0.007       68,850,776
 7     32.255%   12.001/12.024/12.046/0.021        1.263/1.270/1.277/0.006       68,366,759
 8     32.167%   17.607/17.646/17.685/0.035        1.261/1.270/1.276/0.006       68,180,762
 9     31.887%   19.963/20.070/20.321/0.168        1.285/1.291/1.295/0.005       67,586,442

 avg1  40.841%                        8.673                          1.220
 avg2  45.379%                        9.637                          1.356
 tot                                346.924                         48.799      865,661,374

So seemingly no performance penalty. The dip at level 9 for inflate in both cases is curious but I guess not too unexpected.

KungFuJesus · 2024-12-18T23:21:11Z

Here's after removing the aforementioned alignment loop:

OS: Linux 6.1.1-gentoo #1 SMP PREEMPT Wed Dec 28 01:35:23 EST 2022 ppc64
CPU: PPC970MP, altivec supported
Tool: /home/adam/zlib-ng/build/minigzip Size: 146,440 B
Levels: 0-9       
Runs: 4          Trim worst: 0         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.388/0.396/0.404/0.007        0.415/0.421/0.427/0.005      211,973,953
 1     44.409%      2.878/2.899/2.927/0.021        1.352/1.358/1.364/0.006       94,127,497
 2     35.521%      4.642/4.665/4.708/0.031        1.367/1.372/1.375/0.003       75,290,525
 3     33.851%      6.066/6.094/6.126/0.026        1.336/1.340/1.347/0.005       71,750,763
 4     33.189%      6.798/6.838/6.871/0.034        1.308/1.312/1.317/0.004       70,346,478
 5     32.642%      7.299/7.340/7.385/0.035        1.280/1.283/1.287/0.003       69,187,419
 6     32.483%      8.698/8.741/8.798/0.043        1.281/1.285/1.288/0.003       68,850,776
 7     32.255%   12.011/12.052/12.137/0.057        1.268/1.275/1.279/0.005       68,366,759
 8     32.167%   17.586/17.616/17.705/0.059        1.269/1.272/1.274/0.002       68,180,762
 9     31.887%   20.076/20.126/20.196/0.056        1.290/1.292/1.294/0.002       67,586,442

 avg1  40.841%                        8.677                          1.221
 avg2  45.379%                        9.641                          1.357
 tot                                347.073                         48.844      865,661,374

So, basically the same.

KungFuJesus · 2024-12-18T23:34:27Z

And removing the arbitrary requirement of little endian for 8 byte comparisons on deflate maybe marginally helped level 9 performance:

OS: Linux 6.1.1-gentoo #1 SMP PREEMPT Wed Dec 28 01:35:23 EST 2022 ppc64
CPU: PPC970MP, altivec supported
Tool: /home/adam/zlib-ng/build/minigzip Size: 146,440 B
Levels: 0-9       
Runs: 4          Trim worst: 0         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.389/0.393/0.399/0.004        0.411/0.427/0.433/0.011      211,973,953
 1     44.409%      2.892/2.914/2.956/0.030        1.351/1.359/1.368/0.009       94,127,497
 2     35.519%      4.795/4.812/4.827/0.014        1.368/1.371/1.376/0.003       75,286,322
 3     33.844%      6.170/6.212/6.255/0.035        1.333/1.336/1.339/0.003       71,735,218
 4     33.146%      6.890/6.925/6.955/0.027        1.303/1.309/1.316/0.006       70,255,223
 5     32.642%      7.285/7.339/7.423/0.059        1.282/1.292/1.300/0.009       69,187,419
 6     32.483%      8.541/8.589/8.686/0.066        1.284/1.286/1.287/0.002       68,850,776
 7     32.255%   12.121/12.145/12.180/0.026        1.278/1.280/1.281/0.001       68,366,759
 8     32.167%   18.046/18.095/18.170/0.057        1.269/1.272/1.279/0.005       68,180,762
 9     31.887%   19.314/19.341/19.374/0.025        1.281/1.289/1.294/0.005       67,586,442

 avg1  40.836%                        8.677                          1.222
 avg2  45.373%                        9.641                          1.358
 tot                                347.065                         48.880      865,550,371

Seems to have surprisingly hurt level 8 for compression run time.

Dead2 · 2024-12-20T22:59:01Z

I wonder if we should get rid of the *_16 versions of compare since we don't ever really target a 16-bit platform.

With #1830 the _16 variants are the fallbacks if we don't have the CTZ variants.
The fallback can be rewritten to do whatever is most performant though, if that is making it a _32 variant without CTZ for example, that should be fine.

Dead2 added the cleanup Improving maintainability or removing code. label Dec 17, 2024

Dead2 requested review from KungFuJesus, iii-i, mscastanho, mtl1979 and nmoinvaz December 17, 2024 22:20

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

compare256_rle.h Outdated Show resolved Hide resolved

Since we long ago make unaligned reads safe (by using memcpy or intri…

8ff2fec

…nsics), it is time to replace the UNALIGNED_OK checks that have since really only been used to select the optimal comparison sizes for the arch instead.

Dead2 force-pushed the remove-unaligned branch from 6a4e0ad to 8ff2fec Compare December 17, 2024 22:29

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

KungFuJesus approved these changes Dec 17, 2024

View reviewed changes

nmoinvaz reviewed Dec 17, 2024

View reviewed changes

test/test_compare256_rle.cc Show resolved Hide resolved

test/test_compare256.cc Show resolved Hide resolved

test/benchmarks/benchmark_compare256_rle.cc Show resolved Hide resolved

test/benchmarks/benchmark_compare256.cc Show resolved Hide resolved

Dead2 mentioned this pull request Dec 20, 2024

Continued cleanup of old UNALIGNED_OK checks #1830

Merged

nmoinvaz approved these changes Dec 20, 2024

View reviewed changes

Dead2 merged commit 509f6b5 into develop Dec 20, 2024
278 of 292 checks passed

This was referenced Dec 21, 2024

Make big endians first class citizens again #1831

Merged

Use GCC's may_alias attribute for unaligned memory access #1548

Merged

Provide an inline asm fallback for the ARMv8 intrinsics #1697

Merged

Dead2 mentioned this pull request Dec 31, 2024

2.2.3 Release #1838

Merged

Dead2 deleted the remove-unaligned branch February 10, 2025 22:59

coderabbitai bot mentioned this pull request Feb 21, 2025

Explicit SSE2 vectorization of Chorba CRC method #1872

Merged

Uh oh!

Remove UNALIGNED_OK checks #1828

Remove UNALIGNED_OK checks #1828

Uh oh!

Conversation

Dead2 commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

KungFuJesus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dead2 Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nmoinvaz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KungFuJesus commented Dec 18, 2024

Uh oh!

KungFuJesus commented Dec 18, 2024

Uh oh!

KungFuJesus commented Dec 18, 2024

Uh oh!

Dead2 commented Dec 20, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Dead2 commented Dec 17, 2024 •

edited

Loading

coderabbitai bot commented Dec 17, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

Dead2 Dec 17, 2024 •

edited

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading