这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@Dead2
Copy link
Member

@Dead2 Dead2 commented Dec 17, 2024

Since we long ago (since 2.1.0-beta1) made unaligned reads safe by using memcpy or intrinsics,
it is time to replace the UNALIGNED_OK checks that have since really only been used to select the
optimal comparison sizes for the arch instead.

Distros that are used to zlib's potentially unsafe UNALIGNED_OK have been disabling it for zlib-ng too, unfortunately this has effectively resulted in disabling the safe unaligned reads and in parts of the code falling back to the slowest byte-by-byte comparison methods.
This PR should clear up that easily-misunderstood naming.

In my tests on x86-64 w/AVX2, comparing develop with UNALIGNED_OK disabled vs this PR:
Compression: 5.82% faster
Decompression: 0.72% faster

@Dead2 Dead2 added the cleanup Improving maintainability or removing code. label Dec 17, 2024
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 17, 2024

Walkthrough

This pull request introduces significant changes to the zlib library's memory handling and optimization strategies. The primary focus is on refining unaligned memory access, removing the WITH_UNALIGNED configuration option, and introducing more granular control over memory comparisons through the new OPTIMAL_CMP macro. The changes span multiple files, including workflow configurations, CMake build scripts, and source code, with an emphasis on improving memory safety, compiler compatibility, and architecture-specific optimizations.

Changes

File Change Summary
.github/workflows/cmake.yml Updated job name and removed -DWITH_UNALIGNED=OFF CMake argument
CMakeLists.txt Removed WITH_UNALIGNED option, added multiple architecture-specific options
README.md Updated description of memory handling improvements
arch/generic/compare256_c.c Added unaligned comparison functions with updated compilation conditions
arch/generic/generic_functions.h Modified conditional compilation for comparison functions
chunkset_tpl.h Updated memory manipulation functions, added CHUNKCOPY_SAFE
cmake/detect-sanitizer.cmake Added alignment to known sanitizer checks
configure Removed unalignedok variable and related logic
test/benchmarks/benchmark_compare256.cc Updated conditional compilation for benchmarks
test/benchmarks/benchmark_compare256_rle.cc Updated conditional compilation for benchmarks
test/test_compare256.cc Updated conditional compilation for tests
test/test_compare256_rle.cc Updated conditional compilation for tests
zbuild.h Added OPTIMAL_CMP macro and sanitizer-related macros

Sequence Diagram

sequenceDiagram
    participant Build System
    participant Compiler
    participant Memory Comparison Functions
    participant Sanitizer Runtime

    Build System->>Compiler: Set OPTIMAL_CMP based on architecture
    Compiler->>Memory Comparison Functions: Conditionally compile unaligned functions
    Memory Comparison Functions->>Sanitizer Runtime: Instrument memory access
    Sanitizer Runtime->>Memory Comparison Functions: Validate memory operations
Loading

The sequence diagram illustrates how the build system configures the compiler to select appropriate memory comparison functions based on the OPTIMAL_CMP value, with integration of sanitizer runtime for enhanced memory safety.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
chunkset_tpl.h (2)

230-234: Consider handling OPTIMAL_CMP >= 64 case for alignment mask

The alignment mask is defined for OPTIMAL_CMP < 32 and == 32, but there's no explicit handling for OPTIMAL_CMP >= 64. Consider whether a different alignment mask would be beneficial for 64-bit optimized architectures.

#if OPTIMAL_CMP < 32
    static const uint32_t align_mask = 7;
#elif OPTIMAL_CMP == 32
    static const uint32_t align_mask = 3;
+#elif OPTIMAL_CMP >= 64
+    static const uint32_t align_mask = 1;
#endif

Line range hint 238-242: Consider vectorized alignment for OPTIMAL_CMP >= 64

Based on the learnings about performance in critical code sections, consider using vectorized operations for alignment when OPTIMAL_CMP >= 64 instead of byte-by-byte copying.

match_tpl.h (1)

Line range hint 141-166: Consider refactoring nested optimization conditions

While the logic is correct, the nested conditions for different optimization levels make the code harder to maintain. Consider extracting the comparison logic into separate functions for each optimization level.

Example structure:

static inline bool compare_match_16(/* params */) {
    return mbase_end[cur_match] == scan_end[0] && 
           mbase_end[cur_match+1] == scan_end[1] &&
           mbase_start[cur_match] == scan[0] && 
           mbase_start[cur_match+1] == scan[1];
}

static inline bool compare_match_32(/* params */) {
    return zng_memcmp_4(mbase_end+cur_match, scan_end) == 0 &&
           zng_memcmp_4(mbase_start+cur_match, scan_start) == 0;
}

// ... similar for 64-bit
zbuild.h (1)

270-293: LGTM! Consider adding documentation for size choices.

The architecture-specific OPTIMAL_CMP definitions look good and align well with the PR's objective of removing UNALIGNED_OK checks. The implementation properly handles different architectures and maintains backward compatibility through NO_UNALIGNED.

Consider adding comments explaining the rationale for choosing specific sizes (64/32/8) for each architecture to help future maintainers understand the performance implications.

arch/generic/generic_functions.h (1)

Line range hint 31-76: Excellent architectural improvements for handling unaligned access

The transition from UNALIGNED_OK to OPTIMAL_CMP provides several benefits:

  1. Better granularity in selecting optimal implementations based on architecture capabilities
  2. Safer handling of unaligned memory access through explicit size thresholds
  3. Clear separation between feature detection (CTZ/CTZLL) and performance optimization (OPTIMAL_CMP)

This change resolves the issue where distributions might disable safe unaligned reads due to confusion with zlib's potentially unsafe UNALIGNED_OK.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 037ab0f and 6a4e0ad.

📒 Files selected for processing (17)
  • .github/workflows/cmake.yml (1 hunks)
  • CMakeLists.txt (0 hunks)
  • README.md (1 hunks)
  • arch/generic/compare256_c.c (2 hunks)
  • arch/generic/generic_functions.h (2 hunks)
  • chunkset_tpl.h (1 hunks)
  • cmake/detect-sanitizer.cmake (1 hunks)
  • compare256_rle.h (2 hunks)
  • configure (0 hunks)
  • deflate_rle.c (1 hunks)
  • insert_string_tpl.h (1 hunks)
  • match_tpl.h (4 hunks)
  • test/benchmarks/benchmark_compare256.cc (1 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
  • test/test_compare256.cc (1 hunks)
  • test/test_compare256_rle.cc (1 hunks)
  • zbuild.h (1 hunks)
💤 Files with no reviewable changes (2)
  • configure
  • CMakeLists.txt
✅ Files skipped from review due to trivial changes (1)
  • README.md
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:135-135
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKMEMSET` function within `chunkset_tpl.h`, extra bounds checks are avoided to maintain performance in critical code sections. Branching is minimized to prevent negative impacts on speculative execution. The variable `len` is enforced with `safelen` early on.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:263-271
Timestamp: 2024-11-12T09:04:30.407Z
Learning: In the `CHUNKCOPY_SAFE` function within `chunkset_tpl.h`, `from` cannot be ahead of `safe` because `safe` marks the end of the window, and `from` is always within the allocated window bounds. Thus, the subtraction `safe - from` is valid without additional checks.
🔇 Additional comments (20)
insert_string_tpl.h (1)

32-38: LGTM! The simplified HASH_CALC_READ macro improves clarity and safety.

The changes align well with the PR objectives by:

  • Using memcpy consistently for safe unaligned reads
  • Properly handling endianness without unnecessary complexity
  • Removing the redundant UNALIGNED_OK checks

Let's verify the impact across different architectures:

✅ Verification successful

The simplified HASH_CALC_READ macro is safe and consistent with the codebase architecture

The verification confirms that:

  • The change aligns with the codebase's handling of unaligned access across architectures
  • memcpy is consistently used for safe unaligned access throughout the codebase, including in architecture-specific implementations
  • The UNALIGNED_OK definitions in zbuild.h are only used for feature detection and don't affect the actual implementation anymore
  • The changes in insert_string_tpl.h are used by both generic (insert_string.c) and optimized (insert_string_roll.c) implementations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for architecture-specific implementations and potential performance impact
# Look for any remaining UNALIGNED_OK references or architecture-specific optimizations

# Search for any remaining UNALIGNED_OK checks
echo "Checking for remaining UNALIGNED_OK references:"
rg "UNALIGNED_OK"

# Look for architecture-specific implementations that might be affected
echo "Checking for architecture-specific implementations:"
fd -e h -e c | xargs rg -l "__(i386|x86_64|arm|aarch64|ppc|sparc)__"

Length of output: 661


Script:

#!/bin/bash
# Let's check the content of zbuild.h to understand the UNALIGNED_OK definitions
echo "=== Checking zbuild.h for UNALIGNED_OK context ==="
rg -A 5 -B 5 "UNALIGNED_OK" zbuild.h

# Let's also check if there are any other uses of memcpy for unaligned access
echo -e "\n=== Checking for similar memcpy patterns in architecture-specific files ==="
rg "memcpy.*sizeof" arch/

# Check the context of insert_string_tpl.h usage
echo -e "\n=== Checking where insert_string_tpl.h is included ==="
rg -l "insert_string_tpl\.h"

Length of output: 3454

cmake/detect-sanitizer.cmake (1)

114-114: LGTM: Alignment sanitizer check addition is appropriate

Adding the alignment check to UBSAN's known_checks is the correct approach, as it replaces the old UNALIGNED_OK checks with proper sanitizer-based validation. This change ensures that any unsafe unaligned memory access will be caught by the sanitizer.

.github/workflows/cmake.yml (2)

84-88: LGTM: CI configuration properly updated

The job rename and removal of -DWITH_UNALIGNED=OFF correctly reflects the architectural change of making unaligned access checks unconditional through sanitizers rather than compile-time flags.


84-88: Verify removal of WITH_UNALIGNED flag across all jobs

Let's ensure consistency across all CI jobs.

✅ Verification successful

After reviewing the entire workflow file, I can see that there is no usage of WITH_UNALIGNED flag in any of the jobs. Since the review comment was concerned about checking for any remaining WITH_UNALIGNED flags across all jobs in the workflow, and we've confirmed there are none, I can generate the final response:

No remaining WITH_UNALIGNED flags found in workflow

The workflow file has been checked and there are no instances of the WITH_UNALIGNED flag being used in any of the CI jobs.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining WITH_UNALIGNED flags in the workflow
rg "WITH_UNALIGNED" .github/workflows/

Length of output: 38


Script:

#!/bin/bash
# Let's check the entire workflow file content to understand the context better
cat .github/workflows/cmake.yml

Length of output: 32965

arch/generic/compare256_c.c (2)

60-60: Conditional compilation updated to use OPTIMAL_CMP

The preprocessor check now uses OPTIMAL_CMP >= 32 instead of UNALIGNED_OK, which aligns with the objective to remove UNALIGNED_OK checks and simplifies optimization control.


141-142: Appropriate update for 64-bit comparison conditional

The condition has been correctly modified to #if defined(HAVE_BUILTIN_CTZLL) && OPTIMAL_CMP >= 64, removing the redundant UNALIGNED64_OK check and ensuring consistency with the new optimization strategy.

test/test_compare256_rle.cc (1)

55-62: Test conditions updated to reflect codebase changes

The conditional directives for the RLE compare tests have been appropriately updated to use BYTE_ORDER == LITTLE_ENDIAN and OPTIMAL_CMP thresholds, ensuring that the tests align with the modified implementation.

test/benchmarks/benchmark_compare256_rle.cc (1)

64-71: Benchmark conditionals adjusted to new optimization parameters

The updates to the preprocessor conditions now correctly depend on OPTIMAL_CMP and built-in function availability, which ensures that benchmarks are compiled when appropriate and aligns with the removal of UNALIGNED_OK checks.

test/test_compare256.cc (1)

68-77: Conditional directives in tests updated to match new criteria

The conditional compilation for the compare256 tests now relies on OPTIMAL_CMP and checks for built-in functions, which is consistent with the updated codebase and the removal of UNALIGNED_OK checks.

deflate_rle.c (1)

Line range hint 13-22: LGTM! Well-structured function selection hierarchy.

The migration from UNALIGNED_OK to OPTIMAL_CMP provides a clear progression of optimizations:

  1. 64-bit comparisons when OPTIMAL_CMP >= 64 and CTZLL is available
  2. 32-bit comparisons when OPTIMAL_CMP >= 32 and CTZ is available
  3. 16-bit comparisons as a fallback when OPTIMAL_CMP >= 32
  4. Byte-by-byte comparison for OPTIMAL_CMP < 32
match_tpl.h (2)

62-70: Well-structured offset calculations for different optimization levels

The offset calculations are properly adjusted based on both OPTIMAL_CMP values and data sizes, ensuring optimal matching behavior for different architectures.


189-198: Consistent handling of best length updates

The offset adjustments and optimization level checks maintain consistency with the initial calculations, ensuring symmetric behavior throughout the matching process.

zbuild.h (2)

Line range hint 295-371: LGTM! Well-structured sanitizer support.

The sanitizer implementation is robust with proper compiler detection, clean fallbacks, and well-designed instrumentation macros. The code handles both C and C++ compilation contexts appropriately.

Let's verify the sanitizer integration:

✅ Verification successful

Let me check the CMakeLists.txt files to understand how the sanitizers are integrated into the build system.


Let me check the sanitizer detection and configuration in the cmake directory to complete the verification.


Sanitizer support is properly integrated and used

The verification confirms that the sanitizer implementation is well-integrated:

  • Build system properly supports Address, Memory, Thread, and Undefined Behavior sanitizers through CMake options
  • Sanitizer detection is robust with comprehensive compiler flag checks
  • Instrumentation macros are actively used in the codebase, particularly in the s390 architecture implementation
  • CMake configuration includes proper compiler flags and build options for each sanitizer type
  • Memory Sanitizer integration includes special handling for Google Test with LLVM instrumented libraries
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if sanitizers are properly integrated in the build system
# and verify their usage across the codebase

# Check build system integration
echo "Checking build system integration..."
rg -l "ASAN|MSAN|sanitizer" CMakeLists.txt *.cmake

# Check usage of instrumentation macros
echo "Checking instrumentation macro usage..."
rg "instrument_(read|write)" --type c --type cpp --type h

# Look for existing sanitizer-related tests
echo "Checking for sanitizer tests..."
rg -l "ASAN|MSAN|sanitizer" test/

Length of output: 1424


Script:

#!/bin/bash
# Check CMakeLists.txt content for sanitizer configuration
echo "Main CMakeLists.txt sanitizer configuration:"
rg "sanitizer|ASAN|MSAN" -A 3 -B 3 CMakeLists.txt

echo -e "\nTest CMakeLists.txt sanitizer configuration:"
rg "sanitizer|ASAN|MSAN" -A 3 -B 3 test/CMakeLists.txt

# Check if there are any other sanitizer-related compiler flags
echo -e "\nCompiler flags related to sanitizers:"
rg "CMAKE_C_FLAGS|CMAKE_CXX_FLAGS" -A 1 -B 1 CMakeLists.txt test/CMakeLists.txt

Length of output: 3070


Script:

#!/bin/bash
# Check sanitizer detection and configuration
echo "Sanitizer detection configuration:"
cat cmake/detect-sanitizer.cmake

# Check if there are any other sanitizer-related files
echo -e "\nOther sanitizer-related files in cmake directory:"
fd sanitizer cmake/

Length of output: 5616


270-293: Verify performance impact across architectures.

Let's verify that the chosen OPTIMAL_CMP sizes provide optimal performance for each architecture.

✅ Verification successful

OPTIMAL_CMP values are properly optimized for architecture-specific performance

Based on the comprehensive analysis of the codebase:

  1. The OPTIMAL_CMP values are carefully chosen based on:

    • Architecture word sizes (64-bit for x86_64, aarch64, ppc64le)
    • Hardware unaligned access capabilities (ARM with __ARM_FEATURE_UNALIGNED)
    • Fallback to smaller sizes when unaligned access isn't supported
    • Safe default of 8 bytes when architecture-specific optimization isn't available
  2. The implementation validates these choices through:

    • Comprehensive benchmarking suite (test/benchmarks/benchmark_compare256.cc)
    • Architecture-specific optimized implementations (compare256_unaligned_16/32/64)
    • Performance-focused code comments indicating careful optimization
    • Conditional compilation ensuring the most efficient method is selected
  3. The code shows evidence of performance tuning:

    • Comments indicate optimization for STD_MAX_MATCH-2 multiple of 16
    • Early exit optimizations for better performance
    • Careful handling of unaligned access for each architecture
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any performance regression tests or benchmarks in the codebase
# that might validate the OPTIMAL_CMP size choices.

# Look for performance tests
echo "Searching for performance tests..."
rg -l "benchmark|perf_test|performance" --type c --type cpp --type h

# Look for related discussions or documentation
echo "Searching for related documentation..."
rg -l "OPTIMAL_CMP|comparison size|unaligned" README.md docs/

Length of output: 874


Script:

#!/bin/bash
# Let's examine the benchmark files and implementation details

# Check the benchmark implementation for compare256
echo "=== Checking compare256 benchmark implementation ==="
cat test/benchmarks/benchmark_compare256.cc

# Check if there are any comments or documentation about OPTIMAL_CMP in README
echo -e "\n=== Checking README for OPTIMAL_CMP documentation ==="
cat README.md | grep -A 5 -B 5 "OPTIMAL_CMP\|unaligned"

# Look for any configuration or build system tests related to OPTIMAL_CMP
echo -e "\n=== Checking for build system tests ==="
rg "OPTIMAL_CMP" configure

Length of output: 3893


Script:

#!/bin/bash
# Let's examine the actual implementation and usage of compare256 functions

# Check the compare256 implementation file
echo "=== Checking compare256 implementations ==="
rg -p "compare256_unaligned|compare256_c" --type c --type h -A 10

# Check where OPTIMAL_CMP is actually used
echo -e "\n=== Checking OPTIMAL_CMP usage ==="
rg "OPTIMAL_CMP" --type c --type h -A 5

# Look for any performance-related comments near the implementations
echo -e "\n=== Checking for performance-related comments ==="
rg -p "Performance|performance|optimized for|OPTIMAL" --type c --type h -B 3 -A 3

Length of output: 24919

test/benchmarks/benchmark_compare256.cc (3)

69-71: LGTM: Appropriate condition for unaligned_16 benchmark

The change from UNALIGNED_OK to OPTIMAL_CMP >= 32 aligns with the PR objectives. The condition correctly ensures that the 16-bit unaligned comparison benchmark is only included when the architecture supports efficient unaligned access.


71-73: LGTM: Proper dependency check for unaligned_32 benchmark

The condition appropriately checks for the built-in CTZ (count trailing zeros) function, which is required for the 32-bit implementation.


74-76: LGTM: Correct conditions for unaligned_64 benchmark

The conditions properly combine:

  1. Check for 64-bit CTZ support (HAVE_BUILTIN_CTZLL)
  2. Verification that optimal comparison size is at least 64 bits (OPTIMAL_CMP >= 64)
arch/generic/generic_functions.h (3)

31-32: LGTM: Proper conditional compilation for compare256 unaligned variants

The conditions for unaligned comparison functions are well-structured:

  • Base requirement: Little-endian and OPTIMAL_CMP >= 32
  • Additional requirements for 32-bit (HAVE_BUILTIN_CTZ) and 64-bit (HAVE_BUILTIN_CTZLL) variants

Also applies to: 34-34, 36-37


46-47: LGTM: Well-structured conditions for longest_match variants

The conditions mirror those of compare256 functions, maintaining consistency across the codebase. Both regular and slow variants are properly declared for each bit width.

Also applies to: 49-50, 52-52, 54-55


62-63: LGTM: Appropriate implementation selection logic

The generic implementation selection uses a cascading approach to choose the most optimal variant based on:

  1. Architecture endianness
  2. OPTIMAL_CMP threshold
  3. Available CPU features (CTZ/CTZLL)

Let's verify the implementation selection logic is comprehensive:

✅ Verification successful

Implementation selection logic is comprehensive and well-structured

The verification confirms that the implementation selection logic is complete and handles all possible combinations:

  1. For Little Endian with OPTIMAL_CMP >= 32:
    • 64-bit optimized variants when HAVE_BUILTIN_CTZLL and OPTIMAL_CMP >= 64
    • 32-bit optimized variants when HAVE_BUILTIN_CTZ
    • 16-bit variants as fallback
  2. For all other cases (Big Endian or OPTIMAL_CMP < 32):
    • Falls back to the standard C implementation

The code also properly handles runtime CPU detection through the DISABLE_RUNTIME_CPU_DETECTION flag, mapping the generic implementations to native functions.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that all possible combinations of conditions are handled

# Check for any potential gaps in implementation selection
rg -l "longest_match.*generic|compare256.*generic" | xargs rg "define.*longest_match.*generic|define.*compare256.*generic" -A 5 -B 5

Length of output: 2819

…nsics),

it is time to replace the UNALIGNED_OK checks that have since really only been
used to select the optimal comparison sizes for the arch instead.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
arch/generic/compare256_c.c (1)

141-142: Implementation looks good with efficient bit operations

The implementation is safe and efficient:

  • Uses memcpy for safe unaligned access
  • Leverages CPU's count trailing zeros instruction through __builtin_ctzll
  • Properly handles 64-bit to 32-bit conversion in return value

Consider adding bounds checking for better safety:

 diff = sv ^ mv;
 if (diff) {
     uint64_t match_byte = __builtin_ctzll(diff) / 8;
+    /* Ensure we don't return more than the current length */
+    if (match_byte > UINT32_MAX - len)
+        match_byte = UINT32_MAX - len;
     return len + (uint32_t)match_byte;
 }

Also applies to: 143-166

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a4e0ad and 8ff2fec.

📒 Files selected for processing (17)
  • .github/workflows/cmake.yml (1 hunks)
  • CMakeLists.txt (0 hunks)
  • README.md (1 hunks)
  • arch/generic/compare256_c.c (2 hunks)
  • arch/generic/generic_functions.h (2 hunks)
  • chunkset_tpl.h (1 hunks)
  • cmake/detect-sanitizer.cmake (1 hunks)
  • compare256_rle.h (2 hunks)
  • configure (0 hunks)
  • deflate_rle.c (1 hunks)
  • insert_string_tpl.h (1 hunks)
  • match_tpl.h (4 hunks)
  • test/benchmarks/benchmark_compare256.cc (1 hunks)
  • test/benchmarks/benchmark_compare256_rle.cc (1 hunks)
  • test/test_compare256.cc (1 hunks)
  • test/test_compare256_rle.cc (1 hunks)
  • zbuild.h (1 hunks)
💤 Files with no reviewable changes (2)
  • configure
  • CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (11)
  • README.md
  • test/test_compare256_rle.cc
  • test/benchmarks/benchmark_compare256_rle.cc
  • insert_string_tpl.h
  • compare256_rle.h
  • match_tpl.h
  • test/test_compare256.cc
  • test/benchmarks/benchmark_compare256.cc
  • deflate_rle.c
  • .github/workflows/cmake.yml
  • arch/generic/generic_functions.h
🔇 Additional comments (5)
cmake/detect-sanitizer.cmake (1)

114-114: LGTM! The addition of unconditional alignment checks aligns with PR objectives.

The unconditional inclusion of the alignment sanitizer check supports the PR's goal of ensuring consistently safe unaligned reads across all configurations.

Let's verify the impact on CI builds:

✅ Verification successful

LGTM! The alignment sanitizer check is properly integrated with the existing sanitizer framework.

The code inspection confirms that:

  • The alignment check is correctly added to the known_checks list in the add_undefined_sanitizer macro
  • The check is properly integrated with the existing sanitizer detection and configuration framework
  • The CI workflow includes multiple jobs with undefined behavior sanitizer enabled, which will now include alignment checks
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if any CI workflows explicitly disable alignment sanitizer checks
# or rely on the old WITH_UNALIGNED behavior

# Search for alignment sanitizer configurations in workflow files
fd -e yml -e yaml . .github/workflows/ --exec rg -l 'sanitizer|UBSAN|WITH_UNALIGNED'

# Search for any remaining WITH_UNALIGNED references that might need updating
rg 'WITH_UNALIGNED'

Length of output: 134


Script:

#!/bin/bash
# Let's check the content of cmake.yml to understand the sanitizer configuration
cat .github/workflows/cmake.yml

# Also check for any remaining references to alignment checks or sanitizers in CMake files
rg -g '*.cmake' 'alignment|sanitizer|UBSAN' -A 3 -B 3

Length of output: 40058

zbuild.h (1)

246-269: LGTM! Architecture-specific optimal comparison sizes are well-defined.

The OPTIMAL_CMP macro definitions provide a robust framework for determining optimal comparison sizes based on architecture capabilities. The progression from 64-bit to 32-bit to 8-bit is logical and includes proper safeguards.

Let's verify the architecture detection logic:

✅ Verification successful

Let me analyze the usage of OPTIMAL_CMP in more detail to ensure it's being used correctly with the architecture-specific values.


Architecture-specific OPTIMAL_CMP values are correctly implemented and used

The verification confirms that:

  • The OPTIMAL_CMP macro is consistently used across the codebase with proper size-based conditionals
  • The implementation in match_tpl.h and chunkset_tpl.h correctly handles different comparison sizes (8, 32, and 64 bits)
  • No conflicting architecture-specific defines exist in zbuild.h
  • The old UNALIGNED_OK checks have been completely removed
  • The code gracefully degrades from 64-bit to 32-bit to 8-bit operations based on architecture capabilities
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that OPTIMAL_CMP is used consistently across the codebase
# and that there are no remaining UNALIGNED_OK checks

# Check for any remaining UNALIGNED_OK references
rg "UNALIGNED_OK" --type c --type h

# Verify OPTIMAL_CMP usage
rg "OPTIMAL_CMP" --type c --type h

Length of output: 1733


Script:

#!/bin/bash
# Check the context of OPTIMAL_CMP usage in match_tpl.h and chunkset_tpl.h
# to verify proper handling of different sizes

# Check match_tpl.h implementation
rg "OPTIMAL_CMP" match_tpl.h -B 2 -A 2

# Check chunkset_tpl.h implementation
rg "OPTIMAL_CMP" chunkset_tpl.h -B 2 -A 2

# Check if there are any other architecture-specific defines that might conflict
rg "#if.*defined\(__" zbuild.h -A 2

Length of output: 2550

chunkset_tpl.h (1)

230-232: Alignment handling properly adapted for optimal comparison sizes.

The alignment mask and handling logic correctly adapt to different OPTIMAL_CMP values, ensuring proper memory access patterns while maintaining performance.

Also applies to: 238-242

arch/generic/compare256_c.c (2)

Line range hint 61-89: Implementation looks good with efficient partial match handling

The implementation is safe and efficient:

  • Uses zng_memcmp_2 for safe unaligned access
  • Optimizes for partial matches with (*src0 == *src1)
  • Properly processes 2 bytes at a time

60-60: Verify OPTIMAL_CMP threshold for 16-bit operations

The condition OPTIMAL_CMP >= 32 seems high for enabling 16-bit operations. This might prevent the 16-bit implementation from being used on architectures where it could be beneficial.

✅ Verification successful

The OPTIMAL_CMP >= 32 condition is correct for 16-bit operations

The condition is appropriate because:

  1. OPTIMAL_CMP defines the optimal word size for unaligned memory access on the architecture
  2. When OPTIMAL_CMP >= 32, it indicates the architecture efficiently supports unaligned access of at least 32 bits, which means 16-bit operations are also safe and efficient
  3. The implementation is part of a progressive optimization strategy where architectures with better unaligned access capabilities get more optimized implementations (8-bit → 16-bit → 32-bit → 64-bit)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for OPTIMAL_CMP definitions and usage
echo "Searching for OPTIMAL_CMP definitions:"
rg -A 2 "define.*OPTIMAL_CMP" 

echo -e "\nSearching for other 16-bit operations with similar conditions:"
rg "OPTIMAL_CMP.*>=.*32.*16"

Length of output: 3580


Script:

#!/bin/bash
# Let's check the implementation context and other similar conditions
echo "Checking the implementation context:"
rg -B 5 -A 10 "BYTE_ORDER == LITTLE_ENDIAN && OPTIMAL_CMP >= 32" arch/generic/compare256_c.c

echo -e "\nChecking for other OPTIMAL_CMP >= 32 conditions:"
rg "OPTIMAL_CMP >= 32"

echo -e "\nChecking for other 16-bit related implementations:"
rg -A 3 "memcmp_2|16.*bit"

Length of output: 105796


Script:

#!/bin/bash
# Let's check zbuild.h for OPTIMAL_CMP definitions and their context
echo "Checking OPTIMAL_CMP definitions in zbuild.h:"
rg -B 5 -A 5 "define.*OPTIMAL_CMP" zbuild.h

echo -e "\nChecking compare256 implementations:"
rg -B 2 -A 5 "compare256.*16" arch/generic/compare256_c.c

Length of output: 2130

Copy link
Contributor

@KungFuJesus KungFuJesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The alignment sanitizer coming back clean is a good sign. We should definitely test this on something a little more exotic like a G5.

Comment on lines -156 to -161
# Group sanitizer flag -fsanitize=undefined will automatically add alignment, even if
# it is not in our sanitize flag list, so we need to explicitly disable alignment sanitizing.
if(WITH_UNALIGNED)
add_compile_options(-fno-sanitize=alignment)
endif()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, hopefully by removing this bit we'll have some coverage for anything we've missed with CI.

Copy link
Member Author

@Dead2 Dead2 Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks like we already found an invalid alignment access at crc32_acle.c:52

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah yep. Easy fix, hopefully it generates the same code like I'd expect it to.

Comment on lines -43 to +38
val = ((uint32_t)(strstart[0])); \
val |= ((uint32_t)(strstart[1]) << 8); \
val |= ((uint32_t)(strstart[2]) << 16); \
val |= ((uint32_t)(strstart[3]) << 24);
memcpy(&val, strstart, sizeof(val)); \
val = ZSWAP32(val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I wonder of GCC was missing this obvious byteswap idiom. If so, this should actually end up being faster, as I think GCC will probably translate this to movbe on x86.

Comment on lines -240 to 241

#if !defined(UNALIGNED64_OK)
#if OPTIMAL_CMP < 64
while (((uintptr_t)out & align_mask) && (len > 0)) {
*out++ = *from++;
--len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of speculate this loop here doesn't actually need to exist. There may be some performance benefit in a niche case for keeping it if say, the ISA has an unaligned write penalty but otherwise permits arbitrary alignment. Then the compiler would likely generate simple writes that happen to be at word aligned offsets. Of course, there's likely going to be unaligned access in one way or another, either the read, the write, or both.

Maybe I should see if things are impacted on the G5, which will use the C implementations of these things, supports unaligned writes from GPRs, and I think has a performance penalty for doing so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could leave out this loop on some of the architectures, that'd certainly be nice for performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added due to a hang on macOS: #1031

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we don't directly access anything unaligned anymore I suspect it's not needed but I don't have a Mac os machine to test it with. Was this with aarch64 or x86?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain. It looks like it was in a code coverage build.

@codecov
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 32.30%. Comparing base (037ab0f) to head (8ff2fec).
Report is 3 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1828      +/-   ##
===========================================
+ Coverage    32.20%   32.30%   +0.10%     
===========================================
  Files           67       67              
  Lines         5745     5745              
  Branches      1239     1239              
===========================================
+ Hits          1850     1856       +6     
  Misses        3637     3637              
+ Partials       258      252       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@nmoinvaz nmoinvaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should get rid of the *_16 versions of compare since we don't ever really target a 16-bit platform.

@KungFuJesus
Copy link
Contributor

So this is the performance on develop:

CPU: PPC970MP, altivec supported
Tool: /home/adam/zlib-ng/build/minigzip Size: 146,440 B
Levels: 0-9       
Runs: 4          Trim worst: 0         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.390/0.401/0.416/0.011        0.412/0.433/0.451/0.017      211,973,953
 1     44.409%      2.891/2.911/2.939/0.022        1.359/1.361/1.364/0.002       94,127,497
 2     35.521%      4.645/4.670/4.695/0.024        1.368/1.376/1.386/0.008       75,290,525
 3     33.851%      6.067/6.105/6.132/0.027        1.337/1.338/1.339/0.001       71,750,763
 4     33.189%      6.811/6.824/6.844/0.014        1.305/1.307/1.310/0.002       70,346,478
 5     32.642%      7.341/7.358/7.376/0.015        1.283/1.289/1.294/0.005       69,187,419
 6     32.483%      8.701/8.738/8.773/0.029        1.282/1.283/1.285/0.001       68,850,776
 7     32.255%   12.013/12.085/12.203/0.090        1.268/1.272/1.280/0.005       68,366,759
 8     32.167%   17.598/17.617/17.649/0.023        1.262/1.269/1.276/0.006       68,180,762
 9     31.887%   19.923/19.961/20.001/0.033        1.283/1.289/1.293/0.004       67,586,442

 avg1  40.841%                        8.667                          1.222
 avg2  45.379%                        9.630                          1.357
 tot                                346.672                         48.870      865,661,374

And this is this PR, as is:

OS: Linux 6.1.1-gentoo #1 SMP PREEMPT Wed Dec 28 01:35:23 EST 2022 ppc64
CPU: PPC970MP, altivec supported
Tool: /home/adam/zlib-ng/build/minigzip Size: 146,440 B
Levels: 0-9       
Runs: 4          Trim worst: 0         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.396/0.399/0.405/0.004        0.404/0.416/0.432/0.012      211,973,953
 1     44.409%      2.894/2.928/2.964/0.037        1.351/1.362/1.375/0.011       94,127,497
 2     35.521%      4.646/4.694/4.743/0.050        1.373/1.378/1.388/0.007       75,290,525
 3     33.851%      6.080/6.102/6.118/0.017        1.336/1.340/1.346/0.005       71,750,763
 4     33.189%      6.792/6.809/6.819/0.012        1.304/1.310/1.314/0.004       70,346,478
 5     32.642%      7.333/7.361/7.385/0.023        1.276/1.283/1.290/0.007       69,187,419
 6     32.483%      8.675/8.699/8.734/0.026        1.274/1.281/1.289/0.007       68,850,776
 7     32.255%   12.001/12.024/12.046/0.021        1.263/1.270/1.277/0.006       68,366,759
 8     32.167%   17.607/17.646/17.685/0.035        1.261/1.270/1.276/0.006       68,180,762
 9     31.887%   19.963/20.070/20.321/0.168        1.285/1.291/1.295/0.005       67,586,442

 avg1  40.841%                        8.673                          1.220
 avg2  45.379%                        9.637                          1.356
 tot                                346.924                         48.799      865,661,374

So seemingly no performance penalty. The dip at level 9 for inflate in both cases is curious but I guess not too unexpected.

@KungFuJesus
Copy link
Contributor

Here's after removing the aforementioned alignment loop:

OS: Linux 6.1.1-gentoo #1 SMP PREEMPT Wed Dec 28 01:35:23 EST 2022 ppc64
CPU: PPC970MP, altivec supported
Tool: /home/adam/zlib-ng/build/minigzip Size: 146,440 B
Levels: 0-9       
Runs: 4          Trim worst: 0         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.388/0.396/0.404/0.007        0.415/0.421/0.427/0.005      211,973,953
 1     44.409%      2.878/2.899/2.927/0.021        1.352/1.358/1.364/0.006       94,127,497
 2     35.521%      4.642/4.665/4.708/0.031        1.367/1.372/1.375/0.003       75,290,525
 3     33.851%      6.066/6.094/6.126/0.026        1.336/1.340/1.347/0.005       71,750,763
 4     33.189%      6.798/6.838/6.871/0.034        1.308/1.312/1.317/0.004       70,346,478
 5     32.642%      7.299/7.340/7.385/0.035        1.280/1.283/1.287/0.003       69,187,419
 6     32.483%      8.698/8.741/8.798/0.043        1.281/1.285/1.288/0.003       68,850,776
 7     32.255%   12.011/12.052/12.137/0.057        1.268/1.275/1.279/0.005       68,366,759
 8     32.167%   17.586/17.616/17.705/0.059        1.269/1.272/1.274/0.002       68,180,762
 9     31.887%   20.076/20.126/20.196/0.056        1.290/1.292/1.294/0.002       67,586,442

 avg1  40.841%                        8.677                          1.221
 avg2  45.379%                        9.641                          1.357
 tot                                347.073                         48.844      865,661,374

So, basically the same.

@KungFuJesus
Copy link
Contributor

And removing the arbitrary requirement of little endian for 8 byte comparisons on deflate maybe marginally helped level 9 performance:

OS: Linux 6.1.1-gentoo #1 SMP PREEMPT Wed Dec 28 01:35:23 EST 2022 ppc64
CPU: PPC970MP, altivec supported
Tool: /home/adam/zlib-ng/build/minigzip Size: 146,440 B
Levels: 0-9       
Runs: 4          Trim worst: 0         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.389/0.393/0.399/0.004        0.411/0.427/0.433/0.011      211,973,953
 1     44.409%      2.892/2.914/2.956/0.030        1.351/1.359/1.368/0.009       94,127,497
 2     35.519%      4.795/4.812/4.827/0.014        1.368/1.371/1.376/0.003       75,286,322
 3     33.844%      6.170/6.212/6.255/0.035        1.333/1.336/1.339/0.003       71,735,218
 4     33.146%      6.890/6.925/6.955/0.027        1.303/1.309/1.316/0.006       70,255,223
 5     32.642%      7.285/7.339/7.423/0.059        1.282/1.292/1.300/0.009       69,187,419
 6     32.483%      8.541/8.589/8.686/0.066        1.284/1.286/1.287/0.002       68,850,776
 7     32.255%   12.121/12.145/12.180/0.026        1.278/1.280/1.281/0.001       68,366,759
 8     32.167%   18.046/18.095/18.170/0.057        1.269/1.272/1.279/0.005       68,180,762
 9     31.887%   19.314/19.341/19.374/0.025        1.281/1.289/1.294/0.005       67,586,442

 avg1  40.836%                        8.677                          1.222
 avg2  45.373%                        9.641                          1.358
 tot                                347.065                         48.880      865,550,371

Seems to have surprisingly hurt level 8 for compression run time.

@Dead2
Copy link
Member Author

Dead2 commented Dec 20, 2024

I wonder if we should get rid of the *_16 versions of compare since we don't ever really target a 16-bit platform.

With #1830 the _16 variants are the fallbacks if we don't have the CTZ variants.
The fallback can be rewritten to do whatever is most performant though, if that is making it a _32 variant without CTZ for example, that should be fine.

@Dead2 Dead2 merged commit 509f6b5 into develop Dec 20, 2024
278 of 292 checks passed
@Dead2 Dead2 mentioned this pull request Dec 31, 2024
@Dead2 Dead2 deleted the remove-unaligned branch February 10, 2025 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cleanup Improving maintainability or removing code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants