这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@KungFuJesus
Copy link
Contributor

@KungFuJesus KungFuJesus commented Mar 28, 2025

This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason for this has to do with the fact that, while incurring far fewer shifts, an entirely separate stack buffer has to be managed that is the size of the L1 cache on most CPUs. This was one of the main reasons the 32k specialized function was slower for the scalar counterpart, despite auto vectorizing. The auto vectorized loop was setting up the stack buffer at unaligned offsets, which is detrimental to performance pre-nehalem. Additionally, we were losing a fair bit of time to the zero initialization, which we are now doing more selectively.

There are a ton of loads and stores happening, and for sure we are bound on the fill buffer + store forwarding. An SSE2 version of this code is probably possible by simply replacing the shifts with unpacks with zero and the palignr's with shufpd's. I'm just not sure it'll be all that worth it, though. We are gating against SSE4.1 not because we are using specifically a 4.1 instruction but because that marks when Wolfdale came out and palignr became a lot faster.

Summary by CodeRabbit

  • New Features
    • Added support for SSE4.1 instruction set on x86 processors, enabling optimized CRC32 checksum calculations for improved performance on compatible hardware.
  • Bug Fixes
    • None.
  • Tests
    • Introduced new benchmarks and test cases for the SSE4.1-optimized CRC32 implementation, including tests for smaller buffer sizes.
  • Chores
    • Updated build scripts and configuration tools to detect and enable SSE4.1 support when available.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 28, 2025

Walkthrough

This change introduces support for the SSE4.1 instruction set across the build system, runtime feature detection, and CRC32 computation functionality for x86 architectures. It adds new build options and detection logic for SSE4.1 in CMake, configure scripts, and Makefiles. The runtime feature detection is extended to recognize SSE4.1, and a new SSE4.1-optimized CRC32 implementation is provided. Function pointer assignments and test/benchmark registration logic are updated to utilize the new SSE4.1 variant when available. Corresponding test cases and benchmarks are included to validate and measure the new implementation.

Changes

File(s) Change Summary
CMakeLists.txt, cmake/detect-intrinsics.cmake Add build option and detection macro for SSE4.1; update dependency chain for SSE-related options; add feature summary and reporting for SSE4.1.
configure, arch/x86/Makefile.in Add SSE4.1 compiler flag, detection logic, and object file rules; update architecture-specific build and feature checks for SSE4.1.
arch/x86/chorba_sse41.c New file implementing CRC32 calculation optimized for SSE4.1 using the Chorba algorithm, including specialized routines and vectorized processing.
arch/x86/x86_features.c, arch/x86/x86_features.h Add runtime detection and struct member for SSE4.1 CPU feature.
arch/x86/x86_functions.h Add declaration for crc32_chorba_sse41; update macro guards and native function mappings for SSE4.1 support.
functable.c Assign CRC32 function pointer to SSE4.1 variant if available at runtime; update macro guards for SSE2/SSE4.1.
test/benchmarks/benchmark_crc32.cc Register new benchmark for SSE4.1 CRC32 variant; update macro guards for SSE2/SSE4.1.
test/test_crc32.cc Add new test case for SSE4.1 CRC32; update macro guards; add smaller buffer test case.

Sequence Diagram(s)

sequenceDiagram
    participant BuildSystem
    participant CPU
    participant FeatureDetect
    participant CRC32API
    participant SSE2Impl
    participant SSE41Impl
    participant User

    User->>CRC32API: Call crc32(...)
    CRC32API->>FeatureDetect: Query CPU features
    FeatureDetect->>CPU: CPUID
    CPU-->>FeatureDetect: Feature bits (including SSE4.1)
    FeatureDetect-->>CRC32API: has_sse41, has_sse2, etc.
    alt SSE4.1 available
        CRC32API->>SSE41Impl: crc32_chorba_sse41(...)
    else SSE2 available
        CRC32API->>SSE2Impl: crc32_chorba_sse2(...)
    else
        CRC32API->>CRC32API: Use fallback implementation
    end
    CRC32API-->>User: Return CRC32 result
Loading

Suggested reviewers

  • nmoinvaz

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@KungFuJesus KungFuJesus force-pushed the sse41_chorba branch 2 times, most recently from e251b42 to e1574e2 Compare March 28, 2025 22:15
@KungFuJesus
Copy link
Contributor Author

2025-03-28T18:16:29-04:00
Running ./benchmark_zlib
Run on (4 X 3003 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 6144 KiB (x2)
Load Average: 1.91, 1.36, 0.75
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
crc32/generic_chorba/32768      15754 ns        15762 ns        44401
crc32/braid/32768               24084 ns        24095 ns        29072
crc32/chorba_sse2/32768          8991 ns         8994 ns        77933
crc32/chorba_sse41/32768         6264 ns         6267 ns       111684

@codecov
Copy link

codecov bot commented Mar 28, 2025

Codecov Report

Attention: Patch coverage is 95.07389% with 10 lines in your changes missing coverage. Please review.

Project coverage is 81.92%. Comparing base (00a3168) to head (5890b4b).
Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
arch/x86/chorba_sse41.c 96.41% 6 Missing and 1 partial ⚠️
functable.c 66.66% 0 Missing and 1 partial ⚠️
test/benchmarks/benchmark_crc32.cc 0.00% 1 Missing ⚠️
test/test_crc32.cc 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #1893       +/-   ##
============================================
+ Coverage    40.29%   81.92%   +41.63%     
============================================
  Files           73      146       +73     
  Lines         7684    13267     +5583     
  Branches      1281     2944     +1663     
============================================
+ Hits          3096    10869     +7773     
+ Misses        4363     1493     -2870     
- Partials       225      905      +680     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (8)
test/benchmarks/benchmark_crc32.cc (1)

74-76: Consider using a dedicated SSE4.1 compile guard

The code currently uses the same NO_CHORBA_SSE2 guard for both SSE2 and SSE4.1 implementations. Consider using a separate guard like NO_CHORBA_SSE41 for the SSE4.1 implementation to allow more fine-grained control.

-#       if defined(X86_SSE41) && !defined(NO_CHORBA_SSE2)
+#       if defined(X86_SSE41) && !defined(NO_CHORBA_SSE41)
        BENCHMARK_CRC32(chorba_sse41, crc32_chorba_sse41, test_cpu_features.x86.has_sse41);
#       endif
arch/x86/Makefile.in (1)

83-88: Use SFLAGS for shared object compilation

For consistency with other architecture-specific rules in this file, you should use $(SFLAGS) instead of $(CFLAGS) for the .lo target, which represents the shared object compilation.

chorba_sse41.lo:
-	$(CC) $(CFLAGS) $(SSE41FLAG) $(NOLTOFLAG) -DPIC $(INCLUDES) -c -o $@ $(SRCDIR)/chorba_sse41.c
+	$(CC) $(SFLAGS) $(SSE41FLAG) $(NOLTOFLAG) -DPIC $(INCLUDES) -c -o $@ $(SRCDIR)/chorba_sse41.c
test/test_crc32.cc (1)

288-290: Consider using a dedicated SSE4.1 compile guard

Similar to the benchmark file, consider using a separate guard like NO_CHORBA_SSE41 for the SSE4.1 test to allow more fine-grained control over which implementations are tested.

-#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && !defined(NO_CHORBA_SSE2)
+#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && !defined(NO_CHORBA_SSE41)
TEST_CRC32(chorba_sse41, crc32_chorba_sse41, test_cpu_features.x86.has_sse41)
#endif
configure (4)

39-39: Remove trailing whitespace in these lines.

They are causing linter warnings. Consider removing the extra spaces at the end of each of these lines.

Apply this diff to remove trailing spaces:

-# We only need to zero out the bytes between the 128'th value and the 144th 
+# We only need to zero out the bytes between the 128'th value and the 144th
...
-    ;;  
+    ;;
...
-  esac  
+  esac
...
-      --help)  
+      --help)

Also applies to: 104-104, 111-111, 118-118


114-114: Add a comment describing SSE4.1 usage.

The newly introduced sse41flag="-msse4.1" is self-explanatory but it might be helpful to document the rationale for enabling SSE4.1 and any potential compatibility considerations with older compilers or CPUs.


1537-1552: Ensure error handling for SSE4.1 intrinsics check.

The function check_sse41_intrinsics() compiles a test source to detect availability of SSE4.1. While this is correct for most compilers, consider gracefully handling exotic or older compilers that may not support -msse4.1 or <smmintrin.h>.


1683-1687: Add extra logging or condition checks for SSE4.1 availability.

The conditional block enabling SSE4.1 is good. Consider logging a short user-facing message to confirm SSE4.1 detection (similar to other SSE checks) for consistency and debuggability.

arch/x86/chorba_sse41.c (1)

39-39: Remove trailing whitespace.

The linter warns about trailing spaces on these lines. Removing them keeps the codebase clean and consistent.

Apply this diff to remove trailing spaces:

-/* We only need to zero out the bytes between the 128'th value and the 144th 
+/* We only need to zero out the bytes between the 128'th value and the 144th
...
-        _mm_store_si128(bitbuf144++, a);  
+        _mm_store_si128(bitbuf144++, a);
...
-        _mm_store_si128(bitbuf182++, e);  
+        _mm_store_si128(bitbuf182++, e);
...
-        buf210 = _mm_xor_si128(_mm_load_si128(bitbuf210+4), in8_);  
+        buf210 = _mm_xor_si128(_mm_load_si128(bitbuf210+4), in8_);

Also applies to: 104-104, 111-111, 118-118

🧰 Tools
🪛 GitHub Actions: Lint

[warning] 39-39: Trailing whitespace.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 724dc0c and e1574e2.

📒 Files selected for processing (13)
  • CMakeLists.txt (4 hunks)
  • arch/x86/Makefile.in (3 hunks)
  • arch/x86/chorba_sse41.c (1 hunks)
  • arch/x86/x86_features.c (1 hunks)
  • arch/x86/x86_features.h (1 hunks)
  • arch/x86/x86_functions.h (2 hunks)
  • arch/x86/x86_intrins.h (1 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (3 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_crc32.cc (1 hunks)
  • test/test_crc32.cc (2 hunks)
  • win32/Makefile.msc (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
test/benchmarks/benchmark_crc32.cc (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
arch/x86/chorba_sse41.c (3)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-03-26T15:10:53.588Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
🧬 Code Definitions (3)
functable.c (1)
arch/x86/x86_functions.h (1)
  • crc32_chorba_sse41 (41-41)
test/benchmarks/benchmark_crc32.cc (1)
arch/x86/x86_functions.h (1)
  • crc32_chorba_sse41 (41-41)
arch/x86/x86_functions.h (1)
arch/generic/generic_functions.h (1)
  • crc32 (14-14)
🪛 GitHub Actions: Lint
configure

[warning] 1681-1681: Trailing whitespace.

arch/x86/chorba_sse41.c

[warning] 39-39: Trailing whitespace.


[warning] 104-104: Trailing whitespace.


[warning] 111-111: Trailing whitespace.


[warning] 118-118: Trailing whitespace.

🪛 GitHub Check: CodeQL
arch/x86/chorba_sse41.c

[notice] 28-374: Block with too many statements
Block with too many statements (5 complex statements in the block).


[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 148-148: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 149-149: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 150-150: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 151-151: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 185-185: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 186-186: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 187-187: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 188-188: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 196-196: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 197-197: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 198-198: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 199-199: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 245-245: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 246-246: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 247-247: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 248-248: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 256-256: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 257-257: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 258-258: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 259-259: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 278-278: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 279-279: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 280-280: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 281-281: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

⏰ Context from checks skipped due to timeout of 90000ms (140)
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
  • GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Windows MSVC 2022 v142 Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2022 v143 Win32
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat No Opt ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: macOS GCC
🔇 Additional comments (21)
arch/x86/x86_intrins.h (1)

95-95: LGTM: Minor formatting change

The addition of a blank line after the include statement improves code readability without affecting functionality.

arch/x86/x86_features.h (1)

20-20: LGTM: Clean SSE4.1 support addition

The has_sse41 field is correctly placed in the CPU features structure between SSSE3 and SSE4.2, following the logical progression of Intel's instruction set extensions.

arch/x86/x86_features.c (1)

78-78: LGTM: Proper SSE4.1 detection implementation

The SSE4.1 feature detection uses the correct bitmask (0x80000) to check bit 19 of the ECX register, which aligns with Intel's documentation for identifying SSE4.1 support via CPUID.

functable.c (1)

99-106: LGTM: Well-structured SSE4.1 function dispatch

The SSE4.1 conditional block is properly positioned between SSSE3 and SSE4.2 sections, maintaining the logical progression of feature checks. The implementation follows the established pattern in the codebase for conditionally selecting optimized implementations based on CPU capabilities.

The performance improvement (25-30% over SSE2 variant per PR description) will be automatically utilized on compatible systems without disrupting existing code paths.

win32/Makefile.msc (2)

65-65: LGTM: Object file addition looks good

The new object file chorba_sse41.obj is correctly added to the OBJS list, following the same pattern as other architecture-specific implementations.


217-217: LGTM: Dependency rule correctly defined

The dependency rule for the new object file is correctly defined, specifying the source file and required headers.

test/benchmarks/benchmark_crc32.cc (1)

75-75: LGTM: Benchmark registration follows established pattern

The benchmark registration for the new SSE4.1 implementation correctly follows the established pattern, using the appropriate function and feature check.

arch/x86/Makefile.in (2)

16-16: LGTM: SSE4.1 flag correctly defined

The SSE4.1 flag is correctly defined with the appropriate compiler option -msse4.1.


39-39: LGTM: Object files correctly added

The new object files for the SSE4.1-optimized implementation are correctly added to the build targets.

test/test_crc32.cc (1)

196-197: LGTM: Additional test case for half-sized buffer

Adding a test case with a 16KB buffer size (half of the existing 32KB test) is a good approach to ensure the implementation works correctly with different buffer sizes.

arch/x86/x86_functions.h (2)

39-43: LGTM! SSE4.1 function declaration is properly guarded

The declaration of crc32_chorba_sse41 is appropriately guarded with the correct preprocessor directives, ensuring it's only available when SSE4.1 and SSE2 are enabled and the CHORBA implementation is not disabled.


123-126: LGTM! Native CRC32 redirection to SSE4.1 implementation

This change appropriately redirects the native_crc32 function to use the SSE4.1 implementation when SSE4.1 is available and the necessary prerequisites are met.

CMakeLists.txt (4)

129-129: LGTM! SSE41 dependency is correctly specified

The WITH_SSE41 option is properly defined as dependent on WITH_SSSE3, which matches the architecture dependency chain where SSE4.1 builds upon SSSE3 instructions.


149-150: LGTM! Advanced options updated to include SSE41

The mark_as_advanced section has been updated to include the WITH_SSE41 option, maintaining consistency with how other similar options are handled.


995-1005: LGTM! SSE41 build configuration implementation

The implementation for building with SSE4.1 follows the same pattern as other instruction set extensions:

  1. Checks for intrinsics support
  2. Sets up appropriate macros and source files
  3. Applies correct compilation flags

This change is well-structured and consistent with the project's coding standards.


1496-1496: LGTM! Feature summary for SSE41 added

The feature summary is properly updated to include information about the SSE41 build option, maintaining consistency with documentation for other instruction sets.

configure (1)

1680-1682: Maintain consistency with other checks.

You've seamlessly added check_sse41_intrinsics after check_ssse3_intrinsics. Ensure that the new check is also integrated into any relevant feature-detection logs or user-facing messages that summarize SSE capabilities.

🧰 Tools
🪛 GitHub Actions: Lint

[warning] 1681-1681: Trailing whitespace.

arch/x86/chorba_sse41.c (4)

1-12: Check for fallback scenarios.

You guard this code with #if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && defined(X86_SSE2). Ensure that if either SSE2 or SSE4.1 is not available, the build system correctly compiles an alternative implementation or gracefully disables this feature.


13-25: Reassure consistent function declarations.

Make sure these external and macro definitions remain consistent with their declarations in crc32_braid_p.h and any additional headers (like crc32.h). If function signatures change, it could result in subtle linking issues.


51-51: Confirm pointer arithmetic for __m128i arrays.

Static analysis flagged “suspicious pointer scaling” at these lines. bitbuffer is an array of uint64_t, yet you cast and manipulate it as __m128i *. This is likely correct given the 16-byte alignment, but please confirm each offset multiplication matches your intended data layout.

Also applies to: 148-151, 185-188, 196-199, 245-248, 256-259, 278-281

🧰 Tools
🪛 GitHub Check: CodeQL

[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


376-406: Validate alignment logic for SSE usage.

In crc32_chorba_sse41(), you compute algn_diff to align the buffer to a 16-byte boundary. Make sure no performance regressions arise if len is small or if buf is nearly aligned (especially edge cases where algn_diff == len).

@nmoinvaz
Copy link
Member

There are a bunch of these errors:
image

@nmoinvaz nmoinvaz added optimization Architecture Architecture specific labels Mar 28, 2025
@KungFuJesus
Copy link
Contributor Author

KungFuJesus commented Mar 29, 2025

There are a bunch of these errors: image

I mean that is intentional, 2 64 bit integers is a stride through a single __m128i. Maybe I modify the pointer type to appease codeQL, though.

Err actually no, that's me intentionally advancing an __m128i pointer. What else would it have me do? The intrinsic requires a __m128i pointer.

@KungFuJesus KungFuJesus force-pushed the sse41_chorba branch 4 times, most recently from 4ae7746 to cf88723 Compare March 29, 2025 02:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
arch/x86/chorba_sse41.c (1)

28-374: 🛠️ Refactor suggestion

Refactor the large function for maintainability.
This function (crc32_chorba_32768_nondestructive_sse41) contains multiple loops and intricate pointer manipulation across a wide range of lines, making it difficult to follow and maintain. Splitting it into smaller, well-named helper routines (e.g., initialization, SSE mixing, tail processing) can significantly improve readability and manageability.

🧰 Tools
🪛 GitHub Check: CodeQL

[notice] 28-374: Block with too many statements
Block with too many statements (5 complex statements in the block).


[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 148-148: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 149-149: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 150-150: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 151-151: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 185-185: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 186-186: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 187-187: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 188-188: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 196-196: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 197-197: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 198-198: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 199-199: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 245-245: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 246-246: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 247-247: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 248-248: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 256-256: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 257-257: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 258-258: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 259-259: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 278-278: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 279-279: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 280-280: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 281-281: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

🪛 GitHub Actions: Lint

[warning] 39-39: Trailing whitespace found.


[warning] 104-104: Trailing whitespace found.


[warning] 111-111: Trailing whitespace found.


[warning] 118-118: Trailing whitespace found.

🧹 Nitpick comments (3)
configure (2)

1537-1552: Validate usage of _mm_insert_epi64 in the SSE4.1 check
This test snippet effectively confirms compiler support for SSE4.1 intrinsics. However, __m128i b is not initialized, which could trigger warnings in stricter compiler settings. Consider _mm_setzero_si128() if such warnings become problematic.


1680-1688: Remove trailing whitespace and finalize SSE4.1 configuration
Line 1681 includes trailing whitespace, causing the lint warning. Removing it will clear the pipeline warning. The surrounding SSE4.1 build logic is coherent.

Apply this diff to remove trailing whitespace:

-            # (there are trailing spaces here on line 1681)
+            
🧰 Tools
🪛 GitHub Actions: Lint

[warning] 1681-1681: Trailing whitespace found.

arch/x86/chorba_sse41.c (1)

39-39: Remove trailing whitespace.
These lines include trailing whitespaces, as flagged by the lint pipeline. Removing them will keep the project coding style clean and consistent.

(Repeat similarly for lines 104, 111, and 118.)

Also applies to: 104-104, 111-111, 118-118

🧰 Tools
🪛 GitHub Actions: Lint

[warning] 39-39: Trailing whitespace found.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1574e2 and 4ae7746.

📒 Files selected for processing (13)
  • CMakeLists.txt (4 hunks)
  • arch/x86/Makefile.in (3 hunks)
  • arch/x86/chorba_sse41.c (1 hunks)
  • arch/x86/x86_features.c (1 hunks)
  • arch/x86/x86_features.h (1 hunks)
  • arch/x86/x86_functions.h (2 hunks)
  • arch/x86/x86_intrins.h (1 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (3 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_crc32.cc (1 hunks)
  • test/test_crc32.cc (2 hunks)
  • win32/Makefile.msc (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
  • arch/x86/x86_intrins.h
  • arch/x86/Makefile.in
  • arch/x86/x86_features.c
  • test/benchmarks/benchmark_crc32.cc
  • functable.c
  • arch/x86/x86_functions.h
  • arch/x86/x86_features.h
  • cmake/detect-intrinsics.cmake
  • CMakeLists.txt
  • test/test_crc32.cc
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (3)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-03-26T15:10:53.588Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
🧬 Code Definitions (1)
arch/x86/chorba_sse41.c (2)
arch/x86/x86_intrins.h (2)
  • _mm_cvtsi64x_si128 (108-110)
  • _mm_cvtsi64x_si128 (108-108)
arch/x86/x86_functions.h (1)
  • crc32_chorba_sse41 (41-41)
🪛 GitHub Actions: Lint
configure

[warning] 1681-1681: Trailing whitespace found.

arch/x86/chorba_sse41.c

[warning] 39-39: Trailing whitespace found.


[warning] 104-104: Trailing whitespace found.


[warning] 111-111: Trailing whitespace found.


[warning] 118-118: Trailing whitespace found.

🪛 GitHub Check: CodeQL
arch/x86/chorba_sse41.c

[notice] 28-374: Block with too many statements
Block with too many statements (5 complex statements in the block).


[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 148-148: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 149-149: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 150-150: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 151-151: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 185-185: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 186-186: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 187-187: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 188-188: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 196-196: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 197-197: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 198-198: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 199-199: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 245-245: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 246-246: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 247-247: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 248-248: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 256-256: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 257-257: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 258-258: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 259-259: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 278-278: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 279-279: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 280-280: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).


[failure] 281-281: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

⏰ Context from checks skipped due to timeout of 90000ms (174)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2022 v143 Win64
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Windows MSVC 2019 v140 Win32
  • GitHub Check: Windows MSVC 2019 v141 Win32
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC -m32
🔇 Additional comments (6)
win32/Makefile.msc (3)

36-36: Confirm SSE4.1 define
The define -DX86_SSE41 is consistent with the naming pattern for other SSE flags (e.g., -DX86_SSE2, -DX86_SSE42).


64-65: Ensure conditional SSE4.1 build
The addition of chorba_sse2.obj and chorba_sse41.obj references is consistent with SSE-based optimizations. Confirm that your build system conditionally includes chorba_sse41.obj only when SSE4.1 is supported or intended.


217-217: Add SSE4.1 build rule
The build rule for chorba_sse41.obj follows the same pattern as related SSE object rules and appears correct.

configure (1)

114-114: Define SSE4.1 compiler flag
The new sse41flag="-msse4.1" is appropriately aligned with the existing SSE flag definitions.

arch/x86/chorba_sse41.c (2)

376-406: Validate SSE fallback logic.
In the crc32_chorba_sse41 function, ensure that the fallbacks to crc32_braid_internal or chorba_small_nondestructive_sse2 are correct and consistent with your SSE4.1 detection. Verify that length thresholds are aligned with the intended performance trade-offs, and that the code correctly handles edge cases where data is shorter than the SSE4.1-optimized block size.


51-51:

❓ Verification inconclusive

Potential suspicious pointer scaling.
Several lines use pointer increments on __m128i* objects in ways that static analysis flags as “suspicious pointer scaling.” Verify that these increments (often with bitbufXXX++ + N) correctly align with 128-bit boundaries and do not accidentally skip or overlap data. If this behavior is intentional (e.g., skipping multiple SIMD registers), consider adding comments explaining why this is correct.

To confirm correctness, run the following script to highlight relevant pointer arithmetic usage for manual inspection:

Also applies to: 148-151, 185-188, 196-199, 245-248, 256-259, 278-281


🏁 Script executed:

#!/bin/bash
# Searching lines where bitbuf pointers are incremented or offset by numeric values
rg -A 3 'bitbuf\S*\+\+' arch/x86/chorba_sse41.c

Length of output: 5002


Action: Verify pointer arithmetic correctness in SIMD store operations

  • In arch/x86/chorba_sse41.c, line 51 shows the expression _mm_store_si128(bitbuf144++ + 8, z);. This usage combines a post-increment (bitbuf144++) with an added offset (+ 8), which is atypical compared to the other straightforward pointer increments (e.g., bitbuf128++, bitbuf182++, etc.).
  • Please verify that this arithmetic correctly accounts for 128-bit boundaries and intentionally skips the necessary SIMD registers without overlapping or misaligning the data.
  • As similar pointer arithmetic is seen throughout the file (notably in the ranges [148–151, 185–188, 196–199, 245–248, 256–259, 278–281]), if this behavior is deliberate (for example, to zero out or process a specific subrange), please add inline comments explaining the intent to aid maintainability.
🧰 Tools
🪛 GitHub Check: CodeQL

[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

@KungFuJesus
Copy link
Contributor Author

KungFuJesus commented Mar 29, 2025

Can we whitelist this thing with codeQL or something? The alternative looks like this:

_mm_load_si128((__m128i*)((uint64_t*)bitbuffer182 + 2 * 2));

Hmm, evidently derefing with a [] is what it wants? https://codeql.github.com/codeql-query-help/cpp/cpp-suspicious-pointer-scaling/

That's rather silly, but whatever, it's the same thing.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🧹 Nitpick comments (1)
arch/x86/chorba_sse41.c (1)

39-39: Remove trailing whitespace.

The build pipeline flags trailing whitespace on lines 39, 104, 111, and 118. Eliminating trailing whitespace helps maintain a clean codebase.

Example fix:

-    for (size_t i = 0; i < 2; ++i) {··
+    for (size_t i = 0; i < 2; ++i) {

Also applies to: 104-104, 111-111, 118-118

🧰 Tools
🪛 GitHub Actions: Lint

[warning] 39-39: Trailing whitespace.

🛑 Comments failed to post (1)
configure (1)

1680-1688: ⚠️ Potential issue

Fix trailing whitespace on line 1681

There's a trailing whitespace character that should be removed to comply with the codebase style guidelines.

-            check_sse41_intrinsics
-            
+            check_sse41_intrinsics
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

            check_sse41_intrinsics
            if test ${HAVE_SSE41_INTRIN} -eq 1; then
                CFLAGS="${CFLAGS} -DX86_SSE41"
                SFLAGS="${SFLAGS} -DX86_SSE41"
                ARCH_STATIC_OBJS="${ARCH_STATIC_OBJS} chorba_sse41.o"
                ARCH_SHARED_OBJS="${ARCH_SHARED_OBJS} chorba_sse41.lo"
            fi
🧰 Tools
🪛 GitHub Actions: Lint

[warning] 1681-1681: Trailing whitespace.

@KungFuJesus KungFuJesus force-pushed the sse41_chorba branch 2 times, most recently from f1e0e89 to 0211aba Compare March 29, 2025 02:36
Comment on lines 28 to 309
static Z_FORCEINLINE uint32_t crc32_chorba_32768_nondestructive_sse41(uint32_t crc, const uint64_t* buf, size_t len) {
const uint64_t* input = buf;
ALIGNED_(16) uint64_t bitbuffer[32768 / sizeof(uint64_t)];
const uint8_t* bitbufferbytes = (const uint8_t*) bitbuffer;
__m128i z = _mm_setzero_si128();

__m128i *bitbuf128 = (__m128i*)(bitbuffer + 128);
__m128i *bitbuf144 = (__m128i*)(bitbuffer + 144);
__m128i *bitbuf182 = (__m128i*)(bitbuffer + 182);
__m128i *bitbuf0 = bitbuf128;

/* We only need to zero out the bytes between the 128'th value and the 144th
* that are actually read */
for (size_t i = 0; i < 2; ++i) {
_mm_store_si128(bitbuf128++, z);
_mm_store_si128(bitbuf128++, z);
_mm_store_si128(bitbuf128++, z);
_mm_store_si128(bitbuf128++, z);
}

/* We only need to zero out the bytes between the 144'th value and the 182nd that
* are actually read */
for (size_t i = 0; i < 11; ++i) {
_mm_store_si128(bitbuf144++ + 8, z);
}

/* We only need to zero out the bytes between the 182nd value and the 210th that
* are actually read. */
for (size_t i = 0; i < 4; ++i) {
_mm_store_si128(bitbuf182++, z);
_mm_store_si128(bitbuf182++, z);
_mm_store_si128(bitbuf182++, z);
_mm_store_si128(bitbuf182++, z);
}

/* We need to mix this in */
__m128i init_crc = _mm_cvtsi64x_si128(crc);
crc = 0;

size_t i = 0;
__m128i *inptr = (__m128i*)input;
bitbuf144 = (__m128i*)(bitbuffer + 144);
bitbuf182 = (__m128i*)(bitbuffer + 182);
__m128i *bitbuf210 = (__m128i*)(bitbuffer + 210);
__m128i *bitbuf300 = (__m128i*)(bitbuffer + 300);

/* Previous iteration runs carried over */
__m128i buf144 = z;
__m128i buf182 = z;
__m128i buf210 = z;

for(; i + 300*8+64 < len && i < 22 * 8; i += 64) {
__m128i in12, in34, in56, in78;

in12 = _mm_load_si128(inptr++);
in34 = _mm_load_si128(inptr++);
in56 = _mm_load_si128(inptr++);
in78 = _mm_load_si128(inptr++);

if (i == 0) {
in12 = _mm_xor_si128(in12, init_crc);
}

__m128i in_1 = _mm_slli_si128(in12, 8);

__m128i in23 = _mm_alignr_epi8(in34, in12, 8);
__m128i in45 = _mm_alignr_epi8(in56, in34, 8);
__m128i in67 = _mm_alignr_epi8(in78, in56, 8);
__m128i in8_ = _mm_srli_si128(in78, 8);

__m128i a = _mm_xor_si128(buf144, in_1);

_mm_store_si128(bitbuf144++, a);
_mm_store_si128(bitbuf144++, in23);
_mm_store_si128(bitbuf144++, in45);
_mm_store_si128(bitbuf144++, in67);
buf144 = in8_;

__m128i e = _mm_xor_si128(buf182, in_1);
_mm_store_si128(bitbuf182++, e);
_mm_store_si128(bitbuf182++, in23);
_mm_store_si128(bitbuf182++, in45);
_mm_store_si128(bitbuf182++, in67);
buf182 = in8_;

__m128i m = _mm_xor_si128(buf210, in_1);
_mm_store_si128(bitbuf210++, m);
_mm_store_si128(bitbuf210++, in23);
_mm_store_si128(bitbuf210++, in45);
_mm_store_si128(bitbuf210++, in67);
buf210 = in8_;

_mm_store_si128(bitbuf300++, in12);
_mm_store_si128(bitbuf300++, in34);
_mm_store_si128(bitbuf300++, in56);
_mm_store_si128(bitbuf300++, in78);
}

for(; i + 300*8+64 < len && i < 32 * 8; i += 64) {
__m128i in12, in34, in56, in78;
in12 = _mm_load_si128(inptr++);
in34 = _mm_load_si128(inptr++);
in56 = _mm_load_si128(inptr++);
in78 = _mm_load_si128(inptr++);

__m128i in_1 = _mm_slli_si128(in12, 8);
__m128i in23 = _mm_alignr_epi8(in34, in12, 8);
__m128i in45 = _mm_alignr_epi8(in56, in34, 8);
__m128i in67 = _mm_alignr_epi8(in78, in56, 8);
__m128i in8_ = _mm_srli_si128(in78, 8);

__m128i a = _mm_xor_si128(buf144, in_1);

_mm_store_si128(bitbuf144++, a);
_mm_store_si128(bitbuf144++, in23);
_mm_store_si128(bitbuf144++, in45);
_mm_store_si128(bitbuf144++, in67);
buf144 = in8_;

__m128i e = _mm_xor_si128(buf182, in_1);
__m128i f = _mm_xor_si128(_mm_load_si128(&bitbuf182[1]), in23);
__m128i g = _mm_xor_si128(_mm_load_si128(&bitbuf182[2]), in45);
__m128i h = _mm_xor_si128(_mm_load_si128(&bitbuf182[3]), in67);
buf182 = _mm_xor_si128(_mm_load_si128(&bitbuf182[4]), in8_);

_mm_store_si128(bitbuf182++, e);
_mm_store_si128(bitbuf182++, f);
_mm_store_si128(bitbuf182++, g);
_mm_store_si128(bitbuf182++, h);

__m128i m = _mm_xor_si128(buf210, in_1);
_mm_store_si128(bitbuf210++, m);
_mm_store_si128(bitbuf210++, in23);
_mm_store_si128(bitbuf210++, in45);
_mm_store_si128(bitbuf210++, in67);
buf210 = in8_;

_mm_store_si128(bitbuf300++, in12);
_mm_store_si128(bitbuf300++, in34);
_mm_store_si128(bitbuf300++, in56);
_mm_store_si128(bitbuf300++, in78);
}

for(; i + 300*8+64 < len && i < 84 * 8; i += 64) {
__m128i in12, in34, in56, in78;
in12 = _mm_load_si128(inptr++);
in34 = _mm_load_si128(inptr++);
in56 = _mm_load_si128(inptr++);
in78 = _mm_load_si128(inptr++);

__m128i in_1 = _mm_slli_si128(in12, 8);
__m128i in23 = _mm_alignr_epi8(in34, in12, 8);
__m128i in45 = _mm_alignr_epi8(in56, in34, 8);
__m128i in67 = _mm_alignr_epi8(in78, in56, 8);
__m128i in8_ = _mm_srli_si128(in78, 8);

__m128i a = _mm_xor_si128(buf144, in_1);
__m128i b = _mm_xor_si128(_mm_load_si128(&bitbuf144[1]), in23);
__m128i c = _mm_xor_si128(_mm_load_si128(&bitbuf144[2]), in45);
__m128i d = _mm_xor_si128(_mm_load_si128(&bitbuf144[3]), in67);
buf144 = _mm_xor_si128(_mm_load_si128(&bitbuf144[4]), in8_);

_mm_store_si128(bitbuf144++, a);
_mm_store_si128(bitbuf144++, b);
_mm_store_si128(bitbuf144++, c);
_mm_store_si128(bitbuf144++, d);

__m128i e = _mm_xor_si128(buf182, in_1);
__m128i f = _mm_xor_si128(_mm_load_si128(&bitbuf182[1]), in23);
__m128i g = _mm_xor_si128(_mm_load_si128(&bitbuf182[2]), in45);
__m128i h = _mm_xor_si128(_mm_load_si128(&bitbuf182[3]), in67);
buf182 = _mm_xor_si128(_mm_load_si128(&bitbuf182[4]), in8_);

_mm_store_si128(bitbuf182++, e);
_mm_store_si128(bitbuf182++, f);
_mm_store_si128(bitbuf182++, g);
_mm_store_si128(bitbuf182++, h);

__m128i m = _mm_xor_si128(buf210, in_1);
_mm_store_si128(bitbuf210++, m);
_mm_store_si128(bitbuf210++, in23);
_mm_store_si128(bitbuf210++, in45);
_mm_store_si128(bitbuf210++, in67);
buf210 = in8_;

_mm_store_si128(bitbuf300++, in12);
_mm_store_si128(bitbuf300++, in34);
_mm_store_si128(bitbuf300++, in56);
_mm_store_si128(bitbuf300++, in78);
}

for(; i + 300*8+64 < len; i += 64) {
__m128i in12, in34, in56, in78;

if (i < 128 * 8) {
in12 = _mm_load_si128(inptr++);
in34 = _mm_load_si128(inptr++);
in56 = _mm_load_si128(inptr++);
in78 = _mm_load_si128(inptr++);
} else {
in12 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
in34 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
in56 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
in78 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
}

// [0, 145, 183, 211]

/* Pre Penryn CPUs the unpack should be faster */
__m128i in_1 = _mm_slli_si128(in12, 8);

__m128i in23 = _mm_alignr_epi8(in34, in12, 8);
__m128i in45 = _mm_alignr_epi8(in56, in34, 8);
__m128i in67 = _mm_alignr_epi8(in78, in56, 8);
__m128i in8_ = _mm_srli_si128(in78, 8);

__m128i a = _mm_xor_si128(buf144, in_1);
__m128i b = _mm_xor_si128(_mm_load_si128(&bitbuf144[1]), in23);
__m128i c = _mm_xor_si128(_mm_load_si128(&bitbuf144[2]), in45);
__m128i d = _mm_xor_si128(_mm_load_si128(&bitbuf144[3]), in67);
buf144 = _mm_xor_si128(_mm_load_si128(&bitbuf144[4]), in8_);

_mm_store_si128(bitbuf144++, a);
_mm_store_si128(bitbuf144++, b);
_mm_store_si128(bitbuf144++, c);
_mm_store_si128(bitbuf144++, d);

__m128i e = _mm_xor_si128(buf182, in_1);
__m128i f = _mm_xor_si128(_mm_load_si128(&bitbuf182[1]), in23);
__m128i g = _mm_xor_si128(_mm_load_si128(&bitbuf182[2]), in45);
__m128i h = _mm_xor_si128(_mm_load_si128(&bitbuf182[3]), in67);
buf182 = _mm_xor_si128(_mm_load_si128(&bitbuf182[4]), in8_);

_mm_store_si128(bitbuf182++, e);
_mm_store_si128(bitbuf182++, f);
_mm_store_si128(bitbuf182++, g);
_mm_store_si128(bitbuf182++, h);

__m128i n, o, p;
__m128i m = _mm_xor_si128(buf210, in_1);

/* Couldn't tell you why but despite knowing that this is always false,
* removing this branch with GCC makes things significantly slower. Some
* loop bodies must be being joined or something */
if (i < 84 * 8) {
n = in23;
o = in45;
p = in67;
buf210 = in8_;
} else {
n = _mm_xor_si128(_mm_load_si128(&bitbuf210[1]), in23);
o = _mm_xor_si128(_mm_load_si128(&bitbuf210[2]), in45);
p = _mm_xor_si128(_mm_load_si128(&bitbuf210[3]), in67);
buf210 = _mm_xor_si128(_mm_load_si128(&bitbuf210[4]), in8_);
}

_mm_store_si128(bitbuf210++, m);
_mm_store_si128(bitbuf210++, n);
_mm_store_si128(bitbuf210++, o);
_mm_store_si128(bitbuf210++, p);

_mm_store_si128(bitbuf300++, in12);
_mm_store_si128(bitbuf300++, in34);
_mm_store_si128(bitbuf300++, in56);
_mm_store_si128(bitbuf300++, in78);
}

/* Second half of stores bubbled out */
_mm_store_si128(bitbuf144, buf144);
_mm_store_si128(bitbuf182, buf182);
_mm_store_si128(bitbuf210, buf210);

/* We also have to zero out the tail */
size_t left_to_z = len - (300*8 + i);
__m128i *bitbuf_tail = (__m128i*)(bitbuffer + 300 + i/8);
while (left_to_z >= 64) {
_mm_store_si128(bitbuf_tail++, z);
_mm_store_si128(bitbuf_tail++, z);
_mm_store_si128(bitbuf_tail++, z);
_mm_store_si128(bitbuf_tail++, z);
left_to_z -= 64;
}

while (left_to_z >= 16) {
_mm_store_si128(bitbuf_tail++, z);
left_to_z -= 16;
}

uint8_t *tail_bytes = (uint8_t*)bitbuf_tail;
while (left_to_z--) {
*tail_bytes++ = 0;
}

ALIGNED_(16) uint64_t final[9] = {0};
__m128i next12, next34, next56;
next12 = z;
next34 = z;
next56 = z;

for(; (i + 72 < len); i += 32) {
__m128i in1in2, in3in4;
__m128i in1in2_, in3in4_;
__m128i ab1, ab2, ab3, ab4;
__m128i cd1, cd2, cd3, cd4;

READ_NEXT(input, i, in1in2, in3in4);
READ_NEXT(bitbuffer, i, in1in2_, in3in4_);

in1in2 = _mm_xor_si128(_mm_xor_si128(in1in2, in1in2_), next12);
in3in4 = _mm_xor_si128(in3in4, in3in4_);

NEXT_ROUND(in1in2, ab1, ab2, ab3, ab4);

__m128i a2_ = _mm_slli_si128(ab2, 8);
__m128i ab1_next34 = _mm_xor_si128(next34, ab1);
in3in4 = _mm_xor_si128(in3in4, ab1_next34);
in3in4 = _mm_xor_si128(a2_, in3in4);
NEXT_ROUND(in3in4, cd1, cd2, cd3, cd4);

__m128i b2c2 = _mm_alignr_epi8(cd2, ab2, 8);
__m128i a4_ = _mm_slli_si128(ab4, 8);
a4_ = _mm_xor_si128(b2c2, a4_);
next12 = _mm_xor_si128(ab3, a4_);
next12 = _mm_xor_si128(next12, cd1);

__m128i d2_ = _mm_srli_si128(cd2, 8);
__m128i b4c4 = _mm_alignr_epi8(cd4, ab4, 8);
next12 = _mm_xor_si128(next12, next56);
next34 = _mm_xor_si128(cd3, _mm_xor_si128(b4c4, d2_));
next56 = _mm_srli_si128(cd4, 8);
}

memcpy(final, input+(i / sizeof(uint64_t)), len-i);
__m128i *final128 = (__m128i*)final;
_mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next12));
++final128;
_mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next34));
++final128;
_mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next56));

uint8_t* final_bytes = (uint8_t*) final;

for(size_t j = 0; j < (len-i); j++) {
crc = crc_table[(crc ^ final_bytes[j] ^ bitbufferbytes[(j+i)]) & 0xff] ^ (crc >> 8);
}
return crc;
}

Check notice

Code scanning / CodeQL

Block with too many statements Note

Block with too many statements (5 complex statements in the block).
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 29, 2025

Caution

Review failed

An error occurred during the review process. Please try again later.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
arch/x86/chorba_sse41.c (2)

13-15: Consider placing externs in a shared header.

These extern declarations (e.g., crc32_braid_base) might be better placed in dedicated header files instead of localizing them within this C file. That helps maintain a single source of truth.


63-293: Complex loop structure could benefit from modularization.

Multiple loop stages handle data ranges differently and rely on SSE alignment. Extracting repeating patterns or consistent steps into helper functions can improve maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf88723 and 85f3b6b.

📒 Files selected for processing (12)
  • CMakeLists.txt (4 hunks)
  • arch/x86/Makefile.in (3 hunks)
  • arch/x86/chorba_sse41.c (1 hunks)
  • arch/x86/x86_features.c (1 hunks)
  • arch/x86/x86_features.h (1 hunks)
  • arch/x86/x86_functions.h (4 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (3 hunks)
  • functable.c (2 hunks)
  • test/benchmarks/benchmark_crc32.cc (1 hunks)
  • test/test_crc32.cc (2 hunks)
  • win32/Makefile.msc (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • arch/x86/x86_features.h
  • functable.c
  • arch/x86/Makefile.in
  • cmake/detect-intrinsics.cmake
  • win32/Makefile.msc
  • arch/x86/x86_features.c
  • arch/x86/x86_functions.h
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (3)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-03-26T15:10:53.588Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
🧬 Code Definitions (3)
test/benchmarks/benchmark_crc32.cc (1)
arch/x86/x86_functions.h (2)
  • crc32_chorba_sse2 (29-29)
  • crc32_chorba_sse41 (41-41)
arch/x86/chorba_sse41.c (1)
arch/x86/x86_functions.h (1)
  • crc32_chorba_sse41 (41-41)
test/test_crc32.cc (1)
arch/x86/x86_functions.h (2)
  • crc32_chorba_sse2 (29-29)
  • crc32_chorba_sse41 (41-41)
🔇 Additional comments (16)
test/benchmarks/benchmark_crc32.cc (1)

72-76: LGTM! Successfully adds SSE4.1 benchmark support.

The changes properly integrate the new SSE4.1 implementation into the benchmark framework. You've added the conditional check for X86_SSE41 and registered the benchmark with the appropriate CPU feature check, maintaining the same pattern used for other architecture-specific benchmarks.

test/test_crc32.cc (2)

196-197: LGTM! Added additional test data length.

The new test case with a buffer length of 16384 extends test coverage to a more comprehensive range of input sizes, which is valuable for validating the CRC32 implementation across different workloads.


285-290: LGTM! Correctly added SSE4.1 test support.

The new test configuration for the SSE4.1 implementation is properly guarded by the appropriate preprocessor directives and follows the established pattern in the codebase. The test will be skipped gracefully when SSE4.1 support is not available.

configure (3)

114-114: LGTM! Added SSE4.1 compiler flag definition.

The addition of the SSE4.1 flag variable follows the existing pattern for other instruction set flags.


1537-1551: LGTM! Well-implemented SSE4.1 intrinsics detection.

The check_sse41_intrinsics() function properly tests for compiler support of SSE4.1 intrinsics by compiling a small test program that uses the _mm_insert_epi64 intrinsic, which is specific to SSE4.1. This approach is consistent with how other instruction set extensions are detected in the script.


1680-1687: LGTM! Correctly integrated SSE4.1 into the build system.

The code properly calls the SSE4.1 detection function and, if supported, updates the compiler flags and object files list to include the SSE4.1-optimized implementation. This approach is consistent with how other architecture-specific optimizations are enabled.

CMakeLists.txt (3)

128-129: SSE4.1 and SSE4.2 dependent options look consistent.

By making SSE4.2 dependent on SSE4.1, you ensure proper incremental enablement of instruction sets. This appears appropriate, as SSE4.2 builds upon SSE4.1.


149-149: Marking SSE4.1 as advanced is appropriate.

Including WITH_SSE41 in advanced options aligns it with other architecture-specific toggles like WITH_SSSE3 and WITH_SSE42.


1495-1495: Feature info addition is consistent.

Adding add_feature_info(WITH_SSE41 ...) ensures SSE4.1 availability is clearly reported in the CMake feature summary.

arch/x86/chorba_sse41.c (7)

1-2: Guarding SSE4.1 Chorba code.

The #if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) condition cleanly isolates this specialized code and prevents accidental compilation on unsupported targets.


3-12: Includes and headers look valid.

All necessary headers are referenced, and SSE/SMID intrinsics are included. This is aligned with SSE4.1 usage.


28-62: Partial zeroing strategy is clear.

Zeroing out only the parts actually read is a good performance optimization. Ensure that offsets (128, 144, 182) align with the loops' usage elsewhere to avoid missing memory areas.


295-320: Tail zeroing logic appears correct.

The loop systematically stores zero vectors based on remaining length. This prevents processing stale data from the buffer.


321-374: Final block processing is coherent.

Loading the leftover bytes into final[] and XORing with SSE registers aligns with the partial-block approach. This final step ensures leftover data is integrated into the CRC.


376-406: Runtime fallback strategy is appropriate.

The function checks alignment, calls SSE4.1, SSE2, or fallback logic depending on buffer size. This layering avoids duplication and efficiently selects the right method.


16-26:

❓ Verification inconclusive

Macros perform SSE loads and transforms; double-check bounds.

READ_NEXT and NEXT_ROUND macros rely on _mm_load_si128 and shifting. Ensure calling loops do not exceed buffer bounds, as these macros have no built-in checks.

Would you like to verify usage of these macros across the codebase to confirm that every call site guarantees sufficient alignment and buffer size?


🏁 Script executed:

#!/bin/bash
# Searching for READ_NEXT and NEXT_ROUND usage outside this file (if any).
rg --multiline 'READ_NEXT|NEXT_ROUND'

Length of output: 2561


Ensure Calling Code Maintains Buffer Bounds and Alignment for SSE Macros

After running a repository-wide search, we see that both the READ_NEXT and NEXT_ROUND macros are used exclusively in the files below:

  • arch/x86/chorba_sse41.c
  • arch/x86/chorba_sse2.c

Please make sure that in each of these call sites the following are verified:

  • The pointer arguments (e.g., input, bitbuffer) are properly aligned for _mm_load_si128.
  • The loop index or offset (i) guarantees that accesses stay within the allocated buffer bounds.
  • Any commented-out variants (such as READ_NEXT_UNALIGNED) are intentionally omitted and that their removal will not affect safety guarantees.

These checks will prevent potential undefined behavior due to overreaching loads or misaligned memory accesses.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
arch/x86/x86_functions.h (1)

123-126: 🛠️ Refactor suggestion

Consider adding the __SSE4_1__ predefined macro check for native builds.

When building natively, we should check both the compiler feature flag and the predefined macro to ensure SSE4.1 is actually available.

Apply this diff to improve the condition:

-#  if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && !defined(NO_CHORBA_SSE)
+#  if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && defined(__SSE4_1__) && !defined(NO_CHORBA_SSE)
#   undef native_crc32
#   define native_crc32 crc32_chorba_sse41
#   endif
🧹 Nitpick comments (2)
arch/x86/x86_functions.h (1)

14-14: Renamed macro could be improved for clarity.

The macro NO_CHORBA_SSE2 has been renamed to NO_CHORBA_SSE, which makes sense as it now covers all SSE variants (including the new SSE41 implementation). However, this change should be accompanied by documentation explaining the purpose of this macro for future maintainers.

Consider adding a brief comment explaining the purpose of this macro:

#if defined(_MSC_VER) && !defined(_M_AMD64) && _MSC_VER >= 1920 && _MSC_VER <= 1929
-#define NO_CHORBA_SSE
+/* Disable SSE-based CRC32 chorba implementations due to MSVC stack corruption bug */
+#define NO_CHORBA_SSE
#endif
arch/x86/chorba_sse41.c (1)

186-232: Complex loop logic readability.

The loop condition depends on i + 300*8+64 < len and includes branching for offsets. To maintain clarity, consider extracting loop bodies into smaller functions if possible. This would improve readability and testing.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adc43d3 and 6e5877f.

📒 Files selected for processing (12)
  • CMakeLists.txt (4 hunks)
  • arch/x86/Makefile.in (3 hunks)
  • arch/x86/chorba_sse41.c (1 hunks)
  • arch/x86/x86_features.c (1 hunks)
  • arch/x86/x86_features.h (1 hunks)
  • arch/x86/x86_functions.h (4 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (4 hunks)
  • functable.c (2 hunks)
  • test/benchmarks/benchmark_crc32.cc (1 hunks)
  • test/test_crc32.cc (2 hunks)
  • win32/Makefile.msc (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • arch/x86/x86_features.h
  • arch/x86/x86_features.c
  • functable.c
  • arch/x86/Makefile.in
  • test/benchmarks/benchmark_crc32.cc
  • win32/Makefile.msc
  • test/test_crc32.cc
  • cmake/detect-intrinsics.cmake
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (4)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-04-02T20:18:01.745Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-04-02T20:18:01.745Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-04-02T20:18:01.745Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
🧬 Code Definitions (2)
arch/x86/chorba_sse41.c (1)
arch/x86/x86_functions.h (1)
  • crc32_chorba_sse41 (41-41)
arch/x86/x86_functions.h (1)
arch/generic/generic_functions.h (1)
  • crc32 (14-14)
🔇 Additional comments (15)
arch/x86/x86_functions.h (1)

39-43: LGTM: Clean implementation of SSE4.1 function declaration.

The declaration of the new CRC32 function for SSE4.1 is properly conditional on WITHOUT_CHORBA matching the existing pattern for SSE2 implementation.

configure (4)

114-114: LGTM: Added compiler flag for SSE4.1.

The SSE4.1 compiler flag is correctly defined alongside other x86 instruction set flags.


1537-1551: LGTM: Well-implemented SSE4.1 intrinsics detection function.

The function correctly tests for SSE4.1 support by compiling a small test program that uses SSE4.1-specific intrinsics (_mm_min_epi32), following the same pattern as other instruction set detection functions in the script.


1680-1687: LGTM: Properly implemented SSE4.1 build configuration.

The implementation correctly sets up the build flags and adds the object files for the SSE4.1 implementation when support is detected. This follows the established pattern for other instruction set extensions in the script.


2235-2235: LGTM: Added SSE41FLAG variable in the build configuration.

The code correctly adds the SSE41FLAG variable to the Makefile generation, ensuring the compiler flags are properly passed when building the SSE4.1 implementation.

CMakeLists.txt (4)

128-129: Maintain consistency with instruction set dependencies.

Defining SSE4.1 to depend on SSSE3 and SSE4.2 to depend on SSE4.1 is correct since SSE4.1 is indeed a superset of SSSE3, and SSE4.2 is a superset of SSE4.1. This ensures that users can't accidentally enable incompatible options.


149-149: Add SSE41-related entries to advanced settings.

Including WITH_SSE41 in the advanced options is consistent with the other SSE variants. No further issues detected.


995-1005: Correctly conditionally building SSE4.1 code.

The logic to check for SSE4.1 intrinsics and define X86_SSE41 if supported (and SSSE3 is enabled) is sound. This properly avoids build issues on unsupported compilers.


1495-1495: Feature info for SSE41.

Adding the SSE41 feature info clarifies the build output and helps users confirm that SSE4.1 is enabled at configure time.

arch/x86/chorba_sse41.c (6)

1-2: Conditional compilation guard.

The #if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) guard is appropriately limiting this file to SSE4.1 builds when Chorba is enabled.


28-34: Macro usage clarity.

The REALIGN_CHORBA macro now clearly references the correct parameters (in0, in1, … out0, out1, …). This resolves the prior naming mismatch that could cause confusion.


312-343:

✅ Verification successful

Alignment logic and threshold checks.

  1. The use of uintptr_t for alignment calculation in crc32_chorba_sse41 is correct to avoid type mismatch.
  2. The threshold checks for “large,” “medium,” and “small” Chorba variants appear consistent, but ensure they match project-wide definitions to avoid subtle performance or correctness issues.

Please confirm that CHORBA_LARGE_THRESHOLD, CHORBA_MEDIUM_LOWER_THRESHOLD, and CHORBA_MEDIUM_UPPER_THRESHOLD are defined consistently in the codebase:


🏁 Script executed:

#!/bin/bash
rg 'CHORBA_(LARGE_THRESHOLD|MEDIUM_LOWER_THRESHOLD|MEDIUM_UPPER_THRESHOLD)'

Length of output: 785


Verified: Alignment Logic and Thresholds Consistent
The CHORBA threshold macros in crc32.h are defined as follows:
CHORBA_LARGE_THRESHOLD as (sizeof(z_word_t) * 64 * 1024)
CHORBA_MEDIUM_LOWER_THRESHOLD as 8192
CHORBA_MEDIUM_UPPER_THRESHOLD as 32768

These definitions are used consistently in arch/x86/chorba_sse41.c (as well as in related files such as arch/x86/chorba_sse2.c and arch/generic/crc32_c.c). The alignment calculation using uintptr_t and the threshold checks for the “large,” “medium,” and “small” cases are therefore correct and adhere to the project-wide definitions.

No further changes are required for this section.


257-310:

❓ Verification inconclusive

Final partial CRC calculation with table lookups.

You combine leftover data from both the input buffer and the locally stored data in bitbuffer. Verify that (j + i) accesses in bitbufferbytes[(j + i)] never exceed the size of bitbuffer. The partial zeroing is correct, but it’s crucial to confirm that offsets remain within bounds.


🏁 Script executed:

#!/bin/bash
# Ensures j+i < 32768, given that bitbuffer is 32768 bytes and i max in loops
rg 'bitbufferbytes\[\(j\+i\)' -A 5 -B 5

Length of output: 1835


Ensure (j + i) Always Falls Within Bitbuffer Limits

The current x86 SSE4.1 implementation loops over the remaining bytes with the condition j < (len - i) and accesses bitbufferbytes[(j+i)] without using modulo arithmetic—unlike the generic CRC function, which sometimes applies % bitbuffersizebytes. Please confirm that the incoming length (len) is constrained (i.e., ≤32768 bytes) or that other preconditions guarantee (j + i) will never exceed the allocated bitbuffer size. This review is a reminder to verify that those assumptions are explicitly enforced elsewhere in the code or via input validation.


59-95:

✅ Verification successful

Zero-initialization blocks require careful boundary checks.

Multiple segments of partial zeroing are performed using _mm_store_si128. Ensure the computed pointers (bitbuf144, bitbuf182, etc.) remain within bitbuffer’s bounds for all possible data sizes and loop iterations.

Could you confirm these partial zeroing loops do not write out of bitbuffer?


🏁 Script executed:

#!/bin/bash
# Grep for bitbuf144, bitbuf182, etc. usage, verifying loop limits
rg 'bitbuf(144|182|300)' -A 5 -B 5

Length of output: 7694


Boundaries Confirmed for Partial Zeroing

After verifying the pointer arithmetic and loop iterations:

  • bitbuf144 is set to &bitbuffer_v[72] and the loop iterates 11 times, using _mm_store_si128(&z_cursor[8], z). Even on the last iteration, the write occurs at bitbuffer_v[82 + 8] (i.e., index 90).
  • bitbuf182 is set to &bitbuffer_v[91] and the loop iterates 4 times via the STORE4 macro, ensuring the maximum written index is around 97.
  • bitbuf300 is similarly computed with a constant offset.

Given that the underlying bitbuffer is sized to hold 32768 bytes (which, when reinterpreted as an array of __m128i, offers 2048 elements), all pointer operations remain safely within bounds. No out-of-bound writes are observed based on the constant offsets and loop counts.

Please ensure that any future modifications to these constants or the buffer size continue to uphold these boundary conditions.


16-19:

❓ Verification inconclusive

Ensure pointer arithmetic safety.

The macro uses (in + off / sizeof(uint64_t)) to compute addresses. Confirm that the caller always provides valid offsets so we never read past the buffer boundary.

Could you verify that all invocations of READ_NEXT are guaranteed in-bounds?


🏁 Script executed:

#!/bin/bash
# Searches for all lines calling READ_NEXT and checks for code paths ensuring 'off' remains valid.
rg 'READ_NEXT\s*\([^,]+,\s*[^,]+,\s*[^,]+,\s*[^,]+\)'

Length of output: 938


Action: Verify pointer arithmetic safety ensuring valid offsets

The READ_NEXT macro is used in both arch/x86/chorba_sse41.c and arch/x86/chorba_sse2.c (e.g. via invocations like READ_NEXT(input, i, in1in2, in3in4) and READ_NEXT(bitbuffer, i, in1in2_, in3in4_)). However, our search shows that the caller consistently passes an offset variable (typically i), but it isn’t immediately clear from the static context that every such usage is preceded by explicit checks ensuring that i (and the subsequent accesses computed as in + off / sizeof(uint64_t) and in + off / sizeof(uint64_t) + 2) remains in bounds.

Please verify that:

  • Every caller invoking READ_NEXT performs proper bounds checking to ensure the computed memory accesses stay within the buffer.
  • The offset variable (e.g. i) along with any arithmetic adjustments never leads to out-of-bound reads.

@phprus
Copy link
Contributor

phprus commented Apr 5, 2025

Missing X86_SSE41 macro in https://github.com/zlib-ng/zlib-ng/blob/develop/win32/Makefile.msc (WFLAGS variable)

@KungFuJesus
Copy link
Contributor Author

Missing X86_SSE41 macro in https://github.com/zlib-ng/zlib-ng/blob/develop/win32/Makefile.msc (WFLAGS variable)

Are you looking at the wrong branch? It's definitely in there:
https://github.com/KungFuJesus/zlib-ng/blob/sse41_chorba/win32/Makefile.msc#L36

@phprus
Copy link
Contributor

phprus commented Apr 6, 2025

@KungFuJesus
My apologies!
I got it wrong tabs in the browser where branches were open.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
arch/x86/chorba_sse41.c (1)

28-34: Validate macro parameter consistency.

REALIGN_CHORBA() manipulates four input vectors (in0, in1, in2, in3). Ensure that references within the macro consistently match the macro’s parameters. Typos or mismatched references can lead to subtle bugs in vector re-alignment.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e5877f and 196ac09.

📒 Files selected for processing (12)
  • CMakeLists.txt (4 hunks)
  • arch/x86/Makefile.in (3 hunks)
  • arch/x86/chorba_sse41.c (1 hunks)
  • arch/x86/x86_features.c (1 hunks)
  • arch/x86/x86_features.h (1 hunks)
  • arch/x86/x86_functions.h (4 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (4 hunks)
  • functable.c (2 hunks)
  • test/benchmarks/benchmark_crc32.cc (1 hunks)
  • test/test_crc32.cc (2 hunks)
  • win32/Makefile.msc (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • arch/x86/x86_features.h
  • test/benchmarks/benchmark_crc32.cc
  • arch/x86/x86_features.c
  • functable.c
  • test/test_crc32.cc
  • cmake/detect-intrinsics.cmake
  • arch/x86/x86_functions.h
  • win32/Makefile.msc
  • arch/x86/Makefile.in
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (4)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-04-02T20:18:01.745Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-04-02T20:18:01.745Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-04-02T20:18:01.745Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
⏰ Context from checks skipped due to timeout of 90000ms (115)
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC ARM HF
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Fuzzing
🔇 Additional comments (7)
configure (2)

114-114: Good addition of the SSE4.1 compiler flag.

This is consistent with the existing approach for other SSE flags (e.g., sse2flag, sse42flag). It ensures that the build will correctly invoke the compiler to generate SSE4.1 instructions whenever available.


1537-1552: Consistent intrinsics check function.

This new check_sse41_intrinsics() function aligns well with the pattern used for checking other CPU intrinsics. It creates a test file that uses _mm_min_epi32 from <smmintrin.h> and tries to compile with the -msse4.1 flag. This approach is straightforward and ensures reliable auto-detection of SSE4.1 support.

CMakeLists.txt (3)

128-129: New build options for SSE4.1 and SSE4.2.

Declaring:

cmake_dependent_option(WITH_SSE41 "Build with SSE41" ON "WITH_SSSE3" OFF)
cmake_dependent_option(WITH_SSE42 "Build with SSE42" ON "WITH_SSE41" OFF)

is logical since SSE4.1 depends on SSSE3 (Intel’s partial superset path), and SSE4.2 depends on SSE4.1. This guards against impossible configs when toggling one without the other.


149-149: Properly marking SSE4.1 as advanced.

Adding WITH_SSE41 alongside the existing SSE selections keeps your CMake options consistent with the rest of the advanced CPU feature toggles.


495-495: Ensuring feature info display for SSE4.1.

The call to add_feature_info(WITH_SSE41 WITH_SSE41 "Build with SSE41") ensures the feature summary accurately reflects SSE4.1’s optional status.

arch/x86/chorba_sse41.c (2)

1-2: Build guard for SSE4.1-based Chorba CRC.

Wrapping this code with #if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) is consistent with other SSE-based CRC implementations. It keeps the file excluded if SSE4.1 or Chorba is disabled, preventing unintended compilation on unsupported targets.


16-19:

❓ Verification inconclusive

Check potential pointer offset calculations.

In READ_NEXT(in, off, a, b), you compute pointer offsets with (in + off / sizeof(uint64_t)). Ensure that all callers pass offsets in multiples of 8 bytes to avoid partial reads and alignment issues. Otherwise, consider a safer boundary check or a clearer pointer arithmetic approach.

Would you like to run a script to search for all invocations of READ_NEXT to confirm that offsets maintain the required alignment or size assumptions?


🏁 Script executed:

#!/bin/bash
rg -A3 "READ_NEXT\s*\(" 

Length of output: 3777


Pointer Offset Alignment Caution in READ_NEXT Macro

The READ_NEXT(in, off, a, b) macro computes offsets via (in + off / sizeof(uint64_t)) and is invoked in both arch/x86/chorba_sse41.c and arch/x86/chorba_sse2.c with the offset variable i. Please verify that all callers guarantee that the offset (i) is always a multiple of 8 bytes. If there’s any chance that this assumption might be violated, consider adding an assertion, boundary check, or clearer pointer arithmetic to enforce proper alignment and prevent partial reads.

__m128i init_crc = _mm_cvtsi64x_si128(crc);
crc = 0;

size_t i = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate variable name with no restricting scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean, the loops above are the restricted scope of i, it does not live outside of the for loop declaration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C, outer scope starts when function starts. To be valid ANSI C, it would require explicit block scope. Some compilers will issue warning if variable in inner scope shadows variable in outer scope, but that warning can be suppressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly certain declaring the variables in the loop parameters is the explicit scope block. C89 at one point wouldn't allow you to declare variables in that but I don't know that we are trying to cater that far back. GNU's C extension and afterward anything C99 compliant accepts variable declarations there.

#include <stdlib.h>

int main(void)
{
    for (int i = 0; i < 20; ++i) {};
    return i;
}
[adam@thinkpad ~]$ gcc test_scope.c -o test.out
test_scope.c: In function ‘main’:
test_scope.c:6:12: error: ‘i’ undeclared (first use in this function)
    6 |     return i;
      |            ^
test_scope.c:6:12: note: each undeclared identifier is reported only once for each function it appears in

Comment on lines +218 to +219
/* Couldn't tell you why but despite knowing that this is always false,
* removing this branch with GCC makes things significantly slower. Some
* loop bodies must be being joined or something */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abusing for loop for ranges with unknown start value disables some optimisations as compiler can't deduce how many iterations are run. The if block(s) pretty much give the missing information even though the conditions are reversed.

Optimising compiler should in theory duplicate the loop with known start values from if loops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, possibly. I haven't compared the compiled code to see exactly what's happening but yeah it could be that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, removing the condition there seems to be generating a lot more code so yeah I'm guessing one or more of the loops are being joined together.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
arch/x86/chorba_sse41.c (4)

59-309: Function is large and repetitive, consider modularizing.

crc32_chorba_32768_nondestructive_sse41 is quite large and includes multiple loops handling different ranges of i. Splitting it into smaller inline helper functions (e.g., chunk realignment, XOR mixing, tail zeroing) may improve clarity without hurting performance too much.


106-183: Repeated loop logic can be consolidated.

There are multiple loop segments (lines 106-133, 135-158, 160-183) that load, realign, and store data with minimal differences. Consider merging or parameterizing these blocks for a more concise implementation, balancing readability with performance.


217-220: Confirm necessity of this seemingly no-op branch.

The comment indicates removing the branch harms performance. This often hints at subtle compiler-specific optimizations. Keep it if beneficial, but add a clearer explanatory comment for maintainers.


324-333: Replace magic number 72 with a named constant.

Using a recognized constant (like CHORBA_SMALL_THRESHOLD_64BIT) instead of hardcoding 72 clarifies the boundary logic and aligns with other threshold macros.

- } else if (aligned_len > 72) {
+ } else if (aligned_len > CHORBA_SMALL_THRESHOLD_64BIT) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 196ac09 and f12e8e5.

📒 Files selected for processing (12)
  • CMakeLists.txt (4 hunks)
  • arch/x86/Makefile.in (3 hunks)
  • arch/x86/chorba_sse41.c (1 hunks)
  • arch/x86/x86_features.c (1 hunks)
  • arch/x86/x86_features.h (1 hunks)
  • arch/x86/x86_functions.h (4 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (4 hunks)
  • functable.c (2 hunks)
  • test/benchmarks/benchmark_crc32.cc (1 hunks)
  • test/test_crc32.cc (2 hunks)
  • win32/Makefile.msc (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • arch/x86/x86_features.h
  • arch/x86/x86_features.c
  • test/benchmarks/benchmark_crc32.cc
  • arch/x86/Makefile.in
  • cmake/detect-intrinsics.cmake
  • win32/Makefile.msc
  • functable.c
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
🧬 Code Graph Analysis (2)
test/test_crc32.cc (1)
arch/x86/x86_functions.h (2)
  • crc32_chorba_sse2 (29-29)
  • crc32_chorba_sse41 (41-41)
arch/x86/x86_functions.h (1)
arch/generic/generic_functions.h (1)
  • crc32 (14-14)
⏰ Context from checks skipped due to timeout of 90000ms (160)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
  • GitHub Check: Ubuntu MinGW x86_64
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC RISC-V
  • GitHub Check: Ubuntu Clang PPC64 Power9
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC MIPS
  • GitHub Check: Ubuntu GCC PPC64 Power9
  • GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu GCC SSSE3 UBSAN
  • GitHub Check: Ubuntu GCC SSE2 UBSAN
  • GitHub Check: Ubuntu GCC ARM SF ASAN
  • GitHub Check: Ubuntu GCC Compat Symbol Prefix
  • GitHub Check: Ubuntu GCC -O1 UBSAN
  • GitHub Check: Ubuntu GCC Native Instructions (AVX)
🔇 Additional comments (20)
test/test_crc32.cc (3)

196-197: Test suite now includes additional 16KB buffer size.

The new test case follows the existing pattern and validates the CRC32 function with a 16KB buffer in addition to the existing 32KB test.


285-287: Correctly adjusted SSE feature check condition.

The change from X86_SSE2 to X86_SSE ensures the test is properly guarded by the appropriate feature macro.


288-290: Added correctly guarded SSE4.1 test case.

The new test case for chorba_sse41 is properly added with the correct conditional compilation directive, ensuring it only runs when the CPU supports SSE4.1 instructions.

arch/x86/x86_functions.h (4)

14-14: Properly renamed SSE capability macro.

Renaming NO_CHORBA_SSE2 to NO_CHORBA_SSE makes the macro more generic, allowing it to be used for both SSE2 and SSE4.1 variations, which improves maintainability.


39-43: Well-structured SSE4.1 function declaration.

The new section correctly introduces the SSE4.1 function with proper conditional compilation and follows the same pattern used for other instruction sets.


108-108: Updated SSE2 native support with the new macro name.

The reference to NO_CHORBA_SSE is consistent with the macro renaming earlier in the file.


123-126: Added proper SSE4.1 capability detection.

The conditional definition correctly adds SSE4.1 support with the proper compiler detection macros, including __SSE4_1__ as suggested in previous reviews.

configure (3)

114-114: Added SSE4.1 compiler flag.

The sse41flag is correctly defined for detecting and enabling SSE4.1 support.


1537-1551: Well-implemented SSE4.1 intrinsics detection.

The check_sse41_intrinsics function follows the established pattern for intrinsics detection, using the appropriate header file and intrinsic function to test for SSE4.1 support.


1680-1687: Correctly integrated SSE4.1 detection in x86 optimization setup.

The code to check for SSE4.1 support and set up the necessary flags and object files is inserted at the appropriate position in the build flow, between SSSE3 and SSE4.2.

CMakeLists.txt (5)

128-129: Properly ordered CPU feature dependency chain.

The SSE4.1 option is correctly added with a dependency on SSSE3, and SSE4.2 now depends on SSE4.1 instead of SSSE3, reflecting the actual dependency chain of x86 instruction sets.


149-149: Updated advanced options list to include SSE4.1.

SSE4.1 is correctly added to the list of advanced options.


995-1005: Well-structured SSE4.1 intrinsics check and compilation setup.

The SSE4.1 detection and setup follow the existing pattern, checking for both the feature option and intrinsics availability before enabling the feature.


1008-1008: Properly updated SSE4.2 dependency check.

The condition correctly checks for WITH_SSE41 instead of WITH_SSSE3, maintaining the proper dependency chain.


1495-1495: Added SSE4.1 to feature summary.

SSE4.1 is correctly added to the feature summary, providing clear information about its availability in the build.

arch/x86/chorba_sse41.c (5)

16-19: Double-check pointer alignment in macro.

_mm_load_si128() requires 16-byte aligned pointers. Though buffers appear properly aligned through ALIGNED_(16) in the surrounding code, confirm that the calling code never violates this assumption, especially when expanding or refactoring in the future.

Would you like to run a script to search for all macro calls to READ_NEXT to ensure the pointers used are always aligned?


28-34: Consider naming consistency in macro parameters.

For better maintainability, ensure the readable association between the macro parameters (e.g., in0, in1, etc.) and the internally referenced variables. While the current implementation appears consistent, keep an eye on potential copy-paste issues like mismatched variable names (as previously flagged in SSE2 macros).


238-255: Zeroing tail logic looks correct.

Zeroing the remaining portion of the buffer prevents uninitialized reads in subsequent SSE operations. This carefully avoids out-of-bounds writes by decrementing left_to_z. Good approach.


268-271: Pointer arithmetic with offsets.

Calls to READ_NEXT(input, i, …) and then XOR with bitbuffer rely on consistent scaling. Ensure that both input and bitbuffer are equally sized (uint64_t*) and remain 16-byte aligned to avoid platform discrepancies.


316-317: Alignment calculation is correct.

Using uintptr_t for alignment offsets (line 317) is the recommended approach to avoid 32-bit truncation hazards. This matches best practices and effectively prevents overflow.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
arch/x86/x86_functions.h (1)

14-14: NO_CHORBA_SSE macro may reduce code duplication.

Defining NO_CHORBA_SSE resolves potential stack-smashing issues on certain compilers. Confirm that this macro doesn’t unnecessarily block SSE usage in safe compilers.

arch/x86/chorba_sse41.c (2)

21-26: NEXT_ROUND macro performs multi-shift XOR operations.

These bitwise transforms for the SSE pipeline appear correct. Double-check the shift counts for potential off-by-one or partial-lane mismatches.


36-41: STORE4 macro systematically stores four vectors.

Batch-storing with _mm_store_si128 in sequential pointer increments is efficient. Investigate block-based memory prefetch if performance needs further improvement.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f12e8e5 and 07ad6c6.

📒 Files selected for processing (12)
  • CMakeLists.txt (4 hunks)
  • arch/x86/Makefile.in (3 hunks)
  • arch/x86/chorba_sse41.c (1 hunks)
  • arch/x86/x86_features.c (1 hunks)
  • arch/x86/x86_features.h (1 hunks)
  • arch/x86/x86_functions.h (4 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (4 hunks)
  • functable.c (2 hunks)
  • test/benchmarks/benchmark_crc32.cc (1 hunks)
  • test/test_crc32.cc (2 hunks)
  • win32/Makefile.msc (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • arch/x86/x86_features.h
  • arch/x86/x86_features.c
  • functable.c
  • test/benchmarks/benchmark_crc32.cc
  • arch/x86/Makefile.in
  • win32/Makefile.msc
  • cmake/detect-intrinsics.cmake
  • CMakeLists.txt
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
🧬 Code Graph Analysis (2)
test/test_crc32.cc (1)
arch/x86/x86_functions.h (2)
  • crc32_chorba_sse2 (29-29)
  • crc32_chorba_sse41 (41-41)
arch/x86/x86_functions.h (1)
arch/generic/generic_functions.h (1)
  • crc32 (14-14)
🔇 Additional comments (14)
configure (3)

114-114: Add SSE4.1 flag for enhanced performance.

Declaring sse41flag="-msse4.1" is correct for enabling SSE4.1 optimizations. Ensure consumers of this flag properly guard against environments lacking SSE4.1 support.


1537-1552: Check for SSE4.1 intrinsics looks good.

This added function cleanly tests for SSE4.1 by compiling a minimal _mm_min_epi32 snippet. Ensure that subsequent build steps use HAVE_SSE41_INTRIN consistently to avoid silent fallback in unsupported compilers.


2235-2235: SSE41FLAG substitution verified.

This substitution step for SSE41FLAG in the generated Makefile helps ensure the SSE4.1 compiler flag is applied. Check that the generated Makefile references do not conflict with other SSE flags.

test/test_crc32.cc (2)

196-197: Added mid-size buffer CRC test.

Great to see test coverage for a 16K buffer. This ensures coverage for an intermediate size range and helps verify correctness of SSE-based CRC functions.


285-290: New SSE tests for chorba_sse2 and chorba_sse41.

These conditional test macros extend coverage for SSE-based CRC calculations. Make sure future expansions (e.g., SSE4.2 variants) also follow a consistent naming approach.

arch/x86/x86_functions.h (3)

39-43: Conditional compilation for SSE4.1 chorba function.

Segregating SSE4.1 logic under X86_SSE41 is a clean approach. Ensure the fallback path remains valid for systems without SSE4.1.


108-112: Conditionally define native_crc32 to crc32_chorba_sse2.

This block properly gates SSE2-based CRC under the new NO_CHORBA_SSE. Ensure that other SSE definitions do not override or conflict inadvertently (e.g., SSE4.1).


123-126: Conditionally define native_crc32 to crc32_chorba_sse41.

This ensures a direct mapping to SSE4.1-based CRC when available. Verify that runtime CPU detection (if enabled) and macro-based detection remain in sync for consistent SSE4.1 usage.

arch/x86/chorba_sse41.c (6)

1-2: SSE4.1 Chorba file inclusion guard.

Good use of both WITHOUT_CHORBA and X86_SSE41 to restrict compilation to the correct builds.


16-19: Macro READ_NEXT introduces clarity.

Using _mm_load_si128 with offset arithmetic is readable. Confirm that the buffer alignment assumptions (16-byte boundaries) always hold when used in other contexts.


28-34: REALIGN_CHORBA macro re-bases four vectors.

The _mm_alignr_epi8 usage here looks correct. Ensure consistent usage of shift values and confirm that partial alignment cases are handled upstream.


43-48: READ4 macro systematically loads four vectors.

Corresponds well with STORE4. This consistency reduces code duplication. Great naming alignment and approach.


50-57: READ4_WITHXOR macro merges existing data.

Compactly applies _mm_xor_si128. Double-check that the indexing offset for in[] is always valid, especially for boundary cases.


311-341: crc32_chorba_sse41 offers a fully integrated SSE4.1 path.

  • Alignment offset logic is straightforward and robust.
  • The multi-threshold approach (large/medium/small) is a neat strategy for bridging different CRC methods.

Keep an eye on boundary tests to ensure correct transitions between “medium” and “small” thresholds.

Comment on lines +59 to +309
static Z_FORCEINLINE uint32_t crc32_chorba_32768_nondestructive_sse41(uint32_t crc, const uint64_t* buf, size_t len) {
const uint64_t* input = buf;
ALIGNED_(16) uint64_t bitbuffer[32768 / sizeof(uint64_t)];
__m128i *bitbuffer_v = (__m128i*)bitbuffer;
const uint8_t* bitbufferbytes = (const uint8_t*) bitbuffer;
__m128i z = _mm_setzero_si128();

__m128i *bitbuf128 = &bitbuffer_v[64];
__m128i *bitbuf144 = &bitbuffer_v[72];
__m128i *bitbuf182 = &bitbuffer_v[91];
__m128i *bitbuf210 = &bitbuffer_v[105];
__m128i *bitbuf300 = &bitbuffer_v[150];
__m128i *bitbuf0 = bitbuf128;
__m128i *inptr = (__m128i*)input;

/* We only need to zero out the bytes between the 128'th value and the 144th
* that are actually read */
__m128i *z_cursor = bitbuf128;
for (size_t i = 0; i < 2; ++i) {
STORE4(z, z, z, z, z_cursor);
}

/* We only need to zero out the bytes between the 144'th value and the 182nd that
* are actually read */
z_cursor = bitbuf144 + 8;
for (size_t i = 0; i < 11; ++i) {
_mm_store_si128(z_cursor++, z);
}

/* We only need to zero out the bytes between the 182nd value and the 210th that
* are actually read. */
z_cursor = bitbuf182;
for (size_t i = 0; i < 4; ++i) {
STORE4(z, z, z, z, z_cursor);
}

/* We need to mix this in */
__m128i init_crc = _mm_cvtsi64x_si128(crc);
crc = 0;

size_t i = 0;

/* Previous iteration runs carried over */
__m128i buf144 = z;
__m128i buf182 = z;
__m128i buf210 = z;

for(; i + 300*8+64 < len && i < 22 * 8; i += 64) {
__m128i in12, in34, in56, in78,
in_1, in23, in45, in67, in8_;

READ4(in12, in34, in56, in78, inptr);

if (i == 0) {
in12 = _mm_xor_si128(in12, init_crc);
}

REALIGN_CHORBA(in12, in34, in56, in78,
in_1, in23, in45, in67, in8_, 8);

__m128i a = _mm_xor_si128(buf144, in_1);

STORE4(a, in23, in45, in67, bitbuf144);
buf144 = in8_;

__m128i e = _mm_xor_si128(buf182, in_1);
STORE4(e, in23, in45, in67, bitbuf182);
buf182 = in8_;

__m128i m = _mm_xor_si128(buf210, in_1);
STORE4(m, in23, in45, in67, bitbuf210);
buf210 = in8_;

STORE4(in12, in34, in56, in78, bitbuf300);
}

for(; i + 300*8+64 < len && i < 32 * 8; i += 64) {
__m128i in12, in34, in56, in78,
in_1, in23, in45, in67, in8_;
READ4(in12, in34, in56, in78, inptr);

REALIGN_CHORBA(in12, in34, in56, in78,
in_1, in23, in45, in67, in8_, 8);

__m128i a = _mm_xor_si128(buf144, in_1);

STORE4(a, in23, in45, in67, bitbuf144);
buf144 = in8_;

__m128i e, f, g, h;
e = _mm_xor_si128(buf182, in_1);
READ4_WITHXOR(f, g, h, buf182, in23, in45, in67, in8_, bitbuf182);
STORE4(e, f, g, h, bitbuf182);

__m128i m = _mm_xor_si128(buf210, in_1);
STORE4(m, in23, in45, in67, bitbuf210);
buf210 = in8_;

STORE4(in12, in34, in56, in78, bitbuf300);
}

for(; i + 300*8+64 < len && i < 84 * 8; i += 64) {
__m128i in12, in34, in56, in78,
in_1, in23, in45, in67, in8_;
READ4(in12, in34, in56, in78, inptr);

REALIGN_CHORBA(in12, in34, in56, in78,
in_1, in23, in45, in67, in8_, 8);

__m128i a, b, c, d;
a = _mm_xor_si128(buf144, in_1);
READ4_WITHXOR(b, c, d, buf144, in23, in45, in67, in8_, bitbuf144);
STORE4(a, b, c, d, bitbuf144);

__m128i e, f, g, h;
e = _mm_xor_si128(buf182, in_1);
READ4_WITHXOR(f, g, h, buf182, in23, in45, in67, in8_, bitbuf182);
STORE4(e, f, g, h, bitbuf182);

__m128i m = _mm_xor_si128(buf210, in_1);
STORE4(m, in23, in45, in67, bitbuf210);
buf210 = in8_;

STORE4(in12, in34, in56, in78, bitbuf300);
}

for(; i + 300*8+64 < len; i += 64) {
__m128i in12, in34, in56, in78,
in_1, in23, in45, in67, in8_;

if (i < 128 * 8) {
READ4(in12, in34, in56, in78, inptr);
} else {
in12 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
in34 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
in56 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
in78 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
}

// [0, 145, 183, 211]

/* Pre Penryn CPUs the unpack should be faster */
REALIGN_CHORBA(in12, in34, in56, in78,
in_1, in23, in45, in67, in8_, 8);

__m128i a, b, c, d;
a = _mm_xor_si128(buf144, in_1);
READ4_WITHXOR(b, c, d, buf144, in23, in45, in67, in8_, bitbuf144);
STORE4(a, b, c, d, bitbuf144);

__m128i e, f, g, h;
e = _mm_xor_si128(buf182, in_1);
READ4_WITHXOR(f, g, h, buf182, in23, in45, in67, in8_, bitbuf182);
STORE4(e, f, g, h, bitbuf182);

__m128i n, o, p;
__m128i m = _mm_xor_si128(buf210, in_1);

/* Couldn't tell you why but despite knowing that this is always false,
* removing this branch with GCC makes things significantly slower. Some
* loop bodies must be being joined or something */
if (i < 84 * 8) {
n = in23;
o = in45;
p = in67;
buf210 = in8_;
} else {
READ4_WITHXOR(n, o, p, buf210, in23, in45, in67, in8_, bitbuf210);
}

STORE4(m, n, o, p, bitbuf210);
STORE4(in12, in34, in56, in78, bitbuf300);
}

/* Second half of stores bubbled out */
_mm_store_si128(bitbuf144, buf144);
_mm_store_si128(bitbuf182, buf182);
_mm_store_si128(bitbuf210, buf210);

/* We also have to zero out the tail */
size_t left_to_z = len - (300*8 + i);
__m128i *bitbuf_tail = (__m128i*)(bitbuffer + 300 + i/8);
while (left_to_z >= 64) {
STORE4(z, z, z, z, bitbuf_tail);
left_to_z -= 64;
}

while (left_to_z >= 16) {
_mm_store_si128(bitbuf_tail++, z);
left_to_z -= 16;
}

uint8_t *tail_bytes = (uint8_t*)bitbuf_tail;
while (left_to_z--) {
*tail_bytes++ = 0;
}

ALIGNED_(16) uint64_t final[9] = {0};
__m128i next12, next34, next56;
next12 = z;
next34 = z;
next56 = z;

for(; (i + 72 < len); i += 32) {
__m128i in1in2, in3in4;
__m128i in1in2_, in3in4_;
__m128i ab1, ab2, ab3, ab4;
__m128i cd1, cd2, cd3, cd4;

READ_NEXT(input, i, in1in2, in3in4);
READ_NEXT(bitbuffer, i, in1in2_, in3in4_);

in1in2 = _mm_xor_si128(_mm_xor_si128(in1in2, in1in2_), next12);
in3in4 = _mm_xor_si128(in3in4, in3in4_);

NEXT_ROUND(in1in2, ab1, ab2, ab3, ab4);

__m128i a2_ = _mm_slli_si128(ab2, 8);
__m128i ab1_next34 = _mm_xor_si128(next34, ab1);
in3in4 = _mm_xor_si128(in3in4, ab1_next34);
in3in4 = _mm_xor_si128(a2_, in3in4);
NEXT_ROUND(in3in4, cd1, cd2, cd3, cd4);

__m128i b2c2 = _mm_alignr_epi8(cd2, ab2, 8);
__m128i a4_ = _mm_slli_si128(ab4, 8);
a4_ = _mm_xor_si128(b2c2, a4_);
next12 = _mm_xor_si128(ab3, a4_);
next12 = _mm_xor_si128(next12, cd1);

__m128i d2_ = _mm_srli_si128(cd2, 8);
__m128i b4c4 = _mm_alignr_epi8(cd4, ab4, 8);
next12 = _mm_xor_si128(next12, next56);
next34 = _mm_xor_si128(cd3, _mm_xor_si128(b4c4, d2_));
next56 = _mm_srli_si128(cd4, 8);
}

memcpy(final, input+(i / sizeof(uint64_t)), len-i);
__m128i *final128 = (__m128i*)final;
_mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next12));
++final128;
_mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next34));
++final128;
_mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next56));

uint8_t* final_bytes = (uint8_t*) final;

for(size_t j = 0; j < (len-i); j++) {
crc = crc_table[(crc ^ final_bytes[j] ^ bitbufferbytes[(j+i)]) & 0xff] ^ (crc >> 8);
}
return crc;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

crc32_chorba_32768_nondestructive_sse41 large function mechanism.

  1. The zero-initialization of partial segments is well-labeled, but quite verbose.
  2. The branch to handle partial usage of bitbuf0 might be tricky to maintain.
  3. The multi-block realign loops with variable i ranges are sophisticated but can hamper readability.

Consider splitting into smaller helper functions for maintainability.

Copy link
Member

@Dead2 Dead2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dead2 Dead2 added the Rebase needed Please do a 'git rebase develop yourbranch' label Apr 14, 2025
@Dead2
Copy link
Member

Dead2 commented Apr 14, 2025

@KungFuJesus Sorry, this needs a rebase now.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
arch/x86/chorba_sse41.c (2)

106-133: Consider extracting loop phases for better maintainability.

The main processing loop is split into multiple phases (lines 106, 135, 160, 185) with similar but slightly different implementations. While this likely optimizes for specific iteration counts, it makes the code harder to maintain. Consider extracting common patterns into helper functions.


328-328: Replace hardcoded constant with macro for consistency.

Line 329 uses CHORBA_SMALL_THRESHOLD_64BIT for the small buffer threshold, but other parts of the codebase might use the direct value (72). For consistency and maintainability, consider using the macro everywhere.

test/benchmarks/benchmark_compare256.cc (1)

62-62: Consider broader or boundary-specific argument coverage.
Switching from a continuous range to fixed arguments is valid, though you might want to add edge cases (e.g., 2, 256, 257 if applicable) for better stress testing.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07ad6c6 and 0039ea0.

📒 Files selected for processing (14)
  • CMakeLists.txt (5 hunks)
  • arch/x86/Makefile.in (3 hunks)
  • arch/x86/chorba_sse41.c (1 hunks)
  • arch/x86/compare256_avx512.c (1 hunks)
  • arch/x86/x86_features.c (1 hunks)
  • arch/x86/x86_features.h (1 hunks)
  • arch/x86/x86_functions.h (6 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (5 hunks)
  • functable.c (3 hunks)
  • test/benchmarks/benchmark_compare256.cc (2 hunks)
  • test/benchmarks/benchmark_crc32.cc (1 hunks)
  • test/test_compare256.cc (1 hunks)
  • test/test_crc32.cc (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
  • arch/x86/x86_features.h
  • arch/x86/x86_features.c
  • test/benchmarks/benchmark_crc32.cc
  • functable.c
  • cmake/detect-intrinsics.cmake
  • arch/x86/x86_functions.h
  • test/test_crc32.cc
  • configure
  • arch/x86/Makefile.in
  • CMakeLists.txt
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (7)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
🧬 Code Graph Analysis (2)
test/test_compare256.cc (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (70-70)
test/benchmarks/benchmark_compare256.cc (1)
arch/x86/x86_functions.h (1)
  • compare256_avx512 (70-70)
🔇 Additional comments (18)
arch/x86/chorba_sse41.c (10)

1-1: Correct guard condition.

The preprocessor condition correctly guards the SSE4.1 implementation based on both feature support and CRC implementation choice.


28-34: LGTM: Macro implementation looks good.

The REALIGN_CHORBA macro correctly uses SSE4.1's _mm_alignr_epi8 instruction which is a key performance improvement over the SSE2 implementation, as mentioned in the PR objectives.


61-71: Consider L1 cache-friendly buffer sizing.

The stack buffer allocation of 32KB (bitbuffer[32768 / sizeof(uint64_t)]) aligns with the PR objective of optimizing for L1 cache sizes. The fixed offsets into this buffer are carefully chosen to optimize memory access patterns.


74-95: Selective zero initialization is a good optimization.

The selective zeroing of only the parts of the buffer that will be read, rather than the entire buffer, matches the PR's objective of reducing initialization overhead.


217-231: Keep the performance-critical branch condition.

The comment at line 217 explains why removing an apparently unnecessary condition degrades performance. This is a valuable insight about compiler optimization behavior and should be preserved.


238-254: Thorough tail zeroing implementation.

The code carefully handles zeroing the tail bytes with progressively smaller granularity (64-byte, 16-byte, single-byte). This approach maximizes SIMD utilization while ensuring correctness.


262-293: Efficient SSE4.1 vectorized folding algorithm.

This section implements the Chorba folding algorithm using SSE4.1 intrinsics like _mm_alignr_epi8 instead of shifts, which aligns with the PR objective of improving performance by reducing shift operations.


316-317: Use uintptr_t consistently for pointer alignment arithmetic.

The code correctly uses uintptr_t for pointer alignment arithmetic, which is safer than using integer types that might not match the platform's pointer width.


324-333: Well-structured dispatching logic based on buffer size.

The function correctly dispatches to different implementations based on buffer size thresholds:

  • Large buffers (>64KB): crc32_chorba_118960_nondestructive
  • Medium buffers (8KB-32KB): crc32_chorba_32768_nondestructive_sse41 (this implementation)
  • Small buffers (>72 bytes): chorba_small_nondestructive_sse2
  • Tiny buffers: crc32_braid_internal

This segmentation allows for optimal handling of different buffer sizes.


52-57:

✅ Verification successful

Verify comment and indexing in READ4_WITHXOR macro.

The comment indicates this macro is "intentionally shifted one down", but make sure the array indexing starting from 1 (in[1], in[2], etc.) instead of 0 is intentional and matches the expected behavior when reading with XOR operations.


🏁 Script executed:

#!/bin/bash
# Search for other usages of similar indexing patterns
rg -B 3 -A 3 "_mm_xor_si128\(in\[\d\]" --glob "*.c"

Length of output: 783


Indexing in READ4_WITHXOR Macro Confirmed

The search output indicates that the only usage of the _mm_xor_si128(in[<index>], ...) pattern is in arch/x86/chorba_sse41.c. The file contains a comment explaining that the array is intentionally shifted by one to compensate for the deferred store from the last iteration. This confirms that the use of in[1], in[2], in[3], and in[4] is deliberate and consistent with the intended behavior.

No changes are necessary.

test/test_compare256.cc (1)

82-84: Good coverage for AVX512.
This test definition cleanly parallels existing SSE2 and AVX2 variants. No issues found.

test/benchmarks/benchmark_compare256.cc (1)

83-85: Benchmark definition for AVX512 is consistent.
This mirrors the handling of other SIMD variants.

arch/x86/compare256_avx512.c (6)

1-5: Validate copyright year.
It's set to “2025.” Please confirm that this is accurate or if you want to adopt a more flexible approach.


7-11: Includes look consistent.
No concerns or redundant imports.


12-19: Confirm macro definitions.
Double-check that X86_AVX512 and HAVE_BUILTIN_CTZLL are reliably set to prevent unexpected build failures.


19-80: Efficient unrolled comparison logic.
Splitting the comparison into a 16-byte chunk followed by 64-byte chunks (with overlap) is a well-reasoned approach for performance. The use of _mm512_cmpeq_epu8_mask and __builtin_ctzll to compute the mismatch offset appears correct.


82-85: Wrapper function is straightforward.
Clean pass-through to the static function, consistent with other architecture-specific compare256 implementations.


86-96: Template-based match generation.
Referencing match_tpl.h twice to produce both fast and slow path functions aligns well with existing patterns.

This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason
for this has to do with the fact that, while incurring far fewer shifts,
an entirely separate stack buffer has to be managed that is the size of
the L1 cache on most CPUs. This was one of the main reasons the 32k
specialized function was slower for the scalar counterpart, despite auto
vectorizing. The auto vectorized loop was setting up the stack buffer at
unaligned offsets, which is detrimental to performance pre-nehalem.
Additionally, we were losing a fair bit of time to the zero
initialization, which we are now doing more selectively.

There are a ton of loads and stores happening, and for sure we are bound
on the fill buffer + store forwarding. An SSE2 version of this code is
probably possible by simply replacing the shifts with unpacks with zero
and the palignr's with shufpd's. I'm just not sure it'll be all that worth
it, though. We are gating against SSE4.1 not because we are using specifically
a 4.1 instruction but because that marks when Wolfdale came out and palignr
became a lot faster.
@KungFuJesus
Copy link
Contributor Author

Rebased

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
arch/x86/chorba_sse41.c (1)

59-309: Consider decomposing large function for maintainability.

The crc32_chorba_32768_nondestructive_sse41 function is 250+ lines long with multiple complex sections. While the current implementation may be optimized for performance, consider decomposing it into smaller helper functions for better maintainability:

  1. Initialization function for buffer setup
  2. Main processing function for different iteration phases
  3. Tail handling function for remaining bytes

This would improve readability while allowing the compiler to still inline these functions when appropriate.

🧹 Nitpick comments (1)
arch/x86/chorba_sse41.c (1)

189-196: Consider compiler-friendly pointer access pattern.

The code uses explicit casting and pointer arithmetic in multiple places. While functionally correct, consider a more compiler-friendly approach:

- in12 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
- in34 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
- in56 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
- in78 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++));
+ in12 = _mm_xor_si128(inptr[0], bitbuf0[0]);
+ in34 = _mm_xor_si128(inptr[1], bitbuf0[1]);
+ in56 = _mm_xor_si128(inptr[2], bitbuf0[2]);
+ in78 = _mm_xor_si128(inptr[3], bitbuf0[3]);
+ inptr += 4;
+ bitbuf0 += 4;

This approach can help with register allocation and might prevent false positives in static analyzers.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0039ea0 and 5890b4b.

📒 Files selected for processing (11)
  • CMakeLists.txt (4 hunks)
  • arch/x86/Makefile.in (3 hunks)
  • arch/x86/chorba_sse41.c (1 hunks)
  • arch/x86/x86_features.c (1 hunks)
  • arch/x86/x86_features.h (1 hunks)
  • arch/x86/x86_functions.h (4 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (4 hunks)
  • functable.c (2 hunks)
  • test/benchmarks/benchmark_crc32.cc (1 hunks)
  • test/test_crc32.cc (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • arch/x86/Makefile.in
🚧 Files skipped from review as they are similar to previous changes (9)
  • arch/x86/x86_features.c
  • functable.c
  • cmake/detect-intrinsics.cmake
  • arch/x86/x86_functions.h
  • test/test_crc32.cc
  • arch/x86/x86_features.h
  • test/benchmarks/benchmark_crc32.cc
  • configure
  • CMakeLists.txt
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (7)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
⏰ Context from checks skipped due to timeout of 90000ms (130)
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
  • GitHub Check: Analyze (cpp)
  • GitHub Check: Windows MSVC ARM64 No Test
  • GitHub Check: macOS GCC UBSAN (ARM64)
  • GitHub Check: macOS GCC UBSAN
  • GitHub Check: macOS Clang (Target 10.10)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: Ubuntu GCC SSE4.2 UBSAN
  • GitHub Check: Ubuntu Clang
  • GitHub Check: Ubuntu Clang No Optim
  • GitHub Check: Ubuntu GCC AARCH64
  • GitHub Check: Ubuntu Clang No Threads
🔇 Additional comments (12)
arch/x86/chorba_sse41.c (12)

1-12: Properly gated code for SSE4.1 support.

The code begins with appropriate conditional compilation directives to ensure it's only built when Chorba is enabled and SSE4.1 support is available. All necessary headers are included for SSE intrinsics, CRC32 functionality, and platform-specific optimizations.


28-34: REALIGN_CHORBA macro has correct parameter usage.

The REALIGN_CHORBA macro implementation properly uses the parameter names (in0, in1, in2, in3) in the body of the macro, which was fixed from an earlier version where parameter names didn't match the usage.


71-93: Selective zero initialization is correctly implemented.

The code optimizes performance by only zeroing out specific portions of the buffer that will be read later, rather than zeroing the entire buffer. This aligns with the PR objective of "selectively applying zero initialization to reduce overhead."


217-229: Good documentation for counter-intuitive optimization.

The comment explaining why a seemingly redundant branch check is maintained for performance reasons is valuable. This kind of documentation helps prevent future optimizations that might inadvertently degrade performance.


238-254: Efficient tail zeroing with progressively finer granularity.

The code efficiently zeroes out the tail portion by starting with large 64-byte chunks, then 16-byte chunks, and finally handling individual bytes. This approach minimizes the number of operations while ensuring complete zeroing.


296-302: XOR final values with accumulated state.

The code correctly combines the accumulated state with the final values using XOR operations, which is essential for proper CRC calculation.


305-308: Final CRC calculation with table lookup.

The code finalizes the CRC calculation by combining bytes from both the input and bitbuffer with a table-based approach, ensuring correct handling of the remaining bytes.


316-337: Dispatch strategy aligns with PR objectives.

The dispatch logic selects the appropriate implementation based on input size and alignment:

  1. Uses the new SSE4.1 implementation for medium-sized inputs (8192-32768 bytes)
  2. Falls back to other optimized implementations for other input sizes
  3. Handles unaligned initial bytes with a fallback method

This strategy aligns with the PR objective of leveraging SSE4.1 instructions for better performance in specific cases.


328-328: SSE4.1 optimization targets appropriate buffer size range.

The SSE4.1 implementation is specifically targeted at medium-sized buffers between 8KB and 32KB, which aligns with the PR objective of optimizing for L1 cache sizes on most CPUs. Benchmark results in the PR comments confirm a significant performance gain (~30%) over the SSE2 variant in this range.


329-329: Use CHORBA_SMALL_THRESHOLD_64BIT macro instead of literal value.

Line 329 correctly uses the macro CHORBA_SMALL_THRESHOLD_64BIT for consistency with the rest of the codebase. This fixed a previous version that used a magic number (72).


317-317: Pointer alignment calculation uses uintptr_t.

The code correctly uses uintptr_t for pointer alignment arithmetic to ensure portability across different platforms where the size of unsigned long might differ from pointer width.


59-62:

❓ Verification inconclusive

Consider stack size implications of large stack buffer.

The function allocates a 32KB aligned buffer on the stack. While this is intentionally sized to match L1 cache on most CPUs (as mentioned in PR objectives), be aware that this large stack allocation could cause issues on platforms with limited stack space.


🏁 Script executed:

#!/bin/bash
# Check if there are any stack size limitations set in the build system
rg -A 2 "stack-limit|stack-size|stacksize" --include="*.{cmake,mk,mak,in}"

Length of output: 518


Action Required: Manually Verify Stack Allocation Impact

In the function defined in arch/x86/chorba_sse41.c (lines 59–62), a 32KB aligned stack buffer is intentionally allocated to match the L1 cache size on most platforms. However, no build configuration settings (e.g., stack size limits) were automatically detected by our initial search. Since the command to check for such settings returned an error and provided no insight, please manually verify that this large stack allocation is acceptable on all target platforms—especially those with limited stack space.

  • Location: arch/x86/chorba_sse41.c, lines 59–62.
  • Recommendation: Review your build configurations and platform documentation for any potential stack limitations. Ensure that the intentional choice of a 32KB allocation does not risk stack overflow or other issues on platforms with lower default stack sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Architecture Architecture specific optimization Rebase needed Please do a 'git rebase develop yourbranch'

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants