SSE4.1 optimized chorba #1893

KungFuJesus · 2025-03-28T22:04:51Z

This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason for this has to do with the fact that, while incurring far fewer shifts, an entirely separate stack buffer has to be managed that is the size of the L1 cache on most CPUs. This was one of the main reasons the 32k specialized function was slower for the scalar counterpart, despite auto vectorizing. The auto vectorized loop was setting up the stack buffer at unaligned offsets, which is detrimental to performance pre-nehalem. Additionally, we were losing a fair bit of time to the zero initialization, which we are now doing more selectively.

There are a ton of loads and stores happening, and for sure we are bound on the fill buffer + store forwarding. An SSE2 version of this code is probably possible by simply replacing the shifts with unpacks with zero and the palignr's with shufpd's. I'm just not sure it'll be all that worth it, though. We are gating against SSE4.1 not because we are using specifically a 4.1 instruction but because that marks when Wolfdale came out and palignr became a lot faster.

Summary by CodeRabbit

New Features
- Added support for SSE4.1 instruction set on x86 processors, enabling optimized CRC32 checksum calculations for improved performance on compatible hardware.
Bug Fixes
- None.
Tests
- Introduced new benchmarks and test cases for the SSE4.1-optimized CRC32 implementation, including tests for smaller buffer sizes.
Chores
- Updated build scripts and configuration tools to detect and enable SSE4.1 support when available.

coderabbitai · 2025-03-28T22:04:58Z

Walkthrough

This change introduces support for the SSE4.1 instruction set across the build system, runtime feature detection, and CRC32 computation functionality for x86 architectures. It adds new build options and detection logic for SSE4.1 in CMake, configure scripts, and Makefiles. The runtime feature detection is extended to recognize SSE4.1, and a new SSE4.1-optimized CRC32 implementation is provided. Function pointer assignments and test/benchmark registration logic are updated to utilize the new SSE4.1 variant when available. Corresponding test cases and benchmarks are included to validate and measure the new implementation.

Changes

File(s)	Change Summary
CMakeLists.txt, cmake/detect-intrinsics.cmake	Add build option and detection macro for SSE4.1; update dependency chain for SSE-related options; add feature summary and reporting for SSE4.1.
configure, arch/x86/Makefile.in	Add SSE4.1 compiler flag, detection logic, and object file rules; update architecture-specific build and feature checks for SSE4.1.
arch/x86/chorba_sse41.c	New file implementing CRC32 calculation optimized for SSE4.1 using the Chorba algorithm, including specialized routines and vectorized processing.
arch/x86/x86_features.c, arch/x86/x86_features.h	Add runtime detection and struct member for SSE4.1 CPU feature.
arch/x86/x86_functions.h	Add declaration for `crc32_chorba_sse41`; update macro guards and native function mappings for SSE4.1 support.
functable.c	Assign CRC32 function pointer to SSE4.1 variant if available at runtime; update macro guards for SSE2/SSE4.1.
test/benchmarks/benchmark_crc32.cc	Register new benchmark for SSE4.1 CRC32 variant; update macro guards for SSE2/SSE4.1.
test/test_crc32.cc	Add new test case for SSE4.1 CRC32; update macro guards; add smaller buffer test case.

Sequence Diagram(s)

sequenceDiagram
    participant BuildSystem
    participant CPU
    participant FeatureDetect
    participant CRC32API
    participant SSE2Impl
    participant SSE41Impl
    participant User

    User->>CRC32API: Call crc32(...)
    CRC32API->>FeatureDetect: Query CPU features
    FeatureDetect->>CPU: CPUID
    CPU-->>FeatureDetect: Feature bits (including SSE4.1)
    FeatureDetect-->>CRC32API: has_sse41, has_sse2, etc.
    alt SSE4.1 available
        CRC32API->>SSE41Impl: crc32_chorba_sse41(...)
    else SSE2 available
        CRC32API->>SSE2Impl: crc32_chorba_sse2(...)
    else
        CRC32API->>CRC32API: Use fallback implementation
    end
    CRC32API-->>User: Return CRC32 result

Suggested reviewers

nmoinvaz

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)

We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

KungFuJesus · 2025-03-28T22:17:02Z

2025-03-28T18:16:29-04:00
Running ./benchmark_zlib
Run on (4 X 3003 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 6144 KiB (x2)
Load Average: 1.91, 1.36, 0.75
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
crc32/generic_chorba/32768      15754 ns        15762 ns        44401
crc32/braid/32768               24084 ns        24095 ns        29072
crc32/chorba_sse2/32768          8991 ns         8994 ns        77933
crc32/chorba_sse41/32768         6264 ns         6267 ns       111684

codecov · 2025-03-28T22:18:58Z

Codecov Report

Attention: Patch coverage is 95.07389% with 10 lines in your changes missing coverage. Please review.

Project coverage is 81.92%. Comparing base (00a3168) to head (5890b4b).
Report is 2 commits behind head on develop.

Files with missing lines	Patch %	Lines
arch/x86/chorba_sse41.c	96.41%	6 Missing and 1 partial ⚠️
functable.c	66.66%	0 Missing and 1 partial ⚠️
test/benchmarks/benchmark_crc32.cc	0.00%	1 Missing ⚠️
test/test_crc32.cc	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #1893       +/-   ##
============================================
+ Coverage    40.29%   81.92%   +41.63%     
============================================
  Files           73      146       +73     
  Lines         7684    13267     +5583     
  Branches      1281     2944     +1663     
============================================
+ Hits          3096    10869     +7773     
+ Misses        4363     1493     -2870     
- Partials       225      905      +680

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (8)

test/benchmarks/benchmark_crc32.cc (1)
74-76: Consider using a dedicated SSE4.1 compile guard

The code currently uses the same NO_CHORBA_SSE2 guard for both SSE2 and SSE4.1 implementations. Consider using a separate guard like NO_CHORBA_SSE41 for the SSE4.1 implementation to allow more fine-grained control.
-#       if defined(X86_SSE41) && !defined(NO_CHORBA_SSE2)
+#       if defined(X86_SSE41) && !defined(NO_CHORBA_SSE41)
        BENCHMARK_CRC32(chorba_sse41, crc32_chorba_sse41, test_cpu_features.x86.has_sse41);
#       endif
arch/x86/Makefile.in (1)
83-88: Use SFLAGS for shared object compilation

For consistency with other architecture-specific rules in this file, you should use $(SFLAGS) instead of $(CFLAGS) for the .lo target, which represents the shared object compilation.
chorba_sse41.lo:
-	$(CC) $(CFLAGS) $(SSE41FLAG) $(NOLTOFLAG) -DPIC $(INCLUDES) -c -o $@ $(SRCDIR)/chorba_sse41.c
+	$(CC) $(SFLAGS) $(SSE41FLAG) $(NOLTOFLAG) -DPIC $(INCLUDES) -c -o $@ $(SRCDIR)/chorba_sse41.c
test/test_crc32.cc (1)
288-290: Consider using a dedicated SSE4.1 compile guard

Similar to the benchmark file, consider using a separate guard like NO_CHORBA_SSE41 for the SSE4.1 test to allow more fine-grained control over which implementations are tested.
-#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && !defined(NO_CHORBA_SSE2)
+#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && !defined(NO_CHORBA_SSE41)
TEST_CRC32(chorba_sse41, crc32_chorba_sse41, test_cpu_features.x86.has_sse41)
#endif
configure (4)
39-39: Remove trailing whitespace in these lines.

They are causing linter warnings. Consider removing the extra spaces at the end of each of these lines.

Apply this diff to remove trailing spaces:
-# We only need to zero out the bytes between the 128'th value and the 144th 
+# We only need to zero out the bytes between the 128'th value and the 144th
...
-    ;;  
+    ;;
...
-  esac  
+  esac
...
-      --help)  
+      --help)
Also applies to: 104-104, 111-111, 118-118

114-114: Add a comment describing SSE4.1 usage.

The newly introduced sse41flag="-msse4.1" is self-explanatory but it might be helpful to document the rationale for enabling SSE4.1 and any potential compatibility considerations with older compilers or CPUs.

1537-1552: Ensure error handling for SSE4.1 intrinsics check.

The function check_sse41_intrinsics() compiles a test source to detect availability of SSE4.1. While this is correct for most compilers, consider gracefully handling exotic or older compilers that may not support -msse4.1 or <smmintrin.h>.

1683-1687: Add extra logging or condition checks for SSE4.1 availability.

The conditional block enabling SSE4.1 is good. Consider logging a short user-facing message to confirm SSE4.1 detection (similar to other SSE checks) for consistency and debuggability.
arch/x86/chorba_sse41.c (1)
39-39: Remove trailing whitespace.

The linter warns about trailing spaces on these lines. Removing them keeps the codebase clean and consistent.

Apply this diff to remove trailing spaces:
-/* We only need to zero out the bytes between the 128'th value and the 144th 
+/* We only need to zero out the bytes between the 128'th value and the 144th
...
-        _mm_store_si128(bitbuf144++, a);  
+        _mm_store_si128(bitbuf144++, a);
...
-        _mm_store_si128(bitbuf182++, e);  
+        _mm_store_si128(bitbuf182++, e);
...
-        buf210 = _mm_xor_si128(_mm_load_si128(bitbuf210+4), in8_);  
+        buf210 = _mm_xor_si128(_mm_load_si128(bitbuf210+4), in8_);
Also applies to: 104-104, 111-111, 118-118

🧰 Tools

🪛 GitHub Actions: Lint

[warning] 39-39: Trailing whitespace.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 724dc0c and e1574e2.

📒 Files selected for processing (13)

CMakeLists.txt (4 hunks)
arch/x86/Makefile.in (3 hunks)
arch/x86/chorba_sse41.c (1 hunks)
arch/x86/x86_features.c (1 hunks)
arch/x86/x86_features.h (1 hunks)
arch/x86/x86_functions.h (2 hunks)
arch/x86/x86_intrins.h (1 hunks)
cmake/detect-intrinsics.cmake (1 hunks)
configure (3 hunks)
functable.c (1 hunks)
test/benchmarks/benchmark_crc32.cc (1 hunks)
test/test_crc32.cc (2 hunks)
win32/Makefile.msc (3 hunks)

🧰 Additional context used

🧠 Learnings (2)

test/benchmarks/benchmark_crc32.cc (1)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

arch/x86/chorba_sse41.c (3)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-03-26T15:10:53.588Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

🧬 Code Definitions (3)

functable.c (1)

arch/x86/x86_functions.h (1)

crc32_chorba_sse41 (41-41)

test/benchmarks/benchmark_crc32.cc (1)

arch/x86/x86_functions.h (1)

crc32_chorba_sse41 (41-41)

arch/x86/x86_functions.h (1)

arch/generic/generic_functions.h (1)

crc32 (14-14)

🪛 GitHub Actions: Lint

configure

[warning] 1681-1681: Trailing whitespace.

arch/x86/chorba_sse41.c

[warning] 39-39: Trailing whitespace.

[warning] 104-104: Trailing whitespace.

[warning] 111-111: Trailing whitespace.

[warning] 118-118: Trailing whitespace.

🪛 GitHub Check: CodeQL

arch/x86/chorba_sse41.c

[notice] 28-374: Block with too many statements
Block with too many statements (5 complex statements in the block).

[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 148-148: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 149-149: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 150-150: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 151-151: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 185-185: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 186-186: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 187-187: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 188-188: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 196-196: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 197-197: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 198-198: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 199-199: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 245-245: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 246-246: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 247-247: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 248-248: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 256-256: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 257-257: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 258-258: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 259-259: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 278-278: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 279-279: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 280-280: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 281-281: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

⏰ Context from checks skipped due to timeout of 90000ms (140)

GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC
GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
GitHub Check: macOS GCC UBSAN (ARM64)
GitHub Check: Ubuntu GCC SPARC64
GitHub Check: Windows MSVC 2022 v142 Win64
GitHub Check: Windows ClangCl Win32
GitHub Check: Windows ClangCl Win64
GitHub Check: Windows MSVC 2019 v141 Win32
GitHub Check: Windows GCC Compat No Opt
GitHub Check: Windows MSVC 2019 v140 Win64
GitHub Check: Windows MSVC 2022 v143 Win32
GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
GitHub Check: Ubuntu GCC Compat No Opt ASAN
GitHub Check: Ubuntu GCC Compat Symbol Prefix
GitHub Check: macOS GCC

🔇 Additional comments (21)

arch/x86/x86_intrins.h (1)

95-95: LGTM: Minor formatting change

The addition of a blank line after the include statement improves code readability without affecting functionality.

arch/x86/x86_features.h (1)

20-20: LGTM: Clean SSE4.1 support addition

The has_sse41 field is correctly placed in the CPU features structure between SSSE3 and SSE4.2, following the logical progression of Intel's instruction set extensions.

arch/x86/x86_features.c (1)

78-78: LGTM: Proper SSE4.1 detection implementation

The SSE4.1 feature detection uses the correct bitmask (0x80000) to check bit 19 of the ECX register, which aligns with Intel's documentation for identifying SSE4.1 support via CPUID.

functable.c (1)

99-106: LGTM: Well-structured SSE4.1 function dispatch

The SSE4.1 conditional block is properly positioned between SSSE3 and SSE4.2 sections, maintaining the logical progression of feature checks. The implementation follows the established pattern in the codebase for conditionally selecting optimized implementations based on CPU capabilities.

The performance improvement (25-30% over SSE2 variant per PR description) will be automatically utilized on compatible systems without disrupting existing code paths.

win32/Makefile.msc (2)

65-65: LGTM: Object file addition looks good

The new object file chorba_sse41.obj is correctly added to the OBJS list, following the same pattern as other architecture-specific implementations.

217-217: LGTM: Dependency rule correctly defined

The dependency rule for the new object file is correctly defined, specifying the source file and required headers.

test/benchmarks/benchmark_crc32.cc (1)

75-75: LGTM: Benchmark registration follows established pattern

The benchmark registration for the new SSE4.1 implementation correctly follows the established pattern, using the appropriate function and feature check.

arch/x86/Makefile.in (2)

16-16: LGTM: SSE4.1 flag correctly defined

The SSE4.1 flag is correctly defined with the appropriate compiler option -msse4.1.

39-39: LGTM: Object files correctly added

The new object files for the SSE4.1-optimized implementation are correctly added to the build targets.

test/test_crc32.cc (1)

196-197: LGTM: Additional test case for half-sized buffer

Adding a test case with a 16KB buffer size (half of the existing 32KB test) is a good approach to ensure the implementation works correctly with different buffer sizes.

arch/x86/x86_functions.h (2)

39-43: LGTM! SSE4.1 function declaration is properly guarded

The declaration of crc32_chorba_sse41 is appropriately guarded with the correct preprocessor directives, ensuring it's only available when SSE4.1 and SSE2 are enabled and the CHORBA implementation is not disabled.

123-126: LGTM! Native CRC32 redirection to SSE4.1 implementation

This change appropriately redirects the native_crc32 function to use the SSE4.1 implementation when SSE4.1 is available and the necessary prerequisites are met.

CMakeLists.txt (4)

129-129: LGTM! SSE41 dependency is correctly specified

The WITH_SSE41 option is properly defined as dependent on WITH_SSSE3, which matches the architecture dependency chain where SSE4.1 builds upon SSSE3 instructions.

149-150: LGTM! Advanced options updated to include SSE41

The mark_as_advanced section has been updated to include the WITH_SSE41 option, maintaining consistency with how other similar options are handled.

995-1005: LGTM! SSE41 build configuration implementation

The implementation for building with SSE4.1 follows the same pattern as other instruction set extensions:

Checks for intrinsics support

Sets up appropriate macros and source files

Applies correct compilation flags

This change is well-structured and consistent with the project's coding standards.

1496-1496: LGTM! Feature summary for SSE41 added

The feature summary is properly updated to include information about the SSE41 build option, maintaining consistency with documentation for other instruction sets.

configure (1)

1680-1682: Maintain consistency with other checks.

You've seamlessly added check_sse41_intrinsics after check_ssse3_intrinsics. Ensure that the new check is also integrated into any relevant feature-detection logs or user-facing messages that summarize SSE capabilities.

🧰 Tools

🪛 GitHub Actions: Lint

[warning] 1681-1681: Trailing whitespace.

arch/x86/chorba_sse41.c (4)

1-12: Check for fallback scenarios.

You guard this code with #if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && defined(X86_SSE2). Ensure that if either SSE2 or SSE4.1 is not available, the build system correctly compiles an alternative implementation or gracefully disables this feature.

13-25: Reassure consistent function declarations.

Make sure these external and macro definitions remain consistent with their declarations in crc32_braid_p.h and any additional headers (like crc32.h). If function signatures change, it could result in subtle linking issues.

51-51: Confirm pointer arithmetic for __m128i arrays.

Static analysis flagged “suspicious pointer scaling” at these lines. bitbuffer is an array of uint64_t, yet you cast and manipulate it as __m128i *. This is likely correct given the 16-byte alignment, but please confirm each offset multiplication matches your intended data layout.

Also applies to: 148-151, 185-188, 196-199, 245-248, 256-259, 278-281

🧰 Tools

🪛 GitHub Check: CodeQL

[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

376-406: Validate alignment logic for SSE usage.

In crc32_chorba_sse41(), you compute algn_diff to align the buffer to a 16-byte boundary. Make sure no performance regressions arise if len is small or if buf is nearly aligned (especially edge cases where algn_diff == len).

win32/Makefile.msc

cmake/detect-intrinsics.cmake

arch/x86/chorba_sse41.c

win32/Makefile.msc

nmoinvaz · 2025-03-28T23:30:36Z

There are a bunch of these errors:

KungFuJesus · 2025-03-29T02:05:02Z

There are a bunch of these errors:

I mean that is intentional, 2 64 bit integers is a stride through a single __m128i. Maybe I modify the pointer type to appease codeQL, though.

Err actually no, that's me intentionally advancing an __m128i pointer. What else would it have me do? The intrinsic requires a __m128i pointer.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

arch/x86/chorba_sse41.c (1)

28-374: 🛠️ Refactor suggestion

Refactor the large function for maintainability.
This function (crc32_chorba_32768_nondestructive_sse41) contains multiple loops and intricate pointer manipulation across a wide range of lines, making it difficult to follow and maintain. Splitting it into smaller, well-named helper routines (e.g., initialization, SSE mixing, tail processing) can significantly improve readability and manageability.

🧰 Tools

🪛 GitHub Check: CodeQL

[notice] 28-374: Block with too many statements
Block with too many statements (5 complex statements in the block).

[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 148-148: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 149-149: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 150-150: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 151-151: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 185-185: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 186-186: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 187-187: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 188-188: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 196-196: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 197-197: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 198-198: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 199-199: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 245-245: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 246-246: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 247-247: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 248-248: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 256-256: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 257-257: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 258-258: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 259-259: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 278-278: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 279-279: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 280-280: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

[failure] 281-281: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).

🪛 GitHub Actions: Lint

[warning] 39-39: Trailing whitespace found.

[warning] 104-104: Trailing whitespace found.

[warning] 111-111: Trailing whitespace found.

[warning] 118-118: Trailing whitespace found.

🧹 Nitpick comments (3)

configure (2)
1537-1552: Validate usage of _mm_insert_epi64 in the SSE4.1 check
This test snippet effectively confirms compiler support for SSE4.1 intrinsics. However, __m128i b is not initialized, which could trigger warnings in stricter compiler settings. Consider _mm_setzero_si128() if such warnings become problematic.

1680-1688: Remove trailing whitespace and finalize SSE4.1 configuration
Line 1681 includes trailing whitespace, causing the lint warning. Removing it will clear the pipeline warning. The surrounding SSE4.1 build logic is coherent.

Apply this diff to remove trailing whitespace:
-            # (there are trailing spaces here on line 1681)
+            
🧰 Tools

🪛 GitHub Actions: Lint

[warning] 1681-1681: Trailing whitespace found.
arch/x86/chorba_sse41.c (1)
39-39: Remove trailing whitespace.
These lines include trailing whitespaces, as flagged by the lint pipeline. Removing them will keep the project coding style clean and consistent.
(Repeat similarly for lines 104, 111, and 118.)

Also applies to: 104-104, 111-111, 118-118

🧰 Tools

🪛 GitHub Actions: Lint

[warning] 39-39: Trailing whitespace found.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1574e2 and 4ae7746.

📒 Files selected for processing (13)

CMakeLists.txt (4 hunks)
arch/x86/Makefile.in (3 hunks)
arch/x86/chorba_sse41.c (1 hunks)
arch/x86/x86_features.c (1 hunks)
arch/x86/x86_features.h (1 hunks)
arch/x86/x86_functions.h (2 hunks)
arch/x86/x86_intrins.h (1 hunks)
cmake/detect-intrinsics.cmake (1 hunks)
configure (3 hunks)
functable.c (1 hunks)
test/benchmarks/benchmark_crc32.cc (1 hunks)
test/test_crc32.cc (2 hunks)
win32/Makefile.msc (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (10)

arch/x86/x86_intrins.h
arch/x86/Makefile.in
arch/x86/x86_features.c
test/benchmarks/benchmark_crc32.cc
functable.c
arch/x86/x86_functions.h
arch/x86/x86_features.h
cmake/detect-intrinsics.cmake
CMakeLists.txt
test/test_crc32.cc

🧰 Additional context used

🧠 Learnings (1)

arch/x86/chorba_sse41.c (3)

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.

Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-03-26T15:10:53.588Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.

🧬 Code Definitions (1)

arch/x86/chorba_sse41.c (2)

arch/x86/x86_intrins.h (2)

_mm_cvtsi64x_si128 (108-110)

_mm_cvtsi64x_si128 (108-108)

arch/x86/x86_functions.h (1)

crc32_chorba_sse41 (41-41)

🪛 GitHub Actions: Lint

configure

[warning] 1681-1681: Trailing whitespace found.

arch/x86/chorba_sse41.c

[warning] 39-39: Trailing whitespace found.

[warning] 104-104: Trailing whitespace found.

[warning] 111-111: Trailing whitespace found.

[warning] 118-118: Trailing whitespace found.