-
-
Notifications
You must be signed in to change notification settings - Fork 308
SSE4.1 optimized chorba #1893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSE4.1 optimized chorba #1893
Conversation
WalkthroughThis change introduces support for the SSE4.1 instruction set across the build system, runtime feature detection, and CRC32 computation functionality for x86 architectures. It adds new build options and detection logic for SSE4.1 in CMake, configure scripts, and Makefiles. The runtime feature detection is extended to recognize SSE4.1, and a new SSE4.1-optimized CRC32 implementation is provided. Function pointer assignments and test/benchmark registration logic are updated to utilize the new SSE4.1 variant when available. Corresponding test cases and benchmarks are included to validate and measure the new implementation. Changes
Sequence Diagram(s)sequenceDiagram
participant BuildSystem
participant CPU
participant FeatureDetect
participant CRC32API
participant SSE2Impl
participant SSE41Impl
participant User
User->>CRC32API: Call crc32(...)
CRC32API->>FeatureDetect: Query CPU features
FeatureDetect->>CPU: CPUID
CPU-->>FeatureDetect: Feature bits (including SSE4.1)
FeatureDetect-->>CRC32API: has_sse41, has_sse2, etc.
alt SSE4.1 available
CRC32API->>SSE41Impl: crc32_chorba_sse41(...)
else SSE2 available
CRC32API->>SSE2Impl: crc32_chorba_sse2(...)
else
CRC32API->>CRC32API: Use fallback implementation
end
CRC32API-->>User: Return CRC32 result
Suggested reviewers
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
e251b42 to
e1574e2
Compare
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1893 +/- ##
============================================
+ Coverage 40.29% 81.92% +41.63%
============================================
Files 73 146 +73
Lines 7684 13267 +5583
Branches 1281 2944 +1663
============================================
+ Hits 3096 10869 +7773
+ Misses 4363 1493 -2870
- Partials 225 905 +680 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (8)
test/benchmarks/benchmark_crc32.cc (1)
74-76: Consider using a dedicated SSE4.1 compile guardThe code currently uses the same
NO_CHORBA_SSE2guard for both SSE2 and SSE4.1 implementations. Consider using a separate guard likeNO_CHORBA_SSE41for the SSE4.1 implementation to allow more fine-grained control.-# if defined(X86_SSE41) && !defined(NO_CHORBA_SSE2) +# if defined(X86_SSE41) && !defined(NO_CHORBA_SSE41) BENCHMARK_CRC32(chorba_sse41, crc32_chorba_sse41, test_cpu_features.x86.has_sse41); # endifarch/x86/Makefile.in (1)
83-88: Use SFLAGS for shared object compilationFor consistency with other architecture-specific rules in this file, you should use
$(SFLAGS)instead of$(CFLAGS)for the.lotarget, which represents the shared object compilation.chorba_sse41.lo: - $(CC) $(CFLAGS) $(SSE41FLAG) $(NOLTOFLAG) -DPIC $(INCLUDES) -c -o $@ $(SRCDIR)/chorba_sse41.c + $(CC) $(SFLAGS) $(SSE41FLAG) $(NOLTOFLAG) -DPIC $(INCLUDES) -c -o $@ $(SRCDIR)/chorba_sse41.ctest/test_crc32.cc (1)
288-290: Consider using a dedicated SSE4.1 compile guardSimilar to the benchmark file, consider using a separate guard like
NO_CHORBA_SSE41for the SSE4.1 test to allow more fine-grained control over which implementations are tested.-#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && !defined(NO_CHORBA_SSE2) +#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && !defined(NO_CHORBA_SSE41) TEST_CRC32(chorba_sse41, crc32_chorba_sse41, test_cpu_features.x86.has_sse41) #endifconfigure (4)
39-39: Remove trailing whitespace in these lines.They are causing linter warnings. Consider removing the extra spaces at the end of each of these lines.
Apply this diff to remove trailing spaces:
-# We only need to zero out the bytes between the 128'th value and the 144th +# We only need to zero out the bytes between the 128'th value and the 144th ... - ;; + ;; ... - esac + esac ... - --help) + --help)Also applies to: 104-104, 111-111, 118-118
114-114: Add a comment describing SSE4.1 usage.The newly introduced
sse41flag="-msse4.1"is self-explanatory but it might be helpful to document the rationale for enabling SSE4.1 and any potential compatibility considerations with older compilers or CPUs.
1537-1552: Ensure error handling for SSE4.1 intrinsics check.The function
check_sse41_intrinsics()compiles a test source to detect availability of SSE4.1. While this is correct for most compilers, consider gracefully handling exotic or older compilers that may not support-msse4.1or<smmintrin.h>.
1683-1687: Add extra logging or condition checks for SSE4.1 availability.The conditional block enabling SSE4.1 is good. Consider logging a short user-facing message to confirm SSE4.1 detection (similar to other SSE checks) for consistency and debuggability.
arch/x86/chorba_sse41.c (1)
39-39: Remove trailing whitespace.The linter warns about trailing spaces on these lines. Removing them keeps the codebase clean and consistent.
Apply this diff to remove trailing spaces:
-/* We only need to zero out the bytes between the 128'th value and the 144th +/* We only need to zero out the bytes between the 128'th value and the 144th ... - _mm_store_si128(bitbuf144++, a); + _mm_store_si128(bitbuf144++, a); ... - _mm_store_si128(bitbuf182++, e); + _mm_store_si128(bitbuf182++, e); ... - buf210 = _mm_xor_si128(_mm_load_si128(bitbuf210+4), in8_); + buf210 = _mm_xor_si128(_mm_load_si128(bitbuf210+4), in8_);Also applies to: 104-104, 111-111, 118-118
🧰 Tools
🪛 GitHub Actions: Lint
[warning] 39-39: Trailing whitespace.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
CMakeLists.txt(4 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/chorba_sse41.c(1 hunks)arch/x86/x86_features.c(1 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(2 hunks)arch/x86/x86_intrins.h(1 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(3 hunks)functable.c(1 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)win32/Makefile.msc(3 hunks)
🧰 Additional context used
🧠 Learnings (2)
test/benchmarks/benchmark_crc32.cc (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
arch/x86/chorba_sse41.c (3)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-03-26T15:10:53.588Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
🧬 Code Definitions (3)
functable.c (1)
arch/x86/x86_functions.h (1)
crc32_chorba_sse41(41-41)
test/benchmarks/benchmark_crc32.cc (1)
arch/x86/x86_functions.h (1)
crc32_chorba_sse41(41-41)
arch/x86/x86_functions.h (1)
arch/generic/generic_functions.h (1)
crc32(14-14)
🪛 GitHub Actions: Lint
configure
[warning] 1681-1681: Trailing whitespace.
arch/x86/chorba_sse41.c
[warning] 39-39: Trailing whitespace.
[warning] 104-104: Trailing whitespace.
[warning] 111-111: Trailing whitespace.
[warning] 118-118: Trailing whitespace.
🪛 GitHub Check: CodeQL
arch/x86/chorba_sse41.c
[notice] 28-374: Block with too many statements
Block with too many statements (5 complex statements in the block).
[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 148-148: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 149-149: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 150-150: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 151-151: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 185-185: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 186-186: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 187-187: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 188-188: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 196-196: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 197-197: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 198-198: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 199-199: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 245-245: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 246-246: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 247-247: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 248-248: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 256-256: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 257-257: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 258-258: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 259-259: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 278-278: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 279-279: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 280-280: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 281-281: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
⏰ Context from checks skipped due to timeout of 90000ms (140)
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
- GitHub Check: Windows MSVC 2022 v143 Win64 Native Instructions (AVX)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Windows MSVC 2022 v142 Win64
- GitHub Check: Windows ClangCl Win32
- GitHub Check: Windows ClangCl Win64
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2022 v143 Win32
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat No Opt ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: macOS GCC
🔇 Additional comments (21)
arch/x86/x86_intrins.h (1)
95-95: LGTM: Minor formatting changeThe addition of a blank line after the include statement improves code readability without affecting functionality.
arch/x86/x86_features.h (1)
20-20: LGTM: Clean SSE4.1 support additionThe
has_sse41field is correctly placed in the CPU features structure between SSSE3 and SSE4.2, following the logical progression of Intel's instruction set extensions.arch/x86/x86_features.c (1)
78-78: LGTM: Proper SSE4.1 detection implementationThe SSE4.1 feature detection uses the correct bitmask (0x80000) to check bit 19 of the ECX register, which aligns with Intel's documentation for identifying SSE4.1 support via CPUID.
functable.c (1)
99-106: LGTM: Well-structured SSE4.1 function dispatchThe SSE4.1 conditional block is properly positioned between SSSE3 and SSE4.2 sections, maintaining the logical progression of feature checks. The implementation follows the established pattern in the codebase for conditionally selecting optimized implementations based on CPU capabilities.
The performance improvement (25-30% over SSE2 variant per PR description) will be automatically utilized on compatible systems without disrupting existing code paths.
win32/Makefile.msc (2)
65-65: LGTM: Object file addition looks goodThe new object file
chorba_sse41.objis correctly added to the OBJS list, following the same pattern as other architecture-specific implementations.
217-217: LGTM: Dependency rule correctly definedThe dependency rule for the new object file is correctly defined, specifying the source file and required headers.
test/benchmarks/benchmark_crc32.cc (1)
75-75: LGTM: Benchmark registration follows established patternThe benchmark registration for the new SSE4.1 implementation correctly follows the established pattern, using the appropriate function and feature check.
arch/x86/Makefile.in (2)
16-16: LGTM: SSE4.1 flag correctly definedThe SSE4.1 flag is correctly defined with the appropriate compiler option
-msse4.1.
39-39: LGTM: Object files correctly addedThe new object files for the SSE4.1-optimized implementation are correctly added to the build targets.
test/test_crc32.cc (1)
196-197: LGTM: Additional test case for half-sized bufferAdding a test case with a 16KB buffer size (half of the existing 32KB test) is a good approach to ensure the implementation works correctly with different buffer sizes.
arch/x86/x86_functions.h (2)
39-43: LGTM! SSE4.1 function declaration is properly guardedThe declaration of
crc32_chorba_sse41is appropriately guarded with the correct preprocessor directives, ensuring it's only available when SSE4.1 and SSE2 are enabled and the CHORBA implementation is not disabled.
123-126: LGTM! Native CRC32 redirection to SSE4.1 implementationThis change appropriately redirects the
native_crc32function to use the SSE4.1 implementation when SSE4.1 is available and the necessary prerequisites are met.CMakeLists.txt (4)
129-129: LGTM! SSE41 dependency is correctly specifiedThe
WITH_SSE41option is properly defined as dependent onWITH_SSSE3, which matches the architecture dependency chain where SSE4.1 builds upon SSSE3 instructions.
149-150: LGTM! Advanced options updated to include SSE41The
mark_as_advancedsection has been updated to include theWITH_SSE41option, maintaining consistency with how other similar options are handled.
995-1005: LGTM! SSE41 build configuration implementationThe implementation for building with SSE4.1 follows the same pattern as other instruction set extensions:
- Checks for intrinsics support
- Sets up appropriate macros and source files
- Applies correct compilation flags
This change is well-structured and consistent with the project's coding standards.
1496-1496: LGTM! Feature summary for SSE41 addedThe feature summary is properly updated to include information about the SSE41 build option, maintaining consistency with documentation for other instruction sets.
configure (1)
1680-1682: Maintain consistency with other checks.You've seamlessly added
check_sse41_intrinsicsaftercheck_ssse3_intrinsics. Ensure that the new check is also integrated into any relevant feature-detection logs or user-facing messages that summarize SSE capabilities.🧰 Tools
🪛 GitHub Actions: Lint
[warning] 1681-1681: Trailing whitespace.
arch/x86/chorba_sse41.c (4)
1-12: Check for fallback scenarios.You guard this code with
#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && defined(X86_SSE2). Ensure that if either SSE2 or SSE4.1 is not available, the build system correctly compiles an alternative implementation or gracefully disables this feature.
13-25: Reassure consistent function declarations.Make sure these external and macro definitions remain consistent with their declarations in
crc32_braid_p.hand any additional headers (likecrc32.h). If function signatures change, it could result in subtle linking issues.
51-51: Confirm pointer arithmetic for __m128i arrays.Static analysis flagged “suspicious pointer scaling” at these lines.
bitbufferis an array ofuint64_t, yet you cast and manipulate it as__m128i *. This is likely correct given the 16-byte alignment, but please confirm each offset multiplication matches your intended data layout.Also applies to: 148-151, 185-188, 196-199, 245-248, 256-259, 278-281
🧰 Tools
🪛 GitHub Check: CodeQL
[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
376-406: Validate alignment logic for SSE usage.In
crc32_chorba_sse41(), you computealgn_diffto align the buffer to a 16-byte boundary. Make sure no performance regressions arise iflenis small or ifbufis nearly aligned (especially edge cases wherealgn_diff == len).
|
I mean that is intentional, 2 64 bit integers is a stride through a single __m128i. Maybe I modify the pointer type to appease codeQL, though. Err actually no, that's me intentionally advancing an __m128i pointer. What else would it have me do? The intrinsic requires a __m128i pointer. |
4ae7746 to
cf88723
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
arch/x86/chorba_sse41.c (1)
28-374: 🛠️ Refactor suggestionRefactor the large function for maintainability.
This function (crc32_chorba_32768_nondestructive_sse41) contains multiple loops and intricate pointer manipulation across a wide range of lines, making it difficult to follow and maintain. Splitting it into smaller, well-named helper routines (e.g., initialization, SSE mixing, tail processing) can significantly improve readability and manageability.🧰 Tools
🪛 GitHub Check: CodeQL
[notice] 28-374: Block with too many statements
Block with too many statements (5 complex statements in the block).
[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 148-148: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 149-149: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 150-150: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 151-151: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 185-185: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 186-186: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 187-187: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 188-188: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 196-196: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 197-197: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 198-198: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 199-199: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 245-245: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 246-246: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 247-247: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 248-248: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 256-256: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 257-257: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 258-258: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 259-259: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 278-278: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 279-279: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 280-280: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 281-281: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).🪛 GitHub Actions: Lint
[warning] 39-39: Trailing whitespace found.
[warning] 104-104: Trailing whitespace found.
[warning] 111-111: Trailing whitespace found.
[warning] 118-118: Trailing whitespace found.
🧹 Nitpick comments (3)
configure (2)
1537-1552: Validate usage of_mm_insert_epi64in the SSE4.1 check
This test snippet effectively confirms compiler support for SSE4.1 intrinsics. However,__m128i bis not initialized, which could trigger warnings in stricter compiler settings. Consider_mm_setzero_si128()if such warnings become problematic.
1680-1688: Remove trailing whitespace and finalize SSE4.1 configuration
Line 1681 includes trailing whitespace, causing the lint warning. Removing it will clear the pipeline warning. The surrounding SSE4.1 build logic is coherent.Apply this diff to remove trailing whitespace:
- # (there are trailing spaces here on line 1681) +🧰 Tools
🪛 GitHub Actions: Lint
[warning] 1681-1681: Trailing whitespace found.
arch/x86/chorba_sse41.c (1)
39-39: Remove trailing whitespace.
These lines include trailing whitespaces, as flagged by the lint pipeline. Removing them will keep the project coding style clean and consistent.(Repeat similarly for lines 104, 111, and 118.)
Also applies to: 104-104, 111-111, 118-118
🧰 Tools
🪛 GitHub Actions: Lint
[warning] 39-39: Trailing whitespace found.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
CMakeLists.txt(4 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/chorba_sse41.c(1 hunks)arch/x86/x86_features.c(1 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(2 hunks)arch/x86/x86_intrins.h(1 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(3 hunks)functable.c(1 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)win32/Makefile.msc(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- arch/x86/x86_intrins.h
- arch/x86/Makefile.in
- arch/x86/x86_features.c
- test/benchmarks/benchmark_crc32.cc
- functable.c
- arch/x86/x86_functions.h
- arch/x86/x86_features.h
- cmake/detect-intrinsics.cmake
- CMakeLists.txt
- test/test_crc32.cc
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (3)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-03-26T15:10:53.588Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
🧬 Code Definitions (1)
arch/x86/chorba_sse41.c (2)
arch/x86/x86_intrins.h (2)
_mm_cvtsi64x_si128(108-110)_mm_cvtsi64x_si128(108-108)arch/x86/x86_functions.h (1)
crc32_chorba_sse41(41-41)
🪛 GitHub Actions: Lint
configure
[warning] 1681-1681: Trailing whitespace found.
arch/x86/chorba_sse41.c
[warning] 39-39: Trailing whitespace found.
[warning] 104-104: Trailing whitespace found.
[warning] 111-111: Trailing whitespace found.
[warning] 118-118: Trailing whitespace found.
🪛 GitHub Check: CodeQL
arch/x86/chorba_sse41.c
[notice] 28-374: Block with too many statements
Block with too many statements (5 complex statements in the block).
[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 148-148: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 149-149: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 150-150: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 151-151: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 185-185: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 186-186: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 187-187: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 188-188: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 196-196: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 197-197: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 198-198: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 199-199: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 245-245: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 246-246: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 247-247: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 248-248: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 256-256: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 257-257: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 258-258: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 259-259: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 278-278: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 279-279: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 280-280: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
[failure] 281-281: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
⏰ Context from checks skipped due to timeout of 90000ms (174)
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
- GitHub Check: macOS Clang Native Instructions (ARM64)
- GitHub Check: Windows GCC Compat No Opt
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2022 v143 Win64
- GitHub Check: macOS Clang ASAN
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Windows MSVC 2019 v140 Win32
- GitHub Check: Windows MSVC 2019 v141 Win32
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC No PCLMULQDQ UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC -m32
🔇 Additional comments (6)
win32/Makefile.msc (3)
36-36: Confirm SSE4.1 define
The define-DX86_SSE41is consistent with the naming pattern for other SSE flags (e.g.,-DX86_SSE2,-DX86_SSE42).
64-65: Ensure conditional SSE4.1 build
The addition ofchorba_sse2.objandchorba_sse41.objreferences is consistent with SSE-based optimizations. Confirm that your build system conditionally includeschorba_sse41.objonly when SSE4.1 is supported or intended.
217-217: Add SSE4.1 build rule
The build rule forchorba_sse41.objfollows the same pattern as related SSE object rules and appears correct.configure (1)
114-114: Define SSE4.1 compiler flag
The newsse41flag="-msse4.1"is appropriately aligned with the existing SSE flag definitions.arch/x86/chorba_sse41.c (2)
376-406: Validate SSE fallback logic.
In thecrc32_chorba_sse41function, ensure that the fallbacks tocrc32_braid_internalorchorba_small_nondestructive_sse2are correct and consistent with your SSE4.1 detection. Verify that length thresholds are aligned with the intended performance trade-offs, and that the code correctly handles edge cases where data is shorter than the SSE4.1-optimized block size.
51-51:❓ Verification inconclusive
Potential suspicious pointer scaling.
Several lines use pointer increments on__m128i*objects in ways that static analysis flags as “suspicious pointer scaling.” Verify that these increments (often withbitbufXXX++ + N) correctly align with 128-bit boundaries and do not accidentally skip or overlap data. If this behavior is intentional (e.g., skipping multiple SIMD registers), consider adding comments explaining why this is correct.To confirm correctness, run the following script to highlight relevant pointer arithmetic usage for manual inspection:
Also applies to: 148-151, 185-188, 196-199, 245-248, 256-259, 278-281
🏁 Script executed:
#!/bin/bash # Searching lines where bitbuf pointers are incremented or offset by numeric values rg -A 3 'bitbuf\S*\+\+' arch/x86/chorba_sse41.cLength of output: 5002
Action: Verify pointer arithmetic correctness in SIMD store operations
- In
arch/x86/chorba_sse41.c, line 51 shows the expression_mm_store_si128(bitbuf144++ + 8, z);. This usage combines a post-increment (bitbuf144++) with an added offset (+ 8), which is atypical compared to the other straightforward pointer increments (e.g.,bitbuf128++,bitbuf182++, etc.).- Please verify that this arithmetic correctly accounts for 128-bit boundaries and intentionally skips the necessary SIMD registers without overlapping or misaligning the data.
- As similar pointer arithmetic is seen throughout the file (notably in the ranges [148–151, 185–188, 196–199, 245–248, 256–259, 278–281]), if this behavior is deliberate (for example, to zero out or process a specific subrange), please add inline comments explaining the intent to aid maintainability.
🧰 Tools
🪛 GitHub Check: CodeQL
[failure] 51-51: Suspicious pointer scaling
This pointer might have type unsigned long (size 8), but this pointer arithmetic is done with type __m128i * (size 16).
|
Can we whitelist this thing with codeQL or something? The alternative looks like this: Hmm, evidently derefing with a [] is what it wants? https://codeql.github.com/codeql-query-help/cpp/cpp-suspicious-pointer-scaling/ That's rather silly, but whatever, it's the same thing. |
cf88723 to
7a7fd32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🧹 Nitpick comments (1)
arch/x86/chorba_sse41.c (1)
39-39: Remove trailing whitespace.The build pipeline flags trailing whitespace on lines 39, 104, 111, and 118. Eliminating trailing whitespace helps maintain a clean codebase.
Example fix:
- for (size_t i = 0; i < 2; ++i) {·· + for (size_t i = 0; i < 2; ++i) {Also applies to: 104-104, 111-111, 118-118
🧰 Tools
🪛 GitHub Actions: Lint
[warning] 39-39: Trailing whitespace.
🛑 Comments failed to post (1)
configure (1)
1680-1688:
⚠️ Potential issueFix trailing whitespace on line 1681
There's a trailing whitespace character that should be removed to comply with the codebase style guidelines.
- check_sse41_intrinsics - + check_sse41_intrinsics📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.check_sse41_intrinsics if test ${HAVE_SSE41_INTRIN} -eq 1; then CFLAGS="${CFLAGS} -DX86_SSE41" SFLAGS="${SFLAGS} -DX86_SSE41" ARCH_STATIC_OBJS="${ARCH_STATIC_OBJS} chorba_sse41.o" ARCH_SHARED_OBJS="${ARCH_SHARED_OBJS} chorba_sse41.lo" fi🧰 Tools
🪛 GitHub Actions: Lint
[warning] 1681-1681: Trailing whitespace.
f1e0e89 to
0211aba
Compare
| static Z_FORCEINLINE uint32_t crc32_chorba_32768_nondestructive_sse41(uint32_t crc, const uint64_t* buf, size_t len) { | ||
| const uint64_t* input = buf; | ||
| ALIGNED_(16) uint64_t bitbuffer[32768 / sizeof(uint64_t)]; | ||
| const uint8_t* bitbufferbytes = (const uint8_t*) bitbuffer; | ||
| __m128i z = _mm_setzero_si128(); | ||
|
|
||
| __m128i *bitbuf128 = (__m128i*)(bitbuffer + 128); | ||
| __m128i *bitbuf144 = (__m128i*)(bitbuffer + 144); | ||
| __m128i *bitbuf182 = (__m128i*)(bitbuffer + 182); | ||
| __m128i *bitbuf0 = bitbuf128; | ||
|
|
||
| /* We only need to zero out the bytes between the 128'th value and the 144th | ||
| * that are actually read */ | ||
| for (size_t i = 0; i < 2; ++i) { | ||
| _mm_store_si128(bitbuf128++, z); | ||
| _mm_store_si128(bitbuf128++, z); | ||
| _mm_store_si128(bitbuf128++, z); | ||
| _mm_store_si128(bitbuf128++, z); | ||
| } | ||
|
|
||
| /* We only need to zero out the bytes between the 144'th value and the 182nd that | ||
| * are actually read */ | ||
| for (size_t i = 0; i < 11; ++i) { | ||
| _mm_store_si128(bitbuf144++ + 8, z); | ||
| } | ||
|
|
||
| /* We only need to zero out the bytes between the 182nd value and the 210th that | ||
| * are actually read. */ | ||
| for (size_t i = 0; i < 4; ++i) { | ||
| _mm_store_si128(bitbuf182++, z); | ||
| _mm_store_si128(bitbuf182++, z); | ||
| _mm_store_si128(bitbuf182++, z); | ||
| _mm_store_si128(bitbuf182++, z); | ||
| } | ||
|
|
||
| /* We need to mix this in */ | ||
| __m128i init_crc = _mm_cvtsi64x_si128(crc); | ||
| crc = 0; | ||
|
|
||
| size_t i = 0; | ||
| __m128i *inptr = (__m128i*)input; | ||
| bitbuf144 = (__m128i*)(bitbuffer + 144); | ||
| bitbuf182 = (__m128i*)(bitbuffer + 182); | ||
| __m128i *bitbuf210 = (__m128i*)(bitbuffer + 210); | ||
| __m128i *bitbuf300 = (__m128i*)(bitbuffer + 300); | ||
|
|
||
| /* Previous iteration runs carried over */ | ||
| __m128i buf144 = z; | ||
| __m128i buf182 = z; | ||
| __m128i buf210 = z; | ||
|
|
||
| for(; i + 300*8+64 < len && i < 22 * 8; i += 64) { | ||
| __m128i in12, in34, in56, in78; | ||
|
|
||
| in12 = _mm_load_si128(inptr++); | ||
| in34 = _mm_load_si128(inptr++); | ||
| in56 = _mm_load_si128(inptr++); | ||
| in78 = _mm_load_si128(inptr++); | ||
|
|
||
| if (i == 0) { | ||
| in12 = _mm_xor_si128(in12, init_crc); | ||
| } | ||
|
|
||
| __m128i in_1 = _mm_slli_si128(in12, 8); | ||
|
|
||
| __m128i in23 = _mm_alignr_epi8(in34, in12, 8); | ||
| __m128i in45 = _mm_alignr_epi8(in56, in34, 8); | ||
| __m128i in67 = _mm_alignr_epi8(in78, in56, 8); | ||
| __m128i in8_ = _mm_srli_si128(in78, 8); | ||
|
|
||
| __m128i a = _mm_xor_si128(buf144, in_1); | ||
|
|
||
| _mm_store_si128(bitbuf144++, a); | ||
| _mm_store_si128(bitbuf144++, in23); | ||
| _mm_store_si128(bitbuf144++, in45); | ||
| _mm_store_si128(bitbuf144++, in67); | ||
| buf144 = in8_; | ||
|
|
||
| __m128i e = _mm_xor_si128(buf182, in_1); | ||
| _mm_store_si128(bitbuf182++, e); | ||
| _mm_store_si128(bitbuf182++, in23); | ||
| _mm_store_si128(bitbuf182++, in45); | ||
| _mm_store_si128(bitbuf182++, in67); | ||
| buf182 = in8_; | ||
|
|
||
| __m128i m = _mm_xor_si128(buf210, in_1); | ||
| _mm_store_si128(bitbuf210++, m); | ||
| _mm_store_si128(bitbuf210++, in23); | ||
| _mm_store_si128(bitbuf210++, in45); | ||
| _mm_store_si128(bitbuf210++, in67); | ||
| buf210 = in8_; | ||
|
|
||
| _mm_store_si128(bitbuf300++, in12); | ||
| _mm_store_si128(bitbuf300++, in34); | ||
| _mm_store_si128(bitbuf300++, in56); | ||
| _mm_store_si128(bitbuf300++, in78); | ||
| } | ||
|
|
||
| for(; i + 300*8+64 < len && i < 32 * 8; i += 64) { | ||
| __m128i in12, in34, in56, in78; | ||
| in12 = _mm_load_si128(inptr++); | ||
| in34 = _mm_load_si128(inptr++); | ||
| in56 = _mm_load_si128(inptr++); | ||
| in78 = _mm_load_si128(inptr++); | ||
|
|
||
| __m128i in_1 = _mm_slli_si128(in12, 8); | ||
| __m128i in23 = _mm_alignr_epi8(in34, in12, 8); | ||
| __m128i in45 = _mm_alignr_epi8(in56, in34, 8); | ||
| __m128i in67 = _mm_alignr_epi8(in78, in56, 8); | ||
| __m128i in8_ = _mm_srli_si128(in78, 8); | ||
|
|
||
| __m128i a = _mm_xor_si128(buf144, in_1); | ||
|
|
||
| _mm_store_si128(bitbuf144++, a); | ||
| _mm_store_si128(bitbuf144++, in23); | ||
| _mm_store_si128(bitbuf144++, in45); | ||
| _mm_store_si128(bitbuf144++, in67); | ||
| buf144 = in8_; | ||
|
|
||
| __m128i e = _mm_xor_si128(buf182, in_1); | ||
| __m128i f = _mm_xor_si128(_mm_load_si128(&bitbuf182[1]), in23); | ||
| __m128i g = _mm_xor_si128(_mm_load_si128(&bitbuf182[2]), in45); | ||
| __m128i h = _mm_xor_si128(_mm_load_si128(&bitbuf182[3]), in67); | ||
| buf182 = _mm_xor_si128(_mm_load_si128(&bitbuf182[4]), in8_); | ||
|
|
||
| _mm_store_si128(bitbuf182++, e); | ||
| _mm_store_si128(bitbuf182++, f); | ||
| _mm_store_si128(bitbuf182++, g); | ||
| _mm_store_si128(bitbuf182++, h); | ||
|
|
||
| __m128i m = _mm_xor_si128(buf210, in_1); | ||
| _mm_store_si128(bitbuf210++, m); | ||
| _mm_store_si128(bitbuf210++, in23); | ||
| _mm_store_si128(bitbuf210++, in45); | ||
| _mm_store_si128(bitbuf210++, in67); | ||
| buf210 = in8_; | ||
|
|
||
| _mm_store_si128(bitbuf300++, in12); | ||
| _mm_store_si128(bitbuf300++, in34); | ||
| _mm_store_si128(bitbuf300++, in56); | ||
| _mm_store_si128(bitbuf300++, in78); | ||
| } | ||
|
|
||
| for(; i + 300*8+64 < len && i < 84 * 8; i += 64) { | ||
| __m128i in12, in34, in56, in78; | ||
| in12 = _mm_load_si128(inptr++); | ||
| in34 = _mm_load_si128(inptr++); | ||
| in56 = _mm_load_si128(inptr++); | ||
| in78 = _mm_load_si128(inptr++); | ||
|
|
||
| __m128i in_1 = _mm_slli_si128(in12, 8); | ||
| __m128i in23 = _mm_alignr_epi8(in34, in12, 8); | ||
| __m128i in45 = _mm_alignr_epi8(in56, in34, 8); | ||
| __m128i in67 = _mm_alignr_epi8(in78, in56, 8); | ||
| __m128i in8_ = _mm_srli_si128(in78, 8); | ||
|
|
||
| __m128i a = _mm_xor_si128(buf144, in_1); | ||
| __m128i b = _mm_xor_si128(_mm_load_si128(&bitbuf144[1]), in23); | ||
| __m128i c = _mm_xor_si128(_mm_load_si128(&bitbuf144[2]), in45); | ||
| __m128i d = _mm_xor_si128(_mm_load_si128(&bitbuf144[3]), in67); | ||
| buf144 = _mm_xor_si128(_mm_load_si128(&bitbuf144[4]), in8_); | ||
|
|
||
| _mm_store_si128(bitbuf144++, a); | ||
| _mm_store_si128(bitbuf144++, b); | ||
| _mm_store_si128(bitbuf144++, c); | ||
| _mm_store_si128(bitbuf144++, d); | ||
|
|
||
| __m128i e = _mm_xor_si128(buf182, in_1); | ||
| __m128i f = _mm_xor_si128(_mm_load_si128(&bitbuf182[1]), in23); | ||
| __m128i g = _mm_xor_si128(_mm_load_si128(&bitbuf182[2]), in45); | ||
| __m128i h = _mm_xor_si128(_mm_load_si128(&bitbuf182[3]), in67); | ||
| buf182 = _mm_xor_si128(_mm_load_si128(&bitbuf182[4]), in8_); | ||
|
|
||
| _mm_store_si128(bitbuf182++, e); | ||
| _mm_store_si128(bitbuf182++, f); | ||
| _mm_store_si128(bitbuf182++, g); | ||
| _mm_store_si128(bitbuf182++, h); | ||
|
|
||
| __m128i m = _mm_xor_si128(buf210, in_1); | ||
| _mm_store_si128(bitbuf210++, m); | ||
| _mm_store_si128(bitbuf210++, in23); | ||
| _mm_store_si128(bitbuf210++, in45); | ||
| _mm_store_si128(bitbuf210++, in67); | ||
| buf210 = in8_; | ||
|
|
||
| _mm_store_si128(bitbuf300++, in12); | ||
| _mm_store_si128(bitbuf300++, in34); | ||
| _mm_store_si128(bitbuf300++, in56); | ||
| _mm_store_si128(bitbuf300++, in78); | ||
| } | ||
|
|
||
| for(; i + 300*8+64 < len; i += 64) { | ||
| __m128i in12, in34, in56, in78; | ||
|
|
||
| if (i < 128 * 8) { | ||
| in12 = _mm_load_si128(inptr++); | ||
| in34 = _mm_load_si128(inptr++); | ||
| in56 = _mm_load_si128(inptr++); | ||
| in78 = _mm_load_si128(inptr++); | ||
| } else { | ||
| in12 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); | ||
| in34 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); | ||
| in56 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); | ||
| in78 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); | ||
| } | ||
|
|
||
| // [0, 145, 183, 211] | ||
|
|
||
| /* Pre Penryn CPUs the unpack should be faster */ | ||
| __m128i in_1 = _mm_slli_si128(in12, 8); | ||
|
|
||
| __m128i in23 = _mm_alignr_epi8(in34, in12, 8); | ||
| __m128i in45 = _mm_alignr_epi8(in56, in34, 8); | ||
| __m128i in67 = _mm_alignr_epi8(in78, in56, 8); | ||
| __m128i in8_ = _mm_srli_si128(in78, 8); | ||
|
|
||
| __m128i a = _mm_xor_si128(buf144, in_1); | ||
| __m128i b = _mm_xor_si128(_mm_load_si128(&bitbuf144[1]), in23); | ||
| __m128i c = _mm_xor_si128(_mm_load_si128(&bitbuf144[2]), in45); | ||
| __m128i d = _mm_xor_si128(_mm_load_si128(&bitbuf144[3]), in67); | ||
| buf144 = _mm_xor_si128(_mm_load_si128(&bitbuf144[4]), in8_); | ||
|
|
||
| _mm_store_si128(bitbuf144++, a); | ||
| _mm_store_si128(bitbuf144++, b); | ||
| _mm_store_si128(bitbuf144++, c); | ||
| _mm_store_si128(bitbuf144++, d); | ||
|
|
||
| __m128i e = _mm_xor_si128(buf182, in_1); | ||
| __m128i f = _mm_xor_si128(_mm_load_si128(&bitbuf182[1]), in23); | ||
| __m128i g = _mm_xor_si128(_mm_load_si128(&bitbuf182[2]), in45); | ||
| __m128i h = _mm_xor_si128(_mm_load_si128(&bitbuf182[3]), in67); | ||
| buf182 = _mm_xor_si128(_mm_load_si128(&bitbuf182[4]), in8_); | ||
|
|
||
| _mm_store_si128(bitbuf182++, e); | ||
| _mm_store_si128(bitbuf182++, f); | ||
| _mm_store_si128(bitbuf182++, g); | ||
| _mm_store_si128(bitbuf182++, h); | ||
|
|
||
| __m128i n, o, p; | ||
| __m128i m = _mm_xor_si128(buf210, in_1); | ||
|
|
||
| /* Couldn't tell you why but despite knowing that this is always false, | ||
| * removing this branch with GCC makes things significantly slower. Some | ||
| * loop bodies must be being joined or something */ | ||
| if (i < 84 * 8) { | ||
| n = in23; | ||
| o = in45; | ||
| p = in67; | ||
| buf210 = in8_; | ||
| } else { | ||
| n = _mm_xor_si128(_mm_load_si128(&bitbuf210[1]), in23); | ||
| o = _mm_xor_si128(_mm_load_si128(&bitbuf210[2]), in45); | ||
| p = _mm_xor_si128(_mm_load_si128(&bitbuf210[3]), in67); | ||
| buf210 = _mm_xor_si128(_mm_load_si128(&bitbuf210[4]), in8_); | ||
| } | ||
|
|
||
| _mm_store_si128(bitbuf210++, m); | ||
| _mm_store_si128(bitbuf210++, n); | ||
| _mm_store_si128(bitbuf210++, o); | ||
| _mm_store_si128(bitbuf210++, p); | ||
|
|
||
| _mm_store_si128(bitbuf300++, in12); | ||
| _mm_store_si128(bitbuf300++, in34); | ||
| _mm_store_si128(bitbuf300++, in56); | ||
| _mm_store_si128(bitbuf300++, in78); | ||
| } | ||
|
|
||
| /* Second half of stores bubbled out */ | ||
| _mm_store_si128(bitbuf144, buf144); | ||
| _mm_store_si128(bitbuf182, buf182); | ||
| _mm_store_si128(bitbuf210, buf210); | ||
|
|
||
| /* We also have to zero out the tail */ | ||
| size_t left_to_z = len - (300*8 + i); | ||
| __m128i *bitbuf_tail = (__m128i*)(bitbuffer + 300 + i/8); | ||
| while (left_to_z >= 64) { | ||
| _mm_store_si128(bitbuf_tail++, z); | ||
| _mm_store_si128(bitbuf_tail++, z); | ||
| _mm_store_si128(bitbuf_tail++, z); | ||
| _mm_store_si128(bitbuf_tail++, z); | ||
| left_to_z -= 64; | ||
| } | ||
|
|
||
| while (left_to_z >= 16) { | ||
| _mm_store_si128(bitbuf_tail++, z); | ||
| left_to_z -= 16; | ||
| } | ||
|
|
||
| uint8_t *tail_bytes = (uint8_t*)bitbuf_tail; | ||
| while (left_to_z--) { | ||
| *tail_bytes++ = 0; | ||
| } | ||
|
|
||
| ALIGNED_(16) uint64_t final[9] = {0}; | ||
| __m128i next12, next34, next56; | ||
| next12 = z; | ||
| next34 = z; | ||
| next56 = z; | ||
|
|
||
| for(; (i + 72 < len); i += 32) { | ||
| __m128i in1in2, in3in4; | ||
| __m128i in1in2_, in3in4_; | ||
| __m128i ab1, ab2, ab3, ab4; | ||
| __m128i cd1, cd2, cd3, cd4; | ||
|
|
||
| READ_NEXT(input, i, in1in2, in3in4); | ||
| READ_NEXT(bitbuffer, i, in1in2_, in3in4_); | ||
|
|
||
| in1in2 = _mm_xor_si128(_mm_xor_si128(in1in2, in1in2_), next12); | ||
| in3in4 = _mm_xor_si128(in3in4, in3in4_); | ||
|
|
||
| NEXT_ROUND(in1in2, ab1, ab2, ab3, ab4); | ||
|
|
||
| __m128i a2_ = _mm_slli_si128(ab2, 8); | ||
| __m128i ab1_next34 = _mm_xor_si128(next34, ab1); | ||
| in3in4 = _mm_xor_si128(in3in4, ab1_next34); | ||
| in3in4 = _mm_xor_si128(a2_, in3in4); | ||
| NEXT_ROUND(in3in4, cd1, cd2, cd3, cd4); | ||
|
|
||
| __m128i b2c2 = _mm_alignr_epi8(cd2, ab2, 8); | ||
| __m128i a4_ = _mm_slli_si128(ab4, 8); | ||
| a4_ = _mm_xor_si128(b2c2, a4_); | ||
| next12 = _mm_xor_si128(ab3, a4_); | ||
| next12 = _mm_xor_si128(next12, cd1); | ||
|
|
||
| __m128i d2_ = _mm_srli_si128(cd2, 8); | ||
| __m128i b4c4 = _mm_alignr_epi8(cd4, ab4, 8); | ||
| next12 = _mm_xor_si128(next12, next56); | ||
| next34 = _mm_xor_si128(cd3, _mm_xor_si128(b4c4, d2_)); | ||
| next56 = _mm_srli_si128(cd4, 8); | ||
| } | ||
|
|
||
| memcpy(final, input+(i / sizeof(uint64_t)), len-i); | ||
| __m128i *final128 = (__m128i*)final; | ||
| _mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next12)); | ||
| ++final128; | ||
| _mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next34)); | ||
| ++final128; | ||
| _mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next56)); | ||
|
|
||
| uint8_t* final_bytes = (uint8_t*) final; | ||
|
|
||
| for(size_t j = 0; j < (len-i); j++) { | ||
| crc = crc_table[(crc ^ final_bytes[j] ^ bitbufferbytes[(j+i)]) & 0xff] ^ (crc >> 8); | ||
| } | ||
| return crc; | ||
| } |
Check notice
Code scanning / CodeQL
Block with too many statements Note
|
Caution Review failedAn error occurred during the review process. Please try again later. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
0211aba to
85f3b6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
arch/x86/chorba_sse41.c (2)
13-15: Consider placing externs in a shared header.These
externdeclarations (e.g.,crc32_braid_base) might be better placed in dedicated header files instead of localizing them within this C file. That helps maintain a single source of truth.
63-293: Complex loop structure could benefit from modularization.Multiple loop stages handle data ranges differently and rely on SSE alignment. Extracting repeating patterns or consistent steps into helper functions can improve maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
CMakeLists.txt(4 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/chorba_sse41.c(1 hunks)arch/x86/x86_features.c(1 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(4 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(3 hunks)functable.c(2 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)win32/Makefile.msc(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- arch/x86/x86_features.h
- functable.c
- arch/x86/Makefile.in
- cmake/detect-intrinsics.cmake
- win32/Makefile.msc
- arch/x86/x86_features.c
- arch/x86/x86_functions.h
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (3)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-03-26T15:10:53.588Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-03-26T15:10:53.588Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
🧬 Code Definitions (3)
test/benchmarks/benchmark_crc32.cc (1)
arch/x86/x86_functions.h (2)
crc32_chorba_sse2(29-29)crc32_chorba_sse41(41-41)
arch/x86/chorba_sse41.c (1)
arch/x86/x86_functions.h (1)
crc32_chorba_sse41(41-41)
test/test_crc32.cc (1)
arch/x86/x86_functions.h (2)
crc32_chorba_sse2(29-29)crc32_chorba_sse41(41-41)
🔇 Additional comments (16)
test/benchmarks/benchmark_crc32.cc (1)
72-76: LGTM! Successfully adds SSE4.1 benchmark support.The changes properly integrate the new SSE4.1 implementation into the benchmark framework. You've added the conditional check for X86_SSE41 and registered the benchmark with the appropriate CPU feature check, maintaining the same pattern used for other architecture-specific benchmarks.
test/test_crc32.cc (2)
196-197: LGTM! Added additional test data length.The new test case with a buffer length of 16384 extends test coverage to a more comprehensive range of input sizes, which is valuable for validating the CRC32 implementation across different workloads.
285-290: LGTM! Correctly added SSE4.1 test support.The new test configuration for the SSE4.1 implementation is properly guarded by the appropriate preprocessor directives and follows the established pattern in the codebase. The test will be skipped gracefully when SSE4.1 support is not available.
configure (3)
114-114: LGTM! Added SSE4.1 compiler flag definition.The addition of the SSE4.1 flag variable follows the existing pattern for other instruction set flags.
1537-1551: LGTM! Well-implemented SSE4.1 intrinsics detection.The
check_sse41_intrinsics()function properly tests for compiler support of SSE4.1 intrinsics by compiling a small test program that uses the_mm_insert_epi64intrinsic, which is specific to SSE4.1. This approach is consistent with how other instruction set extensions are detected in the script.
1680-1687: LGTM! Correctly integrated SSE4.1 into the build system.The code properly calls the SSE4.1 detection function and, if supported, updates the compiler flags and object files list to include the SSE4.1-optimized implementation. This approach is consistent with how other architecture-specific optimizations are enabled.
CMakeLists.txt (3)
128-129: SSE4.1 and SSE4.2 dependent options look consistent.By making SSE4.2 dependent on SSE4.1, you ensure proper incremental enablement of instruction sets. This appears appropriate, as SSE4.2 builds upon SSE4.1.
149-149: Marking SSE4.1 as advanced is appropriate.Including
WITH_SSE41in advanced options aligns it with other architecture-specific toggles likeWITH_SSSE3andWITH_SSE42.
1495-1495: Feature info addition is consistent.Adding
add_feature_info(WITH_SSE41 ...)ensures SSE4.1 availability is clearly reported in the CMake feature summary.arch/x86/chorba_sse41.c (7)
1-2: Guarding SSE4.1 Chorba code.The
#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41)condition cleanly isolates this specialized code and prevents accidental compilation on unsupported targets.
3-12: Includes and headers look valid.All necessary headers are referenced, and SSE/SMID intrinsics are included. This is aligned with SSE4.1 usage.
28-62: Partial zeroing strategy is clear.Zeroing out only the parts actually read is a good performance optimization. Ensure that offsets (128, 144, 182) align with the loops' usage elsewhere to avoid missing memory areas.
295-320: Tail zeroing logic appears correct.The loop systematically stores zero vectors based on remaining length. This prevents processing stale data from the buffer.
321-374: Final block processing is coherent.Loading the leftover bytes into
final[]and XORing with SSE registers aligns with the partial-block approach. This final step ensures leftover data is integrated into the CRC.
376-406: Runtime fallback strategy is appropriate.The function checks alignment, calls SSE4.1, SSE2, or fallback logic depending on buffer size. This layering avoids duplication and efficiently selects the right method.
16-26:❓ Verification inconclusive
Macros perform SSE loads and transforms; double-check bounds.
READ_NEXTandNEXT_ROUNDmacros rely on_mm_load_si128and shifting. Ensure calling loops do not exceed buffer bounds, as these macros have no built-in checks.Would you like to verify usage of these macros across the codebase to confirm that every call site guarantees sufficient alignment and buffer size?
🏁 Script executed:
#!/bin/bash # Searching for READ_NEXT and NEXT_ROUND usage outside this file (if any). rg --multiline 'READ_NEXT|NEXT_ROUND'Length of output: 2561
Ensure Calling Code Maintains Buffer Bounds and Alignment for SSE Macros
After running a repository-wide search, we see that both the
READ_NEXTandNEXT_ROUNDmacros are used exclusively in the files below:
- arch/x86/chorba_sse41.c
- arch/x86/chorba_sse2.c
Please make sure that in each of these call sites the following are verified:
- The pointer arguments (e.g.,
input,bitbuffer) are properly aligned for_mm_load_si128.- The loop index or offset (
i) guarantees that accesses stay within the allocated buffer bounds.- Any commented-out variants (such as
READ_NEXT_UNALIGNED) are intentionally omitted and that their removal will not affect safety guarantees.These checks will prevent potential undefined behavior due to overreaching loads or misaligned memory accesses.
85f3b6b to
6eed76a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
arch/x86/x86_functions.h (1)
123-126: 🛠️ Refactor suggestionConsider adding the
__SSE4_1__predefined macro check for native builds.When building natively, we should check both the compiler feature flag and the predefined macro to ensure SSE4.1 is actually available.
Apply this diff to improve the condition:
-# if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && !defined(NO_CHORBA_SSE) +# if !defined(WITHOUT_CHORBA) && defined(X86_SSE41) && defined(__SSE4_1__) && !defined(NO_CHORBA_SSE) # undef native_crc32 # define native_crc32 crc32_chorba_sse41 # endif
🧹 Nitpick comments (2)
arch/x86/x86_functions.h (1)
14-14: Renamed macro could be improved for clarity.The macro
NO_CHORBA_SSE2has been renamed toNO_CHORBA_SSE, which makes sense as it now covers all SSE variants (including the new SSE41 implementation). However, this change should be accompanied by documentation explaining the purpose of this macro for future maintainers.Consider adding a brief comment explaining the purpose of this macro:
#if defined(_MSC_VER) && !defined(_M_AMD64) && _MSC_VER >= 1920 && _MSC_VER <= 1929 -#define NO_CHORBA_SSE +/* Disable SSE-based CRC32 chorba implementations due to MSVC stack corruption bug */ +#define NO_CHORBA_SSE #endifarch/x86/chorba_sse41.c (1)
186-232: Complex loop logic readability.The loop condition depends on
i + 300*8+64 < lenand includes branching for offsets. To maintain clarity, consider extracting loop bodies into smaller functions if possible. This would improve readability and testing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
CMakeLists.txt(4 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/chorba_sse41.c(1 hunks)arch/x86/x86_features.c(1 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(4 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(4 hunks)functable.c(2 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)win32/Makefile.msc(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- arch/x86/x86_features.h
- arch/x86/x86_features.c
- functable.c
- arch/x86/Makefile.in
- test/benchmarks/benchmark_crc32.cc
- win32/Makefile.msc
- test/test_crc32.cc
- cmake/detect-intrinsics.cmake
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (4)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-04-02T20:18:01.745Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-04-02T20:18:01.745Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-04-02T20:18:01.745Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
🧬 Code Definitions (2)
arch/x86/chorba_sse41.c (1)
arch/x86/x86_functions.h (1)
crc32_chorba_sse41(41-41)
arch/x86/x86_functions.h (1)
arch/generic/generic_functions.h (1)
crc32(14-14)
🔇 Additional comments (15)
arch/x86/x86_functions.h (1)
39-43: LGTM: Clean implementation of SSE4.1 function declaration.The declaration of the new CRC32 function for SSE4.1 is properly conditional on
WITHOUT_CHORBAmatching the existing pattern for SSE2 implementation.configure (4)
114-114: LGTM: Added compiler flag for SSE4.1.The SSE4.1 compiler flag is correctly defined alongside other x86 instruction set flags.
1537-1551: LGTM: Well-implemented SSE4.1 intrinsics detection function.The function correctly tests for SSE4.1 support by compiling a small test program that uses SSE4.1-specific intrinsics (_mm_min_epi32), following the same pattern as other instruction set detection functions in the script.
1680-1687: LGTM: Properly implemented SSE4.1 build configuration.The implementation correctly sets up the build flags and adds the object files for the SSE4.1 implementation when support is detected. This follows the established pattern for other instruction set extensions in the script.
2235-2235: LGTM: Added SSE41FLAG variable in the build configuration.The code correctly adds the SSE41FLAG variable to the Makefile generation, ensuring the compiler flags are properly passed when building the SSE4.1 implementation.
CMakeLists.txt (4)
128-129: Maintain consistency with instruction set dependencies.Defining SSE4.1 to depend on SSSE3 and SSE4.2 to depend on SSE4.1 is correct since SSE4.1 is indeed a superset of SSSE3, and SSE4.2 is a superset of SSE4.1. This ensures that users can't accidentally enable incompatible options.
149-149: Add SSE41-related entries to advanced settings.Including
WITH_SSE41in the advanced options is consistent with the other SSE variants. No further issues detected.
995-1005: Correctly conditionally building SSE4.1 code.The logic to check for SSE4.1 intrinsics and define
X86_SSE41if supported (and SSSE3 is enabled) is sound. This properly avoids build issues on unsupported compilers.
1495-1495: Feature info for SSE41.Adding the SSE41 feature info clarifies the build output and helps users confirm that SSE4.1 is enabled at configure time.
arch/x86/chorba_sse41.c (6)
1-2: Conditional compilation guard.The
#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41)guard is appropriately limiting this file to SSE4.1 builds when Chorba is enabled.
28-34: Macro usage clarity.The
REALIGN_CHORBAmacro now clearly references the correct parameters (in0, in1, … out0, out1, …). This resolves the prior naming mismatch that could cause confusion.
312-343:✅ Verification successful
Alignment logic and threshold checks.
- The use of
uintptr_tfor alignment calculation incrc32_chorba_sse41is correct to avoid type mismatch.- The threshold checks for “large,” “medium,” and “small” Chorba variants appear consistent, but ensure they match project-wide definitions to avoid subtle performance or correctness issues.
Please confirm that
CHORBA_LARGE_THRESHOLD,CHORBA_MEDIUM_LOWER_THRESHOLD, andCHORBA_MEDIUM_UPPER_THRESHOLDare defined consistently in the codebase:
🏁 Script executed:
#!/bin/bash rg 'CHORBA_(LARGE_THRESHOLD|MEDIUM_LOWER_THRESHOLD|MEDIUM_UPPER_THRESHOLD)'Length of output: 785
Verified: Alignment Logic and Thresholds Consistent
The CHORBA threshold macros incrc32.hare defined as follows:
•CHORBA_LARGE_THRESHOLDas(sizeof(z_word_t) * 64 * 1024)
•CHORBA_MEDIUM_LOWER_THRESHOLDas8192
•CHORBA_MEDIUM_UPPER_THRESHOLDas32768These definitions are used consistently in
arch/x86/chorba_sse41.c(as well as in related files such asarch/x86/chorba_sse2.candarch/generic/crc32_c.c). The alignment calculation usinguintptr_tand the threshold checks for the “large,” “medium,” and “small” cases are therefore correct and adhere to the project-wide definitions.No further changes are required for this section.
257-310:❓ Verification inconclusive
Final partial CRC calculation with table lookups.
You combine leftover data from both the input buffer and the locally stored data in
bitbuffer. Verify that(j + i)accesses inbitbufferbytes[(j + i)]never exceed the size ofbitbuffer. The partial zeroing is correct, but it’s crucial to confirm that offsets remain within bounds.
🏁 Script executed:
#!/bin/bash # Ensures j+i < 32768, given that bitbuffer is 32768 bytes and i max in loops rg 'bitbufferbytes\[\(j\+i\)' -A 5 -B 5Length of output: 1835
Ensure (j + i) Always Falls Within Bitbuffer Limits
The current x86 SSE4.1 implementation loops over the remaining bytes with the condition
j < (len - i)and accessesbitbufferbytes[(j+i)]without using modulo arithmetic—unlike the generic CRC function, which sometimes applies% bitbuffersizebytes. Please confirm that the incoming length (len) is constrained (i.e., ≤32768 bytes) or that other preconditions guarantee(j + i)will never exceed the allocated bitbuffer size. This review is a reminder to verify that those assumptions are explicitly enforced elsewhere in the code or via input validation.
59-95:✅ Verification successful
Zero-initialization blocks require careful boundary checks.
Multiple segments of partial zeroing are performed using
_mm_store_si128. Ensure the computed pointers (bitbuf144,bitbuf182, etc.) remain withinbitbuffer’s bounds for all possible data sizes and loop iterations.Could you confirm these partial zeroing loops do not write out of
bitbuffer?
🏁 Script executed:
#!/bin/bash # Grep for bitbuf144, bitbuf182, etc. usage, verifying loop limits rg 'bitbuf(144|182|300)' -A 5 -B 5Length of output: 7694
Boundaries Confirmed for Partial Zeroing
After verifying the pointer arithmetic and loop iterations:
bitbuf144is set to&bitbuffer_v[72]and the loop iterates 11 times, using_mm_store_si128(&z_cursor[8], z). Even on the last iteration, the write occurs atbitbuffer_v[82 + 8](i.e., index 90).bitbuf182is set to&bitbuffer_v[91]and the loop iterates 4 times via theSTORE4macro, ensuring the maximum written index is around 97.bitbuf300is similarly computed with a constant offset.Given that the underlying
bitbufferis sized to hold 32768 bytes (which, when reinterpreted as an array of__m128i, offers 2048 elements), all pointer operations remain safely within bounds. No out-of-bound writes are observed based on the constant offsets and loop counts.Please ensure that any future modifications to these constants or the buffer size continue to uphold these boundary conditions.
16-19:❓ Verification inconclusive
Ensure pointer arithmetic safety.
The macro uses
(in + off / sizeof(uint64_t))to compute addresses. Confirm that the caller always provides valid offsets so we never read past the buffer boundary.Could you verify that all invocations of
READ_NEXTare guaranteed in-bounds?
🏁 Script executed:
#!/bin/bash # Searches for all lines calling READ_NEXT and checks for code paths ensuring 'off' remains valid. rg 'READ_NEXT\s*\([^,]+,\s*[^,]+,\s*[^,]+,\s*[^,]+\)'Length of output: 938
Action: Verify pointer arithmetic safety ensuring valid offsets
The
READ_NEXTmacro is used in botharch/x86/chorba_sse41.candarch/x86/chorba_sse2.c(e.g. via invocations likeREAD_NEXT(input, i, in1in2, in3in4)andREAD_NEXT(bitbuffer, i, in1in2_, in3in4_)). However, our search shows that the caller consistently passes an offset variable (typicallyi), but it isn’t immediately clear from the static context that every such usage is preceded by explicit checks ensuring thati(and the subsequent accesses computed asin + off / sizeof(uint64_t)andin + off / sizeof(uint64_t) + 2) remains in bounds.Please verify that:
- Every caller invoking
READ_NEXTperforms proper bounds checking to ensure the computed memory accesses stay within the buffer.- The offset variable (e.g.
i) along with any arithmetic adjustments never leads to out-of-bound reads.
|
Missing |
Are you looking at the wrong branch? It's definitely in there: |
|
@KungFuJesus |
6e5877f to
196ac09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
arch/x86/chorba_sse41.c (1)
28-34: Validate macro parameter consistency.
REALIGN_CHORBA()manipulates four input vectors(in0, in1, in2, in3). Ensure that references within the macro consistently match the macro’s parameters. Typos or mismatched references can lead to subtle bugs in vector re-alignment.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
CMakeLists.txt(4 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/chorba_sse41.c(1 hunks)arch/x86/x86_features.c(1 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(4 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(4 hunks)functable.c(2 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)win32/Makefile.msc(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- arch/x86/x86_features.h
- test/benchmarks/benchmark_crc32.cc
- arch/x86/x86_features.c
- functable.c
- test/test_crc32.cc
- cmake/detect-intrinsics.cmake
- arch/x86/x86_functions.h
- win32/Makefile.msc
- arch/x86/Makefile.in
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (4)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-04-02T20:18:01.745Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-04-02T20:18:01.745Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-04-02T20:18:01.745Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
⏰ Context from checks skipped due to timeout of 90000ms (115)
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: macOS Clang Symbol Prefix
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64LE
- GitHub Check: Ubuntu GCC
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC ARM HF
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Fuzzing
🔇 Additional comments (7)
configure (2)
114-114: Good addition of the SSE4.1 compiler flag.This is consistent with the existing approach for other SSE flags (e.g., sse2flag, sse42flag). It ensures that the build will correctly invoke the compiler to generate SSE4.1 instructions whenever available.
1537-1552: Consistent intrinsics check function.This new
check_sse41_intrinsics()function aligns well with the pattern used for checking other CPU intrinsics. It creates a test file that uses_mm_min_epi32from<smmintrin.h>and tries to compile with the-msse4.1flag. This approach is straightforward and ensures reliable auto-detection of SSE4.1 support.CMakeLists.txt (3)
128-129: New build options for SSE4.1 and SSE4.2.Declaring:
cmake_dependent_option(WITH_SSE41 "Build with SSE41" ON "WITH_SSSE3" OFF) cmake_dependent_option(WITH_SSE42 "Build with SSE42" ON "WITH_SSE41" OFF)is logical since SSE4.1 depends on SSSE3 (Intel’s partial superset path), and SSE4.2 depends on SSE4.1. This guards against impossible configs when toggling one without the other.
149-149: Properly marking SSE4.1 as advanced.Adding
WITH_SSE41alongside the existing SSE selections keeps your CMake options consistent with the rest of the advanced CPU feature toggles.
495-495: Ensuring feature info display for SSE4.1.The call to
add_feature_info(WITH_SSE41 WITH_SSE41 "Build with SSE41")ensures the feature summary accurately reflects SSE4.1’s optional status.arch/x86/chorba_sse41.c (2)
1-2: Build guard for SSE4.1-based Chorba CRC.Wrapping this code with
#if !defined(WITHOUT_CHORBA) && defined(X86_SSE41)is consistent with other SSE-based CRC implementations. It keeps the file excluded if SSE4.1 or Chorba is disabled, preventing unintended compilation on unsupported targets.
16-19:❓ Verification inconclusive
Check potential pointer offset calculations.
In
READ_NEXT(in, off, a, b), you compute pointer offsets with(in + off / sizeof(uint64_t)). Ensure that all callers pass offsets in multiples of 8 bytes to avoid partial reads and alignment issues. Otherwise, consider a safer boundary check or a clearer pointer arithmetic approach.Would you like to run a script to search for all invocations of
READ_NEXTto confirm that offsets maintain the required alignment or size assumptions?
🏁 Script executed:
#!/bin/bash rg -A3 "READ_NEXT\s*\("Length of output: 3777
Pointer Offset Alignment Caution in READ_NEXT Macro
The
READ_NEXT(in, off, a, b)macro computes offsets via(in + off / sizeof(uint64_t))and is invoked in botharch/x86/chorba_sse41.candarch/x86/chorba_sse2.cwith the offset variablei. Please verify that all callers guarantee that the offset (i) is always a multiple of 8 bytes. If there’s any chance that this assumption might be violated, consider adding an assertion, boundary check, or clearer pointer arithmetic to enforce proper alignment and prevent partial reads.
| __m128i init_crc = _mm_cvtsi64x_si128(crc); | ||
| crc = 0; | ||
|
|
||
| size_t i = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate variable name with no restricting scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean, the loops above are the restricted scope of i, it does not live outside of the for loop declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C, outer scope starts when function starts. To be valid ANSI C, it would require explicit block scope. Some compilers will issue warning if variable in inner scope shadows variable in outer scope, but that warning can be suppressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly certain declaring the variables in the loop parameters is the explicit scope block. C89 at one point wouldn't allow you to declare variables in that but I don't know that we are trying to cater that far back. GNU's C extension and afterward anything C99 compliant accepts variable declarations there.
#include <stdlib.h>
int main(void)
{
for (int i = 0; i < 20; ++i) {};
return i;
}[adam@thinkpad ~]$ gcc test_scope.c -o test.out
test_scope.c: In function ‘main’:
test_scope.c:6:12: error: ‘i’ undeclared (first use in this function)
6 | return i;
| ^
test_scope.c:6:12: note: each undeclared identifier is reported only once for each function it appears in| /* Couldn't tell you why but despite knowing that this is always false, | ||
| * removing this branch with GCC makes things significantly slower. Some | ||
| * loop bodies must be being joined or something */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abusing for loop for ranges with unknown start value disables some optimisations as compiler can't deduce how many iterations are run. The if block(s) pretty much give the missing information even though the conditions are reversed.
Optimising compiler should in theory duplicate the loop with known start values from if loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, possibly. I haven't compared the compiled code to see exactly what's happening but yeah it could be that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, removing the condition there seems to be generating a lot more code so yeah I'm guessing one or more of the loops are being joined together.
196ac09 to
f12e8e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
arch/x86/chorba_sse41.c (4)
59-309: Function is large and repetitive, consider modularizing.
crc32_chorba_32768_nondestructive_sse41is quite large and includes multiple loops handling different ranges ofi. Splitting it into smaller inline helper functions (e.g., chunk realignment, XOR mixing, tail zeroing) may improve clarity without hurting performance too much.
106-183: Repeated loop logic can be consolidated.There are multiple loop segments (lines 106-133, 135-158, 160-183) that load, realign, and store data with minimal differences. Consider merging or parameterizing these blocks for a more concise implementation, balancing readability with performance.
217-220: Confirm necessity of this seemingly no-op branch.The comment indicates removing the branch harms performance. This often hints at subtle compiler-specific optimizations. Keep it if beneficial, but add a clearer explanatory comment for maintainers.
324-333: Replace magic number 72 with a named constant.Using a recognized constant (like
CHORBA_SMALL_THRESHOLD_64BIT) instead of hardcoding 72 clarifies the boundary logic and aligns with other threshold macros.- } else if (aligned_len > 72) { + } else if (aligned_len > CHORBA_SMALL_THRESHOLD_64BIT) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
CMakeLists.txt(4 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/chorba_sse41.c(1 hunks)arch/x86/x86_features.c(1 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(4 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(4 hunks)functable.c(2 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)win32/Makefile.msc(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- arch/x86/x86_features.h
- arch/x86/x86_features.c
- test/benchmarks/benchmark_crc32.cc
- arch/x86/Makefile.in
- cmake/detect-intrinsics.cmake
- win32/Makefile.msc
- functable.c
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
🧬 Code Graph Analysis (2)
test/test_crc32.cc (1)
arch/x86/x86_functions.h (2)
crc32_chorba_sse2(29-29)crc32_chorba_sse41(41-41)
arch/x86/x86_functions.h (1)
arch/generic/generic_functions.h (1)
crc32(14-14)
⏰ Context from checks skipped due to timeout of 90000ms (160)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
- GitHub Check: Ubuntu MinGW x86_64
- GitHub Check: Ubuntu GCC SPARC64
- GitHub Check: Ubuntu GCC PPC
- GitHub Check: Ubuntu GCC RISC-V
- GitHub Check: Ubuntu Clang PPC64 Power9
- GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
- GitHub Check: Ubuntu GCC MIPS
- GitHub Check: Ubuntu GCC PPC64 Power9
- GitHub Check: Ubuntu GCC ARM HF No NEON ASAN
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu GCC SSSE3 UBSAN
- GitHub Check: Ubuntu GCC SSE2 UBSAN
- GitHub Check: Ubuntu GCC ARM SF ASAN
- GitHub Check: Ubuntu GCC Compat Symbol Prefix
- GitHub Check: Ubuntu GCC -O1 UBSAN
- GitHub Check: Ubuntu GCC Native Instructions (AVX)
🔇 Additional comments (20)
test/test_crc32.cc (3)
196-197: Test suite now includes additional 16KB buffer size.The new test case follows the existing pattern and validates the CRC32 function with a 16KB buffer in addition to the existing 32KB test.
285-287: Correctly adjusted SSE feature check condition.The change from X86_SSE2 to X86_SSE ensures the test is properly guarded by the appropriate feature macro.
288-290: Added correctly guarded SSE4.1 test case.The new test case for chorba_sse41 is properly added with the correct conditional compilation directive, ensuring it only runs when the CPU supports SSE4.1 instructions.
arch/x86/x86_functions.h (4)
14-14: Properly renamed SSE capability macro.Renaming
NO_CHORBA_SSE2toNO_CHORBA_SSEmakes the macro more generic, allowing it to be used for both SSE2 and SSE4.1 variations, which improves maintainability.
39-43: Well-structured SSE4.1 function declaration.The new section correctly introduces the SSE4.1 function with proper conditional compilation and follows the same pattern used for other instruction sets.
108-108: Updated SSE2 native support with the new macro name.The reference to
NO_CHORBA_SSEis consistent with the macro renaming earlier in the file.
123-126: Added proper SSE4.1 capability detection.The conditional definition correctly adds SSE4.1 support with the proper compiler detection macros, including
__SSE4_1__as suggested in previous reviews.configure (3)
114-114: Added SSE4.1 compiler flag.The
sse41flagis correctly defined for detecting and enabling SSE4.1 support.
1537-1551: Well-implemented SSE4.1 intrinsics detection.The
check_sse41_intrinsicsfunction follows the established pattern for intrinsics detection, using the appropriate header file and intrinsic function to test for SSE4.1 support.
1680-1687: Correctly integrated SSE4.1 detection in x86 optimization setup.The code to check for SSE4.1 support and set up the necessary flags and object files is inserted at the appropriate position in the build flow, between SSSE3 and SSE4.2.
CMakeLists.txt (5)
128-129: Properly ordered CPU feature dependency chain.The SSE4.1 option is correctly added with a dependency on SSSE3, and SSE4.2 now depends on SSE4.1 instead of SSSE3, reflecting the actual dependency chain of x86 instruction sets.
149-149: Updated advanced options list to include SSE4.1.SSE4.1 is correctly added to the list of advanced options.
995-1005: Well-structured SSE4.1 intrinsics check and compilation setup.The SSE4.1 detection and setup follow the existing pattern, checking for both the feature option and intrinsics availability before enabling the feature.
1008-1008: Properly updated SSE4.2 dependency check.The condition correctly checks for
WITH_SSE41instead ofWITH_SSSE3, maintaining the proper dependency chain.
1495-1495: Added SSE4.1 to feature summary.SSE4.1 is correctly added to the feature summary, providing clear information about its availability in the build.
arch/x86/chorba_sse41.c (5)
16-19: Double-check pointer alignment in macro.
_mm_load_si128()requires 16-byte aligned pointers. Though buffers appear properly aligned throughALIGNED_(16)in the surrounding code, confirm that the calling code never violates this assumption, especially when expanding or refactoring in the future.Would you like to run a script to search for all macro calls to
READ_NEXTto ensure the pointers used are always aligned?
28-34: Consider naming consistency in macro parameters.For better maintainability, ensure the readable association between the macro parameters (e.g.,
in0,in1, etc.) and the internally referenced variables. While the current implementation appears consistent, keep an eye on potential copy-paste issues like mismatched variable names (as previously flagged in SSE2 macros).
238-255: Zeroing tail logic looks correct.Zeroing the remaining portion of the buffer prevents uninitialized reads in subsequent SSE operations. This carefully avoids out-of-bounds writes by decrementing
left_to_z. Good approach.
268-271: Pointer arithmetic with offsets.Calls to
READ_NEXT(input, i, …)and then XOR withbitbufferrely on consistent scaling. Ensure that bothinputandbitbufferare equally sized (uint64_t*) and remain 16-byte aligned to avoid platform discrepancies.
316-317: Alignment calculation is correct.Using
uintptr_tfor alignment offsets (line 317) is the recommended approach to avoid 32-bit truncation hazards. This matches best practices and effectively prevents overflow.
f12e8e5 to
07ad6c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
arch/x86/x86_functions.h (1)
14-14: NO_CHORBA_SSE macro may reduce code duplication.Defining
NO_CHORBA_SSEresolves potential stack-smashing issues on certain compilers. Confirm that this macro doesn’t unnecessarily block SSE usage in safe compilers.arch/x86/chorba_sse41.c (2)
21-26: NEXT_ROUND macro performs multi-shift XOR operations.These bitwise transforms for the SSE pipeline appear correct. Double-check the shift counts for potential off-by-one or partial-lane mismatches.
36-41: STORE4 macro systematically stores four vectors.Batch-storing with
_mm_store_si128in sequential pointer increments is efficient. Investigate block-based memory prefetch if performance needs further improvement.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
CMakeLists.txt(4 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/chorba_sse41.c(1 hunks)arch/x86/x86_features.c(1 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(4 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(4 hunks)functable.c(2 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)win32/Makefile.msc(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- arch/x86/x86_features.h
- arch/x86/x86_features.c
- functable.c
- test/benchmarks/benchmark_crc32.cc
- arch/x86/Makefile.in
- win32/Makefile.msc
- cmake/detect-intrinsics.cmake
- CMakeLists.txt
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
🧬 Code Graph Analysis (2)
test/test_crc32.cc (1)
arch/x86/x86_functions.h (2)
crc32_chorba_sse2(29-29)crc32_chorba_sse41(41-41)
arch/x86/x86_functions.h (1)
arch/generic/generic_functions.h (1)
crc32(14-14)
🔇 Additional comments (14)
configure (3)
114-114: Add SSE4.1 flag for enhanced performance.Declaring
sse41flag="-msse4.1"is correct for enabling SSE4.1 optimizations. Ensure consumers of this flag properly guard against environments lacking SSE4.1 support.
1537-1552: Check for SSE4.1 intrinsics looks good.This added function cleanly tests for SSE4.1 by compiling a minimal
_mm_min_epi32snippet. Ensure that subsequent build steps useHAVE_SSE41_INTRINconsistently to avoid silent fallback in unsupported compilers.
2235-2235: SSE41FLAG substitution verified.This substitution step for
SSE41FLAGin the generatedMakefilehelps ensure the SSE4.1 compiler flag is applied. Check that the generatedMakefilereferences do not conflict with other SSE flags.test/test_crc32.cc (2)
196-197: Added mid-size buffer CRC test.Great to see test coverage for a 16K buffer. This ensures coverage for an intermediate size range and helps verify correctness of SSE-based CRC functions.
285-290: New SSE tests for chorba_sse2 and chorba_sse41.These conditional test macros extend coverage for SSE-based CRC calculations. Make sure future expansions (e.g., SSE4.2 variants) also follow a consistent naming approach.
arch/x86/x86_functions.h (3)
39-43: Conditional compilation for SSE4.1 chorba function.Segregating SSE4.1 logic under
X86_SSE41is a clean approach. Ensure the fallback path remains valid for systems without SSE4.1.
108-112: Conditionally define native_crc32 tocrc32_chorba_sse2.This block properly gates SSE2-based CRC under the new
NO_CHORBA_SSE. Ensure that other SSE definitions do not override or conflict inadvertently (e.g., SSE4.1).
123-126: Conditionally define native_crc32 tocrc32_chorba_sse41.This ensures a direct mapping to SSE4.1-based CRC when available. Verify that runtime CPU detection (if enabled) and macro-based detection remain in sync for consistent SSE4.1 usage.
arch/x86/chorba_sse41.c (6)
1-2: SSE4.1 Chorba file inclusion guard.Good use of both
WITHOUT_CHORBAandX86_SSE41to restrict compilation to the correct builds.
16-19: Macro READ_NEXT introduces clarity.Using
_mm_load_si128with offset arithmetic is readable. Confirm that the buffer alignment assumptions (16-byte boundaries) always hold when used in other contexts.
28-34: REALIGN_CHORBA macro re-bases four vectors.The
_mm_alignr_epi8usage here looks correct. Ensure consistent usage of shift values and confirm that partial alignment cases are handled upstream.
43-48: READ4 macro systematically loads four vectors.Corresponds well with STORE4. This consistency reduces code duplication. Great naming alignment and approach.
50-57: READ4_WITHXOR macro merges existing data.Compactly applies
_mm_xor_si128. Double-check that the indexing offset forin[]is always valid, especially for boundary cases.
311-341:crc32_chorba_sse41offers a fully integrated SSE4.1 path.
- Alignment offset logic is straightforward and robust.
- The multi-threshold approach (large/medium/small) is a neat strategy for bridging different CRC methods.
Keep an eye on boundary tests to ensure correct transitions between “medium” and “small” thresholds.
| static Z_FORCEINLINE uint32_t crc32_chorba_32768_nondestructive_sse41(uint32_t crc, const uint64_t* buf, size_t len) { | ||
| const uint64_t* input = buf; | ||
| ALIGNED_(16) uint64_t bitbuffer[32768 / sizeof(uint64_t)]; | ||
| __m128i *bitbuffer_v = (__m128i*)bitbuffer; | ||
| const uint8_t* bitbufferbytes = (const uint8_t*) bitbuffer; | ||
| __m128i z = _mm_setzero_si128(); | ||
|
|
||
| __m128i *bitbuf128 = &bitbuffer_v[64]; | ||
| __m128i *bitbuf144 = &bitbuffer_v[72]; | ||
| __m128i *bitbuf182 = &bitbuffer_v[91]; | ||
| __m128i *bitbuf210 = &bitbuffer_v[105]; | ||
| __m128i *bitbuf300 = &bitbuffer_v[150]; | ||
| __m128i *bitbuf0 = bitbuf128; | ||
| __m128i *inptr = (__m128i*)input; | ||
|
|
||
| /* We only need to zero out the bytes between the 128'th value and the 144th | ||
| * that are actually read */ | ||
| __m128i *z_cursor = bitbuf128; | ||
| for (size_t i = 0; i < 2; ++i) { | ||
| STORE4(z, z, z, z, z_cursor); | ||
| } | ||
|
|
||
| /* We only need to zero out the bytes between the 144'th value and the 182nd that | ||
| * are actually read */ | ||
| z_cursor = bitbuf144 + 8; | ||
| for (size_t i = 0; i < 11; ++i) { | ||
| _mm_store_si128(z_cursor++, z); | ||
| } | ||
|
|
||
| /* We only need to zero out the bytes between the 182nd value and the 210th that | ||
| * are actually read. */ | ||
| z_cursor = bitbuf182; | ||
| for (size_t i = 0; i < 4; ++i) { | ||
| STORE4(z, z, z, z, z_cursor); | ||
| } | ||
|
|
||
| /* We need to mix this in */ | ||
| __m128i init_crc = _mm_cvtsi64x_si128(crc); | ||
| crc = 0; | ||
|
|
||
| size_t i = 0; | ||
|
|
||
| /* Previous iteration runs carried over */ | ||
| __m128i buf144 = z; | ||
| __m128i buf182 = z; | ||
| __m128i buf210 = z; | ||
|
|
||
| for(; i + 300*8+64 < len && i < 22 * 8; i += 64) { | ||
| __m128i in12, in34, in56, in78, | ||
| in_1, in23, in45, in67, in8_; | ||
|
|
||
| READ4(in12, in34, in56, in78, inptr); | ||
|
|
||
| if (i == 0) { | ||
| in12 = _mm_xor_si128(in12, init_crc); | ||
| } | ||
|
|
||
| REALIGN_CHORBA(in12, in34, in56, in78, | ||
| in_1, in23, in45, in67, in8_, 8); | ||
|
|
||
| __m128i a = _mm_xor_si128(buf144, in_1); | ||
|
|
||
| STORE4(a, in23, in45, in67, bitbuf144); | ||
| buf144 = in8_; | ||
|
|
||
| __m128i e = _mm_xor_si128(buf182, in_1); | ||
| STORE4(e, in23, in45, in67, bitbuf182); | ||
| buf182 = in8_; | ||
|
|
||
| __m128i m = _mm_xor_si128(buf210, in_1); | ||
| STORE4(m, in23, in45, in67, bitbuf210); | ||
| buf210 = in8_; | ||
|
|
||
| STORE4(in12, in34, in56, in78, bitbuf300); | ||
| } | ||
|
|
||
| for(; i + 300*8+64 < len && i < 32 * 8; i += 64) { | ||
| __m128i in12, in34, in56, in78, | ||
| in_1, in23, in45, in67, in8_; | ||
| READ4(in12, in34, in56, in78, inptr); | ||
|
|
||
| REALIGN_CHORBA(in12, in34, in56, in78, | ||
| in_1, in23, in45, in67, in8_, 8); | ||
|
|
||
| __m128i a = _mm_xor_si128(buf144, in_1); | ||
|
|
||
| STORE4(a, in23, in45, in67, bitbuf144); | ||
| buf144 = in8_; | ||
|
|
||
| __m128i e, f, g, h; | ||
| e = _mm_xor_si128(buf182, in_1); | ||
| READ4_WITHXOR(f, g, h, buf182, in23, in45, in67, in8_, bitbuf182); | ||
| STORE4(e, f, g, h, bitbuf182); | ||
|
|
||
| __m128i m = _mm_xor_si128(buf210, in_1); | ||
| STORE4(m, in23, in45, in67, bitbuf210); | ||
| buf210 = in8_; | ||
|
|
||
| STORE4(in12, in34, in56, in78, bitbuf300); | ||
| } | ||
|
|
||
| for(; i + 300*8+64 < len && i < 84 * 8; i += 64) { | ||
| __m128i in12, in34, in56, in78, | ||
| in_1, in23, in45, in67, in8_; | ||
| READ4(in12, in34, in56, in78, inptr); | ||
|
|
||
| REALIGN_CHORBA(in12, in34, in56, in78, | ||
| in_1, in23, in45, in67, in8_, 8); | ||
|
|
||
| __m128i a, b, c, d; | ||
| a = _mm_xor_si128(buf144, in_1); | ||
| READ4_WITHXOR(b, c, d, buf144, in23, in45, in67, in8_, bitbuf144); | ||
| STORE4(a, b, c, d, bitbuf144); | ||
|
|
||
| __m128i e, f, g, h; | ||
| e = _mm_xor_si128(buf182, in_1); | ||
| READ4_WITHXOR(f, g, h, buf182, in23, in45, in67, in8_, bitbuf182); | ||
| STORE4(e, f, g, h, bitbuf182); | ||
|
|
||
| __m128i m = _mm_xor_si128(buf210, in_1); | ||
| STORE4(m, in23, in45, in67, bitbuf210); | ||
| buf210 = in8_; | ||
|
|
||
| STORE4(in12, in34, in56, in78, bitbuf300); | ||
| } | ||
|
|
||
| for(; i + 300*8+64 < len; i += 64) { | ||
| __m128i in12, in34, in56, in78, | ||
| in_1, in23, in45, in67, in8_; | ||
|
|
||
| if (i < 128 * 8) { | ||
| READ4(in12, in34, in56, in78, inptr); | ||
| } else { | ||
| in12 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); | ||
| in34 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); | ||
| in56 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); | ||
| in78 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); | ||
| } | ||
|
|
||
| // [0, 145, 183, 211] | ||
|
|
||
| /* Pre Penryn CPUs the unpack should be faster */ | ||
| REALIGN_CHORBA(in12, in34, in56, in78, | ||
| in_1, in23, in45, in67, in8_, 8); | ||
|
|
||
| __m128i a, b, c, d; | ||
| a = _mm_xor_si128(buf144, in_1); | ||
| READ4_WITHXOR(b, c, d, buf144, in23, in45, in67, in8_, bitbuf144); | ||
| STORE4(a, b, c, d, bitbuf144); | ||
|
|
||
| __m128i e, f, g, h; | ||
| e = _mm_xor_si128(buf182, in_1); | ||
| READ4_WITHXOR(f, g, h, buf182, in23, in45, in67, in8_, bitbuf182); | ||
| STORE4(e, f, g, h, bitbuf182); | ||
|
|
||
| __m128i n, o, p; | ||
| __m128i m = _mm_xor_si128(buf210, in_1); | ||
|
|
||
| /* Couldn't tell you why but despite knowing that this is always false, | ||
| * removing this branch with GCC makes things significantly slower. Some | ||
| * loop bodies must be being joined or something */ | ||
| if (i < 84 * 8) { | ||
| n = in23; | ||
| o = in45; | ||
| p = in67; | ||
| buf210 = in8_; | ||
| } else { | ||
| READ4_WITHXOR(n, o, p, buf210, in23, in45, in67, in8_, bitbuf210); | ||
| } | ||
|
|
||
| STORE4(m, n, o, p, bitbuf210); | ||
| STORE4(in12, in34, in56, in78, bitbuf300); | ||
| } | ||
|
|
||
| /* Second half of stores bubbled out */ | ||
| _mm_store_si128(bitbuf144, buf144); | ||
| _mm_store_si128(bitbuf182, buf182); | ||
| _mm_store_si128(bitbuf210, buf210); | ||
|
|
||
| /* We also have to zero out the tail */ | ||
| size_t left_to_z = len - (300*8 + i); | ||
| __m128i *bitbuf_tail = (__m128i*)(bitbuffer + 300 + i/8); | ||
| while (left_to_z >= 64) { | ||
| STORE4(z, z, z, z, bitbuf_tail); | ||
| left_to_z -= 64; | ||
| } | ||
|
|
||
| while (left_to_z >= 16) { | ||
| _mm_store_si128(bitbuf_tail++, z); | ||
| left_to_z -= 16; | ||
| } | ||
|
|
||
| uint8_t *tail_bytes = (uint8_t*)bitbuf_tail; | ||
| while (left_to_z--) { | ||
| *tail_bytes++ = 0; | ||
| } | ||
|
|
||
| ALIGNED_(16) uint64_t final[9] = {0}; | ||
| __m128i next12, next34, next56; | ||
| next12 = z; | ||
| next34 = z; | ||
| next56 = z; | ||
|
|
||
| for(; (i + 72 < len); i += 32) { | ||
| __m128i in1in2, in3in4; | ||
| __m128i in1in2_, in3in4_; | ||
| __m128i ab1, ab2, ab3, ab4; | ||
| __m128i cd1, cd2, cd3, cd4; | ||
|
|
||
| READ_NEXT(input, i, in1in2, in3in4); | ||
| READ_NEXT(bitbuffer, i, in1in2_, in3in4_); | ||
|
|
||
| in1in2 = _mm_xor_si128(_mm_xor_si128(in1in2, in1in2_), next12); | ||
| in3in4 = _mm_xor_si128(in3in4, in3in4_); | ||
|
|
||
| NEXT_ROUND(in1in2, ab1, ab2, ab3, ab4); | ||
|
|
||
| __m128i a2_ = _mm_slli_si128(ab2, 8); | ||
| __m128i ab1_next34 = _mm_xor_si128(next34, ab1); | ||
| in3in4 = _mm_xor_si128(in3in4, ab1_next34); | ||
| in3in4 = _mm_xor_si128(a2_, in3in4); | ||
| NEXT_ROUND(in3in4, cd1, cd2, cd3, cd4); | ||
|
|
||
| __m128i b2c2 = _mm_alignr_epi8(cd2, ab2, 8); | ||
| __m128i a4_ = _mm_slli_si128(ab4, 8); | ||
| a4_ = _mm_xor_si128(b2c2, a4_); | ||
| next12 = _mm_xor_si128(ab3, a4_); | ||
| next12 = _mm_xor_si128(next12, cd1); | ||
|
|
||
| __m128i d2_ = _mm_srli_si128(cd2, 8); | ||
| __m128i b4c4 = _mm_alignr_epi8(cd4, ab4, 8); | ||
| next12 = _mm_xor_si128(next12, next56); | ||
| next34 = _mm_xor_si128(cd3, _mm_xor_si128(b4c4, d2_)); | ||
| next56 = _mm_srli_si128(cd4, 8); | ||
| } | ||
|
|
||
| memcpy(final, input+(i / sizeof(uint64_t)), len-i); | ||
| __m128i *final128 = (__m128i*)final; | ||
| _mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next12)); | ||
| ++final128; | ||
| _mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next34)); | ||
| ++final128; | ||
| _mm_store_si128(final128, _mm_xor_si128(_mm_load_si128(final128), next56)); | ||
|
|
||
| uint8_t* final_bytes = (uint8_t*) final; | ||
|
|
||
| for(size_t j = 0; j < (len-i); j++) { | ||
| crc = crc_table[(crc ^ final_bytes[j] ^ bitbufferbytes[(j+i)]) & 0xff] ^ (crc >> 8); | ||
| } | ||
| return crc; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
crc32_chorba_32768_nondestructive_sse41 large function mechanism.
- The zero-initialization of partial segments is well-labeled, but quite verbose.
- The branch to handle partial usage of
bitbuf0might be tricky to maintain. - The multi-block realign loops with variable i ranges are sophisticated but can hamper readability.
Consider splitting into smaller helper functions for maintainability.
Dead2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@KungFuJesus Sorry, this needs a rebase now. |
07ad6c6 to
0039ea0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
arch/x86/chorba_sse41.c (2)
106-133: Consider extracting loop phases for better maintainability.The main processing loop is split into multiple phases (lines 106, 135, 160, 185) with similar but slightly different implementations. While this likely optimizes for specific iteration counts, it makes the code harder to maintain. Consider extracting common patterns into helper functions.
328-328: Replace hardcoded constant with macro for consistency.Line 329 uses
CHORBA_SMALL_THRESHOLD_64BITfor the small buffer threshold, but other parts of the codebase might use the direct value (72). For consistency and maintainability, consider using the macro everywhere.test/benchmarks/benchmark_compare256.cc (1)
62-62: Consider broader or boundary-specific argument coverage.
Switching from a continuous range to fixed arguments is valid, though you might want to add edge cases (e.g., 2, 256, 257 if applicable) for better stress testing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
CMakeLists.txt(5 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/chorba_sse41.c(1 hunks)arch/x86/compare256_avx512.c(1 hunks)arch/x86/x86_features.c(1 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(6 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(5 hunks)functable.c(3 hunks)test/benchmarks/benchmark_compare256.cc(2 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_compare256.cc(1 hunks)test/test_crc32.cc(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- arch/x86/x86_features.h
- arch/x86/x86_features.c
- test/benchmarks/benchmark_crc32.cc
- functable.c
- cmake/detect-intrinsics.cmake
- arch/x86/x86_functions.h
- test/test_crc32.cc
- configure
- arch/x86/Makefile.in
- CMakeLists.txt
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (7)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
🧬 Code Graph Analysis (2)
test/test_compare256.cc (1)
arch/x86/x86_functions.h (1)
compare256_avx512(70-70)
test/benchmarks/benchmark_compare256.cc (1)
arch/x86/x86_functions.h (1)
compare256_avx512(70-70)
🔇 Additional comments (18)
arch/x86/chorba_sse41.c (10)
1-1: Correct guard condition.The preprocessor condition correctly guards the SSE4.1 implementation based on both feature support and CRC implementation choice.
28-34: LGTM: Macro implementation looks good.The
REALIGN_CHORBAmacro correctly uses SSE4.1's_mm_alignr_epi8instruction which is a key performance improvement over the SSE2 implementation, as mentioned in the PR objectives.
61-71: Consider L1 cache-friendly buffer sizing.The stack buffer allocation of 32KB (
bitbuffer[32768 / sizeof(uint64_t)]) aligns with the PR objective of optimizing for L1 cache sizes. The fixed offsets into this buffer are carefully chosen to optimize memory access patterns.
74-95: Selective zero initialization is a good optimization.The selective zeroing of only the parts of the buffer that will be read, rather than the entire buffer, matches the PR's objective of reducing initialization overhead.
217-231: Keep the performance-critical branch condition.The comment at line 217 explains why removing an apparently unnecessary condition degrades performance. This is a valuable insight about compiler optimization behavior and should be preserved.
238-254: Thorough tail zeroing implementation.The code carefully handles zeroing the tail bytes with progressively smaller granularity (64-byte, 16-byte, single-byte). This approach maximizes SIMD utilization while ensuring correctness.
262-293: Efficient SSE4.1 vectorized folding algorithm.This section implements the Chorba folding algorithm using SSE4.1 intrinsics like
_mm_alignr_epi8instead of shifts, which aligns with the PR objective of improving performance by reducing shift operations.
316-317: Use uintptr_t consistently for pointer alignment arithmetic.The code correctly uses
uintptr_tfor pointer alignment arithmetic, which is safer than using integer types that might not match the platform's pointer width.
324-333: Well-structured dispatching logic based on buffer size.The function correctly dispatches to different implementations based on buffer size thresholds:
- Large buffers (>64KB):
crc32_chorba_118960_nondestructive- Medium buffers (8KB-32KB):
crc32_chorba_32768_nondestructive_sse41(this implementation)- Small buffers (>72 bytes):
chorba_small_nondestructive_sse2- Tiny buffers:
crc32_braid_internalThis segmentation allows for optimal handling of different buffer sizes.
52-57:✅ Verification successful
Verify comment and indexing in READ4_WITHXOR macro.
The comment indicates this macro is "intentionally shifted one down", but make sure the array indexing starting from 1 (
in[1],in[2], etc.) instead of 0 is intentional and matches the expected behavior when reading with XOR operations.
🏁 Script executed:
#!/bin/bash # Search for other usages of similar indexing patterns rg -B 3 -A 3 "_mm_xor_si128\(in\[\d\]" --glob "*.c"Length of output: 783
Indexing in READ4_WITHXOR Macro Confirmed
The search output indicates that the only usage of the
_mm_xor_si128(in[<index>], ...)pattern is inarch/x86/chorba_sse41.c. The file contains a comment explaining that the array is intentionally shifted by one to compensate for the deferred store from the last iteration. This confirms that the use ofin[1],in[2],in[3], andin[4]is deliberate and consistent with the intended behavior.No changes are necessary.
test/test_compare256.cc (1)
82-84: Good coverage for AVX512.
This test definition cleanly parallels existing SSE2 and AVX2 variants. No issues found.test/benchmarks/benchmark_compare256.cc (1)
83-85: Benchmark definition for AVX512 is consistent.
This mirrors the handling of other SIMD variants.arch/x86/compare256_avx512.c (6)
1-5: Validate copyright year.
It's set to “2025.” Please confirm that this is accurate or if you want to adopt a more flexible approach.
7-11: Includes look consistent.
No concerns or redundant imports.
12-19: Confirm macro definitions.
Double-check thatX86_AVX512andHAVE_BUILTIN_CTZLLare reliably set to prevent unexpected build failures.
19-80: Efficient unrolled comparison logic.
Splitting the comparison into a 16-byte chunk followed by 64-byte chunks (with overlap) is a well-reasoned approach for performance. The use of_mm512_cmpeq_epu8_maskand__builtin_ctzllto compute the mismatch offset appears correct.
82-85: Wrapper function is straightforward.
Clean pass-through to the static function, consistent with other architecture-specific compare256 implementations.
86-96: Template-based match generation.
Referencingmatch_tpl.htwice to produce both fast and slow path functions aligns well with existing patterns.
This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason for this has to do with the fact that, while incurring far fewer shifts, an entirely separate stack buffer has to be managed that is the size of the L1 cache on most CPUs. This was one of the main reasons the 32k specialized function was slower for the scalar counterpart, despite auto vectorizing. The auto vectorized loop was setting up the stack buffer at unaligned offsets, which is detrimental to performance pre-nehalem. Additionally, we were losing a fair bit of time to the zero initialization, which we are now doing more selectively. There are a ton of loads and stores happening, and for sure we are bound on the fill buffer + store forwarding. An SSE2 version of this code is probably possible by simply replacing the shifts with unpacks with zero and the palignr's with shufpd's. I'm just not sure it'll be all that worth it, though. We are gating against SSE4.1 not because we are using specifically a 4.1 instruction but because that marks when Wolfdale came out and palignr became a lot faster.
0039ea0 to
5890b4b
Compare
|
Rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
arch/x86/chorba_sse41.c (1)
59-309: Consider decomposing large function for maintainability.The
crc32_chorba_32768_nondestructive_sse41function is 250+ lines long with multiple complex sections. While the current implementation may be optimized for performance, consider decomposing it into smaller helper functions for better maintainability:
- Initialization function for buffer setup
- Main processing function for different iteration phases
- Tail handling function for remaining bytes
This would improve readability while allowing the compiler to still inline these functions when appropriate.
🧹 Nitpick comments (1)
arch/x86/chorba_sse41.c (1)
189-196: Consider compiler-friendly pointer access pattern.The code uses explicit casting and pointer arithmetic in multiple places. While functionally correct, consider a more compiler-friendly approach:
- in12 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); - in34 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); - in56 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); - in78 = _mm_xor_si128(_mm_load_si128(inptr++), _mm_load_si128(bitbuf0++)); + in12 = _mm_xor_si128(inptr[0], bitbuf0[0]); + in34 = _mm_xor_si128(inptr[1], bitbuf0[1]); + in56 = _mm_xor_si128(inptr[2], bitbuf0[2]); + in78 = _mm_xor_si128(inptr[3], bitbuf0[3]); + inptr += 4; + bitbuf0 += 4;This approach can help with register allocation and might prevent false positives in static analyzers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
CMakeLists.txt(4 hunks)arch/x86/Makefile.in(3 hunks)arch/x86/chorba_sse41.c(1 hunks)arch/x86/x86_features.c(1 hunks)arch/x86/x86_features.h(1 hunks)arch/x86/x86_functions.h(4 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(4 hunks)functable.c(2 hunks)test/benchmarks/benchmark_crc32.cc(1 hunks)test/test_crc32.cc(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- arch/x86/Makefile.in
🚧 Files skipped from review as they are similar to previous changes (9)
- arch/x86/x86_features.c
- functable.c
- cmake/detect-intrinsics.cmake
- arch/x86/x86_functions.h
- test/test_crc32.cc
- arch/x86/x86_features.h
- test/benchmarks/benchmark_crc32.cc
- configure
- CMakeLists.txt
🧰 Additional context used
🧠 Learnings (1)
arch/x86/chorba_sse41.c (7)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-21T01:42:40.488Z
Learning: In the SSE2-optimized Chorba CRC implementation (chorba_small_nondestructive_sse), the input buffer length is enforced to be a multiple of 16 bytes due to SSE2 operations, making additional checks for smaller alignments (like 8 bytes) redundant.
Learnt from: Dead2
PR: zlib-ng/zlib-ng#1837
File: arch/generic/crc32_c.c:19-29
Timestamp: 2025-01-23T22:01:53.422Z
Learning: The Chorba CRC32 functions (crc32_chorba_118960_nondestructive, crc32_chorba_32768_nondestructive, crc32_chorba_small_nondestructive, crc32_chorba_small_nondestructive_32bit) are declared in crc32_c.h.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:26-28
Timestamp: 2025-02-21T01:44:03.996Z
Learning: The alignment requirements for chorba_small_nondestructive_sse2 (16-byte alignment and multiple of 8 length) are enforced by its calling function, making additional checks redundant.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:830-831
Timestamp: 2025-02-21T01:37:54.508Z
Learning: The alignment calculation `((uintptr_t)16 - ((uintptr_t)buf & 15)) & 15` is safe from overflow as it's mathematically bounded between 1 and 16, making the use of uintptr_t appropriate for this pointer arithmetic.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1872
File: arch/x86/chorba_sse2.c:0-0
Timestamp: 2025-02-23T16:49:52.043Z
Learning: In zlib-ng, bounds checking for CRC32 computation is handled by the caller, not within the individual CRC32 implementation functions like `crc32_chorba_sse2`.
⏰ Context from checks skipped due to timeout of 90000ms (130)
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
- GitHub Check: Analyze (cpp)
- GitHub Check: Windows MSVC ARM64 No Test
- GitHub Check: macOS GCC UBSAN (ARM64)
- GitHub Check: macOS GCC UBSAN
- GitHub Check: macOS Clang (Target 10.10)
- GitHub Check: Windows MSVC 2019 v140 Win64
- GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
- GitHub Check: Ubuntu GCC MIPS64
- GitHub Check: Ubuntu GCC SSE4.2 UBSAN
- GitHub Check: Ubuntu Clang
- GitHub Check: Ubuntu Clang No Optim
- GitHub Check: Ubuntu GCC AARCH64
- GitHub Check: Ubuntu Clang No Threads
🔇 Additional comments (12)
arch/x86/chorba_sse41.c (12)
1-12: Properly gated code for SSE4.1 support.The code begins with appropriate conditional compilation directives to ensure it's only built when Chorba is enabled and SSE4.1 support is available. All necessary headers are included for SSE intrinsics, CRC32 functionality, and platform-specific optimizations.
28-34: REALIGN_CHORBA macro has correct parameter usage.The REALIGN_CHORBA macro implementation properly uses the parameter names (in0, in1, in2, in3) in the body of the macro, which was fixed from an earlier version where parameter names didn't match the usage.
71-93: Selective zero initialization is correctly implemented.The code optimizes performance by only zeroing out specific portions of the buffer that will be read later, rather than zeroing the entire buffer. This aligns with the PR objective of "selectively applying zero initialization to reduce overhead."
217-229: Good documentation for counter-intuitive optimization.The comment explaining why a seemingly redundant branch check is maintained for performance reasons is valuable. This kind of documentation helps prevent future optimizations that might inadvertently degrade performance.
238-254: Efficient tail zeroing with progressively finer granularity.The code efficiently zeroes out the tail portion by starting with large 64-byte chunks, then 16-byte chunks, and finally handling individual bytes. This approach minimizes the number of operations while ensuring complete zeroing.
296-302: XOR final values with accumulated state.The code correctly combines the accumulated state with the final values using XOR operations, which is essential for proper CRC calculation.
305-308: Final CRC calculation with table lookup.The code finalizes the CRC calculation by combining bytes from both the input and bitbuffer with a table-based approach, ensuring correct handling of the remaining bytes.
316-337: Dispatch strategy aligns with PR objectives.The dispatch logic selects the appropriate implementation based on input size and alignment:
- Uses the new SSE4.1 implementation for medium-sized inputs (8192-32768 bytes)
- Falls back to other optimized implementations for other input sizes
- Handles unaligned initial bytes with a fallback method
This strategy aligns with the PR objective of leveraging SSE4.1 instructions for better performance in specific cases.
328-328: SSE4.1 optimization targets appropriate buffer size range.The SSE4.1 implementation is specifically targeted at medium-sized buffers between 8KB and 32KB, which aligns with the PR objective of optimizing for L1 cache sizes on most CPUs. Benchmark results in the PR comments confirm a significant performance gain (~30%) over the SSE2 variant in this range.
329-329: Use CHORBA_SMALL_THRESHOLD_64BIT macro instead of literal value.Line 329 correctly uses the macro
CHORBA_SMALL_THRESHOLD_64BITfor consistency with the rest of the codebase. This fixed a previous version that used a magic number (72).
317-317: Pointer alignment calculation uses uintptr_t.The code correctly uses
uintptr_tfor pointer alignment arithmetic to ensure portability across different platforms where the size ofunsigned longmight differ from pointer width.
59-62:❓ Verification inconclusive
Consider stack size implications of large stack buffer.
The function allocates a 32KB aligned buffer on the stack. While this is intentionally sized to match L1 cache on most CPUs (as mentioned in PR objectives), be aware that this large stack allocation could cause issues on platforms with limited stack space.
🏁 Script executed:
#!/bin/bash # Check if there are any stack size limitations set in the build system rg -A 2 "stack-limit|stack-size|stacksize" --include="*.{cmake,mk,mak,in}"Length of output: 518
Action Required: Manually Verify Stack Allocation Impact
In the function defined in
arch/x86/chorba_sse41.c(lines 59–62), a 32KB aligned stack buffer is intentionally allocated to match the L1 cache size on most platforms. However, no build configuration settings (e.g., stack size limits) were automatically detected by our initial search. Since the command to check for such settings returned an error and provided no insight, please manually verify that this large stack allocation is acceptable on all target platforms—especially those with limited stack space.
- Location:
arch/x86/chorba_sse41.c, lines 59–62.- Recommendation: Review your build configurations and platform documentation for any potential stack limitations. Ensure that the intentional choice of a 32KB allocation does not risk stack overflow or other issues on platforms with lower default stack sizes.
This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason for this has to do with the fact that, while incurring far fewer shifts, an entirely separate stack buffer has to be managed that is the size of the L1 cache on most CPUs. This was one of the main reasons the 32k specialized function was slower for the scalar counterpart, despite auto vectorizing. The auto vectorized loop was setting up the stack buffer at unaligned offsets, which is detrimental to performance pre-nehalem. Additionally, we were losing a fair bit of time to the zero initialization, which we are now doing more selectively.
There are a ton of loads and stores happening, and for sure we are bound on the fill buffer + store forwarding. An SSE2 version of this code is probably possible by simply replacing the shifts with unpacks with zero and the palignr's with shufpd's. I'm just not sure it'll be all that worth it, though. We are gating against SSE4.1 not because we are using specifically a 4.1 instruction but because that marks when Wolfdale came out and palignr became a lot faster.
Summary by CodeRabbit