这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@KungFuJesus
Copy link
Contributor

@KungFuJesus KungFuJesus commented Feb 12, 2025

So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix #1824.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Enhanced checksum processing on ARM platforms leveraging advanced SIMD support with an optional data copying mode.
    • Introduced a new inline macro for improved function inlining across different compilers.
  • Tests

    • Updated performance benchmarks to evaluate the enhanced checksum routine, ensuring robust performance across supported architectures.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2025

Walkthrough

The changes update the NEON-optimized Adler32 implementation to optionally perform data copying during the accumulation process. A new inline function adler32_fold_copy_impl is introduced to handle both checksum computation and, when requested, copying data from the source to the destination. Additionally, a new wrapper function adler32_fold_copy_neon is added. Related declarations, a macro definition in the ARM header, a function table pointer update, and benchmark modifications ensure that the new functionality is properly integrated and benchmarked.

Changes

File(s) Change Summary
arch/arm/adler32_neon.c Added new inline function adler32_fold_copy_impl with an extra destination buffer and COPY flag parameter; replaced adler32_neon to call adler32_fold_copy_impl with COPY=0; added wrapper function adler32_fold_copy_neon that calls the implementation with COPY=1.
arch/arm/arm_functions.h Introduced declaration for adler32_fold_copy_neon and defined a new macro native_adler32_fold_copy mapping to it.
functable.c Added function pointer assignment ft.adler32_fold_copy = &adler32_fold_copy_neon; within the ARM-specific init_functable function.
test/benchmarks/benchmark_adler32_copy.cc Updated benchmark definition for NEON to use adler32_fold_copy_neon instead of adler32_neon; removed commented-out benchmark lines for other architectures.
zbuild.h Added new macro Z_FORCEINLINE for forced inline decoration based on compiler detection.
arch/arm/neon_intrins.h Modified conditional compilation directives for alignment requirements, broadening checks for both Android and iOS platforms.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Wrapper
    participant Implementation

    %% For non-copy flow (original behavior)
    Caller->>Wrapper: Call adler32_neon (adler, buf, len)
    Wrapper->>Implementation: Call adler32_fold_copy_impl with COPY=0
    Implementation-->>Wrapper: Return checksum
    Wrapper-->>Caller: Return checksum

    %% For copy flow (new behavior)
    Caller->>Wrapper: Call adler32_fold_copy_neon (adler, dst, src, len)
    Wrapper->>Implementation: Call adler32_fold_copy_impl with COPY=1
    Implementation-->>Wrapper: Return checksum
    Wrapper-->>Caller: Return checksum
Loading

Suggested labels

optimization, bug, Architecture

Suggested reviewers

  • Dead2
  • nmoinvaz
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@neheb
Copy link

neheb commented Feb 12, 2025

Nope

[4/41] Compiling C object libneon.a.p/arch_arm_slide_hash_neon.c.o
FAILED: libneon.a.p/arch_arm_slide_hash_neon.c.o 
clang -Ilibneon.a.p -I. -I.. -fcolor-diagnostics -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c11 -O2 -g -fPIC -D_LARGEFILE64_SOURCE=1 -D__USE_LARGEFILE64 -DHAVE_LINUX_AUXVEC_H -DHAVE_SYS_AUXV_H -DHAVE_ARM_ACLE_H -DHAVE_BUILTIN_ASSUME_ALIGNED -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DHAVE_ALIGNED_ALLOC -DHAVE_POSIX_MEMALIGN -DHAVE_ATTRIBUTE_ALIGNED -DHAVE_VISIBILITY_HIDDEN -DHAVE_VISIBILITY_INTERNAL -DWITH_GZFILEOP -DARM_FEATURES -DARM_AUXV_HAS_CRC32 -DARM_ASM_HWCAP -DARM_AUXV_HAS_NEON -DARM_ACLE -DARM_NEON -DARM_NEON_HASLD4 -mfpu=neon -MD -MQ libneon.a.p/arch_arm_slide_hash_neon.c.o -MF libneon.a.p/arch_arm_slide_hash_neon.c.o.d -o libneon.a.p/arch_arm_slide_hash_neon.c.o -c ../arch/arm/slide_hash_neon.c
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: clang -Ilibneon.a.p -I. -I.. -fcolor-diagnostics -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c11 -O2 -g -fPIC -D_LARGEFILE64_SOURCE=1 -D__USE_LARGEFILE64 -DHAVE_LINUX_AUXVEC_H -DHAVE_SYS_AUXV_H -DHAVE_ARM_ACLE_H -DHAVE_BUILTIN_ASSUME_ALIGNED -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DHAVE_ALIGNED_ALLOC -DHAVE_POSIX_MEMALIGN -DHAVE_ATTRIBUTE_ALIGNED -DHAVE_VISIBILITY_HIDDEN -DHAVE_VISIBILITY_INTERNAL -DWITH_GZFILEOP -DARM_FEATURES -DARM_AUXV_HAS_CRC32 -DARM_ASM_HWCAP -DARM_AUXV_HAS_NEON -DARM_ACLE -DARM_NEON -DARM_NEON_HASLD4 -mfpu=neon -MD -MQ libneon.a.p/arch_arm_slide_hash_neon.c.o -MF libneon.a.p/arch_arm_slide_hash_neon.c.o.d -o libneon.a.p/arch_arm_slide_hash_neon.c.o -c ../arch/arm/slide_hash_neon.c 
1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module '../arch/arm/slide_hash_neon.c'.
4.	Running pass 'ARM Instruction Selection' on function '@slide_hash_neon'
clang: error: clang frontend command failed due to signal (use -v to see invocation)


[1/37] Compiling C object libneon.a.p/arch_arm_adler32_neon.c.o
FAILED: libneon.a.p/arch_arm_adler32_neon.c.o 
clang -Ilibneon.a.p -I. -I.. -fcolor-diagnostics -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c11 -O2 -g -fPIC -D_LARGEFILE64_SOURCE=1 -D__USE_LARGEFILE64 -DHAVE_LINUX_AUXVEC_H -DHAVE_SYS_AUXV_H -DHAVE_ARM_ACLE_H -DHAVE_BUILTIN_ASSUME_ALIGNED -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DHAVE_ALIGNED_ALLOC -DHAVE_POSIX_MEMALIGN -DHAVE_ATTRIBUTE_ALIGNED -DHAVE_VISIBILITY_HIDDEN -DHAVE_VISIBILITY_INTERNAL -DWITH_GZFILEOP -DARM_FEATURES -DARM_AUXV_HAS_CRC32 -DARM_ASM_HWCAP -DARM_AUXV_HAS_NEON -DARM_ACLE -DARM_NEON -DARM_NEON_HASLD4 -mfpu=neon -MD -MQ libneon.a.p/arch_arm_adler32_neon.c.o -MF libneon.a.p/arch_arm_adler32_neon.c.o.d -o libneon.a.p/arch_arm_adler32_neon.c.o -c ../arch/arm/adler32_neon.c
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: clang -Ilibneon.a.p -I. -I.. -fcolor-diagnostics -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c11 -O2 -g -fPIC -D_LARGEFILE64_SOURCE=1 -D__USE_LARGEFILE64 -DHAVE_LINUX_AUXVEC_H -DHAVE_SYS_AUXV_H -DHAVE_ARM_ACLE_H -DHAVE_BUILTIN_ASSUME_ALIGNED -DHAVE_BUILTIN_CTZ -DHAVE_BUILTIN_CTZLL -DHAVE_ALIGNED_ALLOC -DHAVE_POSIX_MEMALIGN -DHAVE_ATTRIBUTE_ALIGNED -DHAVE_VISIBILITY_HIDDEN -DHAVE_VISIBILITY_INTERNAL -DWITH_GZFILEOP -DARM_FEATURES -DARM_AUXV_HAS_CRC32 -DARM_ASM_HWCAP -DARM_AUXV_HAS_NEON -DARM_ACLE -DARM_NEON -DARM_NEON_HASLD4 -mfpu=neon -MD -MQ libneon.a.p/arch_arm_adler32_neon.c.o -MF libneon.a.p/arch_arm_adler32_neon.c.o.d -o libneon.a.p/arch_arm_adler32_neon.c.o -c ../arch/arm/adler32_neon.c 
1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module '../arch/arm/adler32_neon.c'.
4.	Running pass 'ARM Instruction Selection' on function '@adler32_neon'
clang: error: clang frontend command failed due to signal (use -v to see invocation)
Debian clang version 11.0.1-2
Target: armv7l-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin
clang: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /tmp/adler32_neon-19e03b.c
clang: note: diagnostic msg: /tmp/adler32_neon-19e03b.sh
clang: note: diagnostic msg: 

@KungFuJesus
Copy link
Contributor Author

KungFuJesus commented Feb 12, 2025

Hmm, the slide hash function should be pointed at the buffer window, which should be properly aligned. Do you have more error context for that one?

As for the adler, the only thing we're using the multi-register variant with ld1 with is something that we control the alignment of on the stack. Everything else is using plain old ld1 (really 2x ldp's, as that's usually what both clang and GCC compile to):

 ldp    q4, q3, [x2]
 ldp    q2, q1, [x2, #32]

Of course this is aarch64, not necessarily armv7.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
arch/arm/adler32_neon.c (4)

13-13: Clarify the COPY parameter usage in the function signature.
Declaring static inline void NEON_accum32(uint32_t *s, uint8_t *dst, const uint8_t *buf, size_t len, const int COPY) introduces a conditional copy mechanism. It would be helpful to add a brief comment explaining that len here refers to the number of 16-byte blocks (not total bytes), and that COPY toggles storing to dst.


56-56: Remove or justify the commented-out vst1q_u8_x4 call.
Line 56 has //vst1q_u8_x4(dst, d0_d3); commented out. If this code path is obsolete, removing it would clean up the function. Otherwise, add a comment explaining why it remains commented for potential future use.


168-168: Remove trailing whitespace.
The pipeline warns about trailing whitespace on line 168. Removing it helps maintain a clean codebase.

Apply the following fix:

-           *dst = *src; 
+           *dst = *src;
🧰 Tools
🪛 GitHub Actions: Lint

[warning] 168-168: Trailing whitespace found.


230-230: Clarify block-based length usage.
NEON_accum32(pair, dst + done, src + done, n >> 4, COPY); is called with n >> 4, where n is in bytes. The function then treats this as the number of 16-byte blocks. This is correct but can be confusing. Consider documenting that len inside NEON_accum32 is interpreted as “count of 16-byte blocks.”

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 860e4cf and ebdb4b9.

📒 Files selected for processing (4)
  • arch/arm/adler32_neon.c (9 hunks)
  • arch/arm/arm_functions.h (2 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Lint
arch/arm/adler32_neon.c

[warning] 168-168: Trailing whitespace found.

⏰ Context from checks skipped due to timeout of 90000ms (60)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
🔇 Additional comments (6)
arch/arm/adler32_neon.c (3)

51-58: Confirm sufficient buffer space before writing to dst.
When COPY is enabled, the code writes 64 bytes (d0, d1, d2, d3) to dst without explicit bounds checks. Although the caller logic appears to ensure there's enough space, consider documenting or asserting that dst can indeed accommodate these writes, to prevent out-of-bounds writes in case of misuse.


79-86: Ensure no overflow risks in wide accumulations.
Although the internal comment states that the maximum values won’t exceed uint16 limits, ensure that any changes to the iteration bounds (e.g., if NMAX or the data size changes in the future) do not inadvertently cause overflow in these partial sums. Maintaining a test that specifically checks large inputs might help.


256-262: Wrapper functions for copy vs non-copy look appropriate.
The adler32_neon and adler32_fold_copy_neon wrappers calling adler32_fold_copy_impl with COPY=0 and COPY=1, respectively, cleanly separate the two pathways. This is an elegant solution that avoids code duplication. No issues found.

arch/arm/arm_functions.h (1)

10-10: Validate new function declaration and macro definition.
Introducing uint32_t adler32_fold_copy_neon(...) and mapping native_adler32_fold_copy to it (line 42) is consistent with how other NEON-based definitions are set. Ensure all references expect the same parameter types and order.

Also applies to: 42-42

test/benchmarks/benchmark_adler32_copy.cc (1)

95-96: Benchmarking the new folded-copy function.
Switching from adler32_neon to adler32_fold_copy_neon for benchmarking is consistent with the new copy-aware Adler32 logic. This should accurately reflect performance changes introduced by optional copying.

functable.c (1)

175-175: Verify the NEON implementation's interface compatibility.

The integration of adler32_fold_copy_neon into the function table looks correct. However, given the reported compilation issues with ARM architecture files, let's verify the function declaration and interface compatibility.

Run the following script to check the function declaration:

If the function declaration matches the interface of other implementations and follows the expected signature uint32_t (*adler32_fold_copy)(uint32_t adler, uint8_t *dst, const uint8_t *src, size_t len), then the integration is correct.

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Verify the function declaration matches the expected interface

# Check the function declaration in the header file
rg -A 1 "adler32_fold_copy_neon" arch/arm/

# Check if the function signature matches other implementations
rg -A 1 "adler32_fold_copy_\w+" | grep -v "neon"

Length of output: 13737


NEON Implementation Interface Verified

The function declaration and definition of adler32_fold_copy_neon in arch/arm/arm_functions.h and arch/arm/adler32_neon.c match the expected signature, consistent with the other implementations in the codebase. The assignment in functable.c is correct and well-integrated.

  • Confirmed function signature: uint32_t adler32_fold_copy_neon(uint32_t adler, uint8_t *dst, const uint8_t *src, size_t len)
  • Signature matches the interface used by generic and alternative implementations

@neheb
Copy link

neheb commented Feb 12, 2025

@KungFuJesus nope. This is Debian 11 with clang-11.

Interestingly enough, if I use an Alpine Linux container with distrobox, it compiles and it works fine after this PR. Before this PR, example.c would crash when ran.

@KungFuJesus
Copy link
Contributor Author

Ok so the failing to compile is probably a separate issue. There's probably something weird going on with your build toolchain.

@neheb
Copy link

neheb commented Feb 12, 2025

huh? it's stock Debian. clang was installed with sudo apt install clang.

@KungFuJesus
Copy link
Contributor Author

Let me try to reproduce with my potato pi on a similar vintage of debian:

pi@pitemp:/mnt/nasshare/storage/armv7-zlib-ng $ lsb_release  -a
No LSB modules are available.
Distributor ID:	Raspbian
Description:	Raspbian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye
pi@pitemp:/mnt/nasshare/storage/armv7-zlib-ng $ clang --version
Raspbian clang version 11.0.1-2+rpi1
Target: armv6k-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin

The armv6 abi but I'll give it a go.

@KungFuJesus
Copy link
Contributor Author

Hmm, this might be too potato, it's just compiling functions with the fallback v6 asm:

[ 23%] Building C object CMakeFiles/zlibstatic.dir/arch/arm/arm_features.c.o
[ 25%] Building C object CMakeFiles/zlibstatic.dir/arch/arm/crc32_acle.c.o
[ 25%] Building C object CMakeFiles/zlibstatic.dir/arch/arm/slide_hash_armv6.c.o

I don't really have any aarch32 installs to test this on, everything I have that's aarch64 capable is running on aarch64. I might have a second gen pi laying around somewhere but I'd have to track it down. Suffice to say though if the compiler is crashing and spitting a stack trace out it's probably a compiler bug. Either that or your clang binaries are somehow corrupt.

@neheb
Copy link

neheb commented Feb 12, 2025

Correct. That stack trace is indeed an internal LLVM bug.

@codecov
Copy link

codecov bot commented Feb 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.57%. Comparing base (b33ba96) to head (b50216e).
Report is 11 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1870      +/-   ##
===========================================
+ Coverage    82.37%   82.57%   +0.19%     
===========================================
  Files          141      141              
  Lines        12558    12672     +114     
  Branches      2894     2909      +15     
===========================================
+ Hits         10345    10464     +119     
- Misses        1239     1240       +1     
+ Partials       974      968       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)

56-56: Consider using vst1q_u8_x4 for optimized copying.

The commented out vst1q_u8_x4 instruction could potentially offer better performance than individual vst1q_u8 calls.

Let's verify if the compiler supports this instruction:

❌ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Description: Check if vst1q_u8_x4 is supported by the compiler
echo '#include <arm_neon.h>' > test.c
echo 'void test() { uint8x16x4_t data; uint8_t* dst; vst1q_u8_x4(dst, data); }' >> test.c
cc -c test.c -march=armv8-a+simd 2>&1 || true

Length of output: 247


I'll run a new check using clang instead of cc to verify whether the vst1q_u8_x4 intrinsic is supported by the available toolchain:

Once this script executes, we can determine if the NEON intrinsic vst1q_u8_x4 is supported. If the compilation succeeds (i.e. no intrinsic-related errors are emitted), then using vst1q_u8_x4 for optimized copying should be considered. Otherwise, if the compiler errors out due to lack of support, then leaving the code as-is is appropriate.

Please review the result of this script and verify that your build environment includes a toolchain (with clang or an equivalent compiler) that supports ARM NEON intrinsics.


🏁 Script executed:

[run_scripts]
#!/bin/bash
# Check if clang is available and compile test using clang for vst1q_u8_x4 support
which clang || echo "clang not installed"

echo '#include <arm_neon.h>' > test_clang.c
echo 'void test() { uint8x16x4_t data; uint8_t* dst; vst1q_u8_x4(dst, data); }' >> test_clang.c
clang -c test_clang.c -march=armv8-a+simd 2>&1 || true

Length of output: 540


Action Needed: Confirm Compiler Support for vst1q_u8_x4

The script couldn’t verify support due to the absence of a suitable compiler (clang) in the testing environment. Please ensure that your build toolchain supports the vst1q_u8_x4 intrinsic—either by testing compilation locally or reviewing your toolchain documentation. If supported, consider enabling it for optimized copying; otherwise, the current implementation remains acceptable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebdb4b9 and 4f9c6cc.

📒 Files selected for processing (4)
  • arch/arm/adler32_neon.c (9 hunks)
  • arch/arm/arm_functions.h (2 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/benchmarks/benchmark_adler32_copy.cc
  • functable.c
  • arch/arm/arm_functions.h
⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Ubuntu 20.04 Clang 6
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
🔇 Additional comments (4)
arch/arm/adler32_neon.c (4)

13-150: LGTM! The NEON accumulation function is well-implemented.

The function correctly handles both checksum computation and optional data copying using NEON intrinsics. The implementation maintains good performance characteristics by:

  1. Using aligned NEON loads and stores
  2. Efficiently handling data in 64-byte chunks
  3. Properly managing overflow conditions

205-206: Good use of restrict pointers for optimizing the copy loop.

The __restrict qualifier helps the compiler generate more efficient code by guaranteeing that the source and destination pointers don't alias.


160-254: LGTM! The implementation is robust and handles edge cases well.

The function:

  1. Correctly handles special cases (len == 1, len < 16)
  2. Maintains alignment requirements for optimal NEON operations
  3. Properly manages the checksum computation with optional copying

256-262: LGTM! Clean API design with good separation of concerns.

The public API functions provide a clear interface while delegating to the shared implementation, making the code maintainable and efficient.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)

56-56: Remove commented out code.

The commented out vst1q_u8_x4 instruction should be removed as it's not being used.

-            //vst1q_u8_x4(dst, d0_d3);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f9c6cc and 04988a1.

📒 Files selected for processing (4)
  • arch/arm/adler32_neon.c (9 hunks)
  • arch/arm/arm_functions.h (2 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • arch/arm/arm_functions.h
  • test/benchmarks/benchmark_adler32_copy.cc
  • functable.c
⏰ Context from checks skipped due to timeout of 90000ms (40)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
🔇 Additional comments (3)
arch/arm/adler32_neon.c (3)

13-150: LGTM! Well-optimized NEON implementation.

The function efficiently combines checksum calculation with conditional copying, maintaining optimal performance through:

  • Proper use of NEON intrinsics for vectorized operations
  • Minimal branching in the main loop
  • Efficient memory access patterns

160-254: LGTM! Well-structured implementation with proper optimizations.

The function effectively:

  • Handles all edge cases
  • Uses restrict pointers for better compiler optimization
  • Maintains proper memory alignment for SIMD operations

256-262: LGTM! Clean wrapper implementations.

The wrapper functions provide a clear interface to the core implementation with appropriate parameter passing.

@KungFuJesus KungFuJesus force-pushed the fold_adler_neon_copy branch 3 times, most recently from 3f4e99f to da961da Compare February 12, 2025 03:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)

13-14: Consider using vst1q_u8_x4 for more efficient copying.

The commented out vst1q_u8_x4 instruction could potentially be more efficient than four separate vst1q_u8 calls. Consider benchmarking both approaches.

Also applies to: 52-59

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04988a1 and 3f4e99f.

📒 Files selected for processing (5)
  • arch/arm/adler32_neon.c (9 hunks)
  • arch/arm/arm_functions.h (2 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
  • zbuild.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/benchmarks/benchmark_adler32_copy.cc
  • functable.c
  • arch/arm/arm_functions.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (44)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: macOS GCC
  • GitHub Check: macOS GCC Symbol Prefix & Compat
🔇 Additional comments (5)
arch/arm/adler32_neon.c (4)

161-184: Well-structured implementation with optimized edge cases.

The implementation efficiently handles different length cases while maintaining the original adler32 logic. The use of static inline with Z_INLINE and special handling for small lengths helps optimize performance.


206-214: Effective use of restrict pointers for copy optimization.

The use of __restrict qualifiers helps the compiler optimize the copy operation by guaranteeing non-overlapping memory regions.


258-260: Clean wrapper implementation maintaining backward compatibility.

The function correctly delegates to adler32_fold_copy_impl with appropriate parameters.


262-264: Efficient implementation with implicit validation.

Using dst != NULL as the COPY flag is an elegant way to handle validation implicitly.

zbuild.h (1)

73-78: Well-implemented cross-compiler forced inline support.

The Z_INLINE macro correctly handles both MSVC (__forceinline) and GCC (__attribute__((always_inline))) compilers, which is essential for the performance optimizations in the adler32 implementation.

#else
#define Z_INLINE
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking):
Wrapping this in a test for WITHOUT_FORCEINLINE or NO_FORCEINLINE could be helpful for testing and debugging, and maybe also for the few ones that want to minimize library size for use in embedded and such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how that might be useful if we come to use it in many places. In this circumstance, there's a lot of stack churn going into and out of NEON_accum32, it's quite a bit slower without inlining. Maybe we can do that in a later PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of allowing to disable forced inlining, we should consider using x4 versions only for non-copy version... This would mean we would need to duplicate the function instead of using a parameter to decide if we need to copy... On a lot of known platforms where x4 versions don't work, we already convert them to equivalent 4 x1 calls... For the remaining, we need to fix the issue anyways instead of masking it.

Easiest way to avoid inlining is to build debug version as it will disable all inlining.

Having two versions of the function also makes it easier to test for regressions as there is always "spare" version to compare against especially when the plain C version is not included for reducing binary size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are avoiding that due to the 64 byte alignment requirement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 64-byte alignment requirement seems to be dictated by limitation of specific ABI, not generic restriction... On default ABI there is no such restriction and using x1 version should be somewhat slower and rely too much on compiler's ability to reorder code to avoid unnecessary pipeline stalls and stack spills.

As mentioned elsewhere, the other non-ABI related issues with x4 intrinsics were caused by using incompatible combination of compiler command-line options. As such, fix for them warrants own PR.

Copy link
Contributor Author

@KungFuJesus KungFuJesus Feb 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the x4 load improvement is quite minor to begin with but if we're going to have 2 separate function paths anyway, I don't see much real harm. I do find it funny that coderabbit tried to suggest adding it back, claiming regurgitated benefits from whatever lore it may have read from stack overflow, but then failed to do it because its copy of clang to verify that claims didn't support it. I've added it back and injected what I hope is a good workaround for detecting iOS. That should be all the more that's needed.

We could instead enforce 32 byte alignment in all scenarios but then we're going to have 1-31 byte checksums done a byte at a time which is less than ideal. The alignment benefits everywhere are touted to be transparent and automatically realized by many ARM SIPs, yet the x2,3,&4 variants of this instruction still support encoding an alignment hint into the instruction so...clearly something wants that. In any case clang and GCC will not ever put anything there when targeting Linux, even with an alignment suggestion (__builtin_assume_aligned).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My knowledge of ARM architecture is pretty much limited to compilers for embedded programming and that implies utilizing limited set of instructions... As I've mentioned before, sometimes we need to have to find middle ground between using as many registers (in a single instruction or code block) as possible and avoiding stalls. This pretty much means we should in such cases benchmark 1x, 2x and 4x variations, due to us deducing that 3x variants cause more trouble than have benefits. Compilers will almost always try to reorder the code if there is chance of performance penalty and the optimisation level prefers speed over size.

What works for 32-bit ARM might not be the best option for 64-bit ARM and that was one of the reasons I didn't back then want to unify the code even though a lot of code was shared between 32-bit and 64-bit versions.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)

56-56: Consider using vst1q_u8_x4 for better performance.

The commented out vst1q_u8_x4 instruction suggests a potential optimization. This instruction could potentially improve performance by reducing the number of store operations.

✅ Verification successful

🌐 Web query:

What are the performance benefits of using vst1q_u8_x4 over individual vst1q_u8 calls in ARM NEON?

💡 Result:

Using vst1q_u8_x4 (which stores four 128-bit NEON registers consecutively) instead of four separate vst1q_u8 calls provides several performance advantages on ARM NEON:

  1. Reduced Instruction Overhead
    A single vst1q_u8_x4 instruction replaces four vst1q_u8 calls, minimizing loop iterations and branching. This reduces pipeline stalls and improves instruction throughput[1][5].

  2. Memory Access Efficiency
    Combining stores into a single operation improves cache utilization and memory bandwidth efficiency by consolidating scattered writes into contiguous blocks[3][6]. This is critical for data-heavy workloads like image/video processing.

  3. Latency Hiding
    Bulk stores allow better instruction scheduling to hide memory latency. For example:

    // Individual stores (inefficient)
    vst1q_u8(ptr, data0);
    vst1q_u8(ptr+16, data1);
    vst1q_u8(ptr+32, data2);
    vst1q_u8(ptr+48, data3);
    
    // Combined store (optimized)
    vst1q_u8_x4(ptr, {data0, data1, data2, data3});

    The combined version reduces dependencies between store operations[1][5].

  4. Alignment Benefits
    Bulk stores are more likely to align with cache line boundaries (typically 64 bytes), avoiding partial-line writes that incur extra cycles[3][6].

Approach Instructions Cache Utilization Latency Impact
4x vst1q_u8 calls 4 Moderate High
vst1q_u8_x4 1 Optimal Low

While benchmarks vary by hardware[10], tests on Cortex-A9 show similar optimizations improving memory-bound workloads by ~30%[1]. For ARMv8+ CPUs with wider pipelines (e.g., Grace CPU), bulk stores better utilize memory subsystems designed for large contiguous accesses[2][7].

Citations:


Optimize NEON Data Stores with vst1q_u8_x4
The web query confirms that using vst1q_u8_x4 instead of multiple vst1q_u8 calls can reduce instruction overhead, improve memory access efficiency (thanks to better cache and alignment behavior), and lower latency through bulk store operations. The commented out code on line 56 indicates that this optimization was considered. If performance profiling in your target workload confirms these benefits without adverse effects, you might consider un-commenting (or integrating) the use of vst1q_u8_x4.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da961da and 23ee5b6.

📒 Files selected for processing (5)
  • arch/arm/adler32_neon.c (9 hunks)
  • arch/arm/arm_functions.h (2 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
  • zbuild.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • functable.c
  • arch/arm/arm_functions.h
  • test/benchmarks/benchmark_adler32_copy.cc
  • zbuild.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (96)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: macOS Clang Symbol Prefix
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC (ARM64)
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
🔇 Additional comments (3)
arch/arm/adler32_neon.c (3)

13-150: LGTM! Efficient NEON implementation with optional copy functionality.

The NEON implementation correctly handles both checksum calculation and data copying, with proper alignment and efficient use of SIMD instructions.


204-212: Excellent use of restrict pointers for copy optimization.

The use of __restrict qualifiers helps the compiler generate more efficient code by guaranteeing that the source and destination pointers don't alias.


260-262: LGTM! Smart implementation of the copy flag.

Using dst != NULL as the COPY flag is an elegant way to handle the copy functionality while maintaining backward compatibility.

@KungFuJesus
Copy link
Contributor Author

@neheb could you check to see if my iOS check fixes your issue? I don't have a cross compilation setup to test this sort of thing, it'd be good to know.

I had half a mind to force 32 byte alignment unconditionally for the non-copying variant of the checksum. Then we actually could actually not use the fallback 4x loads. But this would probably make a lot of small checksums really slow.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
arch/arm/neon_intrins.h (1)

171-171: Consider removing the alignment assumption.

The __builtin_assume_aligned hint might be unnecessary since the alignment is already handled by the caller.

-        uint8x16x4_t d0_d3 = vld1q_u8_x4(__builtin_assume_aligned(buf, 32));
+        uint8x16x4_t d0_d3 = vld1q_u8_x4(buf);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23ee5b6 and d706600.

📒 Files selected for processing (6)
  • arch/arm/adler32_neon.c (7 hunks)
  • arch/arm/arm_functions.h (2 hunks)
  • arch/arm/neon_intrins.h (1 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
  • zbuild.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • test/benchmarks/benchmark_adler32_copy.cc
  • arch/arm/arm_functions.h
  • functable.c
  • zbuild.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (166)
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS64
  • GitHub Check: macOS Clang ASAN (ARM64)
  • GitHub Check: macOS Clang ASAN
  • GitHub Check: Windows GCC Compat No Opt
  • GitHub Check: Windows ClangCl Win64 Native Instructions (AVX)
  • GitHub Check: Windows ClangCl Win64
  • GitHub Check: Windows ClangCl Win32
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: Ubuntu MinGW i686
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC SPARC64
  • GitHub Check: Ubuntu GCC PPC64LE
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No NEON UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 No ARMv8 UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu GCC PPC
  • GitHub Check: Ubuntu GCC MIPS64
🔇 Additional comments (11)
arch/arm/neon_intrins.h (6)

28-30: LGTM! iOS platform detection is correctly implemented.

The macro definition for iOS platform detection is well-implemented using standard platform-specific macros.


32-34: LGTM! Alignment requirement check is correctly expanded.

The condition now properly handles both Android and iOS platforms, with an accurate comment explaining the strict alignment requirements for x4 NEON intrinsics.


13-21: LGTM! Taps array is correctly aligned for NEON operations.

The array is properly aligned to 64 bytes using ALIGNED_ macro, which is essential for optimal NEON vector operations.


23-146: LGTM! NEON_accum32_copy implementation is well-optimized.

The function efficiently combines checksum computation with data copying:

  • Uses NEON intrinsics for vectorized operations
  • Properly handles alignment and buffer copying
  • Maintains the same performance characteristics as the non-copying version

269-291: LGTM! Short length handling is properly implemented.

The function correctly handles edge cases:

  • Single byte case with conditional copying
  • Short lengths (< 16 bytes) with appropriate helper functions

379-385: LGTM! Public API functions are well-designed.

The functions provide a clean interface:

  • adler32_neon: Maintains backward compatibility
  • adler32_fold_copy_neon: New function with copy support
  • Both use the shared implementation with appropriate flags
arch/arm/adler32_neon.c (5)

13-21: LGTM! Well-structured coefficient array.

The taps array is properly aligned for SIMD operations and contains the correct coefficient values.


23-146: Excellent optimization for combined copy and checksum calculation!

The implementation efficiently interleaves load, store, and accumulation operations for better instruction-level parallelism. The decision to avoid vst1q_u8_x4 in favor of individual stores is well-justified for platforms where stores can block instruction-level parallelism.


269-377: Well-structured implementation with thorough edge case handling!

The implementation:

  • Efficiently handles alignment requirements
  • Uses restrict pointers for better optimization
  • Includes detailed comments about platform-specific considerations
  • Properly handles all edge cases (single byte, short lengths, tails)

379-385: Clean and efficient API design!

The API functions provide a simple interface while maintaining efficiency:

  • adler32_neon: Maintains backward compatibility
  • adler32_fold_copy_neon: Elegantly handles the copy case by checking if dst is NULL

171-171:

❓ Verification inconclusive

Consider platform-specific alignment requirements.

The use of __builtin_assume_aligned(buf, 32) assumes 32-byte alignment. While this works well for most platforms, some ARM platforms might benefit from 64-byte alignment for cache line optimization.


🏁 Script executed:

#!/bin/bash
# Check if there are any alignment-related performance notes in the codebase
rg -i "align|cacheline" --type c

Length of output: 17472


Action: Revisit Buffer Alignment for Optimal Cache Performance

The current use of __builtin_assume_aligned(buf, 32) hardcodes a 32-byte alignment assumption. While 32-byte alignment is generally sufficient for NEON loads, note that several parts of the codebase (e.g., the taps array defined with ALIGNED_(64)) demonstrate a preference for 64-byte alignment on some ARM targets. Please verify that the buffer’s allocation meets the intended alignment requirements. If the target platform benefits from a 64-byte cache line (or stricter alignment, as seen in related modules), consider either changing the alignment parameter (or using an appropriate macro like HINT_ALIGNED_64) or ensuring that the buffer allocation consistently provides the necessary alignment to avoid potential performance hits or misaligned memory accesses.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
arch/arm/adler32_neon.c (2)

51-55: Consider documenting the ILP optimization choice.

The code uses individual vst1q_u8 operations instead of vst1q_u8_x4 to avoid instruction-level parallelism (ILP) blocking. Consider adding a comment explaining this optimization decision, similar to the alignment explanation in the code.


324-331: Consider using memcpy for small copies.

For small copies in the alignment adjustment phase, consider using memcpy instead of a manual loop. Modern compilers can optimize memcpy calls effectively.

-            while (cpy_len--) {
-                *dst_noalias++ = *src_noalias++;
-            }
+            memcpy(dst_noalias, src_noalias, cpy_len);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d706600 and cc91055.

📒 Files selected for processing (6)
  • arch/arm/adler32_neon.c (7 hunks)
  • arch/arm/arm_functions.h (2 hunks)
  • arch/arm/neon_intrins.h (1 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
  • zbuild.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • arch/arm/neon_intrins.h
  • functable.c
  • arch/arm/arm_functions.h
  • zbuild.h
  • test/benchmarks/benchmark_adler32_copy.cc
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (57)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
  • GitHub Check: macOS Clang Native Instructions (ARM64)
  • GitHub Check: Windows MSVC 2019 v140 Win64
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: macOS GCC
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: macOS GCC Symbol Prefix (ARM64)
🔇 Additional comments (3)
arch/arm/adler32_neon.c (3)

13-21: LGTM! Well-structured coefficient array.

The taps array is properly aligned for optimal NEON vector operations and contains the correct coefficients for Adler32 calculation.


379-385: LGTM! Clean and efficient public interface.

The implementation provides a clean interface with efficient NULL handling for the copy operation. The reuse of the implementation function with a boolean flag is elegant.


306-316: Excellent documentation of platform-specific optimizations.

The detailed comments about platform-specific alignment requirements and optimization choices are valuable for maintainability. This level of documentation helps future developers understand the performance implications of the implementation.

@neheb
Copy link

neheb commented Feb 17, 2025

neheb

as is, this PR fixes runtime on Alpine Linux + clang.

So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
arch/arm/adler32_neon.c (1)

268-290: Consider adding early-exit optimization for zero-length input.

While the implementation is solid, consider adding an early return for len == 0 before the len == 1 check to avoid unnecessary processing.

 static Z_FORCEINLINE uint32_t adler32_fold_copy_impl(uint32_t adler, uint8_t *dst, const uint8_t *src, size_t len, const int COPY) {
+    if (len == 0)
+        return adler;
+
     /* split Adler-32 into component sums */
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc91055 and b50216e.

📒 Files selected for processing (6)
  • arch/arm/adler32_neon.c (6 hunks)
  • arch/arm/arm_functions.h (2 hunks)
  • arch/arm/neon_intrins.h (1 hunks)
  • functable.c (1 hunks)
  • test/benchmarks/benchmark_adler32_copy.cc (1 hunks)
  • zbuild.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • test/benchmarks/benchmark_adler32_copy.cc
  • arch/arm/arm_functions.h
  • zbuild.h
  • functable.c
  • arch/arm/neon_intrins.h
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
arch/arm/adler32_neon.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:51:23.250Z
Learning: Internal functions in performance-critical paths don't require the same level of defensive programming (like NULL checks) as public APIs, especially when the usage is well-controlled and the parameters are validated by the caller.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1870
File: arch/arm/adler32_neon.c:173-174
Timestamp: 2025-02-12T02:53:59.567Z
Learning: In performance-critical code paths, parameter validation (like NULL checks) should be done at the API boundary rather than in internal implementations to maintain both safety and performance.
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
  • GitHub Check: EL9 Clang S390X DFLTCC MSAN
🔇 Additional comments (4)
arch/arm/adler32_neon.c (4)

13-21: LGTM! Well-structured coefficient array.

The taps array is properly aligned and organized for optimal SIMD access.


23-145: Excellent optimization for combined checksum and copy operations!

The implementation intelligently avoids x4 load instructions when copying is involved, which helps prevent pipeline stalls. The code maintains numerical accuracy while adding copy functionality.


260-266: LGTM! Clean implementation of tail handling.

The helper function correctly processes remaining bytes that can't be handled by SIMD operations.


378-384: LGTM! Clean public interface implementation.

The wrapper functions provide a clear interface while delegating the actual work to the optimized implementation. The dst != NULL check in adler32_fold_copy_neon is an elegant way to control the copy behavior.

@Dead2 Dead2 changed the title Fold a copy into the adler32 function for UPDATEWINDOW ARM Neon: Fold a copy into the adler32 function for UPDATEWINDOW Mar 3, 2025
Copy link
Member

@Dead2 Dead2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dead2 Dead2 merged commit 50e9ca0 into zlib-ng:develop Mar 5, 2025
151 checks passed
@Un1q32
Copy link
Contributor

Un1q32 commented Mar 30, 2025

Hey, this seems to not be needed with Clang 20. Can anyone else confirm?

@neheb
Copy link

neheb commented Apr 7, 2025

correct.

I just tested versions 2.2.4 and 2.2.3. Both succeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants