这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@am11
Copy link
Contributor

@am11 am11 commented Dec 11, 2024

/runtime/src/native/external/zlib-ng/arch/arm/neon_intrins.h:38:28: error: expected identifier or '('
   38 | static inline uint16x8x4_t vld1q_u16_x4(uint16_t const *a) {
      |                            ^
/usr/lib/llvm19/lib/clang/19/include/arm_neon.h:13458:28: note: expanded from macro 'vld1q_u16_x4'
 13458 | #define vld1q_u16_x4(__p0) __extension__ ({ \
       |                            ^

with this delta, the library code builds. Ideally, we should avoid naming collision with toolchain and system headers.

Summary by CodeRabbit

  • Bug Fixes

    • Adjusted conditional compilation to enhance compatibility with ARM NEON intrinsics across various platforms.
  • Refactor

    • Simplified preprocessor directives for broader applicability, improving code maintainability.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2024

Walkthrough

The changes in this pull request modify the arch/arm/neon_intrins.h header file by adjusting conditional compilation directives related to ARM NEON intrinsics. Specifically, the condition for Android platform checks when using Clang for 32-bit ARM architectures has been refined to include cases where the GNU C Library checks are not defined. This results in the broader conditional context for several macros, while the definitions of inline functions for vector loading and storing remain unchanged.

Changes

File Change Summary
arch/arm/neon_intrins.h Updated conditional compilation to include cases where GNU C Library is not defined, affecting the undefining of several macros.

Suggested labels

Architecture

Suggested reviewers

  • Dead2

Possibly related PRs

  • Fix native detection of ARM CRC instruction #1818: This PR modifies conditional compilation directives in arch/arm/arm_functions.h, similar to the changes in arch/arm/neon_intrins.h regarding ARM architecture, specifically focusing on defining macros based on compiler features.
  • Improved setting of OPTIMAL_CMP on ARM #1835: This PR adjusts the OPTIMAL_CMP macro for ARM architectures in zbuild.h, which relates to the broader context of ARM architecture support, akin to the changes made in the main PR.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea9e691 and b7852fb.

📒 Files selected for processing (1)
  • arch/arm/neon_intrins.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • arch/arm/neon_intrins.h
⏰ Context from checks skipped due to timeout of 90000ms (34)
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt
  • GitHub Check: EL9 GCC S390X DFLTCC ASAN
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt UBSAN
  • GitHub Check: Ubuntu GCC AARCH64 ASAN
  • GitHub Check: Ubuntu Emscripten WASM32
  • GitHub Check: Ubuntu GCC AARCH64 Compat No Opt

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@codecov
Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.32%. Comparing base (43d74a2) to head (b7852fb).
Report is 56 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #1824       +/-   ##
============================================
+ Coverage    32.25%   80.32%   +48.07%     
============================================
  Files           67      138       +71     
  Lines         5745    11147     +5402     
  Branches      1239     2860     +1621     
============================================
+ Hits          1853     8954     +7101     
+ Misses        3637     1218     -2419     
- Partials       255      975      +720     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Dead2
Copy link
Member

Dead2 commented Dec 21, 2024

This basically forces the fallback functions for all clang + arm compiles, that is likely hurting performance for users on those platforms that do not use Android or Alpine.

I think the correct solution here might be to make sure that CMake/Configure sets ARM_NEON_HASLD4 during configure, that way those functions do not get defined because we know that the system already provides them.

Investigating why that define does not get set would be the first step to understanding this error.

@am11
Copy link
Contributor Author

am11 commented Dec 21, 2024

This basically forces the fallback functions for all clang + arm compiles, that is likely hurting performance for users on those platforms that do not use Android or Alpine.

I have added a guard so it's only applied to Android or non-GNU libc (since musl, by design, does not predefine any special macro as a way to enforce POSIX-ness / correctness in software ecosystems).

Investigating why that define does not get set would be the first step to understanding this error.

I agree that it shouldn't be using the platform-detection and instead a feature introspection in configure. As a stop-gap solution unblocker, this is the minimal change which is relatively safe / conservative.

@Dead2
Copy link
Member

Dead2 commented Jan 26, 2025

This patch does not look bad anymore, but it still looks wrong.

From the original description, I would think this does not actually fix the error, since that #if would have already been true and this change only adds another restriction to it, making it more unlikely to trigger, while the description indicates it is instead needed to trigger in a case where it does not.

Since the functions actually collide with existing compiler intrinsics, I would think the NEON_HAS_LD4 define should instead have been set by CMake or Configure, and that way this code would not try to override the intrinsics and would not error but instead use the compiler-provided intrinsics.
I suspect the problem lies in this test:

macro(check_neon_ld4_intrinsics)

The overridden intrinsics here are there to work around alignment restrictions, but the error does not indicate an alignment problem, it indicates a CMake / Configure problem.

Does anyone have access to alpine arm32 to dig a little deeper into this?

@am11
Copy link
Contributor Author

am11 commented Jan 26, 2025

WITH_NEON is on by default:

option(WITH_NEON "Build with NEON intrinsics" ON)

if consumer project has -fpu=vfpv3 it causes this failure. Either the consumer project needs to disable WITH_NEON or it needs better introspection, i.e. revisit:

macro(check_neon_compiler_flag)

@neheb
Copy link

neheb commented Feb 11, 2025

@Dead2 as mentioned in the recent issue, clang-11 fails to compile on Debian-11 unless the ANDROID macro check is removed. It looks like an internal compiler crash.

On Alpine Linux with clang-17, it compiles but crashes at runtime.

I don't think this is related to libc.

edit: also crashes on iOS. Definitely not related to libc.

@Dead2
Copy link
Member

Dead2 commented Feb 11, 2025

The current PR does not look bad, but I am not feeling entirely confident yet that it is the Right Way™ to fix it without adding to technical debt. So reviews are needed.

@KungFuJesus You know this problem better than I do, what is your take on how this should be fixed?
I am reluctant to go for fallbacks on all arm cpus obviously, but how do we best select the ones that do need this?

@ccawley2011 You helped work out the unaligned access rewrite on arm, so I think you might also have thoughts about this.

@am11
Copy link
Contributor Author

am11 commented Feb 11, 2025

@Dead2, the problem I was having is clear now and I described it in previous comment: #1824 (comment). The downstream project has -fpu=vfpv3 set which check_neon_compiler_flag macro seems to ignore. I think we can fix it in cmake and revert neon_intrins.h.

@Dead2
Copy link
Member

Dead2 commented Feb 11, 2025

@Dead2, the problem I was having is clear now and I described it in previous comment: #1824 (comment). The downstream project has -fpu=vfpv3 set which check_neon_compiler_flag macro seems to ignore. I think we can fix it in cmake and revert neon_intrins.h.

This does sound plausible, unfortunately I don't myself know the arm ecosystem well enough to know the intricacies around its fpu implementations, and also the details around reproducing this bug are unclear.

If I understand your comment correctly; Do you mean to say the check_neon_compiler_flag macro should also run the check with NEONFLAG=-mfpu=vfpv3 (falling back to that after failing with -mfpu=neon), and that it would then succeed where it currently does not?

@am11
Copy link
Contributor Author

am11 commented Feb 11, 2025

If I understand your comment correctly; Do you mean to say the check_neon_compiler_flag macro should also run the check with NEONFLAG=-mfpu=vfpv3 (falling back to that after failing with -mfpu=neon), and that it would then succeed where it currently does not?

Right now, the macro simply checks for -mfpu=neon and adds it to the CFLAGS, which overwrites any -mfpu value a downstream project might have set (e.g., -mfpu=vfpv3, -mfpu=vfpv3-neon, or even -mfpu=neon). Ideally, if the user has already set -mfpu, the macro should skip introspection and respect that choice. If no -mfpu is set, then it can proceed with detecting and setting an appropriate value. That way, the code doesn’t need additional platform detection as shown in the current state of this PR.

@Dead2
Copy link
Member

Dead2 commented Feb 11, 2025

Right now, the macro simply checks for -mfpu=neon and adds it to the CFLAGS, which overwrites any -mfpu value a downstream project might have set (e.g., -mfpu=vfpv3, -mfpu=vfpv3-neon, or even -mfpu=neon). Ideally, if the user has already set -mfpu, the macro should skip introspection and respect that choice. If no -mfpu is set, then it can proceed with detecting and setting an appropriate value. That way, the code doesn’t need additional platform detection as shown in the current state of this PR.

So detecting any -mfpu= in CFLAGS and using that, yeah I guess that could work.
Not sure what the benefit is to doing that instead of adding our own check though. We do the detection without using the user-provided CFLAGs because they can (and have) easily mess up the compiler checks.

@am11
Copy link
Contributor Author

am11 commented Feb 11, 2025

Not sure what the benefit is to doing that instead of adding our own check though.

To fix the error shown in PR description; mfpu vfpv3 and neon are conflicting.

@Dead2
Copy link
Member

Dead2 commented Feb 11, 2025

Sorry, there was some confusion on my part, and I forgot that the discussion in #1842 was actually about this bug (somehow misremembered that discussion as being PPC related. I just have too many things to keep track of). Sorry @KungFuJesus and @ccawley2011 for asking you on the same topic again.

@am11 Does the fix proposed in #1869 work for you as well? I suspect not, since I assume Alpine does not set that define.
I am seeing a trend here though that 32bit arm is a bit of a mess to detect automatically, so perhaps this PR is in fact the best tradeoff of losing some (unspecified amount of) performance on some machines to gain correctness on others.

@Un1q32 From the look of it, I am guessing this PR would solve your problem as well?

@neheb @bgilbert Would this also fix the problems you are having?

@Un1q32
Copy link
Contributor

Un1q32 commented Feb 11, 2025

Yes this would make #1869 pointless, but if you're doing this for any non-glibc system you should probably check if platforms like FreeBSD have this problem. Also does Android define __GLIBC__ because if not then you could remove the check for __ANDROID__ as well.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Feb 11, 2025

Honestly I'd be half tempted to just revert to 4 separate 128 bit loads (which actually compile to 8 64 bit loads, usually). The strict alignment requirement on these systems that you might lose more than you're saving doing the scalar sums up to the alignment. Though, for a whole window I think we do have the perfect alignment. The compiler seems to prefer to do scalar 64 bit loads to avoid a transition penalty into NEON domain, when using the ld1 instruction into 4 registers it can't do this. In this circumstance, that transition penalty I think is happening regardless because it's using those vector sums as opposed to just copying bytes. @ccawley2011 probably understands this restriction a little bit better, I'd be curious what his take on it is. I do know when messing around with this with a copy elision, it was significantly slower to do a complimenting st1 (x4) than doing 4x loads and 4x stores (I assume the pipeline was stalled and the loads and stores couldn't overlap).

@neheb
Copy link

neheb commented Feb 11, 2025

@Dead2 no

Compilation under Debian 11 would still fail. And even if compilation succeeds, it breaks at runtime.

BTW, I still have no idea where this even works properly with 32-bit ARM and clang. Also keep in mind that ARM 32 is a deprecated platform. Various OSes, containers, etc... have stopped offering ARM32 binaries.

See:
https://www.linuxserver.io/blog/a-farewell-to-arm-hf
microsoft/WindowsAppSDK#406

edit: I should also mention. Because of this issue, zlib-ng in meson's WrapDB has disabled NEON for ARM32 + clang:
mesonbuild/wrapdb@b45f5e9

I assume suboptimal NEON code is preferable to no NEON code.

@Dead2
Copy link
Member

Dead2 commented Feb 11, 2025

Thank you for your replies. I was wondering whether Glibc was really somehow safe, and as I suspected, it was not.

Ok, looks like the remaining alternatives right now are:

  • Somehow detect the situations where the fallback is needed (Looking increasingly unlikely)
  • Always using the fallback code for Clang + arm32 (Easiest)
  • The solution @KungFuJesus suggested above

What do you think @KungFuJesus, should we just go for the fallback for now so it at least works, then any improvements can come later?

@am11 I suggest updating the PR to something like this: (oh, and please rebase on current Develop too, that way CI should pass successfully too)

#  if defined(__clang__) && defined(__arm__)
/* Clang for 32-bit Android can have strict alignment requirements (:256) for x4 NEON intrinsics */

@KungFuJesus
Copy link
Contributor

We do use it one other place (namely here: https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/slide_hash_neon.c#L30). Because that's only ever strictly from the start of the window buffer though the alignment is guaranteed (I believe we align to 64 bytes).

Here's what I had to do to anyway to elide the copy with the checksum:

https://github.com/KungFuJesus/zlib-ng/blob/fold_adler_neon_copy/arch/arm/adler32_neon.c

The improvement doesn't become apparent until much larger sizes, though. Given that it solves this issue, I'm half tempted to put it in for an PR now. Results:

adler32_copy/neon_copy_baseline/8192           722 ns          722 ns       970203
adler32_copy/neon_copy_baseline/32768         2860 ns         2860 ns       244733
adler32_copy/neon_copy_baseline/262144       23637 ns        23636 ns        29602
adler32_copy/neon_copy_baseline/2097152     527776 ns       527767 ns         1319
adler32_copy/neon_copy_baseline/4194304    1110215 ns      1110213 ns          647
adler32_copy/neon/8192                         673 ns          672 ns      1041015
adler32_copy/neon/32768                       2665 ns         2664 ns       262706
adler32_copy/neon/262144                     21470 ns        21469 ns        32741
adler32_copy/neon/2097152                   273652 ns       273651 ns         2359
adler32_copy/neon/4194304                   774135 ns       774090 ns          926

So, the savings are ~100ns until 262k sized buffers.

@Un1q32
Copy link
Contributor

Un1q32 commented Feb 11, 2025

Do we know why the code works in GCC? If this is a bug in Clang we should maybe get it fixed there.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Feb 11, 2025

The impact on the basic adler routine as a whole from switching to scalar 64 bit loads:

Post Separate Loads

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
adler32/neon/1             6.27 ns         6.27 ns    111682787
adler32/neon/8             10.0 ns         10.0 ns     69975017
adler32/neon/12            13.3 ns         13.3 ns     52482521
adler32/neon/16            18.8 ns         18.8 ns     37319559
adler32/neon/32            19.8 ns         19.8 ns     35356502
adler32/neon/64            21.5 ns         21.5 ns     32504872
adler32/neon/512           48.4 ns         48.4 ns     14477778
adler32/neon/4096           257 ns          257 ns      2723813
adler32/neon/32768         1994 ns         1994 ns       350823
adler32/neon/262144       16013 ns        16013 ns        43719
adler32/neon/4194304     271203 ns       271186 ns         2581

Pre separate loads

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
adler32/neon/1             6.27 ns         6.27 ns    111603089
adler32/neon/8             10.4 ns         10.4 ns     67175829
adler32/neon/12            13.8 ns         13.8 ns     50890743
adler32/neon/16            18.4 ns         18.4 ns     37929933
adler32/neon/32            19.4 ns         19.4 ns     36030444
adler32/neon/64            19.9 ns         19.9 ns     35099318
adler32/neon/512           46.5 ns         46.5 ns     15061949
adler32/neon/4096           253 ns          253 ns      2764379
adler32/neon/32768         1986 ns         1986 ns       352460
adler32/neon/262144       15940 ns        15940 ns        43926
adler32/neon/4194304     269802 ns       269792 ns         2591

So I mean, it's not placebo, it is faster by a tiny margin. Though I do wonder if in fact we're losing a lot of the wins from this by not being 512 bit aligned (the benchmark is, since it's aligned with our alloc).

@KungFuJesus
Copy link
Contributor

Do we know why the code works in GCC? If this is a bug in Clang we should maybe get it fixed there.

The code works on aarch64 because I don't think it has the strict alignment requirement for that variant of ld1. Clang for whatever reason imposes a hint at compile time that decorates the instruction with an alignment hint, regardless of whether or not it's needed architecturally. There may in fact be reasons for that with the ARM ABI but I don't pretend to know what they are. @ccawley2011 might, though.

KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
@KungFuJesus
Copy link
Contributor

@Un1q32 does #1870 work for you?

KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
@Un1q32
Copy link
Contributor

Un1q32 commented Feb 12, 2025

@Un1q32 does #1870 work for you?

Yes, this doesn't crash

@am11
Copy link
Contributor Author

am11 commented Feb 12, 2025

@am11 Does the fix proposed in #1869 work for you as well? I suspect not, since I assume Alpine does not set that define.

The issue is not limited to alpine, we had -mfpu=vfpv3 set in alpine that's why it was observed there. I realized it later at #1824 (comment) and since then I've stopped mentioning alpine. :)

If downstream project has CFLAGS=-mfpu=vfpv3 before including zlib-ng cmake project, the build fails with the error shown in the top post, because check_neon_compiler_flag will add -mfpu=neon (in addition to the existing vfpv3) and things start to go downhill from there. The workaround is to explicitly turn off neon set(WITH_NEON OFF) for arm32 in downstream consumer project. This should not be necessary if check_neon_compiler_flag was considering the existing value of -mfpu and making the right choices.

KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
@Dead2 Dead2 changed the title Fix alpine arm32 build Fix Clang arm32 build Feb 12, 2025
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 12, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 15, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 15, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 16, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
KungFuJesus added a commit to KungFuJesus/zlib-ng that referenced this pull request Feb 23, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix zlib-ng#1824.
Dead2 pushed a commit that referenced this pull request Mar 5, 2025
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix #1824.
@Dead2 Dead2 closed this in #1870 Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants