-
-
Notifications
You must be signed in to change notification settings - Fork 308
Fix Clang arm32 build #1824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Clang arm32 build #1824
Conversation
WalkthroughThe changes in this pull request modify the Changes
Suggested labels
Suggested reviewers
Possibly related PRs
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (34)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1824 +/- ##
============================================
+ Coverage 32.25% 80.32% +48.07%
============================================
Files 67 138 +71
Lines 5745 11147 +5402
Branches 1239 2860 +1621
============================================
+ Hits 1853 8954 +7101
+ Misses 3637 1218 -2419
- Partials 255 975 +720 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This basically forces the fallback functions for all clang + arm compiles, that is likely hurting performance for users on those platforms that do not use Android or Alpine. I think the correct solution here might be to make sure that CMake/Configure sets Investigating why that define does not get set would be the first step to understanding this error. |
I have added a guard so it's only applied to Android or non-GNU libc (since musl, by design, does not predefine any special macro as a way to enforce POSIX-ness / correctness in software ecosystems).
I agree that it shouldn't be using the platform-detection and instead a feature introspection in configure. As a stop-gap solution unblocker, this is the minimal change which is relatively safe / conservative. |
|
This patch does not look bad anymore, but it still looks wrong. From the original description, I would think this does not actually fix the error, since that Since the functions actually collide with existing compiler intrinsics, I would think the zlib-ng/cmake/detect-intrinsics.cmake Line 241 in 5e3510e
The overridden intrinsics here are there to work around alignment restrictions, but the error does not indicate an alignment problem, it indicates a CMake / Configure problem. Does anyone have access to alpine arm32 to dig a little deeper into this? |
|
WITH_NEON is on by default: Line 107 in 5e3510e
if consumer project has zlib-ng/cmake/detect-intrinsics.cmake Line 178 in 5e3510e
|
|
@Dead2 as mentioned in the recent issue, clang-11 fails to compile on Debian-11 unless the ANDROID macro check is removed. It looks like an internal compiler crash. On Alpine Linux with clang-17, it compiles but crashes at runtime. I don't think this is related to libc. edit: also crashes on iOS. Definitely not related to libc. |
|
The current PR does not look bad, but I am not feeling entirely confident yet that it is the Right Way™ to fix it without adding to technical debt. So reviews are needed. @KungFuJesus You know this problem better than I do, what is your take on how this should be fixed? @ccawley2011 You helped work out the unaligned access rewrite on arm, so I think you might also have thoughts about this. |
|
@Dead2, the problem I was having is clear now and I described it in previous comment: #1824 (comment). The downstream project has |
This does sound plausible, unfortunately I don't myself know the arm ecosystem well enough to know the intricacies around its fpu implementations, and also the details around reproducing this bug are unclear. If I understand your comment correctly; Do you mean to say the |
Right now, the macro simply checks for |
So detecting any |
To fix the error shown in PR description; mfpu |
|
Sorry, there was some confusion on my part, and I forgot that the discussion in #1842 was actually about this bug (somehow misremembered that discussion as being PPC related. I just have too many things to keep track of). Sorry @KungFuJesus and @ccawley2011 for asking you on the same topic again. @am11 Does the fix proposed in #1869 work for you as well? I suspect not, since I assume Alpine does not set that define. @Un1q32 From the look of it, I am guessing this PR would solve your problem as well? @neheb @bgilbert Would this also fix the problems you are having? |
|
Yes this would make #1869 pointless, but if you're doing this for any non-glibc system you should probably check if platforms like FreeBSD have this problem. Also does Android define |
|
Honestly I'd be half tempted to just revert to 4 separate 128 bit loads (which actually compile to 8 64 bit loads, usually). The strict alignment requirement on these systems that you might lose more than you're saving doing the scalar sums up to the alignment. Though, for a whole window I think we do have the perfect alignment. The compiler seems to prefer to do scalar 64 bit loads to avoid a transition penalty into NEON domain, when using the ld1 instruction into 4 registers it can't do this. In this circumstance, that transition penalty I think is happening regardless because it's using those vector sums as opposed to just copying bytes. @ccawley2011 probably understands this restriction a little bit better, I'd be curious what his take on it is. I do know when messing around with this with a copy elision, it was significantly slower to do a complimenting st1 (x4) than doing 4x loads and 4x stores (I assume the pipeline was stalled and the loads and stores couldn't overlap). |
|
@Dead2 no Compilation under Debian 11 would still fail. And even if compilation succeeds, it breaks at runtime. BTW, I still have no idea where this even works properly with 32-bit ARM and clang. Also keep in mind that ARM 32 is a deprecated platform. Various OSes, containers, etc... have stopped offering ARM32 binaries. See: edit: I should also mention. Because of this issue, zlib-ng in meson's WrapDB has disabled NEON for ARM32 + clang: I assume suboptimal NEON code is preferable to no NEON code. |
|
Thank you for your replies. I was wondering whether Glibc was really somehow safe, and as I suspected, it was not. Ok, looks like the remaining alternatives right now are:
What do you think @KungFuJesus, should we just go for the fallback for now so it at least works, then any improvements can come later? @am11 I suggest updating the PR to something like this: (oh, and please rebase on current Develop too, that way CI should pass successfully too) |
|
We do use it one other place (namely here: https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/slide_hash_neon.c#L30). Because that's only ever strictly from the start of the window buffer though the alignment is guaranteed (I believe we align to 64 bytes). Here's what I had to do to anyway to elide the copy with the checksum: https://github.com/KungFuJesus/zlib-ng/blob/fold_adler_neon_copy/arch/arm/adler32_neon.c The improvement doesn't become apparent until much larger sizes, though. Given that it solves this issue, I'm half tempted to put it in for an PR now. Results: So, the savings are ~100ns until 262k sized buffers. |
|
Do we know why the code works in GCC? If this is a bug in Clang we should maybe get it fixed there. |
|
The impact on the basic adler routine as a whole from switching to scalar 64 bit loads: Post Separate LoadsPre separate loadsSo I mean, it's not placebo, it is faster by a tiny margin. Though I do wonder if in fact we're losing a lot of the wins from this by not being 512 bit aligned (the benchmark is, since it's aligned with our alloc). |
The code works on aarch64 because I don't think it has the strict alignment requirement for that variant of ld1. Clang for whatever reason imposes a hint at compile time that decorates the instruction with an alignment hint, regardless of whether or not it's needed architecturally. There may in fact be reasons for that with the ARM ABI but I don't pretend to know what they are. @ccawley2011 might, though. |
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
The issue is not limited to alpine, we had If downstream project has |
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix zlib-ng#1824.
So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix #1824.
with this delta, the library code builds. Ideally, we should avoid naming collision with toolchain and system headers.
Summary by CodeRabbit
Bug Fixes
Refactor