-
-
Notifications
You must be signed in to change notification settings - Fork 308
Use GCC's may_alias attribute for unaligned memory access #1548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files☔ View full report in Codecov by Sentry. |
|
Ah cool, this might actually improve the purely scalar performance I saw testing this on a Sun Fire V240 (depressingly slower than stock zlib). I'll bet the extra function calls here per aliasing load were a big part of that. |
|
Fails a lot of CI tests, need to investigate why... |
8737e10 to
6adb2fc
Compare
zmemory.h
Outdated
|
|
||
| static inline uint32_t zng_memread_4(const void *ptr) { | ||
| #if defined(UNALIGNED_OK) | ||
| return *(const uint32_t *)ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a ton of work to avoid this kind of thing...
Compiler will convert memcpy to unaligned access if it is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These UNALIGNED_OK branches should be removed. So there is only HAVE_MAY_ALIAS and the memcpy path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, UNALIGNED_OK branch should be removed.
We for sure don't want to default to that, if it was a new default-off setting like FORCE_UNALIGNED or something, then there would at least be room for discussion.
The HAVE_MAY_ALIAS branch is interesting, I wonder what kind of performance that gets compared to memcpy on the various platforms/compilers. If we do want to use that, detection would preferably be in configure/CMake for cleaner and wider compiler support, or at the very least moved to zbuild.h.
Do you have a Compiler Explorer example where this is the case? |
|
Does it revert PR #1309? |
This example showcases the difference on ARM with modern GCC when unaligned memory access is and isn't available. It also covers SPARC as well, where |
zmemory.h
Outdated
|
|
||
| static inline uint32_t zng_memread_4(const void *ptr) { | ||
| #if defined(UNALIGNED_OK) | ||
| return *(const uint32_t *)ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These UNALIGNED_OK branches should be removed. So there is only HAVE_MAY_ALIAS and the memcpy path.
zmemory.h
Outdated
| } | ||
|
|
||
| static inline int32_t zng_memcmp_4(const void *src0, const void *src1) { | ||
| #if defined(UNALIGNED_OK) || defined(HAVE_MAY_ALIAS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should remove the defined(UNALIGNED_OK) from all these. Only HAVE_MAY_ALIAS. We error on the side of "the compiler knows best" about converting memcpy to unaligned access. This is better than undefined behavior that can occur. There are other GitHub PRs related to this including #1100 that might be worth reading.
6adb2fc to
afdacc1
Compare
|
OK, I've rebased the branch and removed the special cases for |
afdacc1 to
5a9d9e3
Compare
5a9d9e3 to
28509bd
Compare
| #if defined(HAVE_MAY_ALIAS) | ||
| return zng_memread_2(src0) != zng_memread_2(src1); | ||
| #else | ||
| return memcmp(src0, src1, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original comment suggests it should be using memcpy and not memcmp.
Perhaps it should be: return zng_memread_2(src0) != zng_memread_2(src1); no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code now does return zng_memread_2(src0) != zng_memread_2(src1); for cases where UNALIGNED_OK is defined, which should cover MSVC.
I'm not sure about using it for all cases since the worst case scenario there would be to emit two memcpy calls instead of one memcmp call, but since this PR covers the most common compilers (GCC >= 4 (and compatible) and MSVC), this should be OK for the time being.
28509bd to
5dc5794
Compare
5dc5794 to
d8da4c7
Compare
|
I'll have to do some in-depth testing of this, but I am leaving on vacation now with very limited access, so it'll unfortunately have to wait for a while. |
|
X86-64 i9900K, GCC 13.2 Develop Mar 5 af494fcPR 1548No changes noticeable. Anyone able to do MSVC builds and benchmarks? |
|
This PR must be rebased. There are some conflicts in the nmake files. |
silesia.tarMSVC 19.38.33134.0 DEVELOP af494fcPR #1548 rebased on top of DEVELOPI forgot to turn on decompression in the tests so I will check that out later. It is interesting to see a reduction in code size even though it should compile to the same thing. |
|
With decompression silesia.tarMSVC 19.38.33134.0 DEVELOP af494fcPR #1548 rebased on top of DEVELOP |
820046b to
f295cca
Compare
|
So, we have established that new gcc and msvc (both on x86-64) do not get regressions with this. |
f295cca to
d28c71a
Compare
WalkthroughThis pull request introduces a comprehensive refactoring of memory reading, writing, and comparison functions across multiple architecture-specific files in the zlib-ng library. The primary change is the introduction of a new header file Changes
Suggested labels
Suggested reviewers
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
d28c71a to
2f4631d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.
Actionable comments posted: 2
🧹 Nitpick comments (14)
zmemory.h (4)
23-32: Review performance impact of zng_memread_4
Using a packed struct can be efficient if the compiler inlines these reads, but in tight loops or for large-scale usage, it may be worth benchmarking against an optimized memcpy or compiler intrinsics. Consider a fallback or specialized approach for non-GCC compilers.
54-61: Possible code duplication in zng_memwrite_4
The logic here is nearly identical to zng_memwrite_2, except for the data size. Consider templating or macro expansion to reduce code duplication for read/write pairs.
75-81: Check UNALIGNED_OK logic in zng_memcmp_2
You’ve maintained a fallback to memcmp if neither HAVE_MAY_ALIAS nor UNALIGNED_OK is defined. However, there is a risk that compilers might still insert unaligned reads. Consider making this logic more explicit.
83-89: Inconsistent usage of UNALIGNED_OK in zng_memcmp_4
Unlike zng_memcmp_2, which allows UNALIGNED_OK, consider whether your platform can handle 4-byte unaligned reads or if separate macros (e.g., UNALIGNED64_OK vs. UNALIGNED32_OK) are needed for future clarity.deflate_quick.c (1)
21-21: Validate that zmemory.h is necessary in this file
Including "zmemory.h" here suggests usage of the new memory functions. Although your current usage relies on zng_memcmp_2, consider whether inlined definitions from zmemory.h might cause code bloat if included in too many places.match_tpl.h (2)
78-79: Use of zng_memread_8 for scan_start and scan_end
Reading 8 bytes at a time can be beneficial for searching long matches, but it may be overkill if the typical match length is shorter. Evaluate performance on smaller or embedded platforms.
157-158: Use of zng_memcmp_8 for large comparisons
This is an improvement for performance, but confirm that older toolchains do not miscompile it.arch/generic/chunkset_c.c (3)
6-6: Consider narrower include scope
You only appear to use zng_memread_* and zng_memwrite_* in a few functions here. If build times matter, consider grouping these functions or limiting includes.
21-21: 8-byte chunk setting
Similarly, using zng_memread_8 can speed up chunk transfers. Evaluate on 32-bit platforms if alignment is suboptimal or if fallback is needed.
25-25: loadchunk usage
This function is straightforward, but ensure it’s not called in performance-critical hotspots that rely on potential compiler auto-vectorization or intrinsics.compare256_rle.h (1)
78-84: Consider caching the memory read resultThe 32-bit comparison could benefit from caching the result of
zng_memread_4if the compiler doesn't optimize it automatically.Consider this alternative implementation:
- src0_cmp = zng_memread_2(src0); - sv = ((uint32_t)src0_cmp << 16) | src0_cmp; + uint32_t src0_val = zng_memread_4(src0); + sv = src0_val;arch/x86/chunkset_avx2.c (1)
23-23: Excellent optimization for AVX2 memory operationsThe implementation efficiently combines unaligned memory reads with AVX2 vector operations. The use of
zng_memread_*functions with_mm256_set1_epi*is an optimal approach for loading and broadcasting values.This implementation:
- Eliminates temporary variables
- Reduces memory operations
- Leverages AVX2's efficient broadcast instructions
Also applies to: 27-27, 31-31
inflate_p.h (1)
178-181: Consider tracking the TODO comment as a separate issue.The comment identifies a potential performance optimization opportunity regarding memcpy inlining on architectures with strict memory alignment.
Would you like me to create a GitHub issue to track this TODO for future optimization?
arch/x86/chunkset_avx512.c (1)
37-45: Consider adding performance benchmarksWhile the changes look good from a technical perspective, it would be valuable to have performance benchmarks specifically for these chunked memory operations across different architectures.
Would you like me to help create a benchmark suite for these operations?
🛑 Comments failed to post (2)
zmemory.h (1)
45-52:
⚠️ Potential issueCheck for atomicity needs in zng_memwrite_2
This function writes 16 bits using direct assignment to a packed struct if HAVE_MAY_ALIAS is defined. If the surrounding code relies on atomic 16-bit writes, confirm whether the hardware and compiler guarantee atomicity here.arch/generic/chunkset_c.c (1)
29-29:
⚠️ Potential issuestorechunk atomicity
zng_memwrite_8 is used for storing chunks. For concurrency or lock-free data structures, confirm that partial writes are not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
zmemory.h (2)
72-81: zng_memcmp_2 function might benefit from naming clarifications.
The function returns an int32_t but does a boolean check (inequality). The usage is correct, but the name “memcmp_2” could imply returning a memcmp-style result rather than a boolean. Consider clarifying the docs if future maintainers might be confused.
83-89: zng_memcmp_4 function uses the same approach as zng_memcmp_2.
No correctness issues.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
arch/arm/chunkset_neon.c(2 hunks)arch/arm/compare256_neon.c(1 hunks)arch/generic/Makefile.in(1 hunks)arch/generic/chunkset_c.c(2 hunks)arch/generic/compare256_c.c(3 hunks)arch/power/chunkset_power8.c(2 hunks)arch/power/compare256_power9.c(1 hunks)arch/riscv/compare256_rvv.c(1 hunks)arch/x86/chunkset_avx2.c(2 hunks)arch/x86/chunkset_avx512.c(2 hunks)arch/x86/chunkset_sse2.c(2 hunks)arch/x86/chunkset_ssse3.c(2 hunks)arch/x86/compare256_avx2.c(1 hunks)arch/x86/compare256_sse2.c(1 hunks)compare256_rle.h(4 hunks)deflate.h(6 hunks)deflate_quick.c(1 hunks)inflate_p.h(3 hunks)insert_string_tpl.h(2 hunks)match_tpl.h(4 hunks)win32/Makefile.a64(2 hunks)win32/Makefile.arm(2 hunks)win32/Makefile.msc(2 hunks)zmemory.h(1 hunks)zutil_p.h(0 hunks)
💤 Files with no reviewable changes (1)
- zutil_p.h
🚧 Files skipped from review as they are similar to previous changes (22)
- inflate_p.h
- win32/Makefile.msc
- arch/x86/compare256_sse2.c
- arch/x86/chunkset_avx512.c
- insert_string_tpl.h
- arch/x86/chunkset_sse2.c
- arch/generic/Makefile.in
- arch/arm/chunkset_neon.c
- deflate_quick.c
- deflate.h
- arch/power/compare256_power9.c
- win32/Makefile.a64
- arch/x86/chunkset_ssse3.c
- win32/Makefile.arm
- arch/riscv/compare256_rvv.c
- arch/arm/compare256_neon.c
- match_tpl.h
- arch/x86/chunkset_avx2.c
- arch/x86/compare256_avx2.c
- arch/generic/chunkset_c.c
- compare256_rle.h
- arch/power/chunkset_power8.c
🔇 Additional comments (14)
zmemory.h (11)
1-3: File header looks solid and provides necessary context.
Great job including copyright and distribution references.
5-7: Guard macros appear correct and well-named.
Using "_ZMEMORY_H" is consistent and helps avoid collisions.
12-21: zng_memread_2 function is clear and concise.
Utilizing a packed, may_alias struct is a standard approach for unaligned loads. The fallback to memcpy for compilers without may_alias ensures portability.
23-32: zng_memread_4 function mirrors zng_memread_2 pattern.
This consistent approach is easy to maintain across multiple data sizes.
34-43: zng_memread_8 function continues the same pattern.
No issues spotted. Adheres to the PR’s objective of optimizing unaligned reads.
45-52: zng_memwrite_2 function is straightforward.
Ensuring two-byte writes handle alignment carefully is beneficial.
54-61: zng_memwrite_4 function.
Again, consistent with the prior approach. No concerns.
63-70: zng_memwrite_8 function mirrors the smaller-size approach.
Everything looks good.
91-97: zng_memcmp_8 function remains consistent with the smaller-size versions.
No issues. Good unaligned handling.
99-99: End of file macro
Everything looks properly closed.
8-11: Conditional check for GCC version.
Defining HAVE_MAY_ALIAS meets the PR objective of utilizing GCC’s may_alias attribute. Ensure that the build system sets this macro consistently.
✅ Verification successful
Let me verify if HAVE_MAY_ALIAS is used consistently across the codebase.
Let me check the context where HAVE_MAY_ALIAS is used in zmemory.h to ensure it's used consistently with the GCC version check.
HAVE_MAY_ALIAS macro is correctly defined and consistently used
The macro is properly defined based on GCC version (>= 4) and is consistently used throughout zmemory.h for optimized memory operations. It enables the use of the __may_alias__ attribute for unaligned memory access when available, with proper fallbacks to memcpy/memcmp when the attribute is not supported. The implementation is clean and follows good practices for compiler-specific optimizations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Confirm that HAVE_MAY_ALIAS is set only when __GNUC__ >= 4
rg -A 3 '__GNUC__.*>= 4'
Length of output: 140
Script:
#!/bin/bash
# Check for other occurrences of HAVE_MAY_ALIAS
rg "HAVE_MAY_ALIAS"
# Check for other GCC version checks
rg "__GNUC__"
Length of output: 1786
Script:
#!/bin/bash
# Show the content of zmemory.h to see how HAVE_MAY_ALIAS is used
cat zmemory.h
Length of output: 3039
arch/generic/compare256_c.c (3)
7-7: Switch to zmemory.h.
This aligns with the new unaligned memory approach and centralizes memory operations. Good practice to reduce duplication.
110-111: Use of zng_memread_4 in compare256_unaligned_32_static.
This is consistent with the PR’s intent to replace memcpy with direct unaligned reads. Potential performance benefits on platforms without native unaligned access.
154-155: Use of zng_memread_8 in compare256_unaligned_64_static.
Similar pattern for 64-bit reads. Looks correct and consistent.
nmoinvaz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me.
On platforms that don't allow unaligned memory access, calls to
memcpydon't always get inlined in cases where they would on platforms with it. Using themay_aliasattribute ensures that the code for reading and writing one byte at a time is inlined, and should also handle cases where unaligned memory access is allowed (although checkingUNALIGNED_OKworks better in that regard).Summary by CodeRabbit
New Features
zng_memread_*,zng_memwrite_*) for optimized memory operations.zmemory.h, which consolidates memory management functions.Bug Fixes
memcpycalls with direct memory operations to enhance performance and reduce overhead.Documentation
Chores