-
-
Notifications
You must be signed in to change notification settings - Fork 308
Enable AVX2 functions to be built with BMI2 instructions #1816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe pull request updates the handling of AVX2 optimizations in the codebase. It modifies the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
functable.c (1)
113-113: Document the BMI2 dependency in a comment.Consider adding a comment explaining why BMI2 is required alongside AVX2 to help future maintainers understand the dependency.
+ /* BMI2 is required as AVX2 implementations use flagless versions of shifts */ if (cf.x86.has_avx2 && cf.x86.has_bmi2) {cmake/detect-intrinsics.cmake (2)
Line range hint
165-173: Consider enhancing the AVX2 intrinsics test to verify BMI2 support.The current test only verifies AVX2 support through
_mm256_subs_epu16. Since we're adding BMI2 flags, it would be good to also verify BMI2 support in the same test.Here's a suggested enhancement:
check_c_source_compiles( "#include <immintrin.h> __m256i f(__m256i x) { const __m256i y = _mm256_set1_epi16(1); - return _mm256_subs_epu16(x, y); + // Test both AVX2 and BMI2 intrinsics + const __m256i z = _mm256_subs_epu16(x, y); + unsigned long long a = _bzhi_u64(0xFFFFFFFFFFFFFFFF, 32); // BMI2 intrinsic + return _mm256_set1_epi64x(a); } int main(void) { return 0; }" HAVE_AVX2_INTRIN )
153-162: Consider adding a comment explaining the BMI2 requirement.It would be helpful to document why BMI2 is enabled by default with AVX2, explaining that all CPUs supporting AVX2 also support BMI2, and how this improves performance.
Add a comment like this before the flag settings:
if(NOT NATIVEFLAG) if(CMAKE_C_COMPILER_ID MATCHES "Intel") if(CMAKE_HOST_UNIX OR APPLE) + # BMI2 is enabled by default with AVX2 as all AVX2-capable CPUs support BMI2. + # This allows for better performance through flagless shifts and reduced GPR operations. set(AVX2FLAG "-mavx2 -mbmi2")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
arch/x86/Makefile.in(1 hunks)cmake/detect-intrinsics.cmake(1 hunks)functable.c(1 hunks)
🔇 Additional comments (4)
arch/x86/Makefile.in (2)
13-13: LGTM! Verify runtime CPU feature detection.
The addition of -mbmi2 to AVX2FLAG aligns with the existing pattern where AVX512-related flags already include BMI2. This change is technically sound since all CPUs with AVX2 support also have BMI2 capabilities.
Let's verify the runtime CPU feature detection code:
✅ Verification successful
BMI2 detection is properly handled alongside AVX2
The CPU feature detection code in x86_features.c correctly handles BMI2 detection. The code:
- Detects BMI2 support via CPUID (bit 8 in EBX from leaf 7)
- Stores it in
has_bmi2field - Already requires BMI2 for AVX512 optimizations (
has_avx512_commonchecks forhas_bmi2)
Since the runtime detection is properly implemented and BMI2 is already a requirement for higher SIMD features, adding -mbmi2 to AVX2FLAG is safe.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if BMI2 detection is properly handled alongside AVX2
# Search for CPU feature detection code
rg -A 5 "BMI2.*AVX2|AVX2.*BMI2" --type cpp
Length of output: 44
Script:
#!/bin/bash
# Let's search for CPU feature detection code more broadly
# First, let's check for x86 CPU feature detection files
fd -e c -e h "cpu.*feat|x86" --type f
# Then let's look for BMI2 detection
rg "BMI2" --type cpp -A 5
# And AVX2 detection
rg "AVX2" --type cpp -A 5
Length of output: 6574
Script:
#!/bin/bash
# Let's examine the CPU feature detection code in the identified files
cat arch/x86/x86_features.h
cat arch/x86/x86_features.c
Length of output: 4364
13-13: Verify impact on AVX2-optimized files.
This change affects the compilation of all AVX2-optimized files. Let's verify the implementation details of affected files to ensure they can benefit from BMI2 instructions:
functable.c (1)
Line range hint 113-125: LGTM! Verify BMI2 usage in AVX2 functions.
The addition of the BMI2 check is correct and aligns with the PR objectives. The implementation maintains thread-safety through atomic operations and proper memory barriers.
Let's verify that the AVX2 functions actually use BMI2 instructions:
cmake/detect-intrinsics.cmake (1)
154-154: LGTM! The addition of BMI2 flag aligns with the PR objectives.
The changes correctly add the -mbmi2 flag alongside -mavx2 for both Intel and GNU/Clang compilers. This is consistent with the fact that all CPUs supporting AVX2 also support BMI2, and will enable better performance through flagless shifts.
Let's verify the consistency of BMI2 flag usage across other intrinsics:
Also applies to: 159-159
✅ Verification successful
BMI2 flag usage is consistent across AVX2 and AVX512 intrinsics
The verification confirms that -mbmi2 flag is consistently used with both AVX2 and AVX512 intrinsics in the codebase. This is correct because:
- AVX2 functions are being enabled with BMI2 instructions as per PR objectives
- AVX512 implementations already include BMI2 flag, showing established pattern
- The consistency in flag usage across different instruction set extensions maintains proper build configuration
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify BMI2 flag usage in other AVX-related intrinsics checks
# Test: Check if BMI2 is consistently used with AVX512 flags
rg "set\(AVX512.*FLAG.*mbmi2" cmake/detect-intrinsics.cmake
Length of output: 423
4ca7e64 to
bfec601
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (6)
functable.c (1)
113-113: Add a comment explaining the AVX2/BMI2 relationship.Consider adding a code comment explaining why BMI2 is checked alongside AVX2, referencing the architectural relationship between these instruction sets.
+ // BMI2 is checked alongside AVX2 as all CPUs supporting AVX2 also support BMI2. + // This enables optimized implementations using flagless versions of shifts. if (cf.x86.has_avx2 && cf.x86.has_bmi2) {configure (5)
Line range hint
223-246: Enhancecheck_avx2_intrinsics()for better compatibilityTo prevent scenarios where AVX2 optimizations are unintentionally disabled due to lack of BMI2 support, modify the
check_avx2_intrinsics()function to check for AVX2 support without BMI2 as a fallback.Apply this diff to improve the function:
check_avx2_intrinsics() { # Check whether compiler supports AVX2 intrinsics cat > $test.c << EOF #include <immintrin.h> __m256i f(__m256i x) { const __m256i y = _mm256_set1_epi16(1); return _mm256_subs_epu16(x, y); } int main(void) { return 0; } EOF - if try ${CC} ${CFLAGS} ${avx2flag} $test.c; then + if try ${CC} ${CFLAGS} -mavx2 -mbmi2 $test.c; then echo "Checking for AVX2 intrinsics ... Yes." | tee -a configure.log HAVE_AVX2_INTRIN=1 + avx2flag="-mavx2 -mbmi2" + elif try ${CC} ${CFLAGS} -mavx2 $test.c; then + echo "Checking for AVX2 intrinsics without BMI2 ... Yes." | tee -a configure.log + HAVE_AVX2_INTRIN=1 + avx2flag="-mavx2" else echo "Checking for AVX2 intrinsics ... No." | tee -a configure.log HAVE_AVX2_INTRIN=0 fi }
Line range hint
487-494: Indentation inconsistency incheck_mtune_cascadelake_compiler_flag()The indentation within the
ifandelseblocks of thecheck_mtune_cascadelake_compiler_flag()function is inconsistent, which affects the readability of the code.Consider adjusting the indentation for clarity:
check_mtune_cascadelake_compiler_flag() { # Check whether -mtune=cascadelake works correctly cat > $test.c << EOF int main() { return 0; } EOF if try $CC -c $CFLAGS -mtune=cascadelake $test.c; then MTUNE_CASCADELAKE_AVAILABLE=1 echo "Check whether -mtune=cascadelake works ... Yes." | tee -a configure.log else echo "Check whether -mtune=cascadelake works ... No." | tee -a configure.log MTUNE_CASCADELAKE_AVAILABLE=0 check_mtune_skylake_avx512_compiler_flag fi }
Line range hint
1670-1670: Typo in variable assignmentThere's a possible typo in the variable name within the
check_mtune_skylake_avx512_compiler_flag()function. The variableMTUNE_SKYLAKE_AVX512_AVAILABLEmight have been misspelled or inconsistently used elsewhere.Please verify the variable name for consistency to ensure that the script functions correctly.
Line range hint
1479-1489: Avoid hardcoding architecture-specific valuesIn the architecture detection section for x86/amd64, the script hardcodes values for
ARCHDIR,CFLAGS, and object files. This can reduce maintainability and adaptability for future architecture expansions.Consider abstracting architecture-specific settings into variables or a configuration file to improve modularity and ease future updates.
Separate AVX2 and BMI2 flags for better compatibility
The verification confirms that
avx2flagis defined as-mavx2 -mbmi2and used directly incheck_avx2_intrinsics(). The current implementation bundles AVX2 and BMI2 support checks together, which could indeed prevent AVX2 optimizations from being used on systems where the compiler supports AVX2 but not BMI2.
- In
configure, modify line whereavx2flagis defined to only include-mavx2- Create a separate
bmi2flagvariable if BMI2 instructions are needed elsewhere- Update
check_avx2_intrinsics()to test BMI2 support separately if required🔗 Analysis chain
Line range hint
223-233: Ensure robust detection incheck_avx2_intrinsics()The
check_avx2_intrinsics()function checks for compiler support of AVX2 intrinsics using both-mavx2and-mbmi2flags. While most CPUs supporting AVX2 also support BMI2, and compilers generally handle both flags together, there might be edge cases where a compiler supports AVX2 but not BMI2.To ensure broader compatibility, consider modifying the function to separately check for AVX2 support without BMI2, falling back if necessary. This ensures that AVX2 optimizations are not skipped due to lack of BMI2 support.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify compiler support for AVX2 with and without BMI2. cat > $test.c << EOF #include <immintrin.h> __m256i f(__m256i x) { const __m256i y = _mm256_set1_epi16(1); return _mm256_subs_epu16(x, y); } int main(void) { return 0; } EOF # Test AVX2 with BMI2 if try ${CC} ${CFLAGS} ${avx2flag} $test.c; then echo "Compiler supports AVX2 with BMI2." HAVE_AVX2_INTRIN=1 else # Test AVX2 without BMI2 if try ${CC} ${CFLAGS} -mavx2 $test.c; then echo "Compiler supports AVX2 without BMI2." HAVE_AVX2_INTRIN=1 avx2flag="-mavx2" else echo "Compiler does not support AVX2." HAVE_AVX2_INTRIN=0 fi fiLength of output: 314
Script:
#!/bin/bash # Let's first check if the check_avx2_intrinsics function exists and analyze its implementation rg -A 10 "check_avx2_intrinsics" configureLength of output: 844
Script:
#!/bin/bash # Let's check where avx2flag is defined and how it's used rg -B 5 -A 5 "avx2flag=" configureLength of output: 443
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
arch/x86/Makefile.in(1 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(1 hunks)functable.c(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- arch/x86/Makefile.in
- cmake/detect-intrinsics.cmake
🔇 Additional comments (2)
functable.c (1)
Line range hint 113-124: LGTM! Verify CPU feature support assumption.
The addition of the BMI2 check alongside AVX2 is aligned with the PR objectives. However, let's verify the assumption that all CPUs with AVX2 support BMI2.
Let's check CPU feature databases to verify this assumption:
configure (1)
111-111: Inclusion of BMI2 support in avx2flag
The addition of -mbmi2 to the avx2flag ensures that BMI2 instructions are enabled when compiling AVX2 code. This aligns with the PR objective to enhance AVX2 functions by incorporating BMI2 instructions for performance improvements.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1816 +/- ##
===========================================
+ Coverage 32.18% 32.24% +0.06%
===========================================
Files 67 67
Lines 5752 5753 +1
Branches 1237 1237
===========================================
+ Hits 1851 1855 +4
- Misses 3644 3652 +8
+ Partials 257 246 -11 ☔ View full report in Codecov by Sentry. |
Develop 28 NovPR #1816Wow.. I had to retest that because it was such a big difference. Inflate: Perf test is 16% faster. Deflatebench is 21% faster. 😮 |
|
@KungFuJesus If this speedup is the real deal, I wonder what the AVX512 is in comparison now, it feels like the benefit must have been reduced by quite a bit. |
When doing an initial improvement, I tend to compare against -march=native builds, which would have had this enabled. I don't believe it will be significantly undercut, but let me check. |
|
So here's this PR, as is with inflate_fast_avx512 and all its chunksetting: Here's with inflate_fast_avx512 and chunkmemset etc carved out (keeping the adler stuff in). So the plot thickens. It's not a huge net loss, but it is marginally slower. It might be hard to attribute that specifically to any one thing, though. |
Could you do one without this PR as well? |
I can but it's going to take the avx512 path which will have BMI2 enabled as well. I can also use this as the baseline for #1821 |
Oh, right. I was hoping one with |
Doing that then precludes the faster checksums with it, though. The numbers for the version with inflate_fast_avx512 and chunkmemset_safe_avx512 commented out from the function table are what I think you want. |
But that is with this PR included, or did I misunderstand? What I want to see is before and after for the AVX2 codepath, similar to what I did, to see whether the speedup is there for others as well and how big. |
Ah fair enough, let me try this on a U series Haswell CPU since that's what I'm using at the moment. DevelopPR1816 |
|
Nice, a 12.5% speedup. I think that confirms that it is a really good optimization. Makes me wonder how performance compares wrt MinGW vs MSVC compilers, since MSVC does not actually let us choose whether to enable BMI2 or not, so is it used since it always exists along with AVX2 or is it never used because that might not be true in the future? I suspect the latter, meaning that MinGW might result in significantly faster inflate. |
Dead2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome to me ;)
|
@Dead2 |
|
@KungFuJesus Could you add this comment, it does seem like a really good idea. |
Yeah, sorry had to step away for a bit. Will add a comment now. |
bfec601 to
28ab687
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
configure (1)
Line range hint
486-505: Add BMI2 intrinsics verification to the AVX2 check.While the function verifies AVX2 support, it should also verify BMI2 intrinsics since we're enabling both features together.
check_avx2_intrinsics() { # Check whether compiler supports AVX2 intrinsics cat > $test.c << EOF #include <immintrin.h> +/* Test both AVX2 and BMI2 intrinsics */ __m256i f(__m256i x) { const __m256i y = _mm256_set1_epi16(1); - return _mm256_subs_epu16(x, y); + /* AVX2 test */ + __m256i avx2_result = _mm256_subs_epu16(x, y); + /* BMI2 test */ + unsigned long long bmi2_result = _bzhi_u64(0xFFFFFFFFFFFFFFFF, 32); + return avx2_result; } int main(void) { return 0; } EOF if try ${CC} ${CFLAGS} ${avx2flag} $test.c; then - echo "Checking for AVX2 intrinsics ... Yes." | tee -a configure.log + echo "Checking for AVX2+BMI2 intrinsics ... Yes." | tee -a configure.log HAVE_AVX2_INTRIN=1 else - echo "Checking for AVX2 intrinsics ... No." | tee -a configure.log + echo "Checking for AVX2+BMI2 intrinsics ... No." | tee -a configure.log HAVE_AVX2_INTRIN=0 fi }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
arch/x86/Makefile.in(1 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(1 hunks)functable.c(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- arch/x86/Makefile.in
- functable.c
- cmake/detect-intrinsics.cmake
🔇 Additional comments (2)
configure (2)
Line range hint 1124-1196: LGTM! Architecture-specific handling is correct.
The x86/amd64 architecture section properly handles the modified avx2flag, ensuring that BMI2 instructions will be enabled when building AVX2-optimized object files.
111-111: LGTM! Verify compiler support for combined flags.
The addition of -mbmi2 to avx2flag aligns with the PR objective. Since all CPUs supporting AVX2 also support BMI2, this change is logical.
✅ Verification successful
Let me try a different approach to verify the compiler flag support by examining the codebase for existing usage and checks.
Let me check the specific compiler test implementation for AVX2 to verify how the flags are handled.
Let me check the x86 feature detection code more thoroughly and examine how the AVX2 code is used.
Based on the comprehensive code analysis, I can now provide the final response:
The combined AVX2 and BMI2 flags are safe and correctly implemented
The codebase shows that:
- The x86 feature detection properly checks for both AVX2 and BMI2 support via CPUID (
x86_features.c) - The AVX2 implementations don't rely on BMI2 instructions directly, making the
-mbmi2flag a safe addition - The configure script already has proper compiler feature detection for AVX2 through
check_avx2_intrinsics()
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that major compilers support the combined -mavx2 -mbmi2 flags
# Test with both GCC and Clang if available
echo "Testing GCC support for combined flags..."
echo "int main() { return 0; }" | gcc -c -x c - -mavx2 -mbmi2 2>&1 || echo "GCC: Failed"
echo "Testing Clang support for combined flags..."
echo "int main() { return 0; }" | clang -c -x c - -mavx2 -mbmi2 2>&1 || echo "Clang: Failed"
Length of output: 579
Script:
#!/bin/bash
# Search for existing compiler flag checks and BMI2/AVX2 usage
echo "Checking configure script for compiler flag tests..."
rg "gcc.*-mavx2|clang.*-mavx2|check.*avx2|check.*bmi2" configure -A 2
echo -e "\nChecking for BMI2/AVX2 related code..."
fd -e c -e h . --exec rg -l "BMI2|AVX2|__AVX2__|__BMI2__"
echo -e "\nChecking for existing compiler tests..."
rg "compiler.*test|test.*compiler" configure
Length of output: 1103
Script:
#!/bin/bash
# Check the AVX2 intrinsics test implementation
echo "Examining AVX2 compiler test implementation..."
rg -A 15 "check_avx2_intrinsics\(\)" configure
echo -e "\nChecking how compiler flags are tested..."
cat configure | grep -A 5 "try_compile()"
echo -e "\nChecking x86 feature detection code..."
rg "bmi2.*avx2|avx2.*bmi2" arch/x86/x86_features.c
Length of output: 973
Script:
#!/bin/bash
# Check x86 feature detection implementation
echo "Examining x86 feature detection..."
cat arch/x86/x86_features.c
echo -e "\nChecking AVX2 implementation files..."
cat arch/x86/chunkset_avx2.c
echo -e "\nChecking build system's feature detection..."
rg "HAVE_BMI2|HAVE_AVX2" configure
Length of output: 8992
While these are technically different instructions, no such CPU exists that has AVX2 that doesn't have BMI2. Enabling BMI2 allows us to eliminate several flag stalls by having flagless versions of shifts, and allows us to not clobber and move around GPRs so much in scalar code. There's usually a sizeable benefit for enabling it. Since we're building with BMI2 for AVX2 functions, let's also just make sure the CPU claims to support it (just to cover our bases).
28ab687 to
160b3a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
configure (1)
Line range hint
1012-1029: Enhance AVX2 intrinsics test to verify BMI2 supportThe current test only verifies AVX2 intrinsics. Since we're now including BMI2 support, consider enhancing the test to also verify BMI2 intrinsics.
cat > $test.c << EOF #include <immintrin.h> __m256i f(__m256i x) { const __m256i y = _mm256_set1_epi16(1); - return _mm256_subs_epu16(x, y); + // Test both AVX2 and BMI2 intrinsics + unsigned long long z = _bzhi_u64(0xFFFFFFFFFFFFFFFF, 32); // BMI2 intrinsic + __m256i result = _mm256_subs_epu16(x, y); // AVX2 intrinsic + return result; } int main(void) { return 0; } EOF
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
arch/x86/Makefile.in(1 hunks)cmake/detect-intrinsics.cmake(1 hunks)configure(1 hunks)functable.c(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- arch/x86/Makefile.in
- cmake/detect-intrinsics.cmake
- functable.c
🔇 Additional comments (2)
configure (2)
111-111: LGTM: Addition of BMI2 support for AVX2 functions
The addition of -mbmi2 to avx2flag correctly enables BMI2 instructions for AVX2 functions, which aligns with the PR objectives. This is safe since all CPUs that support AVX2 also support BMI2.
Line range hint 1104-1113: Consider adding runtime BMI2 feature detection
Since we're enabling BMI2 instructions alongside AVX2, consider adding explicit runtime BMI2 feature detection in the x86_features implementation to ensure proper CPU support.
✅ Verification successful
Let me gather more context about the BMI2 feature detection implementation.
Let me check the header file and functable.c to get a complete picture of BMI2 usage.
BMI2 feature detection is already properly implemented
The codebase already has proper BMI2 feature detection and runtime handling:
- BMI2 detection is implemented in
x86_features.cusing CPUID instruction (bit 8 of EBX register) - BMI2 is checked alongside AVX2 in
functable.cfor optimal performance - The code correctly enables BMI2-dependent optimizations only when both AVX2 and BMI2 are available at runtime
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if BMI2 feature detection is implemented
# Test: Search for BMI2 feature detection in x86_features implementation
rg -A 5 "bmi2|BMI2" arch/x86/x86_features.c
Length of output: 427
Script:
#!/bin/bash
# Check the complete x86_features implementation and structure
cat arch/x86/x86_features.c
# Also check if there are any BMI2-related runtime checks in the codebase
rg -l "bmi2|BMI2" --type c --type cpp
Length of output: 3700
Script:
#!/bin/bash
# Check the header file for BMI2-related structures and usage
cat arch/x86/x86_features.h
# Check functable.c for BMI2 runtime usage
cat functable.c
Length of output: 11643
While these are technically different instructions, no such CPU exists that has AVX2 that doesn't have BMI2. Enabling BMI2 allows us to eliminate several flag stalls by having flagless versions of shifts, and allows us to not clobber and move around GPRs so much in scalar code. There's usually a sizeable benefit for enabling it. Since we're building with BMI2 for AVX2 functions, let's also just make sure the CPU claims to support it (just to cover our bases).
Summary by CodeRabbit
Summary by CodeRabbit