这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@KungFuJesus
Copy link
Contributor

@KungFuJesus KungFuJesus commented Nov 28, 2024

While these are technically different instructions, no such CPU exists that has AVX2 that doesn't have BMI2. Enabling BMI2 allows us to eliminate several flag stalls by having flagless versions of shifts, and allows us to not clobber and move around GPRs so much in scalar code. There's usually a sizeable benefit for enabling it. Since we're building with BMI2 for AVX2 functions, let's also just make sure the CPU claims to support it (just to cover our bases).

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features
    • Enhanced AVX2 optimizations by adding support for the BMI2 instruction set, improving performance on compatible CPUs.
    • Introduced new functions for checking compiler capabilities related to various intrinsics and optimizations, expanding support for modern architectures.
  • Bug Fixes
    • Improved error handling and logging mechanisms in the configuration process for better troubleshooting.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 28, 2024

Walkthrough

The pull request updates the handling of AVX2 optimizations in the codebase. It modifies the AVX2FLAG variable in both the Makefile and a CMake script to include the -mbmi2 flag alongside the existing -mavx2 flag. The logic in functable.c is adjusted to ensure that both AVX2 and BMI2 support are required for assigning optimized functions. Additionally, the configure script is enhanced to support advanced compiler flags and introduces new functions for checking various intrinsics.

Changes

File Change Summary
arch/x86/Makefile.in Updated AVX2FLAG from -mavx2 to -mavx2 -mbmi2.
cmake/detect-intrinsics.cmake Updated AVX2FLAG from set(AVX2FLAG "-mavx2") to set(AVX2FLAG "-mavx2 -mbmi2").
functable.c Modified init_functable to require both cf.x86.has_avx2 and cf.x86.has_bmi2 for AVX2 optimizations.
configure Updated avx2flag to -mavx2 -mbmi2 and added multiple functions for checking various intrinsics and compiler flags.

Possibly related PRs

  • Improve chunkset_avx2 performance #1778: This PR modifies the AVX2FLAG variable in the arch/x86/chunkset_avx2.c file to enhance performance for AVX2 operations, which is directly related to the changes made in the main PR that also updates the AVX2FLAG variable in the arch/x86/Makefile.in file.

Suggested labels

Build Env

Suggested reviewers

  • Dead2

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
functable.c (1)

113-113: Document the BMI2 dependency in a comment.

Consider adding a comment explaining why BMI2 is required alongside AVX2 to help future maintainers understand the dependency.

+    /* BMI2 is required as AVX2 implementations use flagless versions of shifts */
     if (cf.x86.has_avx2 && cf.x86.has_bmi2) {
cmake/detect-intrinsics.cmake (2)

Line range hint 165-173: Consider enhancing the AVX2 intrinsics test to verify BMI2 support.

The current test only verifies AVX2 support through _mm256_subs_epu16. Since we're adding BMI2 flags, it would be good to also verify BMI2 support in the same test.

Here's a suggested enhancement:

 check_c_source_compiles(
     "#include <immintrin.h>
     __m256i f(__m256i x) {
         const __m256i y = _mm256_set1_epi16(1);
-        return _mm256_subs_epu16(x, y);
+        // Test both AVX2 and BMI2 intrinsics
+        const __m256i z = _mm256_subs_epu16(x, y);
+        unsigned long long a = _bzhi_u64(0xFFFFFFFFFFFFFFFF, 32);  // BMI2 intrinsic
+        return _mm256_set1_epi64x(a);
     }
     int main(void) { return 0; }"
     HAVE_AVX2_INTRIN
 )

153-162: Consider adding a comment explaining the BMI2 requirement.

It would be helpful to document why BMI2 is enabled by default with AVX2, explaining that all CPUs supporting AVX2 also support BMI2, and how this improves performance.

Add a comment like this before the flag settings:

 if(NOT NATIVEFLAG)
     if(CMAKE_C_COMPILER_ID MATCHES "Intel")
         if(CMAKE_HOST_UNIX OR APPLE)
+            # BMI2 is enabled by default with AVX2 as all AVX2-capable CPUs support BMI2.
+            # This allows for better performance through flagless shifts and reduced GPR operations.
             set(AVX2FLAG "-mavx2 -mbmi2")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0ed5ac8 and 4ca7e64.

📒 Files selected for processing (3)
  • arch/x86/Makefile.in (1 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • functable.c (1 hunks)
🔇 Additional comments (4)
arch/x86/Makefile.in (2)

13-13: LGTM! Verify runtime CPU feature detection.

The addition of -mbmi2 to AVX2FLAG aligns with the existing pattern where AVX512-related flags already include BMI2. This change is technically sound since all CPUs with AVX2 support also have BMI2 capabilities.

Let's verify the runtime CPU feature detection code:

✅ Verification successful

BMI2 detection is properly handled alongside AVX2

The CPU feature detection code in x86_features.c correctly handles BMI2 detection. The code:

  1. Detects BMI2 support via CPUID (bit 8 in EBX from leaf 7)
  2. Stores it in has_bmi2 field
  3. Already requires BMI2 for AVX512 optimizations (has_avx512_common checks for has_bmi2)

Since the runtime detection is properly implemented and BMI2 is already a requirement for higher SIMD features, adding -mbmi2 to AVX2FLAG is safe.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if BMI2 detection is properly handled alongside AVX2

# Search for CPU feature detection code
rg -A 5 "BMI2.*AVX2|AVX2.*BMI2" --type cpp

Length of output: 44


Script:

#!/bin/bash
# Let's search for CPU feature detection code more broadly
# First, let's check for x86 CPU feature detection files
fd -e c -e h "cpu.*feat|x86" --type f

# Then let's look for BMI2 detection
rg "BMI2" --type cpp -A 5

# And AVX2 detection
rg "AVX2" --type cpp -A 5

Length of output: 6574


Script:

#!/bin/bash
# Let's examine the CPU feature detection code in the identified files
cat arch/x86/x86_features.h
cat arch/x86/x86_features.c

Length of output: 4364


13-13: Verify impact on AVX2-optimized files.

This change affects the compilation of all AVX2-optimized files. Let's verify the implementation details of affected files to ensure they can benefit from BMI2 instructions:

functable.c (1)

Line range hint 113-125: LGTM! Verify BMI2 usage in AVX2 functions.

The addition of the BMI2 check is correct and aligns with the PR objectives. The implementation maintains thread-safety through atomic operations and proper memory barriers.

Let's verify that the AVX2 functions actually use BMI2 instructions:

cmake/detect-intrinsics.cmake (1)

154-154: LGTM! The addition of BMI2 flag aligns with the PR objectives.

The changes correctly add the -mbmi2 flag alongside -mavx2 for both Intel and GNU/Clang compilers. This is consistent with the fact that all CPUs supporting AVX2 also support BMI2, and will enable better performance through flagless shifts.

Let's verify the consistency of BMI2 flag usage across other intrinsics:

Also applies to: 159-159

✅ Verification successful

BMI2 flag usage is consistent across AVX2 and AVX512 intrinsics

The verification confirms that -mbmi2 flag is consistently used with both AVX2 and AVX512 intrinsics in the codebase. This is correct because:

  • AVX2 functions are being enabled with BMI2 instructions as per PR objectives
  • AVX512 implementations already include BMI2 flag, showing established pattern
  • The consistency in flag usage across different instruction set extensions maintains proper build configuration
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify BMI2 flag usage in other AVX-related intrinsics checks
# Test: Check if BMI2 is consistently used with AVX512 flags
rg "set\(AVX512.*FLAG.*mbmi2" cmake/detect-intrinsics.cmake

Length of output: 423

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (6)
functable.c (1)

113-113: Add a comment explaining the AVX2/BMI2 relationship.

Consider adding a code comment explaining why BMI2 is checked alongside AVX2, referencing the architectural relationship between these instruction sets.

+    // BMI2 is checked alongside AVX2 as all CPUs supporting AVX2 also support BMI2.
+    // This enables optimized implementations using flagless versions of shifts.
     if (cf.x86.has_avx2 && cf.x86.has_bmi2) {
configure (5)

Line range hint 223-246: Enhance check_avx2_intrinsics() for better compatibility

To prevent scenarios where AVX2 optimizations are unintentionally disabled due to lack of BMI2 support, modify the check_avx2_intrinsics() function to check for AVX2 support without BMI2 as a fallback.

Apply this diff to improve the function:

 check_avx2_intrinsics() {
     # Check whether compiler supports AVX2 intrinsics
     cat > $test.c << EOF
 #include <immintrin.h>
 __m256i f(__m256i x) {
     const __m256i y = _mm256_set1_epi16(1);
     return _mm256_subs_epu16(x, y);
 }
 int main(void) { return 0; }
 EOF
-    if try ${CC} ${CFLAGS} ${avx2flag} $test.c; then
+    if try ${CC} ${CFLAGS} -mavx2 -mbmi2 $test.c; then
         echo "Checking for AVX2 intrinsics ... Yes." | tee -a configure.log
         HAVE_AVX2_INTRIN=1
+        avx2flag="-mavx2 -mbmi2"
+    elif try ${CC} ${CFLAGS} -mavx2 $test.c; then
+        echo "Checking for AVX2 intrinsics without BMI2 ... Yes." | tee -a configure.log
+        HAVE_AVX2_INTRIN=1
+        avx2flag="-mavx2"
     else
         echo "Checking for AVX2 intrinsics ... No." | tee -a configure.log
         HAVE_AVX2_INTRIN=0
     fi
 }

Line range hint 487-494: Indentation inconsistency in check_mtune_cascadelake_compiler_flag()

The indentation within the if and else blocks of the check_mtune_cascadelake_compiler_flag() function is inconsistent, which affects the readability of the code.

Consider adjusting the indentation for clarity:

 check_mtune_cascadelake_compiler_flag() {
     # Check whether -mtune=cascadelake works correctly
     cat > $test.c << EOF
 int main() { return 0; }
 EOF
     if try $CC -c $CFLAGS -mtune=cascadelake $test.c; then
         MTUNE_CASCADELAKE_AVAILABLE=1
         echo "Check whether -mtune=cascadelake works ... Yes." | tee -a configure.log
     else
         echo "Check whether -mtune=cascadelake works ... No." | tee -a configure.log
         MTUNE_CASCADELAKE_AVAILABLE=0
         check_mtune_skylake_avx512_compiler_flag
     fi
 }

Line range hint 1670-1670: Typo in variable assignment

There's a possible typo in the variable name within the check_mtune_skylake_avx512_compiler_flag() function. The variable MTUNE_SKYLAKE_AVX512_AVAILABLE might have been misspelled or inconsistently used elsewhere.

Please verify the variable name for consistency to ensure that the script functions correctly.


Line range hint 1479-1489: Avoid hardcoding architecture-specific values

In the architecture detection section for x86/amd64, the script hardcodes values for ARCHDIR, CFLAGS, and object files. This can reduce maintainability and adaptability for future architecture expansions.

Consider abstracting architecture-specific settings into variables or a configuration file to improve modularity and ease future updates.


Separate AVX2 and BMI2 flags for better compatibility

The verification confirms that avx2flag is defined as -mavx2 -mbmi2 and used directly in check_avx2_intrinsics(). The current implementation bundles AVX2 and BMI2 support checks together, which could indeed prevent AVX2 optimizations from being used on systems where the compiler supports AVX2 but not BMI2.

  • In configure, modify line where avx2flag is defined to only include -mavx2
  • Create a separate bmi2flag variable if BMI2 instructions are needed elsewhere
  • Update check_avx2_intrinsics() to test BMI2 support separately if required
🔗 Analysis chain

Line range hint 223-233: Ensure robust detection in check_avx2_intrinsics()

The check_avx2_intrinsics() function checks for compiler support of AVX2 intrinsics using both -mavx2 and -mbmi2 flags. While most CPUs supporting AVX2 also support BMI2, and compilers generally handle both flags together, there might be edge cases where a compiler supports AVX2 but not BMI2.

To ensure broader compatibility, consider modifying the function to separately check for AVX2 support without BMI2, falling back if necessary. This ensures that AVX2 optimizations are not skipped due to lack of BMI2 support.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify compiler support for AVX2 with and without BMI2.

cat > $test.c << EOF
#include <immintrin.h>
__m256i f(__m256i x) {
    const __m256i y = _mm256_set1_epi16(1);
    return _mm256_subs_epu16(x, y);
}
int main(void) { return 0; }
EOF

# Test AVX2 with BMI2
if try ${CC} ${CFLAGS} ${avx2flag} $test.c; then
    echo "Compiler supports AVX2 with BMI2."
    HAVE_AVX2_INTRIN=1
else
    # Test AVX2 without BMI2
    if try ${CC} ${CFLAGS} -mavx2 $test.c; then
        echo "Compiler supports AVX2 without BMI2."
        HAVE_AVX2_INTRIN=1
        avx2flag="-mavx2"
    else
        echo "Compiler does not support AVX2."
        HAVE_AVX2_INTRIN=0
    fi
fi

Length of output: 314


Script:

#!/bin/bash
# Let's first check if the check_avx2_intrinsics function exists and analyze its implementation
rg -A 10 "check_avx2_intrinsics" configure

Length of output: 844


Script:

#!/bin/bash
# Let's check where avx2flag is defined and how it's used
rg -B 5 -A 5 "avx2flag=" configure

Length of output: 443

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4ca7e64 and bfec601.

📒 Files selected for processing (4)
  • arch/x86/Makefile.in (1 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (1 hunks)
  • functable.c (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • arch/x86/Makefile.in
  • cmake/detect-intrinsics.cmake
🔇 Additional comments (2)
functable.c (1)

Line range hint 113-124: LGTM! Verify CPU feature support assumption.

The addition of the BMI2 check alongside AVX2 is aligned with the PR objectives. However, let's verify the assumption that all CPUs with AVX2 support BMI2.

Let's check CPU feature databases to verify this assumption:

configure (1)

111-111: Inclusion of BMI2 support in avx2flag

The addition of -mbmi2 to the avx2flag ensures that BMI2 instructions are enabled when compiling AVX2 code. This aligns with the PR objective to enhance AVX2 functions by incorporating BMI2 instructions for performance improvements.

@codecov
Copy link

codecov bot commented Nov 28, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 32.24%. Comparing base (0ed5ac8) to head (160b3a1).
Report is 7 commits behind head on develop.

Files with missing lines Patch % Lines
functable.c 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1816      +/-   ##
===========================================
+ Coverage    32.18%   32.24%   +0.06%     
===========================================
  Files           67       67              
  Lines         5752     5753       +1     
  Branches      1237     1237              
===========================================
+ Hits          1851     1855       +4     
- Misses        3644     3652       +8     
+ Partials       257      246      -11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nmoinvaz nmoinvaz added optimization Architecture Architecture specific labels Nov 28, 2024
@Dead2
Copy link
Member

Dead2 commented Nov 29, 2024

Develop 28 Nov

          6,155.37 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.01% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               144      page-faults:u                    #   23.394 /sec                        ( +-  0.34% )
    21,365,062,998      cycles:u                         #    3.471 GHz                         ( +-  0.01% )  (62.48%)
    35,582,552,199      instructions:u                   #    1.67  insn per cycle              ( +-  0.02% )  (74.99%)
     3,754,368,604      branches:u                       #  609.934 M/sec                       ( +-  0.02% )  (75.01%)
       174,548,734      branch-misses:u                  #    4.65% of all branches             ( +-  0.02% )  (75.01%)
     6,678,724,547      L1-dcache-loads:u                #    1.085 G/sec                       ( +-  0.00% )  (75.02%)
       233,855,267      L1-dcache-load-misses:u          #    3.50% of all L1-dcache accesses   ( +-  0.08% )  (75.01%)
        17,771,941      LLC-loads:u                      #    2.887 M/sec                       ( +-  0.75% )  (49.98%)
                64      LLC-load-misses:u                #    0.00% of all LL-cache accesses    ( +- 43.98% )  (49.97%)

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.106/0.110/0.111/0.001        0.036/0.039/0.042/0.002        8,526,745
 2     43.871%      0.198/0.202/0.204/0.001        0.034/0.038/0.040/0.001        6,903,702
 3     42.390%      0.247/0.250/0.253/0.002        0.032/0.038/0.040/0.002        6,670,664
 4     41.644%      0.272/0.278/0.281/0.002        0.034/0.037/0.038/0.001        6,553,205
 5     41.215%      0.303/0.311/0.315/0.003        0.034/0.037/0.039/0.001        6,485,659
 6     41.032%      0.354/0.360/0.363/0.003        0.035/0.036/0.037/0.001        6,456,912
 7     40.778%      0.494/0.501/0.506/0.003        0.030/0.035/0.037/0.002        6,416,941
 8     40.704%      0.609/0.616/0.621/0.003        0.030/0.035/0.037/0.002        6,405,249
 9     40.409%      0.904/0.911/0.917/0.004        0.030/0.035/0.037/0.002        6,358,951

 avg1  42.914%                        0.393                          0.037
 tot                                106.149                          9.901       60,778,028

   text    data     bss     dec     hex filename
 138830    1312       8  140150   22376 libz-ng.so.2

PR #1816

          5,154.35 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.01% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               143      page-faults:u                    #   27.744 /sec                        ( +-  0.28% )
    17,778,636,206      cycles:u                         #    3.449 GHz                         ( +-  0.01% )  (62.48%)
    32,149,287,502      instructions:u                   #    1.81  insn per cycle              ( +-  0.01% )  (75.00%)
     3,756,493,111      branches:u                       #  728.801 M/sec                       ( +-  0.01% )  (75.01%)
       173,624,532      branch-misses:u                  #    4.62% of all branches             ( +-  0.02% )  (75.01%)
     6,365,551,872      L1-dcache-loads:u                #    1.235 G/sec                       ( +-  0.00% )  (75.01%)
       231,548,653      L1-dcache-load-misses:u          #    3.64% of all L1-dcache accesses   ( +-  0.03% )  (75.01%)
        17,957,764      LLC-loads:u                      #    3.484 M/sec                       ( +-  1.01% )  (49.98%)
                39      LLC-load-misses:u                #    0.00% of all LL-cache accesses    ( +- 48.35% )  (49.98%)

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.107/0.109/0.111/0.001        0.028/0.030/0.032/0.001        8,526,745
 2     43.871%      0.196/0.201/0.203/0.002        0.028/0.031/0.033/0.001        6,903,702
 3     42.390%      0.247/0.251/0.253/0.002        0.025/0.029/0.031/0.002        6,670,664
 4     41.644%      0.274/0.279/0.282/0.002        0.023/0.029/0.031/0.002        6,553,205
 5     41.215%      0.306/0.311/0.315/0.003        0.024/0.028/0.032/0.002        6,485,659
 6     41.032%      0.352/0.359/0.363/0.003        0.025/0.029/0.030/0.002        6,456,912
 7     40.778%      0.496/0.501/0.505/0.003        0.024/0.028/0.030/0.002        6,416,941
 8     40.704%      0.610/0.617/0.621/0.003        0.023/0.028/0.031/0.002        6,405,249
 9     40.409%      0.906/0.910/0.912/0.002        0.026/0.028/0.030/0.001        6,358,951

 avg1  42.914%                        0.393                          0.029
 tot                                106.131                          7.810       60,778,028
 
   text    data     bss     dec     hex filename
 138702    1312       8  140022   222f6 libz-ng.so.2

# Retest with different unrelated code added to make sure we didn't just hit a rare golden cacheline alignment
 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     54.185%      0.105/0.109/0.111/0.002        0.029/0.031/0.033/0.001        8,526,745
 2     43.871%      0.198/0.202/0.204/0.002        0.028/0.031/0.033/0.001        6,903,702
 3     42.390%      0.246/0.250/0.253/0.002        0.026/0.030/0.032/0.002        6,670,664
 4     41.644%      0.276/0.279/0.282/0.002        0.025/0.029/0.031/0.002        6,553,205
 5     41.215%      0.301/0.310/0.314/0.003        0.025/0.029/0.031/0.002        6,485,659
 6     41.032%      0.353/0.358/0.361/0.002        0.024/0.028/0.030/0.002        6,456,912
 7     40.778%      0.493/0.502/0.507/0.004        0.026/0.028/0.031/0.001        6,416,941
 8     40.704%      0.609/0.616/0.620/0.003        0.023/0.028/0.031/0.002        6,405,249
 9     40.409%      0.895/0.909/0.916/0.005        0.024/0.029/0.030/0.001        6,358,951

 avg1  42.914%                        0.393                          0.029
 tot                                106.081                          7.854       60,778,028

Wow.. I had to retest that because it was such a big difference.
@KungFuJesus Can you do a comparison test with AVX512 disabled also? I am still a little unsure whether this is actually real or not.
In the perf test we can see that the count of instructions executed is down by 9.65%, that is pretty major, so this indicates this might actually be a real speedup.

Inflate: Perf test is 16% faster. Deflatebench is 21% faster. 😮
(Perf and Deflatebench use different datasets)
Deflate: No difference.

@Dead2
Copy link
Member

Dead2 commented Nov 29, 2024

@KungFuJesus If this speedup is the real deal, I wonder what the AVX512 is in comparison now, it feels like the benefit must have been reduced by quite a bit.

@KungFuJesus
Copy link
Contributor Author

@KungFuJesus If this speedup is the real deal, I wonder what the AVX512 is in comparison now, it feels like the benefit must have been reduced by quite a bit.

When doing an initial improvement, I tend to compare against -march=native builds, which would have had this enabled. I don't believe it will be significantly undercut, but let me check.

@KungFuJesus
Copy link
Contributor Author

So here's this PR, as is with inflate_fast_avx512 and all its chunksetting:

OS: Linux 6.11.6-gentoo-x86_64 #3 SMP PREEMPT_DYNAMIC Mon Nov  4 10:41:27 EST 2024 x86_64
CPU: Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
Tool: /home/adam/scratch/zlib-ng/build_clean/minigzip Size: 154,848 B
Levels: 0-9       
Runs: 10         Trim worst: 2         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.014/0.017/0.021/0.003        0.016/0.020/0.023/0.003      211,973,953
 1     44.409%      0.843/0.852/0.864/0.007        0.325/0.332/0.339/0.004       94,127,497
 2     35.519%      1.423/1.430/1.434/0.004        0.318/0.327/0.333/0.005       75,286,322
 3     33.844%      1.780/1.795/1.805/0.008        0.307/0.313/0.318/0.004       71,735,218
 4     33.146%      2.091/2.097/2.102/0.004        0.295/0.302/0.309/0.005       70,255,223
 5     32.642%      2.410/2.423/2.454/0.014        0.290/0.302/0.314/0.007       69,187,419
 6     32.483%      2.929/2.942/2.961/0.012        0.292/0.297/0.303/0.004       68,850,776
 7     32.255%      4.092/4.129/4.175/0.032        0.297/0.303/0.310/0.005       68,366,759
 8     32.167%      6.545/6.579/6.633/0.031        0.294/0.302/0.308/0.004       68,180,762
 9     31.887%      7.177/7.198/7.239/0.023        0.282/0.288/0.293/0.004       67,586,442

 avg1  40.836%                        2.946                          0.279
 avg2  45.373%                        3.274                          0.309
 tot                                235.701                         22.281      865,550,371

Here's with inflate_fast_avx512 and chunkmemset etc carved out (keeping the adler stuff in).

OS: Linux 6.11.6-gentoo-x86_64 #3 SMP PREEMPT_DYNAMIC Mon Nov  4 10:41:27 EST 2024 x86_64
CPU: Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
Tool: /home/adam/scratch/zlib-ng/build_noavx512/minigzip Size: 150,488 B
Levels: 0-9       
Runs: 10         Trim worst: 2         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.012/0.016/0.021/0.003        0.017/0.020/0.023/0.002      211,973,953
 1     44.409%      0.840/0.852/0.860/0.006        0.314/0.323/0.330/0.007       94,127,497
 2     35.519%      1.432/1.440/1.445/0.004        0.314/0.323/0.331/0.005       75,286,322
 3     33.844%      1.784/1.793/1.803/0.007        0.295/0.305/0.309/0.004       71,735,218
 4     33.146%      2.098/2.109/2.128/0.010        0.290/0.296/0.303/0.005       70,255,223
 5     32.642%      2.408/2.421/2.434/0.009        0.286/0.296/0.301/0.005       69,187,419
 6     32.483%      2.918/2.929/2.935/0.007        0.285/0.296/0.301/0.005       68,850,776
 7     32.255%      4.102/4.129/4.202/0.032        0.285/0.298/0.307/0.007       68,366,759
 8     32.167%      6.548/6.602/6.668/0.034        0.285/0.294/0.299/0.005       68,180,762
 9     31.887%      7.168/7.190/7.212/0.016        0.281/0.285/0.288/0.002       67,586,442

 avg1  40.836%                        2.948                          0.274
 avg2  45.373%                        3.276                          0.304
 tot                                235.849                         21.881      865,550,371

So the plot thickens. It's not a huge net loss, but it is marginally slower. It might be hard to attribute that specifically to any one thing, though.

@Dead2
Copy link
Member

Dead2 commented Dec 1, 2024

So here's this PR, as is with inflate_fast_avx512 and all its chunksetting:

Could you do one without this PR as well?

@KungFuJesus
Copy link
Contributor Author

So here's this PR, as is with inflate_fast_avx512 and all its chunksetting:

Could you do one without this PR as well?

I can but it's going to take the avx512 path which will have BMI2 enabled as well. I can also use this as the baseline for #1821

OS: Linux 6.11.6-gentoo-x86_64 #3 SMP PREEMPT_DYNAMIC Mon Nov  4 10:41:27 EST 2024 x86_64
CPU: Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
Tool: /home/adam/scratch/zlib-ng/build_clean/minigzip Size: 154,848 B
Levels: 0-9       
Runs: 10         Trim worst: 2         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.017/0.019/0.022/0.002        0.015/0.018/0.021/0.002      211,973,953
 1     44.409%      0.848/0.858/0.863/0.006        0.332/0.336/0.342/0.004       94,127,497
 2     35.519%      1.427/1.441/1.447/0.008        0.324/0.329/0.337/0.004       75,286,322
 3     33.844%      1.774/1.786/1.792/0.006        0.304/0.312/0.318/0.005       71,735,218
 4     33.146%      2.089/2.099/2.116/0.008        0.295/0.309/0.319/0.008       70,255,223
 5     32.642%      2.407/2.411/2.417/0.004        0.293/0.302/0.307/0.004       69,187,419
 6     32.483%      2.918/2.927/2.935/0.005        0.294/0.302/0.309/0.006       68,850,776
 7     32.255%      4.094/4.115/4.131/0.013        0.297/0.304/0.310/0.005       68,366,759
 8     32.167%      6.582/6.598/6.627/0.018        0.291/0.305/0.310/0.006       68,180,762
 9     31.887%      7.177/7.192/7.218/0.014        0.278/0.287/0.297/0.008       67,586,442

 avg1  40.836%                        2.945                          0.280
 avg2  45.373%                        3.272                          0.312
 tot                                235.577                         22.438      865,550,371

@Dead2
Copy link
Member

Dead2 commented Dec 1, 2024

I can but it's going to take the avx512 path which will have BMI2 enabled as well. I can also use this as the baseline for #1821

Oh, right. I was hoping one with -DWITH_AVX512=OFF -DWITH_AVX512VNNI=OFF

@KungFuJesus
Copy link
Contributor Author

I can but it's going to take the avx512 path which will have BMI2 enabled as well. I can also use this as the baseline for #1821

Oh, right. I was hoping one with -DWITH_AVX512=OFF -DWITH_AVX512VNNI=OFF

Doing that then precludes the faster checksums with it, though. The numbers for the version with inflate_fast_avx512 and chunkmemset_safe_avx512 commented out from the function table are what I think you want.

@Dead2
Copy link
Member

Dead2 commented Dec 2, 2024

Oh, right. I was hoping one with -DWITH_AVX512=OFF -DWITH_AVX512VNNI=OFF

Doing that then precludes the faster checksums with it, though. The numbers for the version with inflate_fast_avx512 and chunkmemset_safe_avx512 commented out from the function table are what I think you want.

But that is with this PR included, or did I misunderstand?

What I want to see is before and after for the AVX2 codepath, similar to what I did, to see whether the speedup is there for others as well and how big.

@KungFuJesus
Copy link
Contributor Author

Reference i

Ah fair enough, let me try this on a U series Haswell CPU since that's what I'm using at the moment.

Develop

 Tool: /home/adam/zlib-ng/build-develop/minigzip Levels: 0-9       
 Runs: 10         Trim worst: 2         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.060/0.093/0.112/0.019        0.099/0.115/0.129/0.010      211,973,953
 1     44.409%      2.808/2.838/2.859/0.016        1.279/1.301/1.312/0.012       94,127,497
 2     35.519%      5.073/5.145/5.186/0.036        1.255/1.307/1.337/0.028       75,286,322
 3     33.844%      6.666/6.701/6.744/0.023        1.214/1.237/1.265/0.019       71,735,218
 4     33.146%      7.530/7.584/7.661/0.052        1.183/1.196/1.209/0.012       70,255,223
 5     32.642%      8.303/8.392/8.555/0.085        1.155/1.191/1.232/0.025       69,187,419
 6     32.483%      9.691/9.750/9.842/0.051        1.150/1.171/1.183/0.012       68,850,776
 7     32.255%   12.765/12.837/12.900/0.042        1.153/1.170/1.190/0.015       68,366,759
 8     32.167%   19.484/19.566/19.822/0.107        1.140/1.156/1.174/0.013       68,180,762
 9     31.887%   22.513/22.583/22.632/0.039        1.168/1.181/1.203/0.012       67,586,442

 avg1  40.836%                        9.549                          1.102
 avg2  45.373%                       10.610                          1.225
 tot                                763.916                         88.199      865,550,371

PR1816

 Tool: /home/adam/zlib-ng/build-1816/minigzip Levels: 0-9       
 Runs: 10         Trim worst: 2         

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 0    100.008%      0.070/0.097/0.117/0.016        0.090/0.101/0.120/0.012      211,973,953
 1     44.409%      2.806/2.906/3.091/0.088        1.119/1.140/1.165/0.018       94,127,497
 2     35.519%      5.108/5.232/5.564/0.139        1.108/1.143/1.182/0.024       75,286,322
 3     33.844%      6.684/6.802/7.089/0.140        1.057/1.091/1.118/0.019       71,735,218
 4     33.146%      7.506/7.607/7.702/0.067        1.019/1.042/1.064/0.014       70,255,223
 5     32.642%      8.305/8.502/9.036/0.236        0.994/1.032/1.054/0.022       69,187,419
 6     32.483%     9.628/9.882/10.274/0.241        1.006/1.043/1.092/0.024       68,850,776
 7     32.255%   12.741/12.945/13.256/0.175        0.977/1.013/1.037/0.019       68,366,759
 8     32.167%   19.496/19.832/20.398/0.344        0.992/1.024/1.047/0.018       68,180,762
 9     31.887%   22.440/22.695/23.209/0.310        0.985/1.013/1.042/0.018       67,586,442

 avg1  40.836%                        9.650                          0.964
 avg2  45.373%                       10.722                          1.071
 tot                                771.992                         77.135      865,550,371

@Dead2
Copy link
Member

Dead2 commented Dec 3, 2024

Nice, a 12.5% speedup.

I think that confirms that it is a really good optimization.

Makes me wonder how performance compares wrt MinGW vs MSVC compilers, since MSVC does not actually let us choose whether to enable BMI2 or not, so is it used since it always exists along with AVX2 or is it never used because that might not be true in the future? I suspect the latter, meaning that MinGW might result in significantly faster inflate.

Copy link
Member

@Dead2 Dead2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome to me ;)

@phprus
Copy link
Contributor

phprus commented Dec 3, 2024

@Dead2
As far as I know, /arch:AVX2 in MSVC enables support for BMI1/2.

@nmoinvaz
Copy link
Member

nmoinvaz commented Dec 3, 2024

Code rabbit makes a good suggestion:
image

@Dead2
Copy link
Member

Dead2 commented Dec 6, 2024

Code rabbit makes a good suggestion: image

@KungFuJesus Could you add this comment, it does seem like a really good idea.

@KungFuJesus
Copy link
Contributor Author

@KungFuJesus Could you add this comment, it does seem like a really good idea.

Yeah, sorry had to step away for a bit. Will add a comment now.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
configure (1)

Line range hint 486-505: Add BMI2 intrinsics verification to the AVX2 check.

While the function verifies AVX2 support, it should also verify BMI2 intrinsics since we're enabling both features together.

 check_avx2_intrinsics() {
     # Check whether compiler supports AVX2 intrinsics
     cat > $test.c << EOF
 #include <immintrin.h>
+/* Test both AVX2 and BMI2 intrinsics */
 __m256i f(__m256i x) {
     const __m256i y = _mm256_set1_epi16(1);
-    return _mm256_subs_epu16(x, y);
+    /* AVX2 test */
+    __m256i avx2_result = _mm256_subs_epu16(x, y);
+    /* BMI2 test */
+    unsigned long long bmi2_result = _bzhi_u64(0xFFFFFFFFFFFFFFFF, 32);
+    return avx2_result;
 }
 int main(void) { return 0; }
 EOF
     if try ${CC} ${CFLAGS} ${avx2flag} $test.c; then
-        echo "Checking for AVX2 intrinsics ... Yes." | tee -a configure.log
+        echo "Checking for AVX2+BMI2 intrinsics ... Yes." | tee -a configure.log
         HAVE_AVX2_INTRIN=1
     else
-        echo "Checking for AVX2 intrinsics ... No." | tee -a configure.log
+        echo "Checking for AVX2+BMI2 intrinsics ... No." | tee -a configure.log
         HAVE_AVX2_INTRIN=0
     fi
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between bfec601 and 28ab687.

📒 Files selected for processing (4)
  • arch/x86/Makefile.in (1 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (1 hunks)
  • functable.c (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • arch/x86/Makefile.in
  • functable.c
  • cmake/detect-intrinsics.cmake
🔇 Additional comments (2)
configure (2)

Line range hint 1124-1196: LGTM! Architecture-specific handling is correct.

The x86/amd64 architecture section properly handles the modified avx2flag, ensuring that BMI2 instructions will be enabled when building AVX2-optimized object files.


111-111: LGTM! Verify compiler support for combined flags.

The addition of -mbmi2 to avx2flag aligns with the PR objective. Since all CPUs supporting AVX2 also support BMI2, this change is logical.

✅ Verification successful

Let me try a different approach to verify the compiler flag support by examining the codebase for existing usage and checks.


Let me check the specific compiler test implementation for AVX2 to verify how the flags are handled.


Let me check the x86 feature detection code more thoroughly and examine how the AVX2 code is used.


Based on the comprehensive code analysis, I can now provide the final response:

The combined AVX2 and BMI2 flags are safe and correctly implemented

The codebase shows that:

  1. The x86 feature detection properly checks for both AVX2 and BMI2 support via CPUID (x86_features.c)
  2. The AVX2 implementations don't rely on BMI2 instructions directly, making the -mbmi2 flag a safe addition
  3. The configure script already has proper compiler feature detection for AVX2 through check_avx2_intrinsics()
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that major compilers support the combined -mavx2 -mbmi2 flags
# Test with both GCC and Clang if available

echo "Testing GCC support for combined flags..."
echo "int main() { return 0; }" | gcc -c -x c - -mavx2 -mbmi2 2>&1 || echo "GCC: Failed"

echo "Testing Clang support for combined flags..."
echo "int main() { return 0; }" | clang -c -x c - -mavx2 -mbmi2 2>&1 || echo "Clang: Failed"

Length of output: 579


Script:

#!/bin/bash
# Search for existing compiler flag checks and BMI2/AVX2 usage
echo "Checking configure script for compiler flag tests..."
rg "gcc.*-mavx2|clang.*-mavx2|check.*avx2|check.*bmi2" configure -A 2

echo -e "\nChecking for BMI2/AVX2 related code..."
fd -e c -e h . --exec rg -l "BMI2|AVX2|__AVX2__|__BMI2__"

echo -e "\nChecking for existing compiler tests..."
rg "compiler.*test|test.*compiler" configure

Length of output: 1103


Script:

#!/bin/bash
# Check the AVX2 intrinsics test implementation
echo "Examining AVX2 compiler test implementation..."
rg -A 15 "check_avx2_intrinsics\(\)" configure

echo -e "\nChecking how compiler flags are tested..."
cat configure | grep -A 5 "try_compile()"

echo -e "\nChecking x86 feature detection code..."
rg "bmi2.*avx2|avx2.*bmi2" arch/x86/x86_features.c

Length of output: 973


Script:

#!/bin/bash
# Check x86 feature detection implementation
echo "Examining x86 feature detection..."
cat arch/x86/x86_features.c

echo -e "\nChecking AVX2 implementation files..."
cat arch/x86/chunkset_avx2.c

echo -e "\nChecking build system's feature detection..."
rg "HAVE_BMI2|HAVE_AVX2" configure

Length of output: 8992

While these are technically different instructions, no such CPU exists
that has AVX2 that doesn't have BMI2. Enabling BMI2 allows us to
eliminate several flag stalls by having flagless versions of shifts, and
allows us to not clobber and move around GPRs so much in scalar code.
There's usually a sizeable benefit for enabling it. Since we're building
with BMI2 for AVX2 functions, let's also just make sure the CPU claims
to support it (just to cover our bases).
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
configure (1)

Line range hint 1012-1029: Enhance AVX2 intrinsics test to verify BMI2 support

The current test only verifies AVX2 intrinsics. Since we're now including BMI2 support, consider enhancing the test to also verify BMI2 intrinsics.

 cat > $test.c << EOF
 #include <immintrin.h>
 __m256i f(__m256i x) {
     const __m256i y = _mm256_set1_epi16(1);
-    return _mm256_subs_epu16(x, y);
+    // Test both AVX2 and BMI2 intrinsics
+    unsigned long long z = _bzhi_u64(0xFFFFFFFFFFFFFFFF, 32);  // BMI2 intrinsic
+    __m256i result = _mm256_subs_epu16(x, y);                  // AVX2 intrinsic
+    return result;
 }
 int main(void) { return 0; }
 EOF
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 28ab687 and 160b3a1.

📒 Files selected for processing (4)
  • arch/x86/Makefile.in (1 hunks)
  • cmake/detect-intrinsics.cmake (1 hunks)
  • configure (1 hunks)
  • functable.c (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • arch/x86/Makefile.in
  • cmake/detect-intrinsics.cmake
  • functable.c
🔇 Additional comments (2)
configure (2)

111-111: LGTM: Addition of BMI2 support for AVX2 functions

The addition of -mbmi2 to avx2flag correctly enables BMI2 instructions for AVX2 functions, which aligns with the PR objectives. This is safe since all CPUs that support AVX2 also support BMI2.


Line range hint 1104-1113: Consider adding runtime BMI2 feature detection

Since we're enabling BMI2 instructions alongside AVX2, consider adding explicit runtime BMI2 feature detection in the x86_features implementation to ensure proper CPU support.

✅ Verification successful

Let me gather more context about the BMI2 feature detection implementation.


Let me check the header file and functable.c to get a complete picture of BMI2 usage.


BMI2 feature detection is already properly implemented

The codebase already has proper BMI2 feature detection and runtime handling:

  • BMI2 detection is implemented in x86_features.c using CPUID instruction (bit 8 of EBX register)
  • BMI2 is checked alongside AVX2 in functable.c for optimal performance
  • The code correctly enables BMI2-dependent optimizations only when both AVX2 and BMI2 are available at runtime
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if BMI2 feature detection is implemented

# Test: Search for BMI2 feature detection in x86_features implementation
rg -A 5 "bmi2|BMI2" arch/x86/x86_features.c

Length of output: 427


Script:

#!/bin/bash
# Check the complete x86_features implementation and structure
cat arch/x86/x86_features.c

# Also check if there are any BMI2-related runtime checks in the codebase
rg -l "bmi2|BMI2" --type c --type cpp

Length of output: 3700


Script:

#!/bin/bash
# Check the header file for BMI2-related structures and usage
cat arch/x86/x86_features.h

# Check functable.c for BMI2 runtime usage
cat functable.c

Length of output: 11643

@Dead2 Dead2 merged commit 7020cb3 into zlib-ng:develop Dec 7, 2024
143 of 149 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Architecture Architecture specific optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants