这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@KungFuJesus
Copy link
Contributor

@KungFuJesus KungFuJesus commented Sep 12, 2024

Make chunkset_avx2 half chunk aware

This gives us appreciable gains on a number of fronts. The first being
we're inlining a pretty hot function that was getting dispatched to
regularly. Another is that we're able to do a safe lagged copy of a
distance that is smaller, so CHUNKCOPY gets its teeth back here for
smaller sizes, without having to do another dispatch to a function.

We're also now doing two overlapping writes at once and letting the CPU
do its store forwarding. This was an enhancement @dougallj had suggested
a while back.

Additionally, the "half chunk mag" here is fundamentally less
complicated because it doesn't require sythensizing cross lane permutes
with a blend operation, so we can optimistically do that first if the
len is small enough that a full 32 byte chunk doesn't make any sense.

Summary by CodeRabbit

  • New Features

    • Introduced new inline functions for handling smaller data chunks using AVX2 instructions.
    • Added support for halfchunk_t type to optimize operations on 128-bit chunks.
    • Implemented a new function HALFCHUNKCOPY for efficient data copying of half-sized chunks.
  • Bug Fixes

    • Adjusted memory operation handling in the inflate_fast function for improved safety.
  • Refactor

    • Removed the CHUNK_SIZE macro across multiple files to streamline chunk size references.
    • Updated CHUNKMEMSET function to a static inline function for better linkage control.

@KungFuJesus KungFuJesus force-pushed the develop branch 2 times, most recently from 6b5bf9a to ebc0ef6 Compare September 12, 2024 22:10
@codecov
Copy link

codecov bot commented Sep 12, 2024

Codecov Report

Attention: Patch coverage is 46.98795% with 44 lines in your changes missing coverage. Please review.

Project coverage is 33.03%. Comparing base (18af700) to head (a90e06e).
Report is 5 commits behind head on develop.

Current head a90e06e differs from pull request most recent head f7a2f9a

Please upload reports for the commit f7a2f9a to get more accurate results.

Files with missing lines Patch % Lines
chunkset_tpl.h 33.96% 32 Missing and 3 partials ⚠️
arch/x86/chunkset_avx2.c 70.00% 9 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1778      +/-   ##
===========================================
- Coverage    33.26%   33.03%   -0.23%     
===========================================
  Files           66       66              
  Lines         5481     5555      +74     
  Branches      1222     1227       +5     
===========================================
+ Hits          1823     1835      +12     
- Misses        3399     3460      +61     
- Partials       259      260       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Dead2
Copy link
Member

Dead2 commented Sep 13, 2024

I tested current develop (with split chunkcopy_safe) with and without this PR on an i9-9900K.

Develop:

   text    data     bss     dec     hex filename
 135966    1312       8  137286   21846 libz-ng.so.2
          5,733.90 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.01% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               143      page-faults:u                    #   24.939 /sec                        ( +-  0.30% )
    19,908,943,906      cycles:u                         #    3.472 GHz                         ( +-  0.01% )
    35,633,452,503      instructions:u                   #    1.79  insn per cycle              ( +-  0.00% )
     3,749,361,374      branches:u                       #  653.894 M/sec                       ( +-  0.00% )
       181,427,372      branch-misses:u                  #    4.84% of all branches             ( +-  0.04% )

          5.734291 +- 0.000584 seconds time elapsed  ( +-  0.01% )

This PR:

 Performance counter stats for 'build/minigzip -c -d -k ../cov-analysis-linux64-2023.6.2.tar.gz' (10 runs):

          5,396.14 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.02% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               144      page-faults:u                    #   26.686 /sec                        ( +-  0.33% )
    18,752,218,475      cycles:u                         #    3.475 GHz                         ( +-  0.02% )
    35,633,297,945      instructions:u                   #    1.90  insn per cycle              ( +-  0.00% )
     3,748,055,372      branches:u                       #  694.580 M/sec                       ( +-  0.00% )
       175,467,241      branch-misses:u                  #    4.68% of all branches             ( +-  0.04% )

          5.396606 +- 0.000948 seconds time elapsed  ( +-  0.02% )

Pretty nice improvement to instructions/cycle.
Inflate is a whopping 5.8% faster on average.

Benchmark of this PR without #1776: (Correct build used)

 Performance counter stats for 'build/minigzip -c -d -k ../cov-analysis-linux64-2023.6.2.tar.gz' (10 runs):

          5,888.93 msec task-clock:u                     #    1.000 CPUs utilized               ( +-  0.01% )
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               143      page-faults:u                    #   24.283 /sec                        ( +-  0.21% )
    20,457,192,322      cycles:u                         #    3.474 GHz                         ( +-  0.01% )
    35,730,996,426      instructions:u                   #    1.75  insn per cycle              ( +-  0.00% )
     3,757,474,157      branches:u                       #  638.058 M/sec                       ( +-  0.00% )
       181,224,621      branch-misses:u                  #    4.82% of all branches             ( +-  0.01% )

          5.889362 +- 0.000682 seconds time elapsed  ( +-  0.01% )

Develop:

# x86-64 split_chunkcopy
 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      1.203/1.221/1.229/0.008        0.415/0.432/0.438/0.006       94,127,485
 2     35.519%      2.253/2.272/2.283/0.009        0.427/0.435/0.439/0.004       75,286,310
 3     33.844%      2.830/2.847/2.857/0.010        0.404/0.410/0.415/0.003       71,735,206
 4     33.146%      3.147/3.185/3.203/0.016        0.391/0.399/0.405/0.004       70,255,211
 5     32.642%      3.508/3.551/3.583/0.024        0.383/0.394/0.402/0.006       69,187,407
 6     32.483%      4.025/4.105/4.132/0.025        0.384/0.394/0.398/0.004       68,850,764
 7     32.255%      5.981/6.024/6.058/0.023        0.387/0.394/0.397/0.003       68,366,747
 8     32.167%      8.673/8.730/8.764/0.028        0.387/0.392/0.396/0.003       68,180,750
 9     31.887%   12.039/12.099/12.165/0.036        0.377/0.389/0.395/0.005       67,586,430

 avg1  34.261%                        4.892                          0.404
 tot                                660.487                         54.599      653,576,310

   text    data     bss     dec     hex filename
 135966    1312       8  137286   21846 libz-ng.so.2

This PR:

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      1.203/1.219/1.231/0.008        0.421/0.429/0.433/0.003       94,127,485
 2     35.519%      2.261/2.281/2.289/0.007        0.420/0.433/0.439/0.005       75,286,310
 3     33.844%      2.828/2.850/2.865/0.012        0.402/0.407/0.412/0.003       71,735,206
 4     33.146%      3.138/3.189/3.217/0.025        0.394/0.403/0.408/0.004       70,255,211
 5     32.642%      3.493/3.549/3.578/0.025        0.384/0.393/0.398/0.005       69,187,407
 6     32.483%      4.051/4.109/4.140/0.027        0.388/0.395/0.400/0.004       68,850,764
 7     32.255%      5.957/6.024/6.059/0.030        0.383/0.392/0.397/0.004       68,366,747
 8     32.167%      8.717/8.740/8.761/0.013        0.380/0.390/0.394/0.004       68,180,750
 9     31.887%   12.054/12.108/12.153/0.030        0.383/0.388/0.391/0.003       67,586,430

 avg1  34.261%                        4.897                          0.403
 tot                                661.036                         54.447      653,576,310

   text    data     bss     dec     hex filename
 136350    1312       8  137670   219c6 libz-ng.so.2

For some reason the improvements were much smaller here, only 0.28% faster inflate on average.

The difference is speeds seems to be data-dependent, not cacheline-aliasing or anything like that.
(I verified this using different builds to ensure things get compiled and placed slightly differently, as I often do when in doubt)


Benchmark of this PR without #1776: (Correct build used)

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      1.209/1.223/1.231/0.007        0.442/0.459/0.465/0.007       94,127,485
 2     35.519%      2.262/2.279/2.289/0.008        0.447/0.460/0.464/0.005       75,286,310
 3     33.844%      2.838/2.862/2.876/0.011        0.428/0.436/0.440/0.003       71,735,206
 4     33.146%      3.156/3.206/3.235/0.023        0.412/0.422/0.427/0.004       70,255,211
 5     32.642%      3.523/3.560/3.581/0.018        0.408/0.418/0.423/0.004       69,187,407
 6     32.483%      4.071/4.119/4.145/0.023        0.404/0.416/0.422/0.005       68,850,764
 7     32.255%      5.951/6.028/6.085/0.045        0.405/0.414/0.420/0.005       68,366,747
 8     32.167%      8.692/8.747/8.779/0.031        0.404/0.414/0.419/0.005       68,180,750
 9     31.887%   12.071/12.130/12.171/0.033        0.410/0.419/0.422/0.004       67,586,430

 avg1  34.261%                        4.906                          0.429
 tot                                662.290                         57.871      653,576,310

   text    data     bss     dec     hex filename
 136006    1312       8  137326   2186e libz-ng.so.2

@KungFuJesus
Copy link
Contributor Author

The difference is speeds seems to be data-dependent, not cacheline-aliasing or anything like that. (I verified this using different builds to ensure things get compiled and placed slightly differently, as I often do when in doubt)

That's more or less what I'd expect. Certain data streams might hit the chunk_memcpy path harder than others, it just depends on what the decompression stream presents. If you had a sequence of something like 17 or so bytes behind the destination pointer that repeated, I think you end up dispatching to plain old chunkcopy more in chunk_memcpy now that there are smaller unit sizes. The inlining alone in my testing bought us like 3%, even without smaller chunks to work with. Being able to share the last len byte copy loop shrank some of the code as well.

I have plans to make the GET_CHUNK_MAG code better for VBMI platforms as well but I'm still waiting on hardware for that.

@nmoinvaz
Copy link
Member

Can this strategy also be used for AVX512?

@KungFuJesus
Copy link
Contributor Author

Can this strategy also be used for AVX512?

It certainly could (with a quarter size chunk being represented with __m128i's). However, on Intel ISAs, it's been repeatedly proven that the clock penalty tends to hurt more than the throughput gain in extra register widths (particularly for something like a copy operation). For Zen 5 that may not be the case, but I don't have any Zen5 hardware to test with.

With Icelake and VBMI2, that chunk magazine logic gets quite a bit simpler since they finally allowed cross lane single byte permutes. I plan to go back and add that in for 256 bit vector lengths once I can actually test it. That might also be a good time to revisit 64 byte chunks to see if that juice is worth the squeeze.


/* Only AVX2 */
#ifdef HAVE_HALF_CHUNK
if (len <= sizeof(halfchunk_t)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we just call chunkmemset_ssse3, is it because it is inlined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. And we're able to shrink the code size by merging common elements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inlining bought us a lot here by itself and by allowing for a half chunk length we get the best of both worlds.

@Dead2
Copy link
Member

Dead2 commented Sep 14, 2024

I did another benchmark with this PR applied to develop before #1776, added it to the post above, so it is easier to compare them.
Edit: Will have to re-do it, just noticed it was done with the AVX512 code compiled in, the others were without. It might make a difference.

@KungFuJesus
Copy link
Contributor Author

I did another benchmark with this PR applied to develop before #1776, added it to the post above, so it is easier to compare them. Edit: Will have to re-do it, just noticed it was done with the AVX512 code compiled in, the others were without. It might make a difference.

Which version of gcc are you using? I feel like chunk_copy_safe has been inlining for me since at least gcc 11, but I'll try a myriad of compilation options and compilers to be sure. Earliest I can go back is probably 11 or 12.

@Dead2
Copy link
Member

Dead2 commented Sep 14, 2024

I did another benchmark with this PR applied to develop before #1776, added it to the post above, so it is easier to compare them. Edit: Will have to re-do it, just noticed it was done with the AVX512 code compiled in, the others were without. It might make a difference.

Which version of gcc are you using? I feel like chunk_copy_safe has been inlining for me since at least gcc 11, but I'll try a myriad of compilation options and compilers to be sure. Earliest I can go back is probably 11 or 12.

I don't think it has ever been inlining for me with the default CMake config. I remember disabling -Winlining a couple years ago because it was so noisy, then re-discovered it this summer when we were having a hard look at optional warnings.

I get exactly the same behavior with GCC 12.2 from Debian on Aarch64 as I do with GCC 13.2.1 from Fedora on x86-64.
I also got the same behavior with GCC 11.x before I upgraded Fedora release recently.

You don't compile using custom flags or something? -O3 for example? That always seems to slow things down for me though.

PS: Updated the benchmarks above, so now they are the correct ones. Hardly any difference from the incorrect build though.

@KungFuJesus
Copy link
Contributor Author

I do build with O3 and that might be the difference. Let me mess around with more stock flags and see if I see the same inlining behavior.

@mtl1979
Copy link
Collaborator

mtl1979 commented Sep 15, 2024

I do build with O3 and that might be the difference. Let me mess around with more stock flags and see if I see the same inlining behavior.

As far as I know and remember, -O3 enables some quite aggressive optimisations which for example (re-)vectorize the code and that might likely result code getting short enough to be able to be inlined. This pretty much work for anything newer than SSE3 with modern compilers as optimizer support for SSE3 and earlier iterations was dropped a long time ago.

@KungFuJesus
Copy link
Contributor Author

I do build with O3 and that might be the difference. Let me mess around with more stock flags and see if I see the same inlining behavior.

As far as I know and remember, -O3 enables some quite aggressive optimisations which for example (re-)vectorize the code and that might likely result code getting short enough to be able to be inlined. This pretty much work for anything newer than SSE3 with modern compilers as optimizer support for SSE3 and earlier iterations was dropped a long time ago.

I don't think that's an issue as the minimum subset of extensions on all x86-64 CPUs contains sse3.

@KungFuJesus
Copy link
Contributor Author

I did another benchmark with this PR applied to develop before #1776, added it to the post above, so it is easier to compare them. Edit: Will have to re-do it, just noticed it was done with the AVX512 code compiled in, the others were without. It might make a difference.

Which version of gcc are you using? I feel like chunk_copy_safe has been inlining for me since at least gcc 11, but I'll try a myriad of compilation options and compilers to be sure. Earliest I can go back is probably 11 or 12.

I don't think it has ever been inlining for me with the default CMake config. I remember disabling -Winlining a couple years ago because it was so noisy, then re-discovered it this summer when we were having a hard look at optional warnings.

I get exactly the same behavior with GCC 12.2 from Debian on Aarch64 as I do with GCC 13.2.1 from Fedora on x86-64. I also got the same behavior with GCC 11.x before I upgraded Fedora release recently.

You don't compile using custom flags or something? -O3 for example? That always seems to slow things down for me though.

PS: Updated the benchmarks above, so now they are the correct ones. Hardly any difference from the incorrect build though.

Yep, so I confirmed it's -O3 that allows this to be inlined (with that commit reverted) and in my testing, it's a significant improvement.

@Dead2
Copy link
Member

Dead2 commented Sep 15, 2024

Yep, so I confirmed it's -O3 that allows this to be inlined (with that commit reverted) and in my testing, it's a significant improvement.

Problem is, -O3 is not the default for most distros, so we cannot rely on that being the case.

@KungFuJesus
Copy link
Contributor Author

Yep, so I confirmed it's -O3 that allows this to be inlined (with that commit reverted) and in my testing, it's a significant improvement.

Problem is, -O3 is not the default for most distros, so we cannot rely on that being the case.

Right but apart from manually forcing the same decisions the compiler is doing with O3 I don't know a better way to get that performance back. Without reverting 6b8efe7, it's a net loss for me.

@Dead2
Copy link
Member

Dead2 commented Sep 15, 2024

Right but apart from manually forcing the same decisions the compiler is doing with O3 I don't know a better way to get that performance back. Without reverting 6b8efe7, it's a net loss for me.

I am open to reverting that change again, but only if we can find a good way that also improves performance on an -O2 compile.
Worst case, we could use cmake options to select between the two methods, but IMHO that is not exactly a great solution.

@KungFuJesus
Copy link
Contributor Author

Here's the comparison of O2 and O3, with 6b8efe7 reverted:

Comparing decode_real (from ./benchmark_zlib_apps_O2) to decode_real (from ./benchmark_zlib_apps_O3)
Benchmark                                                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_pvalue                 0.4735          0.4903      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_mean                  +0.0029         +0.0028          9707          9736          9712          9739
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_median                +0.0001         -0.0000          9729          9730          9733          9733
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_stddev                -0.1475         -0.1467           110            94           110            94
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_cv                    -0.1500         -0.1491             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_mean                  -0.0917         -0.0918         58686         53304         58714         53324
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_median                -0.0905         -0.0907         58636         53333         58665         53346
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_stddev                +2.2214         +2.2434           134           431           134           434
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_cv                    +2.5467         +2.5713             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_mean                  -0.0885         -0.0886         56373         51382         56399         51400
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_median                -0.0893         -0.0895         56351         51320         56377         51333
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_stddev                +1.9267         +1.9454            95           278            95           279
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_cv                    +2.2110         +2.2318             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_mean                  -0.0888         -0.0888         51371         46811         51392         46828
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_median                -0.0873         -0.0873         51332         46849         51357         46871
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_stddev                +1.5714         +1.6452           122           314           118           313
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_cv                    +1.8219         +1.9030             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_mean                  -0.0856         -0.0856         47698         43615         47714         43630
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_median                -0.0842         -0.0844         47670         43654         47690         43665
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_stddev                +1.9497         +1.9987            93           274            92           275
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_cv                    +2.2258         +2.2795             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_mean                  -0.0817         -0.0817         46265         42483         46280         42498
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_median                -0.0816         -0.0816         46218         42446         46228         42456
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_stddev                +1.6292         +1.7346           115           302           111           305
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_cv                    +1.8632         +1.9779             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_mean                  -0.0799         -0.0798         43589         40108         43603         40123
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_median                -0.0782         -0.0782         43555         40148         43576         40168
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_stddev                +0.7956         +0.8423           117           210           114           209
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_cv                    +0.9514         +1.0021             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_mean                  -0.0753         -0.0752         42907         39678         42921         39693
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_median                -0.0737         -0.0735         42873         39715         42875         39726
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_stddev                +1.2346         +1.2600            92           207            92           208
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_cv                    +1.4165         +1.4438             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_mean                  -0.0720         -0.0720         41604         38611         41620         38623
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_median                -0.0717         -0.0718         41594         38609         41608         38619
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_stddev                +1.0171         +1.0073            90           181            90           180
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_cv                    +1.1735         +1.1631             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_mean                  -0.0772         -0.0772         41126         37951         41140         37964
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_median                -0.0760         -0.0762         41058         37936         41077         37945
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_stddev                +0.6152         +0.6330           125           202           124           203
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_cv                    +0.7503         +0.7696             0             0             0             0
OVERALL_GEOMEAN                                                                                       -0.0741         -0.0742             0             0             0             0

Here's O2 with and without the revert compared:

Comparing decode_real (from ./benchmark_zlib_apps_O2) to decode_real (from ./benchmark_zlib_apps_splitchunkcopy)
Benchmark                                                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_pvalue                 0.0337          0.0337      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_mean                  +0.0046         +0.0046          9717          9761          9721          9765
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_median                +0.0078         +0.0078          9685          9760          9687          9762
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_stddev                -0.1403         -0.1412            76            65            76            65
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/0_cv                    -0.1442         -0.1451             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_pvalue                 0.0084          0.0071      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_mean                  -0.0019         -0.0018         58619         58509         58639         58531
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_median                -0.0010         -0.0012         58579         58520         58604         58536
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_stddev                -0.1814         -0.1930           124           101           124           100
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/1_cv                    -0.1799         -0.1915             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_mean                  -0.0029         -0.0029         56388         56223         56410         56244
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_median                -0.0027         -0.0025         56368         56215         56378         56235
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_stddev                -0.3585         -0.3613           111            71           115            73
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/2_cv                    -0.3566         -0.3594             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_mean                  +0.0029         +0.0029         51431         51579         51450         51598
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_median                +0.0046         +0.0042         51341         51578         51363         51582
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_stddev                -0.7115         -0.7044           309            89           305            90
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/3_cv                    -0.7123         -0.7053             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_mean                  -0.0054         -0.0055         47771         47511         47791         47527
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_median                -0.0048         -0.0049         47741         47510         47763         47529
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_stddev                -0.5606         -0.5436           116            51           116            53
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/4_cv                    -0.5582         -0.5410             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_mean                  -0.0107         -0.0107         46309         45812         46326         45829
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_median                -0.0108         -0.0107         46296         45798         46304         45810
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_stddev                +0.0024         +0.0119            72            72            70            71
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/5_cv                    +0.0133         +0.0229             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_mean                  -0.0177         -0.0177         43591         42820         43607         42837
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_median                -0.0177         -0.0177         43559         42789         43578         42809
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_stddev                +0.1162         +0.0773            87            97            88            95
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/6_cv                    +0.1363         +0.0967             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_mean                  -0.0191         -0.0191         42958         42139         42973         42153
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_median                -0.0191         -0.0188         42929         42109         42931         42124
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_stddev                +0.0696         +0.0752            96           103            96           103
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/7_cv                    +0.0904         +0.0961             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_mean                  -0.0198         -0.0198         41611         40788         41626         40802
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_median                -0.0184         -0.0187         41565         40798         41585         40807
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_stddev                +0.0662         +0.0655           104           111           103           110
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/8_cv                    +0.0877         +0.0871             0             0             0             0
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_mean                  -0.0196         -0.0196         41083         40278         41097         40293
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_median                -0.0183         -0.0185         41035         40285         41054         40295
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_stddev                -0.0955         -0.0654           111           101           109           102
png_[decode_real vs. decode_real]istic/png_[decode_real vs. decode_real]istic/9_cv                    -0.0774         -0.0467             0             0             0             0
OVERALL_GEOMEAN                                                                                       -0.0090         -0.0090             0             0             0             0

@KungFuJesus
Copy link
Contributor Author

KungFuJesus commented Sep 15, 2024

Couldn't we just alter our cmake configuration to not override O3 with O2 for release? I don't know how many distros override these things to begin with but I imagine most don't try to override the default build arguments. For me, the difference between O2 and O3 is ~7% and the work to split chunk copy so that it partially inlines hurts the O3 performance by another 2ish%.

@nmoinvaz
Copy link
Member

I would be in favor of forcing CMake/configure to use -O3 or /O3 for this project.

@KungFuJesus KungFuJesus force-pushed the develop branch 3 times, most recently from 4ddaef4 to d2a9451 Compare September 21, 2024 14:50
@ptr1337
Copy link

ptr1337 commented Sep 27, 2024

I would be in favor of forcing CMake/configure to use -O3 or /O3 for this project.

Archlinux is actually tending to use None as buildtype, so maybe the maintainer should be then contacted there.
See here:
https://gitlab.archlinux.org/archlinux/packaging/packages/zlib-ng/-/blob/main/PKGBUILD?ref_type=heads#L31

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 3, 2024

Walkthrough

The pull request includes changes across multiple files focused on memory chunk handling in ARM and x86 architectures. The CHUNK_SIZE macro has been removed from several files, which may impact references to chunk sizes. New functionality has been added in chunkset_avx2.c, introducing a halfchunk_t type and several inline functions for handling smaller data chunks. Additionally, modifications to the chunkset_tpl.h file enhance chunk memory operations by incorporating support for half-sized chunks and refining existing logic.

Changes

File Change Summary
arch/arm/chunkset_neon.c Removed #define CHUNK_SIZE 16.
arch/x86/chunkset_avx2.c Added halfchunk_t type and multiple inline functions for 128-bit chunk operations. Modified GET_CHUNK_MAG logic. Added chunkmemset_16.
arch/x86/chunkset_sse2.c Removed #define CHUNK_SIZE 16.
arch/x86/chunkset_ssse3.c Removed #define CHUNK_SIZE 16.
chunkset_tpl.h Removed chunkmemset_ssse3 declaration. Changed CHUNKMEMSET to static inline. Added HALFCHUNKCOPY function. Refined CHUNKMEMSET logic.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ChunkHandler as Chunk Handler
    participant Memory as Memory Operations

    User->>ChunkHandler: Request chunk operation
    ChunkHandler->>Memory: Perform chunk memory set
    Memory-->>ChunkHandler: Return success
    ChunkHandler-->>User: Operation completed
Loading

Possibly related PRs

  • Compute the "safe" distance properly #1801: The changes in this PR involve memory operations related to the CHUNKMEMSET_SAFE function, which may be indirectly affected by the removal of the CHUNK_SIZE definition in the main PR, as both deal with chunk memory operations.

Suggested reviewers

  • Dead2

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (1)
chunkset_tpl.h (1)

197-210: Consider loop unrolling for performance optimization

In the loop starting at line 197, unrolling the loop could enhance performance by reducing the loop overhead for large values of len. This can be particularly beneficial in performance-critical code paths.

Consider unrolling the loop:

while (len >= (2 * sizeof(chunk_t))) {
    storechunk(out, &chunk_load);
    storechunk(out + adv_amount, &chunk_load);
    storechunk(out + 2 * adv_amount, &chunk_load);
    storechunk(out + 3 * adv_amount, &chunk_load);
    out += 4 * adv_amount;
    len -= 4 * adv_amount;
}

Adjust the unrolling factor based on benchmarking results to find the optimal balance between code size and performance gain.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 8d10c30 and d6150b2.

📒 Files selected for processing (6)
  • arch/arm/chunkset_neon.c (0 hunks)
  • arch/x86/chunkset_avx2.c (5 hunks)
  • arch/x86/chunkset_sse2.c (0 hunks)
  • arch/x86/chunkset_ssse3.c (0 hunks)
  • chunkset_tpl.h (3 hunks)
  • inffast_tpl.h (1 hunks)
💤 Files with no reviewable changes (3)
  • arch/arm/chunkset_neon.c
  • arch/x86/chunkset_sse2.c
  • arch/x86/chunkset_ssse3.c
🔇 Additional comments (10)
inffast_tpl.h (1)

263-263: Potential performance optimization - please clarify safety implications

The modification of the CHUNKMEMSET_SAFE function call by removing the + 1 from (unsigned)((safe - out) + 1) could potentially lead to performance improvements, especially for smaller chunk sizes. This aligns with the PR objectives.

However, I have some concerns:

  1. Safety: Can you please clarify the safety implications of this change? We need to ensure that removing the + 1 doesn't introduce any risk of buffer overflows or other memory-related issues.

  2. Testing: Have you conducted thorough testing with various data sets and chunk sizes to verify that this change consistently improves performance without introducing any regressions or edge cases?

  3. Documentation: If this change is intentional and proven safe, please update the function's documentation to reflect this modification and explain the reasoning behind it.

To help verify the impact of this change, could you run the following benchmarks and provide the results?

This script will help us understand the performance impact across different input sizes, which is crucial given the PR's focus on improving performance for smaller chunk sizes.

arch/x86/chunkset_avx2.c (8)

12-12: Consistent Typedef for halfchunk_t

Defining halfchunk_t as __m128i aligns with the existing chunk_t typedef and maintains consistency in the codebase.


55-58: Initialization of half_rem_vals Array

The half_rem_vals array is correctly initialized to support half-chunk operations, ensuring proper indexing and functionality for distances less than the chunk size.


77-80: Correct Implementation of chunkmemset_16 Function

The chunkmemset_16 function efficiently broadcasts a 128-bit value into a 256-bit AVX2 register using _mm256_broadcastsi128_si256, enabling effective handling of 16-byte chunks.


Line range hint 112-127: Optimized Logic in GET_CHUNK_MAG Function

The modifications in the GET_CHUNK_MAG function streamline the handling of distances less than 16. The use of intrinsics and permutation vectors is appropriate and enhances performance for smaller distances.


128-145: Implementation of halfchunkmemset_* Functions

The halfchunkmemset_2, halfchunkmemset_4, and halfchunkmemset_8 functions correctly use memcpy and appropriate intrinsics (_mm_set1_epi16, _mm_set1_epi32, _mm_set1_epi64x) to safely broadcast values into 128-bit vectors. This ensures efficient handling of small data sizes with potential unaligned input.


146-152: Accurate Loading and Storing with loadhalfchunk and storehalfchunk

The loadhalfchunk and storehalfchunk functions correctly utilize unaligned load and store intrinsics (_mm_loadu_si128 and _mm_storeu_si128), ensuring safe operations regardless of data alignment.


160-171: Proper Implementation of GET_HALFCHUNK_MAG Function

The GET_HALFCHUNK_MAG function is correctly implemented to handle half-chunk magnitudes for distances less than the chunk size. The use of _mm_shuffle_epi8 with appropriate permutation vectors efficiently rearranges bytes for the required output.


154-158: ⚠️ Potential issue

Unused Static Function halfchunk2whole

The function halfchunk2whole appears to be defined but not used within this file or elsewhere in the codebase. This might indicate that it's unnecessary or was intended for future use.

Run the following script to check for usages of halfchunk2whole in the codebase:

chunkset_tpl.h (1)

90-107: HALFCHUNKCOPY function implementation looks correct

The HALFCHUNKCOPY function efficiently handles copying data in half-chunk sizes. The logic is sound, and the implementation aligns with the intended functionality for systems where HAVE_HALF_CHUNK is defined.

Copy link
Member

@nmoinvaz nmoinvaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks clean to me.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
arch/x86/chunkset_avx2.c (3)

55-58: LGTM: New static array for half chunk remainder values

The addition of half_rem_vals is a good optimization for the new half chunk operations. It should help reduce computation time in the GET_HALFCHUNK_MAG function.

Consider adding a brief comment explaining the purpose and derivation of these values for better maintainability.


Line range hint 112-127: LGTM: Simplified logic in GET_CHUNK_MAG function

The modifications to GET_CHUNK_MAG align well with the PR objective of simplifying the "half chunk mag" process. The removal of the specific dist == 16 case suggests a more streamlined approach, which should contribute to better performance.

Consider adding a brief comment explaining why the dist == 16 case is no longer needed, to help future maintainers understand the rationale behind this change.


146-158: LGTM: Essential half chunk operations added

The new functions loadhalfchunk, storehalfchunk, and halfchunk2whole provide crucial operations for working with half chunks. These additions are consistent with the PR's goal of improving performance through half chunk awareness.

In the comment for halfchunk2whole, consider clarifying the specific memory sanitizer issues this addresses, if known. This could help future maintainers understand the rationale behind the zero extension.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d6150b2 and efb588d.

📒 Files selected for processing (5)
  • arch/arm/chunkset_neon.c (0 hunks)
  • arch/x86/chunkset_avx2.c (5 hunks)
  • arch/x86/chunkset_sse2.c (0 hunks)
  • arch/x86/chunkset_ssse3.c (0 hunks)
  • chunkset_tpl.h (3 hunks)
💤 Files with no reviewable changes (3)
  • arch/arm/chunkset_neon.c
  • arch/x86/chunkset_sse2.c
  • arch/x86/chunkset_ssse3.c
🧰 Additional context used
📓 Learnings (1)
chunkset_tpl.h (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1778
File: chunkset_tpl.h:164-165
Timestamp: 2024-10-04T03:17:24.773Z
Learning: In `chunkset_tpl.h`, using `goto` in the `CHUNKMEMSET` function aids the compiler in inlining the function, so it should be retained.
🔇 Additional comments (11)
arch/x86/chunkset_avx2.c (5)

9-19: LGTM: New typedefs and macros for half chunk operations

The addition of halfchunk_t typedef and related macros aligns well with the PR objective of making the function half chunk aware. These changes provide a solid foundation for the performance improvements described in the PR summary.


77-80: LGTM: Efficient 16-byte chunk memset function

The new chunkmemset_16 function is a valuable addition that aligns with the PR's performance improvement goals. It efficiently sets 16-byte chunks using AVX2 instructions, which should contribute to better performance for smaller chunk sizes as mentioned in the PR summary.


128-145: LGTM: New halfchunkmemset functions for various sizes

The addition of halfchunkmemset_2, halfchunkmemset_4, and halfchunkmemset_8 functions provides efficient ways to set half chunks of different sizes. These functions align well with the PR's objective of improving performance for smaller chunk sizes and making the implementation half chunk aware.


160-171: LGTM: Efficient GET_HALFCHUNK_MAG function added

The new GET_HALFCHUNK_MAG function is a crucial addition for efficient half chunk handling. It leverages the half_rem_vals array and simplifies the process of obtaining the magnitude for half chunks. This aligns perfectly with the PR's goal of improving performance for smaller chunk sizes.

To ensure the performance improvements, consider running the following benchmark:

This script compiles two versions of the code (with and without the new half chunk optimizations) and runs benchmarks for various small chunk sizes. The results should show improved performance for the new version, especially for smaller chunk sizes.


Line range hint 1-172: Comprehensive benchmarking recommended

The changes implemented in this PR appear to be well-designed and should improve performance for smaller chunk sizes, as intended. However, to ensure that these optimizations don't negatively impact performance for larger chunk sizes, I recommend running a comprehensive benchmark across a wide range of chunk sizes.

Consider running the following benchmark script:

This script will provide a comprehensive view of the performance impact across various chunk sizes, helping to ensure that the optimizations provide benefits without unexpected regressions.

chunkset_tpl.h (6)

90-108: LGTM: New HALFCHUNKCOPY function improves performance for smaller chunk sizes

The introduction of the HALFCHUNKCOPY function aligns well with the PR objective of making the code half chunk aware. This function efficiently handles copying of smaller data chunks, which should lead to performance improvements, especially for smaller chunk sizes.


Line range hint 112-129: LGTM: Improved function signature and initial checks

The changes to the CHUNKMEMSET function enhance its performance and flexibility:

  1. Changing to static inline allows for better compiler optimization.
  2. New variables chunk_load, chunk_mod, and adv_amount prepare for the half chunk aware implementation.
  3. Using CHUNKCOPY for dist > sizeof(chunk_t) simplifies the code and potentially improves performance.

These modifications align well with the PR objectives of improving performance and making the function half chunk aware.

🧰 Tools
🪛 ast-grep

[warning] 122-122: Avoid 'memset()' function, it does not consider buffer boundaries and can lead to buffer overflows. Use 'memset_s()' instead.
Context: memset(out, *from, len);
Note: [CWE-14]: Compiler Removal of Code to Clear Buffers [OWASP A04:2021] Insecure Design [REFERENCES]
- https://cwe.mitre.org/data/definitions/14.html
- https://owasp.org/Top10/A02_2021-Cryptographic_Failures/


136-165: LGTM: Efficient half chunk handling improves performance for smaller sizes

The new half chunk handling code significantly enhances the function's performance for smaller chunk sizes:

  1. Specific handling for common dist values (2, 4, 8, 16) optimizes frequent cases.
  2. The use of GET_HALFCHUNK_MAG for other dist values ensures flexibility.
  3. The loop for copying half chunks efficiently handles the main copy operation.

Note: The use of goto at line 164, while generally discouraged, is retained here to aid compiler optimization for inlining, as previously discussed.

These changes align perfectly with the PR objective of making the function half chunk aware and improving performance for smaller chunk sizes.


183-200: LGTM: Optimized main chunk handling for improved performance

The changes to the main chunk handling code introduce several performance optimizations:

  1. Addition of a specific case for dist == 16 aligns with the PR objective of optimizing common cases.
  2. Use of adv_amount allows for more efficient copying when dist doesn't divide evenly into chunk_t.
  3. The main loop now handles two chunk copies simultaneously, leveraging CPU store forwarding as mentioned in the PR objectives.

These optimizations should lead to significant performance improvements, especially for smaller chunk sizes and common distance values.


213-215: Acknowledged: Remaining bytes handling for half chunk support

The addition of the rem_bytes: label, conditionally compiled for half chunk support, allows the half chunk code to efficiently handle remaining bytes. This aligns with the overall optimization strategy of the PR.

Note: As discussed earlier, the use of goto here aids compiler optimization for inlining.


Line range hint 90-215: LGTM: Comprehensive performance improvements for chunkset operations

The changes in chunkset_tpl.h represent a significant optimization effort that aligns well with the PR objectives:

  1. Introduction of half chunk awareness improves performance for smaller chunk sizes.
  2. Optimizations for common distance values (2, 4, 8, 16) enhance efficiency in frequent scenarios.
  3. The main loop now leverages CPU store forwarding by handling two chunks simultaneously.
  4. The use of adv_amount allows for more efficient copying when the distance doesn't divide evenly into chunk size.

These optimizations are expected to yield substantial performance improvements, particularly for smaller chunk sizes and common distance values. The implementation is thorough and considers various edge cases, ensuring robust performance across different scenarios.

Great job on these optimizations! The changes should significantly enhance the performance of the zlib-ng library.

@nmoinvaz
Copy link
Member

nmoinvaz commented Oct 4, 2024

Benchmark

Dataset: silesia.tar

Develop 8d10c30

OS: Windows 10 10.0.22631 AMD64
CPU: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
Tool: ../zlib-ng/build-develop/Release/minigzip.exe Size: 105,472 B
Levels: 1-9
Runs: 70         Trim worst: 30

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      0.952/0.983/1.000/0.010        0.451/0.470/0.488/0.011       94,127,497
 2     35.519%      1.472/1.507/1.529/0.014        0.446/0.460/0.487/0.010       75,286,322
 3     33.844%      1.993/2.041/2.064/0.018        0.420/0.437/0.447/0.007       71,735,218
 4     33.146%      2.257/2.323/2.355/0.019        0.411/0.430/0.444/0.008       70,255,223
 5     32.642%      2.500/2.553/2.581/0.019        0.414/0.428/0.456/0.010       69,187,419
 6     32.483%      2.948/3.015/3.048/0.023        0.405/0.423/0.441/0.009       68,850,776
 7     32.255%      3.920/3.996/4.039/0.028        0.408/0.423/0.435/0.006       68,366,759
 8     32.167%      6.149/6.315/6.386/0.041        0.414/0.426/0.442/0.008       68,180,762
 9     31.887%      6.790/6.951/7.012/0.046        0.409/0.418/0.432/0.007       67,586,442

 avg1  34.261%                        3.298                          0.435
 tot                               1187.367                        156.614      653,576,418

PR efb588d

OS: Windows 10 10.0.22631 AMD64
CPU: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
Tool: ../zlib-ng/build-develop-jesus/Release/minigzip.exe Size: 105,984 B
Levels: 1-9
Runs: 70         Trim worst: 30

 Level   Comp   Comptime min/avg/max/stddev  Decomptime min/avg/max/stddev  Compressed size
 1     44.409%      0.964/0.980/0.996/0.009        0.450/0.460/0.472/0.006       94,127,497
 2     35.519%      1.488/1.502/1.520/0.010        0.444/0.452/0.463/0.005       75,286,322
 3     33.844%      2.015/2.035/2.053/0.012        0.425/0.434/0.441/0.005       71,735,218
 4     33.146%      2.301/2.327/2.346/0.012        0.418/0.426/0.433/0.004       70,255,223
 5     32.642%      2.527/2.557/2.578/0.012        0.415/0.423/0.432/0.006       69,187,419
 6     32.483%      2.992/3.017/3.034/0.012        0.411/0.419/0.426/0.005       68,850,776
 7     32.255%      3.966/4.004/4.031/0.015        0.414/0.422/0.431/0.005       68,366,759
 8     32.167%      6.274/6.310/6.350/0.020        0.414/0.419/0.425/0.003       68,180,762
 9     31.887%      6.858/6.920/6.951/0.020        0.406/0.415/0.422/0.005       67,586,442

 avg1  34.261%                        3.295                          0.430
 tot                               1186.072                        154.739      653,576,418

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
arch/x86/chunkset_avx2.c (1)

55-57: Add documentation for the half_rem_vals array

The half_rem_vals array has been added, which is used in the GET_HALFCHUNK_MAG function. While its purpose can be inferred from its usage, it would be beneficial to add a brief comment explaining the significance of these values and how they were derived.

Could you please add a comment explaining the purpose and derivation of the values in the half_rem_vals array?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between efb588d and a90e06e.

📒 Files selected for processing (5)
  • arch/arm/chunkset_neon.c (0 hunks)
  • arch/x86/chunkset_avx2.c (5 hunks)
  • arch/x86/chunkset_sse2.c (0 hunks)
  • arch/x86/chunkset_ssse3.c (0 hunks)
  • chunkset_tpl.h (3 hunks)
💤 Files with no reviewable changes (3)
  • arch/arm/chunkset_neon.c
  • arch/x86/chunkset_sse2.c
  • arch/x86/chunkset_ssse3.c
🧰 Additional context used
📓 Learnings (2)
arch/x86/chunkset_avx2.c (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:56-59
Timestamp: 2024-10-08T03:09:54.717Z
Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.
chunkset_tpl.h (1)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1778
File: chunkset_tpl.h:164-165
Timestamp: 2024-10-04T03:17:24.773Z
Learning: In `chunkset_tpl.h`, using `goto` in the `CHUNKMEMSET` function aids the compiler in inlining the function, so it should be retained.
🔇 Additional comments (11)
arch/x86/chunkset_avx2.c (5)

12-12: LGTM: Appropriate type definition for half-chunks

The addition of halfchunk_t as __m128i is a good choice for handling 128-bit (half-chunk) operations. This aligns well with the PR's objective of improving performance for smaller chunk sizes.


17-19: LGTM: New macros for extended functionality

The addition of HAVE_CHUNKMEMSET_16 and HAVE_HALF_CHUNK macros is consistent with the existing code structure. These will allow for conditional compilation of the new 16-byte chunk operations and half-chunk functionality, which is in line with the PR's performance improvement goals.


77-79: LGTM: Efficient implementation of chunkmemset_16

The chunkmemset_16 function is a well-implemented addition for handling 16-byte chunks. The use of _mm256_broadcastsi128_si256 is an efficient way to set a 256-bit chunk from a 16-byte input, which aligns with the PR's performance improvement goals.


128-158: LGTM: Well-implemented half-chunk operations

The addition of half-chunk related functions (halfchunkmemset_2, halfchunkmemset_4, halfchunkmemset_8, loadhalfchunk, storehalfchunk, halfchunk2whole) is well-implemented. These functions provide efficient operations for 128-bit data, which is consistent with the PR's goal of improving performance for smaller chunk sizes. The implementations are analogous to their full-chunk counterparts, ensuring consistency in the codebase.


Line range hint 91-124: Verify correctness of simplified GET_CHUNK_MAG function

The simplification of the GET_CHUNK_MAG function, particularly for dist < 16, should lead to better performance. However, the removal of the specific branch for dist == 16 warrants careful verification to ensure correctness is maintained for all input cases.

Please run the following script to verify the function's behavior:

chunkset_tpl.h (6)

90-108: LGTM: New HALFCHUNKCOPY function for optimized half-chunk operations

The addition of the HALFCHUNKCOPY function is a good optimization for handling smaller data sizes. It follows the same pattern as CHUNKCOPY, which maintains consistency in the codebase. This function will likely improve performance for operations on data sizes that are half of the standard chunk size.


112-120: LGTM: Improved function signature and new variable declarations

The change of CHUNKMEMSET to static inline is a good optimization that allows the compiler to inline the function, potentially improving performance. The new variables chunk_load and chunk_mod are necessary for the updated logic in the function.


136-165: LGTM: Efficient half-chunk handling logic

The new half-chunk handling logic is a valuable addition that allows for more efficient processing of smaller data sizes. The specialized handling for common dist values (2, 4, 8, 16) is likely to improve performance in frequent use cases.

Note: The use of goto rem_bytes at line 164 is intentional and aids in function inlining, as previously discussed.


194-200: LGTM: Optimized main loop with double chunk copy

The updated main loop in CHUNKMEMSET now handles two chunk copies at once, which can significantly improve performance by reducing the number of loop iterations. The use of adv_amount ensures correct handling for cases where the chunk size doesn't evenly divide the vector register size, maintaining accuracy while optimizing for speed.


Line range hint 213-220: LGTM: Consistent remainder handling with half-chunk support

The addition of the rem_bytes: label, conditionally compiled for half-chunk support, maintains consistency with the earlier goto statement. The unchanged remainder handling logic continues to appropriately deal with any leftover bytes, ensuring complete and accurate data copying.


Line range hint 1-255: Overall assessment: Significant performance optimizations

The changes in this file represent a substantial effort to optimize chunk memory operations, particularly for smaller data sizes. The introduction of half-chunk handling, the optimized main loop in CHUNKMEMSET, and the new HALFCHUNKCOPY function are all likely to contribute to improved performance.

The code maintains good readability and follows consistent patterns, which is crucial for maintainability. The careful handling of different chunk sizes and the use of conditional compilation (#ifdef HAVE_HALF_CHUNK) allow for flexibility across different architectures.

These optimizations align well with the PR objectives of improving the performance of chunk operations, especially for smaller chunk sizes. The changes appear to be well-thought-out and implemented with attention to both performance and code quality.

Comment on lines +160 to +171
static inline halfchunk_t GET_HALFCHUNK_MAG(uint8_t *buf, uint32_t *chunk_rem, uint32_t dist) {
lut_rem_pair lut_rem = perm_idx_lut[dist - 3];
__m128i perm_vec, ret_vec;
__msan_unpoison(buf + dist, 16 - dist);
ret_vec = _mm_loadu_si128((__m128i*)buf);
*chunk_rem = half_rem_vals[dist - 3];

perm_vec = _mm_load_si128((__m128i*)(permute_table + lut_rem.idx));
ret_vec = _mm_shuffle_epi8(ret_vec, perm_vec);

return ret_vec;
}
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

LGTM: Efficient implementation of GET_HALFCHUNK_MAG with minor optimization potential

The GET_HALFCHUNK_MAG function is a well-implemented half-chunk version of GET_CHUNK_MAG. The use of half_rem_vals for setting chunk_rem simplifies the remainder calculation, and the SIMD instructions are used appropriately for efficient operation.

A minor optimization could be considered:

Consider using _mm_loadl_epi64 instead of _mm_loadu_si128 when dist <= 8. This might be slightly more efficient for smaller distances. Here's a potential implementation:

- ret_vec = _mm_loadu_si128((__m128i*)buf);
+ ret_vec = (dist <= 8) ? _mm_loadl_epi64((__m128i*)buf) : _mm_loadu_si128((__m128i*)buf);

This change could potentially improve performance for very small distances without affecting correctness.

Committable suggestion was skipped due to low confidence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branching cost far outweighs the savings from the load.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KungFuJesus, understood. Thank you for the clarification.

(。•́︿•̀。)


✏️ Learnings added
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1778
File: arch/x86/chunkset_avx2.c:160-171
Timestamp: 2024-10-08T21:51:45.150Z
Learning: In `arch/x86/chunkset_avx2.c`, within the `GET_HALFCHUNK_MAG` function, using a conditional branch to select between `_mm_loadl_epi64` and `_mm_loadu_si128` is not recommended because the branching cost outweighs the savings from the load.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:56-59
Timestamp: 2024-10-08T03:09:54.717Z
Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:56-59
Timestamp: 2024-10-08T19:37:14.998Z
Learning: The array `half_rem_vals` in `arch/x86/chunkset_avx2.c` is used in the function `GET_HALFCHUNK_MAG` on line 148.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-07T21:18:37.806Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:82-85
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In `arch/x86/chunkset_avx2.c`, when working with AVX2-capable x86 CPUs, unaligned memory access using `_mm_loadu_si128` is acceptable since there is no performance penalty on architectures after Nehalem. Ensuring alignment may introduce unnecessary overhead due to arbitrary offsets into the window.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:168-168
Timestamp: 2024-10-07T21:21:10.570Z
Learning: In the function `GET_HALFCHUNK_MAG` in `arch/x86/chunkset_avx2.c`, `dist` is always less than 16, so `16 - dist` in `__msan_unpoison` is always positive.

Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: arch/x86/chunkset_avx2.c:168-168
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the function `GET_HALFCHUNK_MAG` in `arch/x86/chunkset_avx2.c`, `dist` is always less than 16, so `16 - dist` in `__msan_unpoison` is always positive.

Put length 16 in the length checking ladder and take care of it there
since it's also a simple case to handle. We kind of went out of our way
to pretend 128 bit vectors didn't exist when using avx2 but this can be
handled in a single instruction. Strangely the intrinsic uses vector
register operands but the instruction itself assumes a memory operand
for the source. This also means we don't have to handle this case in our
"GET_CHUNK_MAG" function.
This gives us appreciable gains on a number of fronts.  The first being
we're inlining a pretty hot function that was getting dispatched to
regularly. Another is that we're able to do a safe lagged copy of a
distance that is smaller, so CHUNKCOPY gets its teeth back here for
smaller sizes, without having to do another dispatch to a function.

We're also now doing two overlapping writes at once and letting the CPU
do its store forwarding. This was an enhancement @dougallj had suggested
a while back.

Additionally, the "half chunk mag" here is fundamentally less
complicated because it doesn't require sythensizing cross lane permutes
with a blend operation, so we can optimistically do that first if the
len is small enough that a full 32 byte chunk doesn't make any sense.
Copy link
Member

@Dead2 Dead2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dead2 Dead2 merged commit e874b34 into zlib-ng:develop Oct 12, 2024
141 of 147 checks passed
@Dead2 Dead2 mentioned this pull request Dec 31, 2024
This was referenced Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants