General optimized chunkset #1568

alexsifivetw · 2023-08-31T08:43:04Z

We try to:

use large chunk size
copy memory as much as possible

to make good use of the the RVV advance.

We get a ~5.4% performance gain in decompression when compared to load/store single 8-byte chunk on the SiFive internal FPGA.

Benchmark tool: deflatebench

codecov · 2023-08-31T08:46:40Z

Codecov Report

Patch coverage has no change and project coverage change: +0.24% 🎉

Comparison is base (2f15a69) 83.88% compared to head (ae29d46) 84.13%.
Report is 16 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1568      +/-   ##
===========================================
+ Coverage    83.88%   84.13%   +0.24%     
===========================================
  Files          132      129       -3     
  Lines        10843    10470     -373     
  Branches      2801     2692     -109     
===========================================
- Hits          9096     8809     -287     
+ Misses        1048     1002      -46     
+ Partials       699      659      -40

Flag	Coverage Δ
macos_clang	`42.97% <ø> (ø)`
macos_gcc	`73.61% <ø> (ø)`
ubuntu_clang	`82.50% <ø> (+0.22%)`	⬆️
ubuntu_clang_debug	`82.14% <ø> (ø)`
ubuntu_clang_inflate_allow_invalid_dist	`82.15% <ø> (ø)`
ubuntu_clang_inflate_strict	`82.49% <ø> (ø)`
ubuntu_clang_mmap	`82.82% <ø> (ø)`
ubuntu_clang_pigz	`13.96% <ø> (+0.05%)`	⬆️
ubuntu_clang_pigz_no_optim	`11.51% <ø> (ø)`
ubuntu_clang_pigz_no_threads	`13.78% <ø> (ø)`
ubuntu_clang_reduced_mem	`82.90% <ø> (ø)`
ubuntu_clang_toolchain_riscv	`∅ <ø> (∅)`
ubuntu_gcc	`75.30% <ø> (ø)`
ubuntu_gcc_aarch64	`77.40% <ø> (+<0.01%)`	⬆️
ubuntu_gcc_aarch64_compat_no_opt	`75.65% <ø> (ø)`
ubuntu_gcc_aarch64_no_acle	`76.17% <ø> (+<0.01%)`	⬆️
ubuntu_gcc_aarch64_no_neon	`76.16% <ø> (+<0.01%)`	⬆️
ubuntu_gcc_armhf	`77.17% <ø> (-0.30%)`	⬇️
ubuntu_gcc_armhf_compat_no_opt	`75.62% <ø> (ø)`
ubuntu_gcc_armhf_no_acle	`77.12% <ø> (-0.30%)`	⬇️
ubuntu_gcc_armhf_no_neon	`77.25% <ø> (-0.08%)`	⬇️
ubuntu_gcc_armsf	`74.60% <ø> (-0.06%)`	⬇️
ubuntu_gcc_armsf_compat_no_opt	`74.09% <ø> (ø)`
ubuntu_gcc_benchmark	`73.57% <ø> (+0.16%)`	⬆️
ubuntu_gcc_compat_no_opt	`76.85% <ø> (+0.01%)`	⬆️
ubuntu_gcc_compat_sprefix	`73.57% <ø> (ø)`
ubuntu_gcc_m32	`73.23% <ø> (ø)`
ubuntu_gcc_mingw_i686	`73.51% <ø> (+<0.01%)`	⬆️
ubuntu_gcc_mingw_x86_64	`73.52% <ø> (+<0.01%)`	⬆️
ubuntu_gcc_mips	`74.97% <ø> (ø)`
ubuntu_gcc_mips64	`74.98% <ø> (ø)`
ubuntu_gcc_no_avx2	`74.28% <ø> (-0.06%)`	⬇️
ubuntu_gcc_no_ctz	`74.65% <ø> (ø)`
ubuntu_gcc_no_ctzll	`74.64% <ø> (ø)`
ubuntu_gcc_no_pclmulqdq	`74.22% <ø> (-0.07%)`	⬇️
ubuntu_gcc_no_sse2	`74.54% <ø> (+0.05%)`	⬆️
ubuntu_gcc_no_sse42	`74.73% <ø> (ø)`
ubuntu_gcc_o1	`74.18% <ø> (ø)`
ubuntu_gcc_osb	`∅ <ø> (∅)`
ubuntu_gcc_pigz	`38.15% <ø> (+0.02%)`	⬆️
ubuntu_gcc_pigz_aarch64	`39.04% <ø> (+0.04%)`	⬆️
ubuntu_gcc_ppc	`73.92% <ø> (ø)`
ubuntu_gcc_ppc64	`74.36% <ø> (ø)`
ubuntu_gcc_ppc64_power9	`74.53% <ø> (ø)`
ubuntu_gcc_ppc64le	`74.43% <ø> (ø)`
ubuntu_gcc_ppc64le_novsx	`74.75% <ø> (ø)`
ubuntu_gcc_ppc64le_power9	`74.31% <ø> (ø)`
ubuntu_gcc_ppc_no_power8	`74.63% <ø> (ø)`
ubuntu_gcc_s390x	`74.80% <ø> (ø)`
ubuntu_gcc_s390x_dfltcc	`?`
ubuntu_gcc_s390x_dfltcc_compat	`?`
ubuntu_gcc_s390x_no_crc32	`74.59% <ø> (ø)`
ubuntu_gcc_sparc64	`74.79% <ø> (ø)`
ubuntu_gcc_sprefix	`73.24% <ø> (-0.16%)`	⬇️
win64_gcc	`73.87% <ø> (-0.12%)`	⬇️
win64_gcc_compat_no_opt	`74.71% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
functable.c	`74.13% <ø> (-0.43%)`	⬇️

... and 18 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nmoinvaz · 2023-08-31T14:41:01Z

arch/riscv/chunkset_rvv.c

+ *         After using a single memcpy to copy N chunks, we have to use series of
+ *         loadchunk and storechunk to ensure the result is correct.
+ */
+static inline uint8_t* CHUNKCOPY(uint8_t *out, uint8_t const *from, unsigned len) {


Why not just use the CHUNKCOPY in chunkset_tpl.h? Could these changes be moved into that CHUNKCOPY?

Since I have confirmed that this is an optimization specifically for RISC-V using RVV-optimized glibc, I've decided to implement this special CHUNKCOPY only for RISC-V.

I'm not sure if it's a general optimization to other platforms. I'll test it on x86/ARM when I have free time.

nmoinvaz · 2023-08-31T14:41:06Z

arch/riscv/chunkset_rvv.c

+#include "zbuild.h"
+
+/*
+ * It's not a optimized implemantation using RISC-V RVV, but a general optimized one.


I am wondering if this isn't better to be put into arch/generic and all references to RISC-V removed since there are no RISC-V intrinsics being used.

It's very interesting problem. I'll tri it on x86/ARM when I have free time.

KungFuJesus · 2023-08-31T14:55:40Z

This looks an awful lot like the CHUNKCOPY_SAFE code I wrote a while back

arch/riscv/chunkset_rvv.c

functable.c

arch/riscv/chunkset_rvv.c

mtl1979 · 2023-09-14T08:54:52Z

arch/riscv/chunkset_rvv.c

+ */
+#define CHUNK_SIZE 32
+
+/* We don't have a 32-byte datatype for RISC-V arch. */


Is using 8-bit data type the fastest? On most architectures using either 32-bit or 64-bit type is faster and still usable for 256-bit chunks...

Thanks. I will try 32-bit and 64-bit data type.

It should be relatively trivial to provide implementations of chunkmemset_2/chunkmemset_4/chunkmemset_8 if uint64_t is used here instead of uint8_t.

arch/riscv/chunkset_rvv.c

ccawley2011

I suspect that the RVV implementation of CHUNKCOPY would benefit other architectures that aren't using SIMD, especially if unaligned memory access is unavailable (such as on older ARM platforms).

It might be worth looking at chunkcopy_safe as well, since RISC-V doesn't seem to inline memcpy calls for more than 8 bytes.

ccawley2011 · 2023-09-14T11:49:30Z

arch/riscv/chunkset_rvv.c

+    while (len > 0) {
+        loadchunk(from, &chunk);
+        storechunk(out, &chunk);
+        out += sizeof(chunk_t);
+        from += sizeof(chunk_t);
+        len -= sizeof(chunk_t);
+    }


Would something like this be beneficial here?

Suggested change

while (len > 0) {

loadchunk(from, &chunk);

storechunk(out, &chunk);

out += sizeof(chunk_t);

from += sizeof(chunk_t);

len -= sizeof(chunk_t);

}

if (len > 0) {

len = ((len + sizeof(chunk_t) - 1) / sizeof(chunk_t)) * sizeof(chunk_t);

memcpy(out, from, len);

out += len;

from += len;

}

It might also help to have a single memcpy call that combines this with the previous one.

Unfortunately, the tests would failed. We have to cope single chunk at time to ensure the correctness of the result.

Assuming that len >= 3 * sizeof(chunk), sizeof(chunk)=4 and align=2
Copy single chunk once:

ABCDDDDDDDDDDDDDDDDDDDDDDDD * * ABCDABCDDDDDDDDDDDDDDDDDDDD * * ABCDABCDABDDDDDDDDDDDDDDDDD * * ABCDABCDABCDABDDDDDDDDDDDDD * * ABCDABCDABCDABCDABDDDDDDDDD * * ABCDABCDABCDABCDABCDABDDDDD * *

Copy whole left memory once:

ABCDDDDDDDDDDDDDDDDDDDD * * ABCDABCDDDDDDDDDDDDDDDDDDDD * * ABCDABCDABDDDDDDDDDDDDDDDDD * * if (len > 0) { //...copt left} ABCDABCDABCDABDDDDDDDDDDDDD * * it would copy CDABDDDDDD...

I think I understand what you're describing. Would it still help to do incrementally larger copies as more chunks are filled in, or would that not make much of a difference?

ccawley2011 · 2023-09-14T11:50:33Z

arch/riscv/chunkset_rvv.c

+    loadchunk(from, &chunk);
+    storechunk(out, &chunk);
+    out += align;
+    from += align;
+    len -= align;


I think this can be merged with the other memcpy calls as well.

According to the assumption: from lags out by at least a chunk.
Using memcpy is safe here.
It's done, thanks.

ccawley2011 · 2023-09-14T11:54:53Z

arch/riscv/chunkset_rvv.c

+ */
+#define CHUNK_SIZE 32
+
+/* We don't have a 32-byte datatype for RISC-V arch. */


It should be relatively trivial to provide implementations of chunkmemset_2/chunkmemset_4/chunkmemset_8 if uint64_t is used here instead of uint8_t.

KungFuJesus · 2023-09-20T13:44:44Z

I suspect that the RVV implementation of CHUNKCOPY would benefit other architectures that aren't using SIMD, especially if unaligned memory access is unavailable (such as on older ARM platforms).

It might be worth looking at chunkcopy_safe as well, since RISC-V doesn't seem to inline memcpy calls for more than 8 bytes.

Again, CHUNKCOPY_SAFE kinda sorta already does this:
https://github.com/zlib-ng/zlib-ng/blob/develop/inflate_p.h#L151

Dead2 · 2023-09-23T16:03:52Z

I tested adding just the two if blocks to the standard chunkcopy, not sure whether anything else would be needed.
Decompression speeds:
C path (chunksize 8), inflate went from ~625ms to ~887ms.
AVX2 path (chunksize 32), inflate went from ~451ms to ~748ms.

So unless there is something to magically speed that up massively, this does not work in the standard chunkcopy implementation and should indeed be a RVV-specific implementation (I am amazed that this does speed things up there though, I assume you did verify that just upping the chunksize didn't account for the speedup)

arch/riscv/chunkset_rvv.c

alexsifivetw · 2023-09-24T10:06:22Z

We used uint64_t for chunk, implemented chunkmemset_2/chunkmemset_4/chunkmemset_8, and get a 10.1% (> 5.4%) performance gain.

alexsifivetw · 2023-09-24T10:09:39Z

I tested adding just the two if blocks to the standard chunkcopy, not sure whether anything else would be needed. Decompression speeds: C path (chunksize 8), inflate went from ~625ms to ~887ms. AVX2 path (chunksize 32), inflate went from ~451ms to ~748ms.

So unless there is something to magically speed that up massively, this does not work in the standard chunkcopy implementation and should indeed be a RVV-specific implementation (I am amazed that this does speed things up there though, I assume you did verify that just upping the chunksize didn't account for the speedup)

It's not a general optimization, sad :(
However, RVV is good for arbitrary length data, so we aim to copy memory as much as possible.

arch/riscv/chunkset_rvv.c

Implement chunk memset for specific length

Dead2

LGTM

ccawley2011 · 2023-09-25T23:02:02Z

I tested adding just the two if blocks to the standard chunkcopy, not sure whether anything else would be needed.

Decompression speeds:

C path (chunksize 8), inflate went from ~625ms to ~887ms.

AVX2 path (chunksize 32), inflate went from ~451ms to ~748ms.

So unless there is something to magically speed that up massively, this does not work in the standard chunkcopy implementation and should indeed be a RVV-specific implementation (I am amazed that this does speed things up there though, I assume you did verify that just upping the chunksize didn't account for the speedup)

I suspect that what makes the difference here is probably one of two things:

If loadchunk/storechunk aren't fully inlined, you end up with a worst case scenario of two memcpy calls for each chunk copied. However if it is inlined, you can avoid needing to call memcpy at all.
If memcpy is repeatedly called with a small size, there's a chance that it'll spend most of the time taking care of alignment instead of doing larger loads and stores.

I haven't done any proper benchmarks though, so this is all just speculation.

It's likely that the discussion PR #1548 is related to a degree.

sh1boot · 2023-10-19T21:50:15Z

My feeling here is that slotting in to fit the chunk-copy idiom is not a path which RISC-V Vector should be taking, and this might generalise to SVE and AVX as well. Trying to squeeze this hardware into chunks surrounded by safety checks which it doesn't need can only slow it down.

Looking at the current structure of inffast_tpl.h, it has quite a few conditional branches deciding which fast and slow paths can be taken, and my suspicion is that some architectures can probably do better with fewer conditional branches and more generic implementations instead.

Perhaps if the architecture-specific implementations are more generic, inline memcpy functions they can be passed a constant bitmap of safe assumptions for each call point, like whether source or destination might overlap or if each buffer has extra runway at the end, then the necessary tests can be performed within the architecture-specific code rather than common code (and removed by the compiler if the relevant flags are passed), and architectures that don't care just don't do the test either way. But, importantly, inffast_tpl.h and inflate_p.h aren't generating spurious code paths (and tests) for implementations that needn't differ.

I'll try to illustrate how simple the main copy loop might be if only the overall library were structured to accommodate it. I'm looking at RISC-V Vector, but SVE and AVX also have masked load and store operations, I believe, so they might benefit from a similar approach.

Here's how simple RISC-V tried to make efficient memcpy:

do {
    size_t vl = __riscv_vsetvl_e8m1(len);
    vuint8m1_t chunk = __riscv_vle8_v_u8m1(src, vl);
    __riscv_vse8_v_u8m1(dst, chunk, vl);
    dst += vl;
    src += vl;
    len -= vl;
} while (len > 0);

Nothing special for aligned, unaligned, or safe or unsafe destinations. Nothing special for different hardware with wider or narrower register widths. Not so far, at least. And it won't over-read or over-write any buffers; __riscv_vsetvl returns the minimum of len and the width of the vector register, and the result constrains the span of the loads and stores. This means that you no longer need to carve out a "slow path" when the output buffer falls below a threshold size, which happens to be a big performance killer for cases like libpng which only allow output of one scanline per call.

We don't just use memcpy() and rely on the quality of the standard library because all our copies are between 3 and 258 bytes, and the cost of register marshalling for a function call is typically greater than the cost of just doing the copy inline. It's usual to use memcpy() to move finite-sized chunks around because the compiler is expected to inline efficient code to implement that, but variable-length copies always seem to go back to libc for me. Plus there's all the other esoterica a libc version might spend time considering, like optimising data alignment when the cumulative effect over megabytes might be meaningful.

In fact for zlib, the chances are that for many cases it won't even iterate, and will drop straight through because len will be less than the maximum vl. m1 can be changed to m2, m4, or m8 to improve the chances that one iteration will finish the job and iteration won't be necessary at all, making branch prediction reliable but potentially increasing the scheduling of more spurious memory operations. How to balance that I do not know.

The other case, where the backreference results in an overlapping copy with pattern unrolling and all that, also works out to be fairly simple:

size_t vl = __riscv_vsetvl_e8m1(len);
vuint8m1_t chunk = __riscv_vle8_v_u8m1(src, vl);
if (offset < vl) {
    vuint8m1_t idx = __riscv_vid_v_u8m1(vl);
    idx = __riscv_vremu(idx, offset, vl);
    chunk = __riscv_vrgather(chunk, idx, vl);
    int unroll = vl - vl % offset;
    // if offset >= vl, unroll becomes 0, so no-op below:
    offset += unroll;
    src -= unroll;
}
__riscv_vse8_v_u8m1(dst, chunk, vl);
dst += vl;
src += vl;
len -= vl;
// continue with regular copy iff there's anything left...

Here __riscv_vid produces a vector of { 0, 1, 2, 3, ... }, __riscv_vremu calculates the remainder of those values divided by offset, and __riscv_vrgather fills the resulting vector with the input data from the modular offset. It's one conditional block inserted into the previous memcpy() example, and after that memcpy proceeds as normal (if there's anything left to do).

But that code comes with caveats.

It's possible that an implementation has a vector length > 256 bytes, in which case that arithmetic will overflow and produce some bad results when 256 < offset < 258 (258 being maximum copy length).

It's pretty bold using a remainder operation in a main loop. I put that there because it's only 8-bit unsigned division, which can be very fast if the implementation chooses to make it so. Otherwise do it with tables instead:

if (offset < vl) {
    vuint8m1_t idx = __riscv_vle8_v_u8m1(id_mod_x[offset], vl);
    chunk = __riscv_vrgather(chunk, idx, vl);
    int unroll = maxvl_minus_maxvl_mod_x[offset];
    // if offset >= vl, unroll becomes 0, so no-op below:
    offset += unroll;
    src -= unroll;
}

KungFuJesus · 2023-10-20T00:29:54Z

So avx's masked load and store have gigantic variable penalties, particularly if a page fault is incurred. Now avx512 gets around that a bit but I suspect you are not going to see a massive speedup over the progressively unrolled chunks approach.

KungFuJesus · 2023-10-20T01:23:55Z

Also note that some of the chunking copies, particularly chunkcopy_safe, needs to preserve self modifying copy semantics, meaning in the case of overlap you need the side effect to occur as if you were doing a byte by byte copy.

alexsifivetw · 2023-10-23T01:50:16Z

I totally agree that fixed size chunk-copy is not suitable for RISC-V RVV. However, I didn't modify the design, because it has worked well for a long time.
In my opinion, if we want to introduce a significant change, it's important to check if it's worthwhile.

Maybe you can benchmark this prototype first. You can do it without any software design, without any clean code, just by hardcoding the LMUL and the vl chunk size.
I'm glad to see more discussion about RISC-V. Thank you for your work.

sh1boot · 2023-11-03T19:08:16Z

So I'm currently doing experiments on a cloudflare fork (realising way too late I should have started from @dougallj's fork, but I'm there now). I don't have anything intelligible just yet, but while I was testing I noticed that zlib-ng is calling memcpy() a lot, and the version of libc that I have available uses a scalar implementation, so results are pretty skewed and I can't compare.

Is it worth a patch just to take control of the quality of memcpy implementation that get used, maybe?

KungFuJesus · 2023-11-03T20:20:11Z

A lot of effort was expended to make use of the standard memcpy, particular for cases where there are alignment concerns. Gcc, clang, and any reasonable compiler should be subbing those calls for compiler builtins.

Additionally, where memcpy might be called for variable lengths, glibc and many other libcs provide an ERMS aware memcpy that is very efficient for particularly large strings on x86 (and even some medium length strings on icelake or newer).

Feel free to try to find a place where memcpy is punitive (maybe there are poorly optimized implementations for more nascent ISAs out there where the compiler can't infer a builtin). But, the memcpys were sprinkled in there for a reason, at least where I put them. In fact, my modifications to chunkcopy_safe had some pretty sizeable savings on very compressible things, and some pretty good savings on things that weren't too.

sh1boot · 2023-11-03T22:17:45Z

I think only variable-length copies are worth discussing. Almost all the fixed-length ones are replaced with sensible code and won't show in the profile.

zlib has no use for large copies except for window maintenance at the end of a block of input. Otherwise the limit is 258 bytes, and the median is much lower. The whole point of the chunk copy optimisation, for example, is to eliminate all the redundant work that a generic, variable-length memcpy goes through and to replace that with a compiler-provided substitute for fixed-length memcpy. If libc memcpy() was that smart it'd surely be used by default in inffast.c after confirming it's not an unroll operation.

But for RVV it's only three instructions to do a modest variable-length memcpy -- plus a branch and pointer updates (pointer updates were going to happen anyway) if it happens to be a longer case. So there's a question about whether there could be anything in the libc version which is going to do better, even after spending extra operations deciding what kind of optimisations are relevant for the particular case and after the cost of register marshalling to do any function call at all.

Do you have a godbolt link with a case where variable-length memcpy() is inlined, btw? I had a very superficial poke at x86 and then gave up, because I don't know all the flags I'd need to use. If I had an x86 example to contrast with then I have something tangible to point at when I ask for the same feature in RISC-V.

Mention of ERMS raises an interesting point. The FSRM feature (not ERMS because that seems to be for larger copies) seems to have pretty much the same application as RVV's variable-length vle/vse pair, which circles back to my original comment that maybe x86 can also benefit from a simplified main loop where length overflow checks are elided in favour of hardware features, and we have fewer branch mispredicts to deal with.

But I don't know what the caveats of FSRM are. Maybe it's less useful than it looks (just like masked load/store).

KungFuJesus · 2023-11-06T15:05:41Z

I think only variable-length copies are worth discussing. Almost all the fixed-length ones are replaced with sensible code and won't show in the profile.

zlib has no use for large copies except for window maintenance at the end of a block of input. Otherwise the limit is 258 bytes, and the median is much lower. The whole point of the chunk copy optimisation, for example, is to eliminate all the redundant work that a generic, variable-length memcpy goes through and to replace that with a compiler-provided substitute for fixed-length memcpy. If libc memcpy() was that smart it'd surely be used by default in inffast.c after confirming it's not an unroll operation.

But for RVV it's only three instructions to do a modest variable-length memcpy -- plus a branch and pointer updates (pointer updates were going to happen anyway) if it happens to be a longer case. So there's a question about whether there could be anything in the libc version which is going to do better, even after spending extra operations deciding what kind of optimisations are relevant for the particular case and after the cost of register marshalling to do any function call at all.

Do you have a godbolt link with a case where variable-length memcpy() is inlined, btw? I had a very superficial poke at x86 and then gave up, because I don't know all the flags I'd need to use. If I had an x86 example to contrast with then I have something tangible to point at when I ask for the same feature in RISC-V.

Mention of ERMS raises an interesting point. The FSRM feature (not ERMS because that seems to be for larger copies) seems to have pretty much the same application as RVV's variable-length vle/vse pair, which circles back to my original comment that maybe x86 can also benefit from a simplified main loop where length overflow checks are elided in favour of hardware features, and we have fewer branch mispredicts to deal with.

But I don't know what the caveats of FSRM are. Maybe it's less useful than it looks (just like masked load/store).

That's the thing, for the most part the memcpys are being translated into small copy loops with the compiler builtins. By using memcpy we usually end up with proper alignment semantics and defined behavior, particularly when something aliases. E.g. what I did here (https://github.com/zlib-ng/zlib-ng/blob/develop/inflate_p.h#L183) was to intentionally use meaningful copy lengths so that this loop body so that it's a handful of move instructions with jumps (arguably a case/switch table might have been better here). The case above it, where it is a single memcpy call at an indefinite but possibly short length, I still found beneficial here in place of what the old chunkcopy_safe had been. Simply jumping with a goto into the copy loop below was not beneficial versus that single memcpy statement.

sh1boot · 2023-11-06T19:41:23Z

But these memcpy operations don't appear in the profile, do they. They're part of the calling function.

What I'm saying is that I see a huge chunk of the profile appearing in the libc memcpy function. Part of that is the naive scalar implementation that I have, but just making a function call (any at all) is more complex than short memcpy on RVV so even calling a good implementation is probably a bad idea. Checking the disassembly, they do all appear to be variable-length copies.

BTW, if you want to use a pattern like that you might want to experiment with doing something like this:
cloudflare/zlib@58703ada8382
(but ensuring that the loop is still unrolled after adding more complex conditions for tail case)

KungFuJesus · 2023-11-06T19:57:16Z

You may want to capture the entire call graph with your profile. I suspect a fair number of memmove/memcpy calls are going to be those massive window sized copies toward the end.

FWIW those incidental copies are done at the same time as an adler checksum on x86 since the additional instructions slot in well with most CPU pipelines and added very minor complexity.

sh1boot · 2023-11-07T00:31:47Z

Doesn't seem to be that. In the trace I have handy adler32_fold_copy_c() is called 500 times, and memcpy() 200k times.

Just to be certain I implemented fold_copy for RVV:
c97c63e31ae2

And that reduced the memcpy() call count by 500 but didn't have much impact on the amount of time spent in memcpy(). Instead, it reduced the time spent in _wordcopy_fwd_dest_aligned() (along with ~450 fewer calls), but memcpy() is still twice as bad as _wordcopy_fwd_dest_aligned().

KungFuJesus · 2023-11-07T16:41:25Z

Oh cool, I encourage you to open up a PR with that change.

It's been a while since I rewrote chunkcopy_safe with that memcpy and I wish I kept more breadcrumbs from my benchmarking and optimization efforts. I do recall it being faster, but if removing the explicit memcpy of indeterminate length does improve performance measurably it's definitely worth considering. That's the only place I can recall that's not going to translate to a builtin (the other chunk based copies translate directly on most ISAs to unrolling SIMD based copies in some form or another). Somewhat tricky is if you do capture a call graph, any memcpy calls are likely to be shown being called from inflate_fast, since they themselves are likely embedded in inlined functions. You may have to systematically substitute memcpy for some stubbed routine until you figure out where the volume of explicit libc calls are coming from.

KungFuJesus · 2023-11-08T15:34:18Z

So @sh1boot, a significant amount of profiling time does show up for me in glibc's memcpy/memmove, however the call graph, if it's to be believed, is not called from the fast path (inflate_fast):

A large amount of it is also called from within libpng, which we can't do much about. But perhaps this helps narrow your search window? I don't think inflate_fast is being inlined in inflate anymore now that it's a separate dispatch in the function table.

KungFuJesus · 2023-11-08T18:23:31Z

@sh1boot possibly here? https://github.com/zlib-ng/zlib-ng/blob/develop/inflate.c#L724
The other places inflate.c is calling memcpy are either one time things or are scenarios that copy out a whole window where memcpy would be beneficial.

sh1boot · 2023-11-08T21:11:41Z

It seems to be a 50:50 split between something in chunkset_rvv.c and inflate_p.h. Fixing chunkset_rvv.c itself halved the number of calls, changing inffast_chunk_tpl.h didn't do much of anything, and then the count dropped to something sensible when I got to inflate_p.h:

sh1boot@7487ef9

I do not know how this is reflected on x86. Your profile suggests you have hardware which finds its way into an 'erms' path, so maybe you could try extending this patch to implement a memcpy_erms() which does a simple, inline __asm__ ("rep movsb", ...) to see what happens?

A quick Google suggests the full syntax might be:

static inline void* memcpy_erms(void* dst, void const* src, size_t len) {
    __asm__ volatile ("rep movsb" : "+D" (dst), "+S" (src), "+c" (len));
    return dst;
}

(updated to reflect my current understanding -- "+" makes dest registers in-out, volatile declares that the code is wanted for its undeclared side effects, and there's no point clobbering "memory" because that's not considered an intended effect and the memory it writes to is not a C data structure)

sh1boot · 2023-11-09T04:51:02Z

I found an x86 machine to try it on and it was something like a 10% speed up, by the way.

Same test was nearer 25% on RISC-V but I think a lot of that is the memcpy implementation I'm linking is not great.

KungFuJesus · 2023-11-09T14:23:41Z

If it is inflate_p.h, then that's certainly chunkcopy_safe. I'll do some benchmarking with that tonight and see if I can see a measurable gain.

KungFuJesus · 2023-11-10T01:54:11Z

Does anyone know if there's an upper bound for the sizes fed into chunkcopy_safe? I'm trying to do some mental calculus here for if it's even worth attempting to be FSRM aware. I kind of suspect it isn't, because it seems like even with FSRM there's a bunch of asterisks for when it actually wins.

edit hmm, looks like it's maybe 255 bytes. FSRM and even ERMS was supposed to be fast around 128 bytes and higher. That might explain why on highly compressible stuff I was seeing gains for using memcpy. There is still one memcpy call of variable length, maybe eliminating one of them lets us eat cake and eat it too? Only for x86 of course, but...something? I can try do the ERMS/FRMS sequence with an inlined function too perhaps, but glibc does a whole lot of analysis before determining that it's worth it.

KungFuJesus · 2023-11-10T02:44:18Z

So, I definitely didn't hallucinate the gain, it gets decidedly worse on x86 for all cases if I cut out the memcpy and jump straight to the copy ladder at the bottom (similar to what this function looked like prior to me getting to it):

Comparing png_decode_real (from ./benchmark_zlib_apps_old) to png_decode_real (from ./benchmark_zlib_apps)
Benchmark                                                                                                        Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0099         -0.0099          9608          9513          9608          9513
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0148         -0.0148          9654          9511          9654          9511
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0031         -0.0031          9561          9531          9561          9531
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0213         -0.0213          9778          9569          9777          9569
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0111         -0.0111          9668          9560          9668          9560
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0144         -0.0144          9709          9570          9709          9569
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0004         -0.0003          9548          9545          9548          9544
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0143         -0.0142          9685          9547          9685          9547
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0054         -0.0054          9597          9546          9597          9545
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0                       -0.0102         -0.0102          9596          9498          9596          9498
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0_pvalue                 0.0008          0.0008      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0_mean                  -0.0105         -0.0105          9641          9539          9640          9539
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0_median                -0.0089         -0.0089          9631          9545          9631          9545
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0_stddev                -0.6512         -0.6514            72            25            72            25
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/0_cv                    -0.6475         -0.6477             0             0             0             0
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1019         +0.1019         52865         58253         52864         58251
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1002         +0.1002         52970         58276         52969         58275
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1023         +0.1023         52840         58244         52837         58244
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1126         +0.1126         52399         58298         52398         58297
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1079         +0.1083         52612         58290         52594         58289
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1050         +0.1050         52659         58188         52658         58188
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1065         +0.1065         52726         58342         52725         58341
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1044         +0.1044         52800         58309         52798         58308
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1045         +0.1045         52727         58239         52726         58238
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1                       +0.1005         +0.1005         52872         58185         52872         58184
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1_mean                  +0.1046         +0.1046         52747         58263         52744         58262
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1_median                +0.1043         +0.1043         52763         58264         52762         58263
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1_stddev                -0.6885         -0.6913           163            51           164            51
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/1_cv                    -0.7180         -0.7206             0             0             0             0
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1298         +0.1298         51374         58044         51373         58043
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1268         +0.1268         51403         57919         51402         57918
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1220         +0.1220         51574         57867         51573         57867
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1225         +0.1225         51477         57784         51476         57782
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1242         +0.1242         51448         57838         51447         57837
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1195         +0.1195         51673         57848         51673         57848
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1210         +0.1210         51600         57842         51599         57841
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1234         +0.1234         51555         57914         51553         57913
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1288         +0.1288         51373         57992         51372         57991
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2                       +0.1369         +0.1369         50973         57950         50972         57949
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2_mean                  +0.1255         +0.1255         51445         57900         51444         57899
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2_median                +0.1249         +0.1249         51463         57891         51462         57890
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2_stddev                -0.5917         -0.5912           194            79           194            79
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/2_cv                    -0.6372         -0.6367             0             0             0             0
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1202         +0.1202         47345         53035         47344         53034
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1212         +0.1212         47388         53131         47387         53130
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1246         +0.1246         47307         53202         47306         53201
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1262         +0.1262         47251         53212         47250         53211
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1194         +0.1194         47384         53040         47382         53039
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1148         +0.1147         47491         52940         47490         52939
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1199         +0.1199         47229         52890         47229         52889
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1185         +0.1185         47403         53021         47402         53021
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1167         +0.1167         47337         52863         47337         52862
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3                       +0.1167         +0.1167         47542         53090         47541         53088
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3_mean                  +0.1198         +0.1198         47368         53042         47367         53041
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3_median                +0.1198         +0.1198         47365         53037         47363         53036
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3_stddev                +0.2378         +0.2397            97           121            97           121
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/3_cv                    +0.1054         +0.1071             0             0             0             0
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.1659         +0.1659         44436         51807         44435         51806
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.1672         +0.1672         44455         51890         44454         51888
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.0487         +0.0487         49522         51932         49521         51931
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.1683         +0.1683         44276         51726         44275         51725
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.1647         +0.1647         44350         51653         44349         51653
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.1627         +0.1627         44540         51785         44538         51784
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.1671         +0.1672         44475         51908         44474         51908
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.1657         +0.1658         44496         51871         44495         51870
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.1627         +0.1627         44538         51786         44537         51785
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4                       +0.1648         +0.1648         44371         51681         44370         51680
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4_mean                  +0.1526         +0.1526         44946         51804         44945         51803
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4_median                +0.1649         +0.1649         44465         51796         44464         51796
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4_stddev                -0.9401         -0.9401          1610            96          1610            96
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/4_cv                    -0.9481         -0.9480             0             0             0             0
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1803         +0.1803         43174         50960         43174         50959
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1745         +0.1745         43359         50927         43358         50926
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1671         +0.1671         43548         50824         43546         50823
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1781         +0.1781         43182         50875         43182         50874
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1792         +0.1792         43188         50929         43187         50928
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1788         +0.1788         43180         50903         43179         50902
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1772         +0.1772         43132         50773         43131         50773
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1761         +0.1761         43149         50746         43149         50745
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1766         +0.1766         43274         50918         43274         50917
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5                       +0.1761         +0.1761         43173         50776         43172         50775
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5_mean                  +0.1764         +0.1764         43236         50863         43235         50862
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5_median                +0.1785         +0.1785         43181         50889         43181         50888
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5_stddev                -0.3997         -0.3990           128            77           128            77
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/5_cv                    -0.4897         -0.4892             0             0             0             0
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2055         +0.2055         40884         49285         40884         49284
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2049         +0.2049         40871         49246         40870         49245
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2054         +0.2054         40779         49157         40779         49156
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2146         +0.2146         40494         49185         40493         49185
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2098         +0.2098         40587         49103         40586         49102
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2049         +0.2049         40802         49164         40801         49163
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2048         +0.2048         40752         49099         40752         49099
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2053         +0.2053         40682         49035         40680         49033
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2034         +0.2034         40855         49164         40854         49163
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6                       +0.2081         +0.2081         40680         49145         40679         49145
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6_mean                  +0.2067         +0.2067         40739         49158         40738         49157
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6_median                +0.2059         +0.2059         40766         49160         40765         49159
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6_stddev                -0.4397         -0.4386           128            72           128            72
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/6_cv                    -0.5357         -0.5348             0             0             0             0
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2229         +0.2229         40118         49060         40118         49059
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2166         +0.2165         40233         48946         40233         48945
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2179         +0.2179         40339         49127         40338         49127
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2211         +0.2211         40021         48871         40020         48871
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2215         +0.2215         40102         48985         40101         48984
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2154         +0.2154         40196         48853         40195         48852
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2127         +0.2127         40280         48847         40279         48846
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2190         +0.2190         40071         48845         40070         48844
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2144         +0.2144         40148         48755         40148         48754
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7                       +0.2164         +0.2164         40024         48686         40023         48685
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7_mean                  +0.2178         +0.2178         40153         48897         40153         48897
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7_median                +0.2175         +0.2175         40133         48862         40133         48861
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7_stddev                +0.2510         +0.2504           107           134           107           134
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/7_cv                    +0.0272         +0.0268             0             0             0             0
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2480         +0.2480         38700         48296         38699         48295
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2453         +0.2453         38840         48368         38838         48367
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2393         +0.2393         39185         48563         39184         48562
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2395         +0.2395         38882         48194         38881         48193
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2416         +0.2416         38830         48212         38829         48212
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2429         +0.2429         38854         48293         38853         48292
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2390         +0.2390         38985         48303         38984         48302
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2421         +0.2421         38916         48338         38916         48337
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2398         +0.2398         38846         48161         38846         48160
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8                       +0.2428         +0.2428         38777         48191         38776         48190
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8_mean                  +0.2420         +0.2420         38882         48292         38881         48291
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8_median                +0.2431         +0.2431         38850         48294         38849         48294
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8_stddev                -0.1012         -0.1020           131           118           131           118
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/8_cv                    -0.2763         -0.2770             0             0             0             0
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2403         +0.2403         38529         47789         38529         47788
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2453         +0.2453         38432         47860         38432         47860
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2401         +0.2400         38522         47769         38521         47768
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2434         +0.2434         38451         47811         38450         47810
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2383         +0.2383         38592         47789         38591         47787
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2391         +0.2391         38630         47866         38630         47865
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2382         +0.2383         38636         47841         38635         47840
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2228         +0.2228         39118         47833         39116         47832
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2430         +0.2430         38455         47801         38454         47800
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9                       +0.2425         +0.2425         38670         48047         38669         48046
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9_mean                  +0.2393         +0.2393         38603         47841         38603         47840
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9_median                +0.2402         +0.2402         38561         47822         38560         47821
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9_stddev                -0.6014         -0.6008           199            79           199            79
[png_decode_real vs. png_decode_real]istic/[png_decode_real vs. png_decode_real]istic/9_cv                    -0.6783         -0.6778             0             0             0             0
OVERALL_GEOMEAN                                                                                               +0.1551         +0.1551             0             0             0             0

And actually, from what I recall this function didn't actually behave correctly and we were just getting lucky with copy sizes, as it didn't properly emulate a self modifying copy loop like the lz decoding loop expects. Let me try to conditionally do these copies, though.

KungFuJesus · 2023-11-10T03:09:11Z

That example you pointed to earlier from cloudflare I think I get the gist of what it's doing, but man it is not clear how it works. I think it's trying to copy through the overlap and then rewind such that the periodicity of the copy is a modulus of a typical chunk size. Maybe I can experiment with that technique, too. At least in doing that the unraveling copies require far less branching.

sh1boot · 2023-11-10T05:44:50Z

So, that unroll function starts with the observation that if offset is 1, then after the first byte copied there's enough room to double the offset (ie., set it to 2) and copy byte pairs from then on. Then, if offset is 2, then after the first byte-pair copy there's enough room to double it again. But if offset was 3 or more then you can't do that.

So it proceeds exponentially, trying to double offset after more room if made, but if there's not enough room then it has to dial offset back a bit. Lucky thing is, if doubling offset overflows it won't overflow by more than the original value of offset.

At the beginning offset is the same as the initial offset, and for as long as it overflows it'll double and then retreat back to its original value, until the doubling is finally larger than the original offset and there's room to double the offset as well. From then on you can either double or double-minus-one the offset, depending on what its initial value was. If the initial value is a power of two then you always double. If it's three then you can double it once you have four bytes unrolled (plus the original three makes seven) for an offset of six. Then an unroll of eight (plus three makes eleven) you can't double six, but you can double and knock three off for nine. Each step means you have at least enough unrolled data for the next copy stage.

Not really sure about performance, though. It's a rapid cycle of reading the data just written at a different offset and size, which can be hard for the CPU to untangle. Hopefully the number of iterations is few enough that it doesn't cause a serious blockage; because the data isn't critical path.

As for rep movsb; even if its performance is a little weaker, that might not matter if the length is hard to predict. It can be a win to go a little slower but suffer zero branch mispredicts in return.

In fact, the definition of that instruction is that it must behave "as if" it's a trivial byte copy loop even when source and destination overlap. That means that technically you can use it for unrolling short patterns.

But I think what will actually happen is that the CPU will get mad with you and do it really slowly. You never know, though. It eliminates more conditional branches.

KungFuJesus · 2023-11-15T15:00:55Z

Well, eh, this makes things even more interesting: https://lock.cmpxchg8b.com/reptar.html

I don't have a CPU affected by that microcode fix but I wonder how much that alters the FSRM calculus. I haven't forgotten about this, I do intend to try your suggestion of just doing the RMSB sequence unconditionally to see what happens. I suspect slow things, but you never know.

KungFuJesus · 2024-09-10T21:22:41Z

@sh1boot a recent bug with regard to chunk based memcpy'ing reminded me to revisit this. It's not better, though not as appreciably worse as I thought it would be to force an ERMS sequence:

Comparing decode/ (from ./benchmark_zlib_apps_old) to decode/ (from ./benchmark_zlib_apps)
Benchmark                                                                    Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------
png_[decode/ vs. decode/]png_[decode/ vs. decode/]0_pvalue                 0.0140          0.0140      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]0_mean                  -0.0095         -0.0096          1377          1364          1371          1358
png_[decode/ vs. decode/]png_[decode/ vs. decode/]0_median                -0.0094         -0.0092          1375          1362          1369          1356
png_[decode/ vs. decode/]png_[decode/ vs. decode/]0_stddev                -0.4354         -0.4357            16             9            16             9
png_[decode/ vs. decode/]png_[decode/ vs. decode/]0_cv                    -0.4300         -0.4302             0             0             0             0
png_[decode/ vs. decode/]png_[decode/ vs. decode/]1_pvalue                 0.0452          0.0211      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]1_mean                  +0.0152         +0.0159           848           861           845           858
png_[decode/ vs. decode/]png_[decode/ vs. decode/]1_median                +0.0182         +0.0199           847           862           843           860
png_[decode/ vs. decode/]png_[decode/ vs. decode/]1_stddev                +0.2141         +0.2056            12            15            12            14
png_[decode/ vs. decode/]png_[decode/ vs. decode/]1_cv                    +0.1960         +0.1867             0             0             0             0
png_[decode/ vs. decode/]png_[decode/ vs. decode/]2_pvalue                 0.0010          0.0010      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]2_mean                  +0.0468         +0.0462           867           908           865           905
png_[decode/ vs. decode/]png_[decode/ vs. decode/]2_median                +0.0398         +0.0389           874           908           871           905
png_[decode/ vs. decode/]png_[decode/ vs. decode/]2_stddev                +0.5724         +0.5477            16            25            16            25
png_[decode/ vs. decode/]png_[decode/ vs. decode/]2_cv                    +0.5020         +0.4794             0             0             0             0
png_[decode/ vs. decode/]png_[decode/ vs. decode/]3_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]3_mean                  +0.0517         +0.0513           903           950           899           946
png_[decode/ vs. decode/]png_[decode/ vs. decode/]3_median                +0.0544         +0.0535           904           953           900           949
png_[decode/ vs. decode/]png_[decode/ vs. decode/]3_stddev                +1.0293         +1.0155            11            23            11            22
png_[decode/ vs. decode/]png_[decode/ vs. decode/]3_cv                    +0.9296         +0.9172             0             0             0             0
png_[decode/ vs. decode/]png_[decode/ vs. decode/]4_pvalue                 0.0173          0.0257      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]4_mean                  +0.0545         +0.0535           914           964           911           960
png_[decode/ vs. decode/]png_[decode/ vs. decode/]4_median                +0.0393         +0.0387           910           946           907           942
png_[decode/ vs. decode/]png_[decode/ vs. decode/]4_stddev                +2.2820         +2.2206            23            75            23            73
png_[decode/ vs. decode/]png_[decode/ vs. decode/]4_cv                    +2.1125         +2.0572             0             0             0             0
png_[decode/ vs. decode/]png_[decode/ vs. decode/]5_pvalue                 0.0022          0.0017      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]5_mean                  +0.0329         +0.0327           919           949           916           946
png_[decode/ vs. decode/]png_[decode/ vs. decode/]5_median                +0.0370         +0.0366           918           952           915           949
png_[decode/ vs. decode/]png_[decode/ vs. decode/]5_stddev                +0.2213         +0.2465            15            18            15            18
png_[decode/ vs. decode/]png_[decode/ vs. decode/]5_cv                    +0.1825         +0.2071             0             0             0             0
png_[decode/ vs. decode/]png_[decode/ vs. decode/]6_pvalue                 0.0376          0.0452      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]6_mean                  +0.0334         +0.0323           934           965           930           960
png_[decode/ vs. decode/]png_[decode/ vs. decode/]6_median                +0.0265         +0.0249           924           948           920           943
png_[decode/ vs. decode/]png_[decode/ vs. decode/]6_stddev                +0.4010         +0.3913            35            48            34            48
png_[decode/ vs. decode/]png_[decode/ vs. decode/]6_cv                    +0.3557         +0.3477             0             0             0             0
png_[decode/ vs. decode/]png_[decode/ vs. decode/]7_pvalue                 0.0376          0.0312      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]7_mean                  +0.0192         +0.0191           910           927           907           924
png_[decode/ vs. decode/]png_[decode/ vs. decode/]7_median                +0.0217         +0.0222           908           928           905           925
png_[decode/ vs. decode/]png_[decode/ vs. decode/]7_stddev                +0.4705         +0.4620            13            19            13            18
png_[decode/ vs. decode/]png_[decode/ vs. decode/]7_cv                    +0.4428         +0.4347             0             0             0             0
png_[decode/ vs. decode/]png_[decode/ vs. decode/]8_pvalue                 0.8501          0.8501      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]8_mean                  -0.0181         -0.0174           919           903           915           899
png_[decode/ vs. decode/]png_[decode/ vs. decode/]8_median                -0.0009         -0.0017           903           902           900           899
png_[decode/ vs. decode/]png_[decode/ vs. decode/]8_stddev                -0.6673         -0.6688            59            19            58            19
png_[decode/ vs. decode/]png_[decode/ vs. decode/]8_cv                    -0.6612         -0.6629             0             0             0             0
png_[decode/ vs. decode/]png_[decode/ vs. decode/]9_pvalue                 1.0000          0.9097      U Test, Repetitions: 10 vs 10
png_[decode/ vs. decode/]png_[decode/ vs. decode/]9_mean                  +0.0016         +0.0005           901           902           897           898
png_[decode/ vs. decode/]png_[decode/ vs. decode/]9_median                -0.0026         -0.0029           902           900           899           897
png_[decode/ vs. decode/]png_[decode/ vs. decode/]9_stddev                -0.2575         -0.3079            17            12            16            11
png_[decode/ vs. decode/]png_[decode/ vs. decode/]9_cv                    -0.2587         -0.3082             0             0             0             0
OVERALL_GEOMEAN                                                           +0.0223         +0.0220             0             0             0             0

This is on a thin U series Haswell CPU so it's not exactly a great test bed since it lacks FSRM. Interestingly, nonetheless. That benchmark is a trivial to compress load. The more realistically compressed imagery shows it's appreciably worse:

Comparing decode_re (from ./benchmark_zlib_apps_old) to decode_re (from ./benchmark_zlib_apps)
Benchmark                                                                                            Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/0_pvalue                 0.2730          0.2730      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/0_mean                  +0.0075         +0.0079         16625         16749         16548         16678
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/0_median                +0.0117         +0.0126         16618         16812         16539         16747
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/0_stddev                +0.4052         +0.3823           245           344           242           335
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/0_cv                    +0.3947         +0.3715             0             0             0             0
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/1_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/1_mean                  +0.0920         +0.0918         71561         78141         71318         77862
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/1_median                +0.0835         +0.0832         71542         77514         71339         77271
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/1_stddev                +3.6674         +3.5906           606          2830           599          2748
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/1_cv                    +3.2743         +3.2048             0             0             0             0
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/2_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/2_mean                  +0.0668         +0.0651         71758         76550         71529         76187
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/2_median                +0.0545         +0.0545         71682         75591         71451         75342
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/2_stddev                +2.3385         +2.0166           951          3173           935          2821
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/2_cv                    +2.1295         +1.8321             0             0             0             0
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/3_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/3_mean                  +0.0447         +0.0449         64656         67547         64437         67332
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/3_median                +0.0481         +0.0482         64618         67727         64384         67487
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/3_stddev                +0.9891         +0.9842           317           630           311           617
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/3_cv                    +0.9039         +0.8988             0             0             0             0
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/4_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/4_mean                  +0.0472         +0.0470         60886         63757         60686         63540
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/4_median                +0.0422         +0.0419         61116         63692         60928         63480
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/4_stddev                +0.0002         -0.0107           920           920           910           900
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/4_cv                    -0.0449         -0.0552             0             0             0             0
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/5_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/5_mean                  +0.0433         +0.0432         59576         62153         59380         61946
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/5_median                +0.0394         +0.0393         59778         62131         59592         61937
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/5_stddev                -0.1654         -0.1617           580           484           576           482
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/5_cv                    -0.2000         -0.1965             0             0             0             0
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/6_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/6_mean                  +0.0528         +0.0528         55423         58350         55245         58161
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/6_median                +0.0515         +0.0517         55330         58177         55149         57999
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/6_stddev                +0.5525         +0.5743           339           527           334           526
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/6_cv                    +0.4746         +0.4954             0             0             0             0
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/7_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/7_mean                  +0.0460         +0.0458         55256         57798         55080         57603
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/7_median                +0.0539         +0.0540         54967         57928         54784         57742
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/7_stddev                -0.2836         -0.2810           908           650           892           641
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/7_cv                    -0.3151         -0.3125             0             0             0             0
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/8_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/8_mean                  +0.0531         +0.0527         53595         56440         53423         56236
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/8_median                +0.0501         +0.0500         53629         56314         53447         56117
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/8_stddev                +3.1788         +3.2751           290          1210           278          1187
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/8_cv                    +2.9681         +3.0612             0             0             0             0
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/9_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/9_mean                  +0.0502         +0.0498         53233         55907         53062         55707
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/9_median                +0.0495         +0.0492         53219         55852         53027         55636
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/9_stddev                +0.9300         +0.9406           525          1014           510           991
png_[decode_re vs. decode_re]alistic/png_[decode_re vs. decode_re]alistic/9_cv                    +0.8376         +0.8485             0             0             0             0
OVERALL_GEOMEAN                                                                                   +0.0500         +0.0498             0             0             0             0

sh1boot · 2024-09-11T01:38:18Z

@Adenilson hey, do you have an FSRM device to try this on? I'm really curious to see when complicated memory implementations become the "legacy" path for old chips.

sh1boot · 2024-09-11T16:11:38Z

We should probably start a FSRM thread or something...

Here:
#1777

nmoinvaz reviewed Aug 31, 2023

View reviewed changes

mtl1979 reviewed Sep 2, 2023

View reviewed changes