-
-
Notifications
You must be signed in to change notification settings - Fork 308
General optimized chunkset #1568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## develop #1568 +/- ##
===========================================
+ Coverage 83.88% 84.13% +0.24%
===========================================
Files 132 129 -3
Lines 10843 10470 -373
Branches 2801 2692 -109
===========================================
- Hits 9096 8809 -287
+ Misses 1048 1002 -46
+ Partials 699 659 -40 Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
| * After using a single memcpy to copy N chunks, we have to use series of | ||
| * loadchunk and storechunk to ensure the result is correct. | ||
| */ | ||
| static inline uint8_t* CHUNKCOPY(uint8_t *out, uint8_t const *from, unsigned len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use the CHUNKCOPY in chunkset_tpl.h? Could these changes be moved into that CHUNKCOPY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I have confirmed that this is an optimization specifically for RISC-V using RVV-optimized glibc, I've decided to implement this special CHUNKCOPY only for RISC-V.
I'm not sure if it's a general optimization to other platforms. I'll test it on x86/ARM when I have free time.
arch/riscv/chunkset_rvv.c
Outdated
| #include "zbuild.h" | ||
|
|
||
| /* | ||
| * It's not a optimized implemantation using RISC-V RVV, but a general optimized one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this isn't better to be put into arch/generic and all references to RISC-V removed since there are no RISC-V intrinsics being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very interesting problem. I'll tri it on x86/ARM when I have free time.
|
This looks an awful lot like the CHUNKCOPY_SAFE code I wrote a while back |
| */ | ||
| #define CHUNK_SIZE 32 | ||
|
|
||
| /* We don't have a 32-byte datatype for RISC-V arch. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using 8-bit data type the fastest? On most architectures using either 32-bit or 64-bit type is faster and still usable for 256-bit chunks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will try 32-bit and 64-bit data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be relatively trivial to provide implementations of chunkmemset_2/chunkmemset_4/chunkmemset_8 if uint64_t is used here instead of uint8_t.
ccawley2011
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the RVV implementation of CHUNKCOPY would benefit other architectures that aren't using SIMD, especially if unaligned memory access is unavailable (such as on older ARM platforms).
It might be worth looking at chunkcopy_safe as well, since RISC-V doesn't seem to inline memcpy calls for more than 8 bytes.
| while (len > 0) { | ||
| loadchunk(from, &chunk); | ||
| storechunk(out, &chunk); | ||
| out += sizeof(chunk_t); | ||
| from += sizeof(chunk_t); | ||
| len -= sizeof(chunk_t); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like this be beneficial here?
| while (len > 0) { | |
| loadchunk(from, &chunk); | |
| storechunk(out, &chunk); | |
| out += sizeof(chunk_t); | |
| from += sizeof(chunk_t); | |
| len -= sizeof(chunk_t); | |
| } | |
| if (len > 0) { | |
| len = ((len + sizeof(chunk_t) - 1) / sizeof(chunk_t)) * sizeof(chunk_t); | |
| memcpy(out, from, len); | |
| out += len; | |
| from += len; | |
| } |
It might also help to have a single memcpy call that combines this with the previous one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the tests would failed. We have to cope single chunk at time to ensure the correctness of the result.
Assuming that len >= 3 * sizeof(chunk), sizeof(chunk)=4 and align=2
Copy single chunk once:
ABCDDDDDDDDDDDDDDDDDDDDDDDD
* *
ABCDABCDDDDDDDDDDDDDDDDDDDD
* *
ABCDABCDABDDDDDDDDDDDDDDDDD
* *
ABCDABCDABCDABDDDDDDDDDDDDD
* *
ABCDABCDABCDABCDABDDDDDDDDD
* *
ABCDABCDABCDABCDABCDABDDDDD
* *
Copy whole left memory once:
ABCDDDDDDDDDDDDDDDDDDDD
* *
ABCDABCDDDDDDDDDDDDDDDDDDDD
* *
ABCDABCDABDDDDDDDDDDDDDDDDD
* *
if (len > 0) { //...copt left}
ABCDABCDABCDABDDDDDDDDDDDDD
* *
it would copy CDABDDDDDD...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand what you're describing. Would it still help to do incrementally larger copies as more chunks are filled in, or would that not make much of a difference?
arch/riscv/chunkset_rvv.c
Outdated
| loadchunk(from, &chunk); | ||
| storechunk(out, &chunk); | ||
| out += align; | ||
| from += align; | ||
| len -= align; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be merged with the other memcpy calls as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the assumption: from lags out by at least a chunk.
Using memcpy is safe here.
It's done, thanks.
| */ | ||
| #define CHUNK_SIZE 32 | ||
|
|
||
| /* We don't have a 32-byte datatype for RISC-V arch. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be relatively trivial to provide implementations of chunkmemset_2/chunkmemset_4/chunkmemset_8 if uint64_t is used here instead of uint8_t.
Again, CHUNKCOPY_SAFE kinda sorta already does this: |
|
I tested adding just the two if blocks to the standard chunkcopy, not sure whether anything else would be needed. So unless there is something to magically speed that up massively, this does not work in the standard chunkcopy implementation and should indeed be a RVV-specific implementation (I am amazed that this does speed things up there though, I assume you did verify that just upping the chunksize didn't account for the speedup) |
|
We used |
It's not a general optimization, sad :( |
Implement chunk memset for specific length
Dead2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I suspect that what makes the difference here is probably one of two things:
I haven't done any proper benchmarks though, so this is all just speculation. It's likely that the discussion PR #1548 is related to a degree. |
|
My feeling here is that slotting in to fit the chunk-copy idiom is not a path which RISC-V Vector should be taking, and this might generalise to SVE and AVX as well. Trying to squeeze this hardware into chunks surrounded by safety checks which it doesn't need can only slow it down. Looking at the current structure of inffast_tpl.h, it has quite a few conditional branches deciding which fast and slow paths can be taken, and my suspicion is that some architectures can probably do better with fewer conditional branches and more generic implementations instead. Perhaps if the architecture-specific implementations are more generic, inline memcpy functions they can be passed a constant bitmap of safe assumptions for each call point, like whether source or destination might overlap or if each buffer has extra runway at the end, then the necessary tests can be performed within the architecture-specific code rather than common code (and removed by the compiler if the relevant flags are passed), and architectures that don't care just don't do the test either way. But, importantly, inffast_tpl.h and inflate_p.h aren't generating spurious code paths (and tests) for implementations that needn't differ. I'll try to illustrate how simple the main copy loop might be if only the overall library were structured to accommodate it. I'm looking at RISC-V Vector, but SVE and AVX also have masked load and store operations, I believe, so they might benefit from a similar approach. Here's how simple RISC-V tried to make efficient memcpy: do {
size_t vl = __riscv_vsetvl_e8m1(len);
vuint8m1_t chunk = __riscv_vle8_v_u8m1(src, vl);
__riscv_vse8_v_u8m1(dst, chunk, vl);
dst += vl;
src += vl;
len -= vl;
} while (len > 0);Nothing special for aligned, unaligned, or safe or unsafe destinations. Nothing special for different hardware with wider or narrower register widths. Not so far, at least. And it won't over-read or over-write any buffers; We don't just use In fact for zlib, the chances are that for many cases it won't even iterate, and will drop straight through because The other case, where the backreference results in an overlapping copy with pattern unrolling and all that, also works out to be fairly simple: size_t vl = __riscv_vsetvl_e8m1(len);
vuint8m1_t chunk = __riscv_vle8_v_u8m1(src, vl);
if (offset < vl) {
vuint8m1_t idx = __riscv_vid_v_u8m1(vl);
idx = __riscv_vremu(idx, offset, vl);
chunk = __riscv_vrgather(chunk, idx, vl);
int unroll = vl - vl % offset;
// if offset >= vl, unroll becomes 0, so no-op below:
offset += unroll;
src -= unroll;
}
__riscv_vse8_v_u8m1(dst, chunk, vl);
dst += vl;
src += vl;
len -= vl;
// continue with regular copy iff there's anything left...Here But that code comes with caveats.
|
|
So avx's masked load and store have gigantic variable penalties, particularly if a page fault is incurred. Now avx512 gets around that a bit but I suspect you are not going to see a massive speedup over the progressively unrolled chunks approach. |
|
Also note that some of the chunking copies, particularly chunkcopy_safe, needs to preserve self modifying copy semantics, meaning in the case of overlap you need the side effect to occur as if you were doing a byte by byte copy. |
|
I totally agree that fixed size chunk-copy is not suitable for RISC-V RVV. However, I didn't modify the design, because it has worked well for a long time. Maybe you can benchmark this prototype first. You can do it without any software design, without any clean code, just by hardcoding the |
|
So I'm currently doing experiments on a cloudflare fork (realising way too late I should have started from @dougallj's fork, but I'm there now). I don't have anything intelligible just yet, but while I was testing I noticed that zlib-ng is calling Is it worth a patch just to take control of the quality of memcpy implementation that get used, maybe? |
|
A lot of effort was expended to make use of the standard memcpy, particular for cases where there are alignment concerns. Gcc, clang, and any reasonable compiler should be subbing those calls for compiler builtins. Additionally, where memcpy might be called for variable lengths, glibc and many other libcs provide an ERMS aware memcpy that is very efficient for particularly large strings on x86 (and even some medium length strings on icelake or newer). Feel free to try to find a place where memcpy is punitive (maybe there are poorly optimized implementations for more nascent ISAs out there where the compiler can't infer a builtin). But, the memcpys were sprinkled in there for a reason, at least where I put them. In fact, my modifications to chunkcopy_safe had some pretty sizeable savings on very compressible things, and some pretty good savings on things that weren't too. |
|
I think only variable-length copies are worth discussing. Almost all the fixed-length ones are replaced with sensible code and won't show in the profile. zlib has no use for large copies except for window maintenance at the end of a block of input. Otherwise the limit is 258 bytes, and the median is much lower. The whole point of the chunk copy optimisation, for example, is to eliminate all the redundant work that a generic, variable-length memcpy goes through and to replace that with a compiler-provided substitute for fixed-length memcpy. If libc memcpy() was that smart it'd surely be used by default in inffast.c after confirming it's not an unroll operation. But for RVV it's only three instructions to do a modest variable-length memcpy -- plus a branch and pointer updates (pointer updates were going to happen anyway) if it happens to be a longer case. So there's a question about whether there could be anything in the libc version which is going to do better, even after spending extra operations deciding what kind of optimisations are relevant for the particular case and after the cost of register marshalling to do any function call at all. Do you have a godbolt link with a case where variable-length memcpy() is inlined, btw? I had a very superficial poke at x86 and then gave up, because I don't know all the flags I'd need to use. If I had an x86 example to contrast with then I have something tangible to point at when I ask for the same feature in RISC-V. Mention of ERMS raises an interesting point. The FSRM feature (not ERMS because that seems to be for larger copies) seems to have pretty much the same application as RVV's variable-length vle/vse pair, which circles back to my original comment that maybe x86 can also benefit from a simplified main loop where length overflow checks are elided in favour of hardware features, and we have fewer branch mispredicts to deal with. But I don't know what the caveats of FSRM are. Maybe it's less useful than it looks (just like masked load/store). |
That's the thing, for the most part the memcpys are being translated into small copy loops with the compiler builtins. By using memcpy we usually end up with proper alignment semantics and defined behavior, particularly when something aliases. E.g. what I did here (https://github.com/zlib-ng/zlib-ng/blob/develop/inflate_p.h#L183) was to intentionally use meaningful copy lengths so that this loop body so that it's a handful of move instructions with jumps (arguably a case/switch table might have been better here). The case above it, where it is a single memcpy call at an indefinite but possibly short length, I still found beneficial here in place of what the old chunkcopy_safe had been. Simply jumping with a goto into the copy loop below was not beneficial versus that single memcpy statement. |
|
But these memcpy operations don't appear in the profile, do they. They're part of the calling function. What I'm saying is that I see a huge chunk of the profile appearing in the libc memcpy function. Part of that is the naive scalar implementation that I have, but just making a function call (any at all) is more complex than short memcpy on RVV so even calling a good implementation is probably a bad idea. Checking the disassembly, they do all appear to be variable-length copies. BTW, if you want to use a pattern like that you might want to experiment with doing something like this: |
|
You may want to capture the entire call graph with your profile. I suspect a fair number of memmove/memcpy calls are going to be those massive window sized copies toward the end. FWIW those incidental copies are done at the same time as an adler checksum on x86 since the additional instructions slot in well with most CPU pipelines and added very minor complexity. |
|
Doesn't seem to be that. In the trace I have handy Just to be certain I implemented fold_copy for RVV: And that reduced the |
|
Oh cool, I encourage you to open up a PR with that change. It's been a while since I rewrote chunkcopy_safe with that memcpy and I wish I kept more breadcrumbs from my benchmarking and optimization efforts. I do recall it being faster, but if removing the explicit memcpy of indeterminate length does improve performance measurably it's definitely worth considering. That's the only place I can recall that's not going to translate to a builtin (the other chunk based copies translate directly on most ISAs to unrolling SIMD based copies in some form or another). Somewhat tricky is if you do capture a call graph, any memcpy calls are likely to be shown being called from inflate_fast, since they themselves are likely embedded in inlined functions. You may have to systematically substitute memcpy for some stubbed routine until you figure out where the volume of explicit libc calls are coming from. |
|
So @sh1boot, a significant amount of profiling time does show up for me in glibc's memcpy/memmove, however the call graph, if it's to be believed, is not called from the fast path (inflate_fast): A large amount of it is also called from within libpng, which we can't do much about. But perhaps this helps narrow your search window? I don't think inflate_fast is being inlined in inflate anymore now that it's a separate dispatch in the function table. |
|
@sh1boot possibly here? https://github.com/zlib-ng/zlib-ng/blob/develop/inflate.c#L724 |
|
It seems to be a 50:50 split between something in chunkset_rvv.c and inflate_p.h. Fixing chunkset_rvv.c itself halved the number of calls, changing inffast_chunk_tpl.h didn't do much of anything, and then the count dropped to something sensible when I got to inflate_p.h: I do not know how this is reflected on x86. Your profile suggests you have hardware which finds its way into an 'erms' path, so maybe you could try extending this patch to implement a memcpy_erms() which does a simple, inline A quick Google suggests the full syntax might be: (updated to reflect my current understanding -- |
|
I found an x86 machine to try it on and it was something like a 10% speed up, by the way. Same test was nearer 25% on RISC-V but I think a lot of that is the memcpy implementation I'm linking is not great. |
|
If it is inflate_p.h, then that's certainly chunkcopy_safe. I'll do some benchmarking with that tonight and see if I can see a measurable gain. |
|
Does anyone know if there's an upper bound for the sizes fed into chunkcopy_safe? I'm trying to do some mental calculus here for if it's even worth attempting to be FSRM aware. I kind of suspect it isn't, because it seems like even with FSRM there's a bunch of asterisks for when it actually wins. edit hmm, looks like it's maybe 255 bytes. FSRM and even ERMS was supposed to be fast around 128 bytes and higher. That might explain why on highly compressible stuff I was seeing gains for using memcpy. There is still one memcpy call of variable length, maybe eliminating one of them lets us eat cake and eat it too? Only for x86 of course, but...something? I can try do the ERMS/FRMS sequence with an inlined function too perhaps, but glibc does a whole lot of analysis before determining that it's worth it. |
|
So, I definitely didn't hallucinate the gain, it gets decidedly worse on x86 for all cases if I cut out the memcpy and jump straight to the copy ladder at the bottom (similar to what this function looked like prior to me getting to it): And actually, from what I recall this function didn't actually behave correctly and we were just getting lucky with copy sizes, as it didn't properly emulate a self modifying copy loop like the lz decoding loop expects. Let me try to conditionally do these copies, though. |
|
That example you pointed to earlier from cloudflare I think I get the gist of what it's doing, but man it is not clear how it works. I think it's trying to copy through the overlap and then rewind such that the periodicity of the copy is a modulus of a typical chunk size. Maybe I can experiment with that technique, too. At least in doing that the unraveling copies require far less branching. |
|
So, that unroll function starts with the observation that if offset is 1, then after the first byte copied there's enough room to double the offset (ie., set it to 2) and copy byte pairs from then on. Then, if offset is 2, then after the first byte-pair copy there's enough room to double it again. But if offset was 3 or more then you can't do that. So it proceeds exponentially, trying to double offset after more room if made, but if there's not enough room then it has to dial offset back a bit. Lucky thing is, if doubling offset overflows it won't overflow by more than the original value of offset. At the beginning offset is the same as the initial offset, and for as long as it overflows it'll double and then retreat back to its original value, until the doubling is finally larger than the original offset and there's room to double the offset as well. From then on you can either double or double-minus-one the offset, depending on what its initial value was. If the initial value is a power of two then you always double. If it's three then you can double it once you have four bytes unrolled (plus the original three makes seven) for an offset of six. Then an unroll of eight (plus three makes eleven) you can't double six, but you can double and knock three off for nine. Each step means you have at least enough unrolled data for the next copy stage. Not really sure about performance, though. It's a rapid cycle of reading the data just written at a different offset and size, which can be hard for the CPU to untangle. Hopefully the number of iterations is few enough that it doesn't cause a serious blockage; because the data isn't critical path. As for In fact, the definition of that instruction is that it must behave "as if" it's a trivial byte copy loop even when source and destination overlap. That means that technically you can use it for unrolling short patterns. But I think what will actually happen is that the CPU will get mad with you and do it really slowly. You never know, though. It eliminates more conditional branches. |
|
Well, eh, this makes things even more interesting: https://lock.cmpxchg8b.com/reptar.html I don't have a CPU affected by that microcode fix but I wonder how much that alters the FSRM calculus. I haven't forgotten about this, I do intend to try your suggestion of just doing the RMSB sequence unconditionally to see what happens. I suspect slow things, but you never know. |
|
@sh1boot a recent bug with regard to chunk based memcpy'ing reminded me to revisit this. It's not better, though not as appreciably worse as I thought it would be to force an ERMS sequence: This is on a thin U series Haswell CPU so it's not exactly a great test bed since it lacks FSRM. Interestingly, nonetheless. That benchmark is a trivial to compress load. The more realistically compressed imagery shows it's appreciably worse: |
|
@Adenilson hey, do you have an FSRM device to try this on? I'm really curious to see when complicated memory implementations become the "legacy" path for old chips. |
|
We should probably start a FSRM thread or something... Here: |
We try to:
to make good use of the the RVV advance.
We get a ~5.4% performance gain in decompression when compared to load/store single 8-byte chunk on the SiFive internal FPGA.
Benchmark tool: deflatebench