-
Notifications
You must be signed in to change notification settings - Fork 329
Description
I've recently taken another close look at the question of whether piet-gpu can be ported to run on WebGPU. The good news is that I think it is possible to do that now. The bad news is that it exposes a number of rough edges. Some of those have to do with the spec itself, but I'm also seeing some significant challenges in implementing the spec. Many of these issues are cross-cutting, and no doubt some should be split out into different bugs or filed against different components, but I'm putting it all in one place here because I think it might be a useful discussion.
I started by porting my prefix sum algorithm. To review, this uses a decoupled look-back algorithm which involves communication between workgroups so the task can be completed in a single dispatch. That is a relatively advanced use of atomics, but I think can be gotten to run on DX11. Thus, I think it's a good benchmark of compute capabilities.
This port does not run successfully on my AMD 5700XT. I believe that's a problem with the AMD Vulkan driver (Windows 10), but still, it exposes an alarming gap in validation of the atomic memory model on which WebGPU relies. There are multiple issues, and I'll break that out.
Message passing litmus test failures
One of the core mechanics of decoupled look-back is for a workgroup to mark a partition as "done," so that another workgroup can consider its dependency on that partition's result resolved, and continue work. This is a classic application of the message passing atomic pattern: store the data, store a flag with release semantics, then another thread will load the flag with acquire semantics, then load the data. This ordering ensures that the data load will be valid.
My development machine does not pass this test. I find this surprising because there is a message passing test as part of the memory model section in the Vulkan CTS, and there is also a test in the webgpu litmus project. I am still working on isolating the exact cause for the difference between these results and mine, but I believe it is fundamentally because the existing tests being based on a scalar model (adapted from CPU land), while my test is based on many such atomic interactions being done in parallel.
My code is basically the vanilla message passing test but in parallel, with each thread doing first the store role then the load role sequentially. Results don't change if that order is reversed, or if it's reorganized so that half the threads do one role and half the other. A note of caution, however: when writing such tests one must be careful not to put the storageBarrier()
in divergent control flow, which is easy if there's a branch to select role. When I tested this, I made sure the role was uniform at a workgroup granularity. The current version of my test has no control flow at all; I consider that a significant simplification.
While it's hard to be sure, I believe this is an AMD driver/GPU issue. I've looked hard at the SPV (translated by both naga and tint), and also the gfx1010 ISA from Radeon Graphics Analyzer. It's hard to see what's going wrong there; there is a buffer_gl0_inv and buffer_gl1_inv between the flag load and data load, which is what I expect as the correct translation of the storage barrier.
Another thing to say about this test: the current version has the flag in an atomic<u32>
and the data in non-atomic memory, which would seem to require some form of coherence or marking the non-atomic load/store as nonprivate (see #1621), but the test also fails in the same way if the data is moved to an atomic type. My personal feeling on coherence is that it seems like a good idea if it's possible to implement efficiently and reliably, but I think that's a risk, and would also be fine with having coherent behavior completely opt-in, if the upside is that nonatomic memory access reliably translates to ISA that utilizes the GPU's cache. (I've been bitten by that not happening, and feel that in any case a performance test suite is in order, something I might work on)
I think the main path forward on this sub-issue is to make sure that there is adequate testing in place; I think the spec is basically fine, though there may be some tradeoffs which hopefully I've helped to illuminate.
Coherence of relaxed atomic loads
This is potentially a complex issue, and my understanding may not be complete. It's possible that the Vulkan memory model addresses this, and that relying on that by reference is adequate. But I'm yet to be convinced.
My understanding of relaxed atomic loads and stores is that they're always considered coherent. Specifically, if one workgroup does a store and another does a load after a reasonable amount of time (where reasonable is within an order of magnitude or two of a microsecond), then the load should see the fresh data. In my prefix sum code, I'm seeing delays around 10ms, and I don't think it's 100% reliable at that.
Again I looked at the SPV and ISA. The relaxed atomic load is being translated to an OpAtomicLoad with (1, 64) semantics, then into buffer_load_dword with the slc (but not dlc or glc bits set; see the "GLC, DLC and SLC Bits Explained" section of the RDNA 1.0 ISA doc). I believe that's a mistranslation, and, looking at the llvm docs think it may be treating "relaxed" in the SPV as "unordered" in LLVM, where "monotonic" is more accurate. See the "AMDHSA Memory Model Code Sequences GFX10" table in the linked LLVM doc for more discussion. I believe the correct translation has dlc and glc, and if I apply workarounds to get that behavior, this part of the test passes.
(One workaround is to do an atomicOr of 0. Another, which produces the dlc + glc ISA, is to mark the buffer as coherent using OpMemberDecorate, then do the load with OpLoad instead of OpAtomicLoad; this is what GLSL will produce when using a "coherent" qualifier on the buffer and doing a regular load)
I'm going into some detail here, because I can't find the language that specifies precisely what it means to be coherent. Obviously a short (submicrosecond) propagation delay is acceptable with relaxed semantics, so one interpretation that I haven't seen ruled out yet is that the exact magnitude of the propagation delay is a performance issue and not a correctness issue. In this interpretation, the load with slc is valid ISA.
I'm still trying to wrap my head around how to write a test for this behavior. It's more of a forward progress concern, and we're very careful to avoid forward progress guarantees. (Indeed, we now have empirical evidence that not all existing GPUs can provide a strong forward progress guarantee). It's possible this will end up as a performance test, effectively measuring a forward progress expectation rather than guarantee.
To be clear, what I'm going for in my code is a bounded spinlock, where some progress is made each "spin," so the shader completes even when there is a forward progress failure. I would like to be able to rely on short inter-workgroup propagation delays most of the time.
storageBarrier forces uniform control flow
My prefix sum shader fails to compile on DX12 on wgpu, and the reason is a bit complex. For correctness, I need a memory barrier between the flag load and the subsequent data load. All this needs to be in a loop, which spins until the flag reaches a particular value. The best way to specify this is to annotate the flag load with acquire semantics. The second best is to put an acquire/release memory barrier after the flag load. But neither of these are available in WGSL, and the only way to get this correct is to use a storageBarrier
, which is also a control barrier, and requires workgroup uniform control flow.
My control flow is in fact workgroup uniform, but a simplistic program analysis (as is done by D3DCompiler / FXC) cannot infer that. The reason it's uniform is that one thread is loading the flag into shared memory, then there's a workgroup barrier, and then all threads participate in a loop where the predicate is controlled by that flag.
DXC has no problem with the naga-generated HLSL. FXC would also accept it if the DeviceMemoryBarrierWithGroupSync was replaced by a DeviceMemoryBarrier, but there's no way to express that in WGSL. (I can get it just fine in GLSL and spirv-cross, by starting with memoryBarrierBuffer).
I'm not sure this should motivate a spec change; I appreciate the simplicity of having only two barriers, with workgroup and device scope. But it does make it more challenging to translate WGSL to existing GPUs. I personally think the long term solution is for naga to emit both DXIL (for DX12) and DXBC (if DX11 compatibility is sought).
atomic<> as a type, not an operation
WGSL follows C++ and Rust by having an atomic type, rather than the GLSL (and HLSL) approach of allowing atomic operations on ordinary memory locations. I see the value in this approach, but it also causes problems.
The first problem is that it's much harder to translate between different shader languages; both naga and tint just give up when they see the atomics in the piet-gpu spv. It might be possible to brute-force a translation, but there would be compromises. Such a GLSL (SPV) to WGSL translation would probably end up with the memory buffer being declared as array<atomic<u32>>
and all loads and stores being rewritten to atomic operations; that in turn may lead to performance loss, as the generated ISA would bypass the L0 and L1 cache.
The second problem is that piet-gpu currently uses an architecture of a large memory pool (uint[] buffer) containing a diverse assortment of actual object types. We use auto-generated accessor code to provide a higher level interface to these objects, so you're not manually writing memory[base + offset]; allocation is done in the shader using an atomic bump allocator.
Basically, in C++, Rust, and related languages, accurate types for objects work, because it's possible to situate those types in memory in a variety of ways. Since WGSL is lacking those allocation mechanisms, we have to treat memory as untyped at the shader code level (not necessarily in a higher level language that targets WGSL, though), and there the insistence on more precise types gets in the way.
The main place piet-gpu needs these atomics in the main memory pool in coarse path rasterization, where it builds per-tile linked lists of path segments. I think this is not terribly unusual in advanced shaders; for example, a similar technique is used in order independent transparency. And I wouldn't be surprised if there were other advanced game renderers that used atomics in a similar way.
I don't have a real recommendation here; I can see how backing off the current atomic approach to make it more like existing shader languages is probably a nonstarter. But I do think it's something people should be aware of.
Conclusion
Thanks for reading this far. I continue to be excited about WebGPU, and would love it to be the main way we ship piet-gpu. However, it doesn't feel quite ready yet, and I hope I've explained clearly some of the ways in which it seems less suitable for our needs than our current path of building our own GPU abstraction. Ideally, some of those rough edges can be smoothed down.