-
Notifications
You must be signed in to change notification settings - Fork 329
mapAsync(WRITE) and zero-filling #2926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is required to make UMA and non-UMA devices behave portably.
I just realized that https://github.com/gpuweb/cts/blob/main/src/webgpu/api/operation/buffers/map.spec.ts#L318 explicitly populates a buffer, maps the buffer for WRITE, then reads the data from the WRITE mapping. This is pretty unfortunate because it means, on non-UMA systems, mapping for WRITE has to actually download the data from the device, just for it to be clobbered by javascript. |
Can you just map any memory as long as that memory came from the same page? In other words, So, you can keep a cache of arrayBuffers behind the scenes and when Is that a solution or is it scary because it means potentially different behavior across implementations if the way they implement their cache is different? |
Oh, maybe this is why this validation rule exists during buffer creation:
I guess the idea is that, if you can guarantee that the only way data gets into the buffer is via buffer mapping (or writeBuffer()), then the web process can just create a shadow ArrayBuffer around, holding the contents that was last uploaded into the buffer. That seems like a pretty high penalty, though, just to enforce portability between UMA and non-UMA devices. Zero-filling would also achieve that portability without the requirement that map_writable buffers can't be used for anything else. |
Exactly, and the intent is that a shadow copy is kept as you guessed. There is a single way for data to get into a If we were to fill the buffers with zeroes on mapping, then we would essentially perform half a copy more for uploads (memset 0 is equivalent memory-wise to half a memcpy). That copy would happen on the CPU in the Web process when buffers aren't triply-mapped (which should be common, triply mapping will stress the OS and is not available everywhere anyway), so it will be quite expensive. Note that in the short proposal for UMA mapping, |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Presumably, when people map, they want to overwrite data. Zeroing that data would let us not have to keep a shadow map around, but it would be kinda pessimal for large uploads like images or other large buffers, to zero them first, and then only after zeroing, overwrite (almost?) all of the buffer again with the data to upload. If an author is fine with zero-filled data, I believe that's what mappedAtCreation:true does during createBuffer, so they might have the opportunity to choose that tradeoff, if using a new buffer object as a one-shot upload is an option. We can also talk about behavior of getMappedRange if doing zero-filling there would be useful instead. Having to zero-fill the whole mapAsync range feels heavy. |
For background for tomorrow's discussion: the original premise of this PR was that getMappedRange was zeroing instead of mapAsync - and that it should be moved to mapAsync. However, this was incorrect, as no zeroing occurs in either place. Mapping for WRITE gives you the current data in the buffer, which is usually expected to be kept in a shadow copy on non-triple-mapping systems. So the tradeoff to discuss here, I think, is to add the cost of zeroing on all systems, and remove the shadow copy memory cost on non-triple-mapping systems. |
Note that browsers could remove shadow copies in GPU-renderer process shmem if they have memory pressure, and recreate the shmem the next time the buffer is used. When that happens a memcpy is needed instead of memset 0, so it's more expensive. We could also have a |
Sorry, I meant to write this last week but got sidetracked. This issue is incorrectly titled, and muddled in what it’s asking for. Current stateRight now, 2 things are required of a conformant implementation:
On a discrete GPU, implementations would want to avoid downloading the data in the buffer to service So, the implementation of AnalysisBoth of the requirements listed above are unfortunate: On mobile devices, it’s incredibly wasteful to allocate 3x as much memory as is necessary. ProposalIt doesn’t have to be this way, though! The contents of the The downside to this is that, yes, the mapping operation would have to zero-fill some memory (or run a system call to get pre-zero’ed memory). This can be somewhat mitigated by having the GPU zero-fill the memory, rather than the CPU literally call The upside, however, is a 33% (or possibly 66%) memory reduction for common usage patterns. By zero-ing out the data, we don’t need to have a shadow allocation; we just need to cause either the GPU or the CPU to zero-fill the mappable destination buffer directly. For a triply-mapped buffer (which we expect to be every But it gets better, because, if we don’t need to maintain shadow buffers, we can open up write access to On a discrete GPU, an application might not want to use the
If we did something like this, then instead of allocating 3x as much memory as is necessary, we’d only be allocating 1x as much memory as is necessary. And, not only that, but it becomes way easier for authors to write WebGPU applications - they don’t have to write code to make twice as many buffers as they need and shuffle data between them. |
Very useful post, thanks. I'm confused about one thing. You say zeroing would benefit triple-mapping implementations because the clearing could be done on the GPU. But wouldn't keeping the original contents be even better? Then triple-mapping implementations don't have to clear at all. That said, I think in theory, zeroing could still be better, if there is memory compression for zeroed pages. Then you could avoid flushing all those zeroes out to main memory (and loading them back in if they get read, regardless of whether by the GPU or the CPU). I have no idea whether anything like this is possible though. |
Right, that’s a good point. On a UMA system, the triply-mapped buffer gets the data for free. I’m imagining a world like this:
Pros:
Cons:
I think the pros outweigh the cons. |
Ah, got it, what I got confused about was triple-mapping on non-UMA. |
GPU Web 2022-10-05
|
There are two aspects I see to the proposal:
We don’t think we should pursue either of these changes and should keep the current spec as-is. We could improve memory usage and reduce copies in a UMA extension. Part 1:Allowing usages other than COPY_SRC with MAP_WRITE is just like issue #2388. While it would be great to reduce the number of copies on unified memory architectures, lifting this restriction should be done in a fuller proposal for a UMA extension. If done poorly, lifting the usage rules would make it easy for developers to use an extremely inefficient path on discrete GPUs. Myles had some good ideas here - maybe a bake() step so that flushing updates is explicit. Regardless though - we need a fully fleshed out proposal for how this (or something else) will work that will allow developers to get good behavior on both UMA and discrete systems without performance pitfalls. Part 2:I’m not convinced by the purported memory savings of returning a zero-filled mapping. As discussed in todays's meeting, at least Safari/Chrome expect to be able to use “triple mapping” everywhere. So, there should never be a need for a CPU-side shadow allocation. On all platforms, a MAP_WRITE|COPY_SRC buffer is backed by a single buffer allocation that is both visible to the web page and accessible to the GPU. The remaining memory savings argued in Myles’ proposal come from the fact that since the contents of the mapping are zero-filled, the implementation is free to make their storage temporary and transient. While this is true, this is exactly the same memory savings the application would get if they explicitly destroyed their staging buffer and recreated it again when they needed to. It is better to give control of allocations to the application so they can manage it themselves. The current design of buffer mapping is clear that So given we don’t see memory savings to zero-filling we don’t think it’s worth doing. It also has a few downsides.
I do think zero-filling could be part of the future UMA extension - perhaps adding a GPUMapMode.ZERO. Zero-filling will be valuable in some situations on a discrete GPU. It would allow the implementation to avoid reading back data into the mappable staging buffer. This could happen in situations where the buffer has MAP_WRITE|STORAGE usage, and the GPU modifies the contents in a shader. In the current spec, mapping the buffer after storage writes means those writes should be visible in the mapping. There are options here, but I think they should be left to the UMA extension. Perhaps:
|
I just want to clarify my understanding based on @austinEng 's comments above. It sounds like a buffer created with
Is not intended to be a GPU buffer at all. It's entirely a single allocation, ideally that both the CPU and GPU process can see. Those 2 flags make it a CPU->GPU staging buffer and that's all it's useful for. If an implementation can do some optimization where it's actually allocated in memory the GPU can see that's great, but without that optimization it's a single CPU ram buffer, ideally in shared memory between the CPU and GPU process. It's possibly not a Metal/Vulkan/D3D buffer at all. Correct or am I mis-understanding? |
Sorry if that last post didn't make any sense. In my mind, given the restrictions of MAP_WRITE | COPY_SRC I thought one valid implementation would be something like
And that a I guess this is wrong though. |
I believe your understanding is correct. And, you're right that the implementation may implement this buffer as a CPU-only shared memory buffer. Whether or not it is GPU-accessible is an implementation detail. However, we expect to be able to share the shared memory directly with the GPU everywhere. This is what we've been referring to in this issue as "triple mapping". |
GPU Web Meeting 2022-10-12/13 APAC-timed
|
GPU Web meeting 2022-11-02/03 APAC-timed
|
This is required to make UMA and non-UMA devices behave portably.
It would be unfortunate if we had to run
memset()
on the CPU for triply-mapped buffers on UMA (GPU Process, web process, and GPU). The only way to offload the zero-filling work to the GPU is to do it inmapAsync()
, and if we do it inmapAsync()
then it has to clear the whole mapped region (not just the regionsgetMappedRange()
is called on).(I also didn't see anything in the definition of
getMappedRange()
about zero-filling so I'm assuming this is just an oversight.)Preview | Diff