-
Notifications
You must be signed in to change notification settings - Fork 329
Description
This is an evolution of #154 initiated by the discussion in #418 (comment) .
It can be seen as an alternative to #481. Also related to #488.
Problem
Data transfers are hard. Different platforms have different physical memory properties as well as different restrictions in the APIs in how they can be used. For example, there are unified (UMA) and non-unified memory architectures, and there "hybrid" platforms that are non-unified in general but expose a small chunk of unified memory. In the API space, D3D12 has a notion of "upload" and "download" heaps that are only supposed to be used for staging data going in and out of the fast local VRAM.
The native API expose memory access to CPU via mapping. On the Web, the situation is different:
- the client code and data may be separated from the GPU by a process boundary
- operations can't block
- concurrent access to the data hurts application portability
The different needs ask us to consider different solutions for the API. In particular, it is appealing to provide an API that hides the UMA/non-UMA differences while still using the most efficient code path on those platforms.
Solution
One of the important building blocks for a performant streaming/transfer system is a "staging belt" (name is invented, don't search for it). It's a ring buffer of CPU-visible persistently mapped buffers that are used to temporarily host data (the "staging") while it's on the way to GPU. The structure is FIFO, and allocation is linear (the "belt"), which makes the implementation fairly straightforward. This was originally done in Dawn for data uploads via GPUBuffer::setSubData
method.
In order to control the lifetime of the staging blocks, we need to associate the upload operations with a GPUQueue
. This is the main difference from #154. The API could look like this:
partial interface GPUQueue {
// Start uploading data. The pass has to be finished before anything can be
// done to the queue, such as submits or other upload passes.
GPUUploadPass beginUpload();
}
dictionary GPUTextureDataLayout {
u32 rowPitch;
u32 imageHeight;
};
dictionary GPUUserData {
ArrayBuffer data;
unsigned long offset;
};
interface GPUUploadPass {
// Stops recording the upload operations.
// Detaches all the temporary `ArrayBuffer` objects.
void finish();
// Upload the CPU data into the specified destination buffer.
// If `data` is provided, the contents are used for the upload, and `nil` is returned.
// If `data` is not provided, a new `ArrayBuffer` object is returned. In this case,
// the `ArrayBuffer` gets detached at the point where `finish` is called.
ArrayBuffer uploadBuffer(
GPUBuffer destination,
GPUBufferSize destinationOffset,
GPUBufferSize size,
GPUUserData? userData);
// Upload the CPU data into the specified destination texture.
ArrayBuffer uploadTexture(
GPUTextureDataLayout layout,
GPUTextureCopyView destination,
GPUExtent3D size,
GPUUserData? userData);
// This may also need a staging buffer.
void copyImageBitmapToTexture(
GPUImageBitmapCopyView source,
GPUTextureCopyView destination,
GPUExtent3D copySize);
}
Explanation
The implementation would manage the "staging belt" internally, with GPUUploadPass
as an interface to it.
The transfer operations are queued on the queue at the GPUUploadPass::finish()
time from the user perspective. It's easy to reason about "when" things happen: after all the previous queue operations (e.g. submit()
calls) and before all the subsequent operations. When an upload operation is recorded, the implementation can decide to execute it instantly, since it knows about the order of operations.
Fast path
The fast path of data transfers is writing directly into the buffer that is persistently mapped by the browser. If the platform supports mapping across the process boundary (or has no IPC), the client can also have a view into the "staging belt", possibly without any means to allocate new buffers in the ring buffer.
Upon calling uploadXxx
method, the client finds the "staging space":
- if the client sees the mapped staging belt, and it has enough space (or can be extended by the client), this mapped area is the staging
- otherwise, the shared memory between the client and the GPU process is used as a staging.
If userData
is provided, the contents are copied into the selected staging space.
Otherwise, a new ArrayBuffer
object is created as the view into the staging space and returned to the user.
After the staging space is filled, we can internally queue a copy operation from this staging memory into the target resource.
In the case where the target resource is CPU-visible and is not used by the GPU, the implementation can write directly into it, which gives us the zero-copy fast path.
setSubData()
Having the GPUUploadPass
in the API, it would be easy for the users to have the most efficient path for setSubData
(of #418):
GPUQueue.setBufferSubData = function(buffer, offset, data) {
let pass = this.beginUpload();
pass.uploadBuffer(buffer, offset, data.byteLength, { data: data } );
pass.finish();
};
FAQ
Q: Why do we need a new pass interface?
A: No strong reason. I just thought it's useful to separate the semantics of submission from transfers at the API level.
Q: Why are we only talking about uploads?
A: Just to reduce the scope, for now, given that downloads are not as important to get fast.
Q: How exactly are ArrayBuffer
objects get detached?
A: (missing answer)
Finally, this isn't a polished or finalized proposal. It's quite hand-wavy, inviting us to explore this distinct API direction.