Multi-queue proposal with explicit transfers

# Multiqueue proposal

This is a separate issue from #1065 to not pollute the investigation issue with all the discussions that can happen on the proposal. Please read the investigation for motivation and background on multi-queue in all APIs. Also see #1066 that's a different multi-queue proposal with implicit transfers. (unfortunately this proposal comes a bit later because I waited for internal reviews)

A complete multiqueue proposal for WebGPU needs to have all of these sub-proposals:

 - Discovery and requesting mechanisms for queues, if any.
 - Synchronization mechanism between queues.
 - Semantics for resources sharing / transfer between queues that prevents data-races.
 - Mechanisms / thoughts on how resources are allowed to be used on a queue or another.
 - How non-resource concepts relate to queues (in particular command encoders, but also maybe buffer mapping and pipelines)

## Synchronization proposal.

The hardest part to come up with a proposal for is the synchronization. That's because the execution of commands on queues need to follow a DAG, and we need to guarantee that if a resource is used as writeable in node W and used (read or write) in node U, then there must be a path from node W to U or from node U to W (edges are execution dependencies and prevent the execution from overlapping).

Both D3D12 and Vulkan have concepts of exclusive and concurrent access, where exclusive access means a single queue can access the resource at a time, while in concurrent access all queues can access it (but data races still disallowed).

### Proposal for exclusive access

Exclusive access synchronization is a bit simpler because there's a single owner at a time, so let's start with it. Vulkan requires a handshake between the two queues transferring ownership, so WebGPU needs the transfer command to be started from the giving queue. The simplest version would be an API like the following (note that it is just an intermediate step and not part of the proposal itself):

```webidl
partial interface GPUQueue {
    void transferOwnership(GPUBuffer | GPUTexture resource, GPUQueue receivingQueue);
};
```

Each resource would have an internal reference to the queue that currently owns it. `transferOwnership` would update that reference, and ensure proper synchronization happens on the giving and receiving queues. Of course a resource would only be allowed to be used on the `GPUQueue` that currently owns it.

The problem is that there might be many resources going back and forth between queues, and for the developer, it isn't very apparent what parallelism will be possible. Ideally we want the DAG of operations to be very apparent to them so they can reason about it.

The key idea is to tie resource transfer to fence signal values. (remember that a `GPUFence` can only be signaled on the queue it was created from), and to avoid having tons of fence signals, we batch ownership transfers.

```webidl
dictionary GPUResourceOwnershipTransfer {
    required GPUQueue receivingQueue;
    required sequence<GPUBuffer | GPUTexture> resources;
}

partial interface GPUQueue {
    void transferOwnership(GPUFence fence,
                           unsigned long long signalValue,
                           sequence<GPUResourceOwnershipTransfer> transfers);
                           
    void wait(GPUFence fence, unsigned long long waitValue);
};
```

In this version `GPUQueue.transferOwnership` acts like a `GPUQueue.signal` (with the same validation). But also stores the `transfers` on the `fence` for `signalValue`. It puts all resources in the `BeingTransferred` state (where they aren't owned on any queue).

`GPUQueue.wait` has the usual validation that the fence must not be an error and the `waitValue` must be less than the last value signaled. It also finishes the transfer of all the resources that where stored for transfer on `fence` for `this` `GPUQueue` for a `signalValue <= waitValue`.

Finishing the transfer requires a `GPUQueue.wait` on the fence that received the transfer, and the following indirect transfer wouldn't work (for implementation simplicity, and clarity in the API):

```js
let fence0 = queue0.createFence();
let fence1 = queue0.createFence();

queue0.transferOwnership(fence0, 1, {queue: queue2, resources: [buffer])
queue1.wait(fence0, 1);
queue1.signal(fence1, 1);
queue2.wait(fence1, 1);

// Error, buffer is still in the "BeingTransferred" state.
queue2.submit([commandBufferUsing(buffer)])
```

### Proposal for concurrent access

Buffers are free to use in concurrent access in D3D12, and in all Vulkan drivers I saw. However textures using concurrent access can be deoptimized so they would need a flag on creation to allow concurrent access. Vulkan requires the list of all queues used for concurrent access, but it doesn't seem used in drivers I saw so in WebGPU concurrent access is for all queues (we can easily add a queue list later if needed).

```webidl
partial dictionary GPUTextureDescriptor {
    // default to false, so that by default textures are optimized.
    boolean allowConcurrentQueueAccess = false;
    
    // Undefined means the device's default queue, so that by default nothing
    // changes compared to the current WebGPU API.
    // (note that starting owned by the device is useless because it only
    // allows readonly access, and the texture is newly created).
    GPUQueue initialQueueOwner;
}

partial dictionary GPUBufferDescriptor {
    // Same as for GPUTexture but must be set to null if
    // mappedAtCreation is true.
    GPUQueue initialQueueOwner;
};
```

Then we change the `GPUResourceOwnershipTransfer` to have the `GPUQueue` be optional, a `GPUQueue` of `undefined` or `null` instead means that the ownership is transferred to the device == all queues for readonly access.

```webidl
dictionary GPUResourceOwnershipTransfer {
    GPUQueue receivingQueue;
    required sequence<GPUBuffer | GPUTexture> resources;
}
```

Each resource has an internal slot that contains the list of queues it is allowed to be used on (instead of a single owning queue), that's updated on exclusive ownership transfer on `GPUQueue.wait` as described above. When `GPUQueue.wait` waits for something that's a transfer to the device, `this` is added as an allowed queue to all the resources in that transfer.

(implementation notes: lazy clearing will need to happen on device transfer, otherwise it could cause data races when trying to clear in concurrent mode. The Vulkan layout / D3D12 state will need to be the most general readonly state/layout allowed for that texture).

Moving from concurrent ownership to exclusive ownership is done via `GPUDevice.transferOwnership`:

```webidl
partial interface GPUDevice {
    void transferOwnership(sequence<GPUResourceOwnershipTransfer> transfers);
};
```

This would cause any further operation on a queue transferred to, to wait on all previous operations on all queues (loss of parallelism, see alternatives below for potential fixes?), and the resource would be immediately owned by the queue.

#### Interaction with `GPUBuffer.unmap`.

What's the queue ownership of unmapped buffers? Buffers with `MAP_WRITE` aren't writeable on the GPU (for now?) so could go back to concurrent access, while `MAP_READ` buffers are only writeable on the GPU so would go back to being owned on a queue. `GPUBuffer.unmap` would take an optional argument:

```webidl
dictionary GPUUnmapOptions {
    GPUQueue owningQueue; // concurrent access if undefined
};
partial interface GPUBuffer {
    void unmap(GPUUnmapOptions? options);
};
```

### Alternatives

#### Implicit synchronization

WebGPU has enough information that it could theoretically do implicit synchronization between queues when it sees they both use the same resource. This is natural to think about because WebGPU already does implicit memory synchronization for execution inside a single queue.

However the goal of multi-queue is to take advantage of high-level parallelism in application's computation graph. Making synchronization implicit means that small changes to the application's code could silently result in additional synchronization that completely serializes the command flow. Likewise when building the multi-queue application, it would be extremely hard to know if we're actually taking advantage of parallelism. The combinations of these two things would make it almost impossible to write multi-queue applications and the feature mostly useless.

#### Automatic multi-queue

Similarly to implicit synchronization, WebGPU has enough information that it could theoretically automatically schedule work on multiple queues and magically make people's code faster!

This has the same drawbacks as implicit synchronization, and would be extremely difficult to implement efficiently. Also moving tiny chunks of work to a different queue would be a deoptimization due to the synchronization overhead so the WebGPU implementation would need to guess how expensive commands are, which is extremely hard to do.

#### Have the list of queues in concurrent queues be explicit

This would change `GPUResourceOwnershipTransfer` to have a `sequence<GPUQueue>` instead of a single queue, and single ownership transfer would happen if the sequence contains a single element. It could help not wait for every single queue when doing a `GPUDevice.transferOwndership`.

#### Have "device" fences that wait for all previously submitted commands

This would allow the concurrent -> exclusive transfer to not happen immediately and instead an application could signal a fence in the `GPUDevice.transferOwnership` and `GPUQueue.wait` for it later, increasing parallelism

## Queue discovery and requesting.

This is mostly orthogonal to the synchronization proposal and any proposal of the following form could work, with no constraints on whether compute-only or copy-only queues can be created:

```webidl
partial interface GPUDevice {
    GPUQueue createQueue(GPUQueueDescriptor descriptor);
    GPUQueue defaultQueue;
};
```

Although for Vulkan it might be useful to know in advance how many queues the application would like for the device.

(note: for efficiency, would it be best to have the number of queues known on device creation? (so that maps of queue->T can just be vectors of T))

## Interaction with the command encoder.

Vulkan and D3D12 need to know which queue family commands will be encoded for, so we could add a new member for the `GPUCommandEncoderDescriptor`:

```webidl
partial dictionary GPUCommandEncoderDescriptor {
    GPUQueue executionQueue; // undefined means the default queue.
};
```

Likewise for the render bundle descriptor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-queue proposal with explicit transfers #1073

Multiqueue proposal

Synchronization proposal.

Proposal for exclusive access

Proposal for concurrent access

Interaction with `GPUBuffer.unmap`.

Alternatives

Implicit synchronization

Automatic multi-queue

Have the list of queues in concurrent queues be explicit

Have "device" fences that wait for all previously submitted commands

Queue discovery and requesting.

Interaction with the command encoder.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-queue proposal with explicit transfers #1073

Description

Multiqueue proposal

Synchronization proposal.

Proposal for exclusive access

Proposal for concurrent access

Interaction with GPUBuffer.unmap.

Alternatives

Implicit synchronization

Automatic multi-queue

Have the list of queues in concurrent queues be explicit

Have "device" fences that wait for all previously submitted commands

Queue discovery and requesting.

Interaction with the command encoder.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Interaction with `GPUBuffer.unmap`.