Memory barriers investigations

Memory barrier is an abstraction provided to the graphics API user that allows controlling the internal mutable state of otherwise immutable objects. Such states are device/driver dependent and may include:
  - cache flushes
  - memory layout/access changes
  - compression states

Two failure cases (from [AMD GDC 2016 presentation](http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2016/03/d3d12_vulkan_lessons_learned.pdf)):
 - too many or too broad: bad performance 
 - missing barriers: corruptions (*)

## General information

### Metal

Memory barriers are inserted automatically by the runtime/driver.

### Direct3D 12

Quote from [MSDN](https://msdn.microsoft.com/en-us/library/windows/desktop/dn899226(v=vs.85).aspx):
> In Direct3D 11, drivers were required to track this state in the background. This is expensive from a CPU perspective and significantly complicates any sort of multi-threaded design.

Direct3D has [3 kinds](https://msdn.microsoft.com/en-us/library/windows/desktop/dn986740(v=vs.85).aspx) of barriers:
  1. State barrier: to tell that a resource needs to transition into a different state.
  2. Alias barrier: to tell that one alias of a resource is going to be used instead of another.
  3. UAV barrier: to wait for all operations on an UAV to finish before another operation on this UAV.

#### Resource states

A (sub-)resource can be either in a single read-write state, or in a combination of read-only states. Read-write states are:
  - `D3D12_RESOURCE_STATE_RENDER_TARGET`
  - `D3D12_RESOURCE_STATE_STREAM_OUT`
  - `D3D12_RESOURCE_STATE_COPY_DEST`
  - `D3D12_RESOURCE_STATE_UNORDERED_ACCESS`

For presentation, a resource must be in `D3D12_RESOURCE_STATE_PRESENT` state, which is equal to `D3D12_RESOURCE_STATE_COMMON`.

There are [special rules](https://msdn.microsoft.com/en-us/library/windows/desktop/dn899226(v=vs.85).aspx#implicit_state_transitions) for resource state promotion from the `COMMON` state and decay into `COMMON`. These transitions are implicit and specified to incur no GPU cost.

The barrier can span over multiple draw calls:
> Split barriers provide hints to the GPU that a resource in state A will next be used in state B sometime later. This gives the GPU the option to optimize the transition workload, possibly reducing or eliminating execution stalls.

### Vulkan

[Typical synchonization use-cases](https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples)

#### Pipeline barriers

Vulkan as a lot of knobs to configure the barriers in a finest detail. For example, user provides separate masks for source and target [pipeline stages](https://www.khronos.org/registry/vulkan/specs/1.0/man/html/VkPipelineStageFlagBits.html). By spreading out the source and target barriers, we can give GPU/driver more time to do the actual transition and minimize the stalls.

There are 3 types of barriers:
  1. _Global_ memory barrier: specifies [access flags](https://www.khronos.org/registry/vulkan/specs/1.0/man/html/VkAccessFlagBits.html) for all memory objects that exist at the time of its execution.
  2. _Buffer_ memory barrier: similar to a global barrier, but limited to a specified sub-range of buffer memory.
  3. _Image_ memory barrier: similar to a global barrier, but limited to a sub-range of image memory. In addition to changing the access flags, image barrier also includes the transition between [image layouts](https://vulkan.lunarg.com/doc/view/1.0.30.0/linux/vkspec.chunked/ch11s04.html).

Similarities with D3D12:
  - explicit barriers
  - both source and destination layout/states are requested, i.e. the driver doesn't track the current layout and expects/trusts the user to insert optimal barriers/transitions
  - image sub-resources carry independent layouts that can be changed individually or in bulk

Vulkan can transition to any layout if the current contents are discarded.

Note: barriers also allow resource transitions between queue families.

#### Implicit barriers

Barriers are inserted automatically between sub-passes of a render pass, based on the follow information:
  - initial and final layouts provided for each attachment
  - a layout provided for each attachment for each sub-pass
  - set of sub-pass dependencies, each specifying what parts of what destination sub-pass stages depend on some results of some stages of a source sub-pass

Vulkan implementation also automatically inserts layout transitions for read-only layouts of a resource used in multiple sub-passes.

#### Events

Vulkan event is a synchronization primitive that can be used to define memory dependencies within a command queue. Arguments of `vkCmdWaitEvents` are almost identical to `vkCmdPipelineBarrier`. The difference is an ability to move the start of transition earlier in the queue, similarly in concept to D3D12 split barriers.


## Analysis

Tips for best performance (for AMD):
  - combine transitions
  - use the most specific state, but also - combine states
  - give driver time to handle the transition
  	- D3D12: split barriers
  	- Vulkan: `vkCmdSetEvent` + `vkCmdWaitEvents`

Nitrous engine (Oxide Games, [GDC 2017 presentation](https://www.khronos.org/assets/uploads/developers/library/2017-gdc/GDC_Vulkan-on-Desktop_Feb17.pdf) slide 36) approach:
  - engine is auto-tracking the current state, the user requests new state only
  - extended (from D3D12) resource state range that maps to Vulkan barriers

Overall, in terms of flexibility/configuration, Vulkan barriers >> D3D12 barriers >> Metal. Developers seem to prefer D3D12 style (TODO: confirm with more developers!).

### Translation between APIs

#### Metal API running on D3D12/Vulkan

We'd have to replicate the analysis already done by D3D11 and Metal drivers, but without a low-level access to the command buffer structure.

#### D3D12/Vulkan API running on Metal

All barriers become no-ops.

#### D3D12 API running on Vulkan

Given that D3D12 appears to have a smaller API surface and stricter set of allowed resources states (e.g. no multiple read/write states allowed), it seems possible to emulate (conservatively) D3D12 states on top of Vulkan. Prototyping would probably help here to narrow down the fine details.

#### Vulkan API on D3D12

Largely comes down to the following aspects:
  - ignoring the given pipeline stages
  - translating (image layout, access mask) -> D3D12 resource state
  - `vkCmdWaitEvents` should be possible to translate to a D3D12 split barrier, but more experiments are needed to confirm


### Security/corruption issues

We've done some research with IHVs on how the hardware behaves when the resources are used in the case of a mismatched resource layout/state. E.g. an operation expects image to be in a shader-readable state, while the image is not.

The conclusion we got is that in most situations this workload will end up in either a GPU page fault (crash), or visual corruption with user data. It's relatively straightforward for Vulkan to add an extension, and for IHVs to implement it, that would guarantee security of such mismatched layout access. The extension would be defined similarly to `robustBufferAccess` and specify the exact behavior of the hardware and the lack of access to non-initialized memory not owned by the current instance.

### Automation versus Validation

Inserting optimal Vulkan/D3D12 barriers at the right times appears to be a complex task, especially when taking multiple independent queues into consideration. It requires knowledge ahead of time on how a resource is going to be used in the future, and thus would need us to defer actual command buffer recording until we get more data on how resources are used. This would add more CPU overhead to command recording.

Simply validating that current transitions are sufficient appears to be more feasible, since it doesn't require patching command buffers and that logic can be moved completely into the validation layer.


## Concrete proposals

TODO


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory barriers investigations #27

General information

Metal

Direct3D 12

Resource states

Vulkan

Pipeline barriers

Implicit barriers

Events

Analysis

Translation between APIs

Metal API running on D3D12/Vulkan

D3D12/Vulkan API running on Metal

D3D12 API running on Vulkan

Vulkan API on D3D12

Security/corruption issues

Automation versus Validation

Concrete proposals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory barriers investigations #27

Description

General information

Metal

Direct3D 12

Resource states

Vulkan

Pipeline barriers

Implicit barriers

Events

Analysis

Translation between APIs

Metal API running on D3D12/Vulkan

D3D12/Vulkan API running on Metal

D3D12 API running on Vulkan

Vulkan API on D3D12

Security/corruption issues

Automation versus Validation

Concrete proposals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions