-
Notifications
You must be signed in to change notification settings - Fork 329
Description
Memory barrier is an abstraction provided to the graphics API user that allows controlling the internal mutable state of otherwise immutable objects. Such states are device/driver dependent and may include:
- cache flushes
- memory layout/access changes
- compression states
Two failure cases (from AMD GDC 2016 presentation):
- too many or too broad: bad performance
- missing barriers: corruptions (*)
General information
Metal
Memory barriers are inserted automatically by the runtime/driver.
Direct3D 12
Quote from MSDN:
In Direct3D 11, drivers were required to track this state in the background. This is expensive from a CPU perspective and significantly complicates any sort of multi-threaded design.
Direct3D has 3 kinds of barriers:
- State barrier: to tell that a resource needs to transition into a different state.
- Alias barrier: to tell that one alias of a resource is going to be used instead of another.
- UAV barrier: to wait for all operations on an UAV to finish before another operation on this UAV.
Resource states
A (sub-)resource can be either in a single read-write state, or in a combination of read-only states. Read-write states are:
D3D12_RESOURCE_STATE_RENDER_TARGET
D3D12_RESOURCE_STATE_STREAM_OUT
D3D12_RESOURCE_STATE_COPY_DEST
D3D12_RESOURCE_STATE_UNORDERED_ACCESS
For presentation, a resource must be in D3D12_RESOURCE_STATE_PRESENT
state, which is equal to D3D12_RESOURCE_STATE_COMMON
.
There are special rules for resource state promotion from the COMMON
state and decay into COMMON
. These transitions are implicit and specified to incur no GPU cost.
The barrier can span over multiple draw calls:
Split barriers provide hints to the GPU that a resource in state A will next be used in state B sometime later. This gives the GPU the option to optimize the transition workload, possibly reducing or eliminating execution stalls.
Vulkan
Typical synchonization use-cases
Pipeline barriers
Vulkan as a lot of knobs to configure the barriers in a finest detail. For example, user provides separate masks for source and target pipeline stages. By spreading out the source and target barriers, we can give GPU/driver more time to do the actual transition and minimize the stalls.
There are 3 types of barriers:
- Global memory barrier: specifies access flags for all memory objects that exist at the time of its execution.
- Buffer memory barrier: similar to a global barrier, but limited to a specified sub-range of buffer memory.
- Image memory barrier: similar to a global barrier, but limited to a sub-range of image memory. In addition to changing the access flags, image barrier also includes the transition between image layouts.
Similarities with D3D12:
- explicit barriers
- both source and destination layout/states are requested, i.e. the driver doesn't track the current layout and expects/trusts the user to insert optimal barriers/transitions
- image sub-resources carry independent layouts that can be changed individually or in bulk
Vulkan can transition to any layout if the current contents are discarded.
Note: barriers also allow resource transitions between queue families.
Implicit barriers
Barriers are inserted automatically between sub-passes of a render pass, based on the follow information:
- initial and final layouts provided for each attachment
- a layout provided for each attachment for each sub-pass
- set of sub-pass dependencies, each specifying what parts of what destination sub-pass stages depend on some results of some stages of a source sub-pass
Vulkan implementation also automatically inserts layout transitions for read-only layouts of a resource used in multiple sub-passes.
Events
Vulkan event is a synchronization primitive that can be used to define memory dependencies within a command queue. Arguments of vkCmdWaitEvents
are almost identical to vkCmdPipelineBarrier
. The difference is an ability to move the start of transition earlier in the queue, similarly in concept to D3D12 split barriers.
Analysis
Tips for best performance (for AMD):
- combine transitions
- use the most specific state, but also - combine states
- give driver time to handle the transition
- D3D12: split barriers
- Vulkan:
vkCmdSetEvent
+vkCmdWaitEvents
Nitrous engine (Oxide Games, GDC 2017 presentation slide 36) approach:
- engine is auto-tracking the current state, the user requests new state only
- extended (from D3D12) resource state range that maps to Vulkan barriers
Overall, in terms of flexibility/configuration, Vulkan barriers >> D3D12 barriers >> Metal. Developers seem to prefer D3D12 style (TODO: confirm with more developers!).
Translation between APIs
Metal API running on D3D12/Vulkan
We'd have to replicate the analysis already done by D3D11 and Metal drivers, but without a low-level access to the command buffer structure.
D3D12/Vulkan API running on Metal
All barriers become no-ops.
D3D12 API running on Vulkan
Given that D3D12 appears to have a smaller API surface and stricter set of allowed resources states (e.g. no multiple read/write states allowed), it seems possible to emulate (conservatively) D3D12 states on top of Vulkan. Prototyping would probably help here to narrow down the fine details.
Vulkan API on D3D12
Largely comes down to the following aspects:
- ignoring the given pipeline stages
- translating (image layout, access mask) -> D3D12 resource state
vkCmdWaitEvents
should be possible to translate to a D3D12 split barrier, but more experiments are needed to confirm
Security/corruption issues
We've done some research with IHVs on how the hardware behaves when the resources are used in the case of a mismatched resource layout/state. E.g. an operation expects image to be in a shader-readable state, while the image is not.
The conclusion we got is that in most situations this workload will end up in either a GPU page fault (crash), or visual corruption with user data. It's relatively straightforward for Vulkan to add an extension, and for IHVs to implement it, that would guarantee security of such mismatched layout access. The extension would be defined similarly to robustBufferAccess
and specify the exact behavior of the hardware and the lack of access to non-initialized memory not owned by the current instance.
Automation versus Validation
Inserting optimal Vulkan/D3D12 barriers at the right times appears to be a complex task, especially when taking multiple independent queues into consideration. It requires knowledge ahead of time on how a resource is going to be used in the future, and thus would need us to defer actual command buffer recording until we get more data on how resources are used. This would add more CPU overhead to command recording.
Simply validating that current transitions are sufficient appears to be more feasible, since it doesn't require patching command buffers and that logic can be moved completely into the validation layer.
Concrete proposals
TODO