Investigation: Programmable Blending

# Motivation

There have been quite a few requests for pieces of custom blending, but the analysis hasn’t been linked together into a coherent investigation.

Existing issues:
* [Advanced blend equations](https://github.com/gpuweb/gpuweb/issues/394) implementable on top of custom blending
* [Dual source blending](https://github.com/gpuweb/gpuweb/issues/391) implementable on top of custom blending
* [Fragment shader interlock/ordering](https://github.com/gpuweb/gpuweb/issues/395) half of the custom blending use case
* [Fragment shader framebuffer fetch](https://github.com/gpuweb/gpuweb/issues/396) the other half of the custom blending use case

The use case is being able to have materials in a rendered scene that are not simply blended (or min’ed or max’ed) with the rest of the scene. This isn’t really relevant to physically-based renderers, but is more relevant for things like cartoon shaders.

This is used in [Lumberyard](https://docs.aws.amazon.com/lumberyard/latest/userguide/graphics-rendering-order-independent-transparency.html), [Just Cause 3](https://software.intel.com/sites/default/files/managed/20/d5/2016_GDC_Optimizations-and-DirectX-features-in-JC3_v0-92_X.pdf), [Grid 2](https://software.intel.com/en-us/articles/oit-approximation-with-pixel-synchronization), and [The Forge](https://github.com/KhronosGroup/Vulkan-Ecosystem/issues/27#issuecomment-430071324).

Similarly, achieving effects like the “vibrancy” effect that’s used all over macOS and iOS would use custom blending. This uses a custom formula to make sure that the foreground is always visible and readable on top any possible background. Here’s an example:

<img width="329" alt="Screen Shot 2019-09-20 at 10 12 11 AM" src="https://user-images.githubusercontent.com/918903/65302857-5ef31d80-dbb7-11e9-8cbb-d603fb506915.png">

Programmable blending functionality can’t be emulated by either API-level texture barriers or by adding additional render passes, because there’s nowhere to save the intermediate results of overlapping geometry. This investigation is about additional capabilities, rather than additional performance.

# Difficulty

There are two distinct pieces here:

* Being able to read from (and write to) the rendering destination
* Because the order of fragment shader execution is undefined, overlapping geometry have to have some synchronization for the read/modify/write cycle to be race-free for each pixel.

Unfortunately, support in the various APIs is different for each of these pieces.

# Direct3D

Direct3D has no facility for reading from the framebuffer (that I could find). However, you can bind a texture as a `RWTexture`. However, if you do this, your reads and writes are unordered.

In Shader Module 5.1, there’s another object which is a drop-in replacement for `RWTextures`: `RasterizerOrderedTexture`s. These have the guarantee that all operations on this resource, between any two fragment shader invocations which target the same framebuffer location (and level and sample), will be strictly ordered. Beyond that, the ordering is guaranteed to be in API submission order.

This means that, if you bind the destination texture as a UAV, rather than binding it as a framebuffer, you can do programmable blending on that resource.

It [looks](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d_shader_model) like its a requirement that all D3D12 devices support Shader Model 5.1. However, support for Rasterizer Order Views is optional; to detect support, check the `ROVsSupported` field in the return of `D3D12Device::CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS)`.

# macOS Metal

Similarly to Direct3D, macOS Metal doesn’t have any facility for reading from the framebuffer. However, you can do the same trick of binding the texture as a `texture2D<access::read_write>` instead of binding it to the framebuffer.

Then, you can mark the texture as belonging to a “Raster Order Group”, which has the same guarantees that `RasterizerOrdered` resources have in HLSL. You do this by simply annotating the image with `[[raster_order_group(0)]]`.

Unfortunately, not all hardware supports raster order groups, and support isn’t aligned with any of the existing GPU Family demarcations; instead, authors have to check `device.areRasterOrderGroupsSupported`. Also, not all hardware can support `access::read_write` textures; authors have to query support by calling `MTLDevice.readWriteTextureSupport`.

# iOS Metal

iOS Metal has the same concepts of Raster Order Groups, but extends it to work with the framebuffer. The fragment shader can mark a value as both a framebuffer color and a raster_order_group by annotating it with `[[color(0), raster_order_group(0)]]`. It can read this value from the framebuffer by simply repeating the same object as a parameter to the shader:

```C++
struct PixelShaderOutput {
    uint result [[color(0), raster_order_group(0)]];
};

fragment PixelShaderOutput fragmentShader(PixelShaderOutput pixelShaderOutput) {
    ...
}
```

This means that programmable blending works naturally.

# Vulkan

The story on Vulkan is much more complicated: https://github.com/KhronosGroup/Vulkan-Ecosystem/issues/27. Nothing is present in pure Vulkan, but there are some extensions:

VK_EXT_fragment_shader_interlock (GPUInfo says 8% on Windows, 4% on Linux, and 0% on Android): Adds explicit functions for locking and unlocking an implicit mutex. There’s one mutex per pixel/level/sample in the framebuffer. Given that none of the other APIs support explicit locking & unlocking, and the fact that the other API’s designs are easier to get right than this kind of explicit API, I’d recommend against adding this design into WebGPU.

GL_EXT_shader_framebuffer_fetch: Lets you read from the framebuffer, but this is a GL extension, not a Vulkan extension.

VK_EXT_blend_operation_advanced: Doesn’t allow true programmable blending, but does allow some pre-canned blend equations. Also, presence of this extension doesn’t mean that the blend equations actually work in overlapping geometry; there’s an extra bit exposed by this extension which represents whether the blend operations are threadsafe with respect to overlapping fragments.

# OpenGL (just for fun)

ARB_shader_image_load_store includes a `memoryBarrier()` GLSL function which can be used to order reads and writes to resources. INTEL_fragment_shader_ordering includes a modal API where you can toggle between “all reads/writes are unordered” and “all reads/writes are ordered” by calling `beginFragmentShaderOrderingINTEL()` at the boundary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigation: Programmable Blending #442

Motivation

Difficulty

Direct3D

macOS Metal

iOS Metal

Vulkan

OpenGL (just for fun)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigation: Programmable Blending #442

Description

Motivation

Difficulty

Direct3D

macOS Metal

iOS Metal

Vulkan

OpenGL (just for fun)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions