Proposal: Support DP4a as WGSL built-in functions

# Motivation

Nowadays **Int8 quantization** has become a popular approach to optimize the computation and memory bandwidth of real-time deep learning inferences on client devices on the basis of the `float32` or `float16` models. And the term “**DP4a**” (8-bit integer Dot-Product of 4 Elements and Accumulate) refers to a set of GPU instructions that are widely used to accelerate the computation of such int8-quantized models.

DP4a instructions take 2 `uint32` values and 1 `int32` value, or 3 `uint32` values as inputs, and return one `uint32` or `int32` value. The first 2 `uint32` values are logically packed 4-element 8-bit signed/unsigned integer vectors, and the instruction first computes the dot product of these two vectors, and then returns the sum of the dot product and the third input value. 

For example,
```javascript
var a = 0x01020304u;
var b = 0x02040608u;
var c = 1u;
var output = dot4AddU8Packed(a, b, c); // output == 61u;
```

Because executing a DP4a instruction is very fast on the GPUs with DP4a supported in their ISAs (Instruction Set Architecture), they have already been widely used in the industry to accelerate the computation with popular AI frameworks (e.g. [Intel](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Accelerate-Deep-Learning-Performance-with-Intel-Xe-Graphics-and/post/1335669), [Nvidia](https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/)).

It should be easy to support DP4a in WGSL as we just need to add two more WGSL built-in functions, whose inputs and outputs are all 32-bit signed or unsigned integers. 

# Requirements to be Standardized in the WebGPU CG / WG

DP4a meets all [the requirements to be standardized in the WebGPU CG / WG](https://github.com/gpuweb/gpuweb/blob/main/process/RequirementsForAdditionalFunctionality.md).

## A proposal for new functionality must be implementable on at least 2 different browser engines.

It is straightforward to implement DP4a in all WGSL compilers. Even on the native APIs without directly support of DP4a, we can easily polyfill DP4a (e.g. [DP4a is polyfilled in Mesa](https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/compiler/nir/nir_opcodes.py#L1367) on the platforms without hardware accelerated DP4a instructions)

## A proposal for new functionality must be implementable on at least 2 different native APIs.
It is straightforward to implement DP4a in HLSL, Metal Shading Language and SPIR-V with the method used in Mesa.

The DP4a intrinsics are supported in D3D12 and Vulkan.
- D3D12 [shader model 6.4](https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/hlsl-shader-model-6-4-features-for-direct3d-12) supports DP4a (DXIL: `Dot4AddI8Packed` and `Dot4AddU8Packed`).
- Vulkan 1.3 core or [VK_KHR_shader_integer_dot_product](https://github.com/KhronosGroup/Vulkan-Docs/blob/main/proposals/VK_KHR_shader_integer_dot_product.asciidoc) with SPIR-V extension [SPV_KHR_integer_dot_product](https://htmlpreview.github.io/?https://github.com/KhronosGroup/SPIRV-Registry/blob/master/extensions/KHR/SPV_KHR_integer_dot_product.html)

Note that we will meet two issues when we directly implement DP4a with the native DP4a intrinsics.
1. **Query if DP4ais hardware accelerated**

    In Vulkan we can query if DP4a is hardware accelerated (with [`VkPhysicalDeviceShaderIntegerDotProductPropertiesKHR`](https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VkPhysicalDeviceShaderIntegerDotProductPropertiesKHR.html)), while on D3D12 we can't query such information (according to [the Microsoft document](https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/hlsl-shader-model-6-4-features-for-direct3d-12), "**consequently, no separate capability bit check is required, beyond assuring the use of Shader Model 6.4**”)

2. **Behaviors when there is overflow or underflow in the final addition**

    The behaviors of the overflow or underflow in the final addition of DP4a instructions are different on D3D12 and Vulkan.
    - When we don't do the accumulation the result won't be overflow or underflow. We can see the introductions of the operation `OpUDotKHR` and `OpSDotKHR` for more details.
    - In `SPV_KHR_integer_dot_product`, the final add in the operation `OpSDotAccSatKHR` and `OpUDotAccSatKHR` are **saturated**. According to the SPEC of SPIR-V, `SaturatedConversion` indicates that a conversion to an integer type which is outside the representable range of Result Type is **clamped to the nearest representable value of Result Type**. [In Mesa the result is always clamped to the range of `INT` or `UINT`.](https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/compiler/nir/nir_opcodes.py#L1417)
    - On D3D12 there is no description about the behavior of potential overflow or underflow on the final accumulation. For example, on Nvidia 2080Ti (Windows version 10.0.19044.1586, 30.0.15.1179) and Intel UHD Graphics 770 (Windows version 10.0.22000.556, 30.0.101.1273), the result of DP4a is overflowed.

```HLSL
RWByteAddressBuffer dst : register(u0);
[numthreads(1, 1, 1)]
void main()
{
    uint a = 0x02;
    uint b = 0x02;
    uint c = 0xFFFFFFFF;
    uint output = dot4add_u8packed(a, b, c);
    dst.Store(0, output);  // output == 3u
}
```

## A proposal for new functionality must be implementable on devices created (designed? manufactured?) by at least 2 different device vendors.

The non-hardware-accelerated DP4a built-in functions can be implemented on all Vulkan, D3D12 and Metal platforms.

Below GPUs support DP4a in their ISAs.
- [Intel Gen12+ GPUs](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Accelerate-Deep-Learning-Performance-with-Intel-Xe-Graphics-and/post/1335669)
- [Nvidia GPUs in Pascal architecture and later](https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/)
- AMD GPUs in ["Vega" 7nm Instruction Set Architecture](https://developer.amd.com/wp-content/resources/Vega_7nm_Shader_ISA.pdf) (`V_DOT4_I32_I8` and `V_DOT4_U32_U8`)
- [ARM Valhall](https://people.collabora.com/~alyssa/Valhall-Documentation.pdf) (`IDP.v4s8`, `IDP.v4u8`)

# Proposal

Our proposal is supporting DP4a **without accumulation** as **new WGSL built-in functions in an extension**:

```javascript
dot4U8Packed(e1: u32, e2: u32) -> u32    // OpUDot with 32-bit unsigned integers as inputs
dot4I8Packed(e1: u32, e2: u32) -> i32    // OpSDot with 32-bit unsigned integers as inputs
```

- **Why “without accumulation”**
The behaviors of DP4a with accumulation on the overflow or underflow of last addition are different on Vulkan and D3D12. Note that:
    - Intel GPUs support both saturated and non-saturated version of DP4a instructions, and it is easy for the compilers to optimize `(OpSDot + OpFAdd)` into one non-saturated version of DP4a instruction in the Vulkan driver.
    - While it is also useful to support the saturated version of DP4a (`OpSDotAccSatKHR` and `OpUDotAccSatKHR`) for AI applications, on D3D12 we have to implement it with below piece of code, which is much more difficult to optimize it inside the driver.
```HLSL
int int32add_saturated(int a, int b){    
    int sum = a + b;
    if (a < 0 && b < 0 && a < sum)
        sum = INT_MIN;
     } else if (a >= 0 && b >= 0 && sum < a)
        sum = INT_MAX;
     }
     return sum;
}

int dot4I8PackedAccumulationSaturated(uint a, uint b, int acc)
{
    return int32add_saturated(dot4add_i8packed(a, b, 0), acc);
}
```

- **Why “as new WGSL built-in functions in an extension”**
We’d like to support DP4a to accelerate the computation of machine learning, while the emulated version of DP4a won’t get such performance gain. On such platforms we can choose not to support DP4a to imply `float32` models may be a better choice.
  - On Vulkan we can rely on the query on `VkPhysicalDeviceShaderIntegerDotProductPropertiesKHR`.
  - On D3D12 we can make the decision with the Windows version and device id.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Support DP4a as WGSL built-in functions #2677

Motivation

Requirements to be Standardized in the WebGPU CG / WG

A proposal for new functionality must be implementable on at least 2 different browser engines.

A proposal for new functionality must be implementable on at least 2 different native APIs.

A proposal for new functionality must be implementable on devices created (designed? manufactured?) by at least 2 different device vendors.

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Support DP4a as WGSL built-in functions #2677

Description

Motivation

Requirements to be Standardized in the WebGPU CG / WG

A proposal for new functionality must be implementable on at least 2 different browser engines.

A proposal for new functionality must be implementable on at least 2 different native APIs.

A proposal for new functionality must be implementable on devices created (designed? manufactured?) by at least 2 different device vendors.

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions