这是indexloc提供的服务,不要输入任何密码
Skip to content

Proposal: Support DP4a as WGSL built-in functions #2677

@Jiawei-Shao

Description

@Jiawei-Shao

Motivation

Nowadays Int8 quantization has become a popular approach to optimize the computation and memory bandwidth of real-time deep learning inferences on client devices on the basis of the float32 or float16 models. And the term “DP4a” (8-bit integer Dot-Product of 4 Elements and Accumulate) refers to a set of GPU instructions that are widely used to accelerate the computation of such int8-quantized models.

DP4a instructions take 2 uint32 values and 1 int32 value, or 3 uint32 values as inputs, and return one uint32 or int32 value. The first 2 uint32 values are logically packed 4-element 8-bit signed/unsigned integer vectors, and the instruction first computes the dot product of these two vectors, and then returns the sum of the dot product and the third input value.

For example,

var a = 0x01020304u;
var b = 0x02040608u;
var c = 1u;
var output = dot4AddU8Packed(a, b, c); // output == 61u;

Because executing a DP4a instruction is very fast on the GPUs with DP4a supported in their ISAs (Instruction Set Architecture), they have already been widely used in the industry to accelerate the computation with popular AI frameworks (e.g. Intel, Nvidia).

It should be easy to support DP4a in WGSL as we just need to add two more WGSL built-in functions, whose inputs and outputs are all 32-bit signed or unsigned integers.

Requirements to be Standardized in the WebGPU CG / WG

DP4a meets all the requirements to be standardized in the WebGPU CG / WG.

A proposal for new functionality must be implementable on at least 2 different browser engines.

It is straightforward to implement DP4a in all WGSL compilers. Even on the native APIs without directly support of DP4a, we can easily polyfill DP4a (e.g. DP4a is polyfilled in Mesa on the platforms without hardware accelerated DP4a instructions)

A proposal for new functionality must be implementable on at least 2 different native APIs.

It is straightforward to implement DP4a in HLSL, Metal Shading Language and SPIR-V with the method used in Mesa.

The DP4a intrinsics are supported in D3D12 and Vulkan.

Note that we will meet two issues when we directly implement DP4a with the native DP4a intrinsics.

  1. Query if DP4ais hardware accelerated

    In Vulkan we can query if DP4a is hardware accelerated (with VkPhysicalDeviceShaderIntegerDotProductPropertiesKHR), while on D3D12 we can't query such information (according to the Microsoft document, "consequently, no separate capability bit check is required, beyond assuring the use of Shader Model 6.4”)

  2. Behaviors when there is overflow or underflow in the final addition

    The behaviors of the overflow or underflow in the final addition of DP4a instructions are different on D3D12 and Vulkan.

    • When we don't do the accumulation the result won't be overflow or underflow. We can see the introductions of the operation OpUDotKHR and OpSDotKHR for more details.
    • In SPV_KHR_integer_dot_product, the final add in the operation OpSDotAccSatKHR and OpUDotAccSatKHR are saturated. According to the SPEC of SPIR-V, SaturatedConversion indicates that a conversion to an integer type which is outside the representable range of Result Type is clamped to the nearest representable value of Result Type. In Mesa the result is always clamped to the range of INT or UINT.
    • On D3D12 there is no description about the behavior of potential overflow or underflow on the final accumulation. For example, on Nvidia 2080Ti (Windows version 10.0.19044.1586, 30.0.15.1179) and Intel UHD Graphics 770 (Windows version 10.0.22000.556, 30.0.101.1273), the result of DP4a is overflowed.
RWByteAddressBuffer dst : register(u0);
[numthreads(1, 1, 1)]
void main()
{
    uint a = 0x02;
    uint b = 0x02;
    uint c = 0xFFFFFFFF;
    uint output = dot4add_u8packed(a, b, c);
    dst.Store(0, output);  // output == 3u
}

A proposal for new functionality must be implementable on devices created (designed? manufactured?) by at least 2 different device vendors.

The non-hardware-accelerated DP4a built-in functions can be implemented on all Vulkan, D3D12 and Metal platforms.

Below GPUs support DP4a in their ISAs.

Proposal

Our proposal is supporting DP4a without accumulation as new WGSL built-in functions in an extension:

dot4U8Packed(e1: u32, e2: u32) -> u32    // OpUDot with 32-bit unsigned integers as inputs
dot4I8Packed(e1: u32, e2: u32) -> i32    // OpSDot with 32-bit unsigned integers as inputs
  • Why “without accumulation”
    The behaviors of DP4a with accumulation on the overflow or underflow of last addition are different on Vulkan and D3D12. Note that:
    • Intel GPUs support both saturated and non-saturated version of DP4a instructions, and it is easy for the compilers to optimize (OpSDot + OpFAdd) into one non-saturated version of DP4a instruction in the Vulkan driver.
    • While it is also useful to support the saturated version of DP4a (OpSDotAccSatKHR and OpUDotAccSatKHR) for AI applications, on D3D12 we have to implement it with below piece of code, which is much more difficult to optimize it inside the driver.
int int32add_saturated(int a, int b){    
    int sum = a + b;
    if (a < 0 && b < 0 && a < sum)
        sum = INT_MIN;
     } else if (a >= 0 && b >= 0 && sum < a)
        sum = INT_MAX;
     }
     return sum;
}

int dot4I8PackedAccumulationSaturated(uint a, uint b, int acc)
{
    return int32add_saturated(dot4add_i8packed(a, b, 0), acc);
}
  • Why “as new WGSL built-in functions in an extension”
    We’d like to support DP4a to accelerate the computation of machine learning, while the emulated version of DP4a won’t get such performance gain. On such platforms we can choose not to support DP4a to imply float32 models may be a better choice.
    • On Vulkan we can rely on the query on VkPhysicalDeviceShaderIntegerDotProductPropertiesKHR.
    • On D3D12 we can make the decision with the Windows version and device id.

Metadata

Metadata

Assignees

No one assigned

    Labels

    wgslWebGPU Shading Language Issues

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions