这是indexloc提供的服务,不要输入任何密码
Skip to content

Out-of-bounds clamping brings big perf regression on TFJS ML demos. #1202

@qjia7

Description

@qjia7

Recently, the TFJS ML demos on webgpu backend have big perf dropping with robust buffer access(RBA) enabled in chrome. Here we extracted a very simple matrix multiplication webgpu example to show the performance impact. Below is the data we collected on different GPUs.

Benchmark MatMul(1024x1024)      
    Enable RBA (ms) Disable RBA (ms) --disable-dawn-robustness  
MacOS Radeon Pro 555X 6.594 5.454 iteration=400
  Intel Iris Plus(ICL) 8.022 5.137 iteration=400
Windows UHD630(CFL) 36.98 21.807 iteration=400
  NV GTX1080 Ti 0.626 0.468 iteration=3000

From above table, we can see that there are 17%~40% perf regressions on different GPUs. We can see the similar regressions on the ML models.

Currently, the RBA is implemented by inserting clamp for all array index accessing, which include storage buffer accessing, shared memory array accessing, local variable array accessing. For example:
acc[innerRow][innerCol] = 0.0f; -> acc[clamp(innerRow, 0, 3)][clamp(innerCol, 0, 3)] = 0.0f;

Obviously, out-of-bounds clamping brings big perf overhead on various GPUs, specially on integrated GPUs.
(Some related topics gfx-rs/naga#35 gfx-rs/naga#33 gfx-rs/naga#311 gfx-rs/naga#955)

Here are some of our findings/issues that we want to discuss:

  • For d3d12, it seems that D3D guarantees to return zero for any resource that is accessed out of bounds. Maybe we don’t need to do extra checking for d3d resources. For vulkan, maybe we can utilize the extension VK_EXT_robustness2. For Metal, I don't know. Maybe we need to do it manually if there is no suitable extension support.
  • There are some discussions that using min instead of clamp may mitigate this issue. But we did some experiments to see how clamp is executed underlying by fxc. We found that clamp was translated into imax and imin in the generated assembly code. If we enable the optimization mode, clamp would be only translated into imin. So we are concerned about the perf improvement by using min in upper shading language.
  • Will it be a problem if clamping all of bounds to the last index for shared memory and buffer? It means it’s possible that multiple threads write different values to the same memory which results undefined behaviors. Maybe we should discard out of bounds writing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    wgslWebGPU Shading Language Issues

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions