Out-of-bounds clamping brings big perf regression on TFJS ML demos.

Recently, the TFJS ML demos on webgpu backend have big perf dropping with robust buffer access(RBA) enabled in chrome. Here we extracted a very simple matrix multiplication webgpu [example](https://github.com/gpuweb/gpuweb/files/5486355/matmul.zip) to show the performance impact. Below is the data we collected on different GPUs.

Benchmark | MatMul(1024x1024) |   |   |  
-- | -- | -- | -- | --
  |   | Enable RBA (ms) | Disable RBA (ms)   --disable-dawn-robustness |  
MacOS | Radeon Pro 555X | 6.594 | 5.454 | iteration=400
  | Intel Iris Plus(ICL) | 8.022 | 5.137 | iteration=400
Windows | UHD630(CFL) | 36.98 | 21.807 | iteration=400
  | NV GTX1080 Ti | 0.626 | 0.468 | iteration=3000

From above table, we can see that there are 17%~40% perf regressions on different GPUs. We can see the similar regressions on the ML models.

Currently, the RBA is implemented by inserting `clamp` for all array index accessing, which include storage buffer accessing, shared memory array accessing, local variable array accessing. For example:
`acc[innerRow][innerCol] = 0.0f; ` -> `acc[clamp(innerRow, 0, 3)][clamp(innerCol, 0, 3)] = 0.0f;`

Obviously, out-of-bounds clamping brings big perf overhead on various GPUs, specially on integrated GPUs.
(Some related topics gfx-rs/naga#35 gfx-rs/naga#33 gfx-rs/naga#311 gfx-rs/naga#955)

Here are some of our findings/issues that we want to discuss:
* For d3d12, it seems that D3D guarantees to return zero for any resource that is accessed out of bounds. Maybe we don’t need to do extra checking for d3d resources. For vulkan, maybe we can utilize the extension `VK_EXT_robustness2`. For Metal,  I don't know. Maybe we need to do it manually if there is no suitable extension support.
* There are some discussions that using `min` instead of `clamp` may mitigate this issue. But we did some experiments to see how `clamp` is executed underlying by fxc. We found that `clamp` was translated into `imax` and `imin`  in the generated assembly code. If we enable the optimization mode, `clamp` would be only translated into `imin`. So we are concerned about the perf improvement by using `min` in upper shading language.
* Will it be a problem if clamping all of bounds to the last index for shared memory and buffer? It means it’s possible that multiple threads write different values to the same memory which results undefined behaviors. Maybe we should discard out of bounds writing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Out-of-bounds clamping brings big perf regression on TFJS ML demos. #1202

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	MatMul(1024x1024)
		Enable RBA (ms)	Disable RBA (ms) --disable-dawn-robustness
MacOS	Radeon Pro 555X	6.594	5.454	iteration=400
	Intel Iris Plus(ICL)	8.022	5.137	iteration=400
Windows	UHD630(CFL)	36.98	21.807	iteration=400
	NV GTX1080 Ti	0.626	0.468	iteration=3000

Out-of-bounds clamping brings big perf regression on TFJS ML demos. #1202

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions