-
Notifications
You must be signed in to change notification settings - Fork 344
Description
Recently, the TFJS ML demos on webgpu backend have big perf dropping with robust buffer access(RBA) enabled in chrome. Here we extracted a very simple matrix multiplication webgpu example to show the performance impact. Below is the data we collected on different GPUs.
| Benchmark | MatMul(1024x1024) | |||
|---|---|---|---|---|
| Enable RBA (ms) | Disable RBA (ms) --disable-dawn-robustness | |||
| MacOS | Radeon Pro 555X | 6.594 | 5.454 | iteration=400 |
| Intel Iris Plus(ICL) | 8.022 | 5.137 | iteration=400 | |
| Windows | UHD630(CFL) | 36.98 | 21.807 | iteration=400 |
| NV GTX1080 Ti | 0.626 | 0.468 | iteration=3000 |
From above table, we can see that there are 17%~40% perf regressions on different GPUs. We can see the similar regressions on the ML models.
Currently, the RBA is implemented by inserting clamp for all array index accessing, which include storage buffer accessing, shared memory array accessing, local variable array accessing. For example:
acc[innerRow][innerCol] = 0.0f; -> acc[clamp(innerRow, 0, 3)][clamp(innerCol, 0, 3)] = 0.0f;
Obviously, out-of-bounds clamping brings big perf overhead on various GPUs, specially on integrated GPUs.
(Some related topics gfx-rs/naga#35 gfx-rs/naga#33 gfx-rs/naga#311 gfx-rs/naga#955)
Here are some of our findings/issues that we want to discuss:
- For d3d12, it seems that D3D guarantees to return zero for any resource that is accessed out of bounds. Maybe we don’t need to do extra checking for d3d resources. For vulkan, maybe we can utilize the extension
VK_EXT_robustness2. For Metal, I don't know. Maybe we need to do it manually if there is no suitable extension support. - There are some discussions that using
mininstead ofclampmay mitigate this issue. But we did some experiments to see howclampis executed underlying by fxc. We found thatclampwas translated intoimaxandiminin the generated assembly code. If we enable the optimization mode,clampwould be only translated intoimin. So we are concerned about the perf improvement by usingminin upper shading language. - Will it be a problem if clamping all of bounds to the last index for shared memory and buffer? It means it’s possible that multiple threads write different values to the same memory which results undefined behaviors. Maybe we should discard out of bounds writing.