GPU Web 2025 04 09

GPU Web WG 2025-04-09 Atlantic-time

Chair: CW

Scribe: KR

Location: Google Meet

Tentative agenda

Administrivia
CTS Update
Add "core-features-and-limits" to spec #5147
Deliberately standardize the order of features in adapter.features and device.features #5148
How to deal with limited nr of samplers allowed to be created on native APIs? #5142
Compat: checking for compat is not backward compatible. #5127
Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388
Triage the rest milestone 1 issues
Agenda for next meeting

Attendance

Apple
- Mike Wyrzykowski
Google
- Corentin Wallez
- Geoff Lang
- Gregg Tavares
- Kai Ninomiya
- Ken Russell
- Stephen White
Mozilla
- Jim Blandy
- Kelsey Gilbert
- Teodor Tanasoaia
Albin Bernhardsson

Administrivia

KN: How to add Compat to the spec? Thinking ~one commit for every Compat restriction, so it's easier to make sure we've hit all the expected parts of the spec for that restriction, and either:
- WG changes the compat branch only via PR and approvals from every browser engine
- Google develops on the compat branch, we organize it into ~one commit per restriction, and then open one big PR that everyone reviews
JB: Review just before we merge it? So we aren't involved in each individual commit. But if committee doesn't like the way something's phrased, you might have copied that undesirable text in multiple places
KN: Compat spec isn't exhaustive in its phrasing, either.
JB: my instinct is that it'll go more smoothly if the committee's involved in individual changes. But I know I have a hard time getting to reviews promptly.
KN: we can just get the branch in good shape, here are a bunch of commits that are ready. Think it will be fine. Most changes are not that big, spec-wise.
CW: merging individual changes - given that each issue was discussed individually, i expect landing individual commits to be quick. Then merging the branch in requires WG approval, yes, we feel this is ready.
KG: that's fine.
KN: ok, so we'll start opening PRs and sending to the WG for approval?
KR: what about sending to the editors?
CW: probably OK to go with the editors. Members of the group should check it, make sure there's nothing that I didn't think was going in.
KG: we need WG approval when the branch is going in.
CW: Compat's big, touches a lot, landing as small PRs sounds good.
KG: I think we should just ask the WG to review each individual part, should be faster than a mega-review.
KR: what is sufficient review?
KG: same as always. We'll do our best to get them done.
CW: Ken's saying, we don't have a meeting per PR.
KG: we have implementer agreement already, that still counts.
CW: then the process shouldn't be too onerous to land the individual chunks. 1 week latency per PR would be too much.
SW: so do you want 1 person from each implementation for each PR?
JB: yes.
KG: happy to hear we'll get 1 commit per restriction - much easier to review.
KN: great, will start doing that soon.

CTS Update

GT: nothing specific to mention
CW: our team's been looking at all the CTS failures. Found a bunch of issues / improvements in the CTS. Fixing that.

Add "core-features-and-limits" to spec #5147

CW: to minimize risks of things. Basically agreed upon previously (?)
KN: decided I'd propose putting this in first. We can put this in now, and want to make sure that browsers ship with this.
JB: Firefox landed this change.
CW: MW also had a change in flight for WebKit.
Approved

Deliberately standardize the order of features in adapter.features and device.features #5148

JB: every single deviation eventually causes people to get upset.
KR: sorted order?
JB: sure.
GT: does this include limits? Do those need to be sorted?
KN: the limits are already an interface that has an order defined.
GT: should have a test for that, then.
CW: OK, we can spec the order per standard JS sort.
KG: there are a bunch of cases where, for example, people will assume something is supported when it's not.
Approved

How to deal with limited nr of samplers allowed to be created on native APIs? #5142

CW: probably low on some iOS devices. Do we do anything? Need it because of LOD min/max clamp.
KG: we should do something. At least, mandate de-duplication.
JB: if we do de-duplicate, it's convenient for the developer. Let them rely on it.
KG: still better if the developer doesn't ask us to make new objects.
MW: in Metal this is a per-process limit. Any browser using 1 rendering process for multiple tabs, limit is across all of those. One misbehaving site using min/max LOD clamp can exhaust this pretty quickly.
MW: if you're using argument buffers it's a per-process limit. Argument buffers will probably be needed for bindless. Chrome not currently using argument buffers, so limit wouldn't apply until you start using those.
CW: can assume Chrome will start using argument buffers soon.
KG: thought you could only have as few as 96 per argument buffer.
MW: it's per Metal device. iOS, same value - 96 limit is "global", part of the driver, when you exceed it it'll overwrite another bucket in the table. In WebKit we have a static lock and LRU cache, annoying. Have to track when sampler's released by command buffer, before evicting it, otherwise result of command buffer is incorrect.
TT: you mentioned blocking on submit. Even with cache, can run into that limit.
MW: we don't block on submit today but we should. One cmdbuf that uses 96 samplers, takes a long time to run, you try to submit another one, you should be calling waitUntilCompleted. Someone doesn't do that, they'll end up using samplers from the other cmdbuf and you'll get incorrect results. A current bug in WebKit.
TT: weird that the impl would block on submit for users.
GT: even worse than that, a given draw could use 96 samplers. Need to split command buffers, too.
KG: falls into "weird" combination - impl could just say "no". Too weird a combo of things isn't that weird.
CW: device loss can happen at any time. Impl could just lose the device.
KG: would really like to not lose the device. We did this for WebGL upon out-of-memory, caused lots of app fragility.
KG: de-duplication plus LRU cache plus bailing if you do something crazy (that we can test easily).
TT: is bailing == losing the device?
KG: no, like an out-of-memory when encoding your draw commands. Finishing the command buffer.
CW: spec doesn't have a provision for that.
KG: we need it then. We don't want to lose the device. The other danger - a cross-content cache like this that's observable is a security vulnerability. Our perf characteristics shouldn't be knowable, esp. for a different content process.
CW: there's this one iOS device that has 96 samplers. Are there a lot of iOS devices with this limitation? How old's the hardware? It's not great, but if it's hardware that are e.g. the two oldest supported ones right now - maybe we don't support them.
MW: the 96 limit applies to Apple4 and Apple5 devices. Apple5 devices are still supported with current iOS updates. Some ones from a few years ago are limited to 512 samplers. Macs limited to 2048 samplers. Limits are higher but still not incredibly high. Even an Apple8 device (iPhone A16 - iPhone 15) has a limit of 1024 samplers across threads in a single process.
CW: at least it's the same order of magnitude as other limits on desktop. Thanks for info.
KN: any place we can look this up?
MW: Metal Feature Set table page 7, max # argument buffers (per stage from argument buffer) - let me double check.
CW: we could OOM upon command buffer finish?
KN: aren't OOM and slow performance equal security problems?
KG: yes. But advice to do that wasn't based on the security problem.
TT: even if we generate OOM - we must add a destroy() method on sampler or people can't do anything about it.
KG: they can GC. :) But yes, should add destroy.
CW: it's not how many you've created, but how many you use in the command buffer.
GT: as a developer I'd like to know how many I can use without problem.
CW: impl can warn you when you reach certain number of unique samplers.
GT: you could say, max samplers you can use in a command buffer.
KG / KN: these are global limits though, something else can affect you.
KG: not a dire security concern, but potentially an impl leak vector. Similar reason why we can't do HTTP caches across origins.
TT: could we say we don't need to use arg buffers? Need them for bindless, but does the proposal allow putting samplers in the binding array?
KG:
CW: there are a number of things that behave differently for samplers in Metal with samplers in argument buffers vs. not. But don't know if declaring they can be used in arg buffers will help.
TT: just unfortunate that other APIs have per-device limits, but we can't do better because Metal's limits are per-process.
GT: need GPU process per tab :)
TT: great idea.
CW: do we agree we'd rather not add a global sampler limit and destroy() on sampler if we can avoid it?
(agreement)
CW: how do we let impls bail out in the worst case?
KG: can generate OOM any time. End of command buffer encoding for example.
CW: right now we can lose the device - can't generate OOM everywhere. Can change that.
KG: need a bailout.
KN: not sure if command buffer finish is the right place. Maybe submit? As painful as it is, I'd prefer we just make this work with command buffer splitting, waits, etc. Think it's possible. Also - talking about not putting things in arg buffers - regardless of bindless, each bind group is both argument buffer and not. Samplers go in non-arg-buffer pile. Is that possible?
CW: yes.
TT: AFAIK.
KN: I'd prefer to do that. As much as we want to use arg buffer samplers because they'll fix problems - we can look at the magnitude of those problems.
MW: if you use indirect command buffers like WebKit does then you can't use samplers outside arg buffers. But, it's not so difficult to, in the rare case that any legit app wants to do this, make this work. As Kelsey said, could maybe do some timing attacks. That's the primary concern.
CW: at some point impls have to stall or fail the operation.
KN: is the stall actually observable? All we do is delay submitting the cmdbuf that wouldn't run anyway because the GPU was already busy. Maybe fine?
KG: not worried about that.
KN: either way I think it's a security problem and we need to request a fix from Metal. Need a way to not have this limit. Assume it works this way because devices are singletons. If we could get multiple devices, could do 1 Metal device per GPU device. Wasn't there another thing that was per-process, too? Can't remember. Max buffer size?
GT: checked three.js, babylon, bevy, playcanvas. No use of min/max LOD clamp, texture min LOD, texture max LOD.
KG: They're pretty rare.
CW: want to find a way forward.
KG: what do we need?
KR: an error that can be generated at the right place.
KG: found it surprising that we don't have an error we can generate.
CW: in WebGPU we wanted to tightly scope error messages. Like push/popErrorScope around texture allocation.
KG: what if we have a resource allocation failure during command buffer submit?
CW: we sometimes allocate textures during command buffer allocation, in Chromium, for example. OOMs there are converted to device loss.
KG: would be better for your impl to go out of spec there.
KN: no, I think we should wait for work to finish, free some resources, and try again.
KG: sometimes you try to allocate a buffer for some reason and it fails.
KN: don't think we should have an operation fail for random reasons.
CW: what if you have an app that works, but one writeBuffer didn't happen. Device lost == nothing works. Or, this specific point doesn't work.
KG: maybe it mostly works but some draw calls don't. In Firefox, if an allocation fails, the buffer is truncated. Warnings in the console. Things are a little broken, but not that broken. Understand what you mean about subtle breakage. Snowball effect like NaN contagion.
KR: think it's better for the program to halt if something internal goes wrong. We've had so many bugs reported where one thing failed and the app broke in subtle ways.
KG: well, it's an impl choice.
SW: native code in Chrome's philosophy - it's better to crash than fail because crashes get fixed.
KG: OOM -> device loss became intolerable because we don't tell you how much memory is available. Testing for that broke the app.
SW: yes. Canvas 2D put in a fallback in this area. Allocating a texture that fails - what's the effect on the app?
KG: not ideal, a gray area. I'm trying to advise to generate an OOM even if it isn't in spec, because it's my strong intuition it's the best behavior for users and devs.
TT: The issue is that if we don't add any limit even for DX12 and Vulkan, you'd have to put all sampler creation, bind group creation, off. It's a model we don't have right now (in our implementation).
CW: let's timebox this.
KG: if nobody wants to implement this…I'm just suggesting, don't implement this this way.
KN: think it's good for us to have these discussions about places where this sort of thing can happen. I'm happy to add more OOMs, internal errors, etc.
KG: given that's the case - focus on what we agree on rather than what we don't.
CW: we'll figure this thing out.

Compat: checking for compat is not backward compatible. #5127

KN: obsolete now, now that it's landed in browsers.
GT: until it ships in core in Chrome, people will be putting in bad code that'll be there forever. If that'll happen only for 6 weeks more, I guess that's OK.
JB: we should talk in the spec about the right way to do this.
GT: OK.

Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388

CW: close to a solution. This issue is difficult. Problem - no way to get the optimal perf/# copies for both impls that can triply map buffers, and impls that can't, for when you do map_write. We don't want to copy data from GPU when we can't triply map, so we zero out - or in triply mapped, we give you the GPU's data because doing so is free and everything else is work. My hunch - we are going to be able to triply map buffers in all impls eventually, so we should optimize for that, but don't want to predict that. Think only 70% of the cases in Chromium can do this.
GT: can you make those two separate cases? You ask for triply mapped, but if we can't give that to you we don't give you that option?
CW: might not be possible to know in the JavaScript process whether triple mapping will work.
GT: more that, it's a device/adapter feature. Triply mapped is something you can ask for at adapter/device.
CW: not sure this works consistently even on a single device. Could gather that data point. If works, makes things much easier.
CW: shouldn't expose rebar to JS because of caching issues.
JB: so you expect 2 classes of GPUs. When they can share memory efficiently, can also get that memory into the content process efficiently.
CW: yes.
CW: if we're willing to make that guess, then the UMA buffer mapping spec gets easier. When you map-write, you always get the previous contents of the buffer because it's triply mapped.
KN: is there a perf penalty for getting that data? Cache flushes, etc.?
CW: don't think so. On Vulkan, the one that has the most exposed knobs and which runs on GPUs with the most constraints - might need to call invalidate(). Invalidates parts of the CPU cache. It's fine. (Might be wrong.) Just need to do this when GPU execution's finished - not that often.
KN: would probably just impact the time it takes to map.
CW: I'm kind of guessing at this point.
CW: everyone OK if we try going in that direction? Try asking for impl experience before landing in the spec?
KN: this extension would be gated on triple mapping?
CW: yes.
CW: ok if we move forward? Say this'll be the way this works, but we'll get more impl confidence before landing it?
JB: so people that have a separate shmem, they'll have an overhead of copying, until they implement triple mapping? Or, don't expose the extension unless triple mapping is available.
CW: yes. I would advocate we not expose the extension unless you have triple mapping.
KG: earlier we were thinking UMA shouldn't be an extension (?)
CW: first meeting we discussed this, thought we were comfortable with this being an extension.

Triage the rest of the milestone 1 issues

Agenda for next meeting

Confidence in direction of proposals for webgpu.h
- [try for v1] Add immediates (push constants) to API #386
- [try for v1] Add binding arrays to API #387
Specify that ASTC Sliced 3D feature requires ASTC feature #5169
Add optional feature storage-texture-format-tier2 #5160
Add @builtin(renderbundle_index) #5158
copyExternalImageToTexture with byte-exact uint format #5157

GPU Web 2025 04 09

GPU Web WG 2025-04-09 Atlantic-time

Tentative agenda

Attendance

Administrivia

CTS Update

Add "core-features-and-limits" to spec #5147

Deliberately standardize the order of features in adapter.features and device.features #5148

How to deal with limited nr of samplers allowed to be created on native APIs? #5142

Compat: checking for compat is not backward compatible. #5127

Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388

Triage the rest of the milestone 1 issues

Agenda for next meeting

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!