-
Notifications
You must be signed in to change notification settings - Fork 344
GPU Web 2024 09 F2F
Find the original doc for the minutes with the images and links here
Chair: CW + JB
Scribe: Many folks, thank you!
Location: Google Toronto, 65 King Street, Toronto, ON M5C 1G3, Canada
**Room: "**Depanneur" on the 2nd floor, note that a Google employee will have to accompany you to take the elevator to that floor.
Date / Time: September 16th and 17th from 9AM to 5PM Toronto-time (EDT).
Remote attendance: Will be possible via this link meet.google.com/fwc-ziyu-wrv
Check-in is 8:30am - 9am
- Non-Googlers: please bring government-issued ID to be checked by reception.
- Googlers: Bring your badge.
Food:
- Catered breakfast and lunch, both days
9:00 - 10:00
- Intro to the F2F (Corentin)
- Roundtable of attendees
- Spec statuses (David and Kai?)
- CTS status (Kai/Gregg?)
10:00 - 11:00
- webgpu.h status (Kai)
- Firefox and wgpu status (JimB/Connor) slides
- Browser statuses (Jim, Mike, Corentin)
- Interop 202X (Corentin)
- Other statuses?
11:00 - 12:00
- WESL slides + QA (Lee and Stefan)
13:30 - 14:30
- Cooperative matrix multiply
14:30 - 15:30
15:30-16:30 Overflow next-tech, can be pushed to the next day if cooperative matrix and bindless take too much time.
- Primitive index and experience needing an explainer (Brandon)
- Finalize Immediates (Corentin, subbing for Shaobo)
- Only discussion avoid render bundles remaining.
- Finalize Texture swizzle..
- 1. What should the descriptor format be? https://github.com/gpuweb/gpuweb/issues/5296
- 2. Where should the validation happen? https://github.com/gpuweb/gpuweb/issues/5298
- 3. Are there new validation rules needed if the view is multisampled? https://github.com/gpuweb/cts/pull/4427#issuecomment-3248725731 and https://github.com/gpuweb/gpuweb/blob/main/proposals/texture-component-swizzle.md#open-questions
- Support for YUV textures #5281
9:00 - 10:00
- Decide on schedule for the day (Corentin)
- WebGPUReconstruct (Albin)
- W2GPU (Oguz)
- Toucan (Stephen)
- Other presentations?
10:00 - 11:00
- Immediates
- Discuss final topics for compat and next steps
11-12
- WGSL swizzle assign (David)
- Popular demand for mapSync (Kai)
- copyExternalImageToTexture should allow copying to the new 16unorm formats #5289
- Memoryless textures
- Other next tech / overflow from Day 1
ADD YOUR TOPICS
- Support for YUV textures #5281
- Kai: Popular demand for mapSync (Kai)
- Kai: copyExternalImageToTexture should allow copying to the new 16unorm formats #5289
- Peter: Memoryless render targets
- Connor: aliasing of types and buffers in MSL for bindless
- // Kai/Gregg: Texture swizzle (needs Metal investigation)
- Corentin: Get some direction on both open design bindless questions.
- Selected WESL issues: reflection, philosophy re: string templating. (user defined overloading if time available)
- Atomic 64 (unsigned integer) max/min (only) Think nanite viz buffer #5314
- The big question here is can we read the current value (apple). It would appear no?
- Apple
- Mike Wyrzykowski
- Autodesk
- Daniel Crookston
- Google
- Alan Baker
- Antonio Maiorano
- Brandon Jones
- Corentin Wallez
- dan sinclair
- David Neto
- Francois Beaufort
- Geoff Lang
- Gregg Tavares
- James Price
- Kai Ninomiya
- Ken Russell
- Loko Kung
- Stephen White
- Ryan Harrison
- Peter McNeeley
- Intel
- Jiawei Shao
- Jie Chen
- Microsoft
- Chris Bieneman
- Jesse Natalie
- Rafael Cintron
- Mozilla
- Andy Leiserson
- Erich Gubler
- Jim Blandy
- Teodor Tanasoaia
- Nvidia
- Markus Tavenrath
- Unity
- Brendan Duncan
- WESL
- Stefan Brandmair
- Lee Mighdoll
- Albin Bernhardsson (ARM)
- Benjamin Brienen (wgsl-analyzer)
- Charlotte McElwain (Bevy)
- Connor Fitzgerald (wgpu)
- Francois Daoust (W3C)
- Iwo Plaza (TypeGPU)
- Jimmy Moir
- Mehmet Oguz Derin
Intro to the F2F WebGPU 2025 F2F intro
- FD: Main difference in having a candidate recommendation is IP protection, call for exclusions, at which point patent policy applies in full to the snapshot
- FD: Rest of web community – horizontal groups – also review the snapshots, in particular privacy and security
- Reviews of just the substantive changes since last review
- CW: So we have to take a snapshot. Question is how often. There is overhead in the reviews, including privacy/fingerprinting discussions.
- JB: Think we should do every year; we don’t want to bump against the limit. And if there’s not much change, then it’s hard to make a big discussion about the fingerprinting issue when there’s no substantive change in that space.
- CW: Agree'd, we'll start the process after the face-2-face
- Discussion of how much the WebGPU CTS is testing compared to the Vulkan CTS or the WebGL CTS.
- Testing some things too much, better than Vulkan CTS
- Less regression tests than WebGL CTS
- JB: Hoping that eventually GPU manufacturers will start using the WebGPU CTS to validate drivers. It's a bunch of free tests!
- AB: Looked at it at some point, but it's difficult to automate because it's not a native API.
- DN: Tried to push IHVs to adopt the CTS but it needs to have no false positives. Need to improve the quality.
- MT: If they are driver bugs do you report them?
- DS: Yes, when we can
- CW: We report to people, but if there are better channels please let us know where is better to report them
- CF: Knowing the correct channels to report to would be great. Found bug, don't know where to go to get issue reviewed. Not a problem with a game, problem with SPIR-V implementation. List of where to go to report issues would be really really helpful.
- MT: If list of bugs could be made available to workgroup members, IHVs could check those.
- CW: Good thing to keep discussing to figure out how to make the ecosystem better. Let's figure out how to do that.
- KN: The header is now stable thanks to Loko and Connor. The core part is defined in the upstream repo, but extensions are defined in each implementation's header.
- KN: The header is stable but behavior might evolve slightly as we sort it out.
- KN Status of implementations
- Dawn does core api, has prebuilt binaries that are showing up. Currently in Ci artifacts.
- EmDawnWebGPU is the emscripten bindings that were forked into Dawn. Maintained separate from Enscripten
- Core API fixed known show stopping bugs. USE_WEBGPU flag in emscripen is deprecated and will be removed in ~a week. Use the one in Dawn instead.
- WGPU-Native, isn't really maintained. The C-bindings. WGPU is maintained and in great shape, C-bindings interface does not match stable header. Needs contributions
- CF: 1 maintainer who reviews but doesn't do significant work
- PRs needed to get conformant with upstream API.
- KN: The ASync changes maybe large, but we know how to do it now.
- New header stuff to deal with async, difficult but doable.
- JB: In FF 142 shipped WebGPU to Windows users. Not a lot of fuss.
- Fixing bugs we've discovered. Overall went out smoothly.
- Came out August 19th. FF 143 shipped yesterday. (whattrainisitnow.com)
- FF WebGPU implementation is based on WGPU rust crate.
- JS -> C++ generated bindings -> Rust crate wgpu-bindings -> wgpu-core -> wgpu-hal -> {VK, D3D, Metal}
- Next, WebGPU in release on MacOS. (Available in Nightly now). Working out final kinks. Getting security story straight.
- Linux after that (possibly Android)
- JB: WGPU is guts of firefox implementation
- From Rust POV, the goto library to write code against all gpu platforms
- Lots of update {bevy, deno, servo, etc}
- CW: Ruffle had interesting feature requests (pathtracing, pixel local maybe) will they be forwarded to the group?
- CF: Not aware of them. Haven't fully adopted compat and need to integrate and that causes some issues. And a wait issue. We had bug that let wait worked, we fixed it as it was UB on Vulkan. We pretend that all submits instantly finish in that case
- JB: New features:
- ray tracing (to some extent)
- CF: Ray query and some acceleration structures which are kinda validated
- mesh-shaders in review
- bindless
- subgroups (roughly conformant to spec)
- Varying levels of readiness of the features for going to committee, not all features at same levels for validation, design, etc.
- CF: Bindless has no bounds checks, can index off end of array. Was done before we had higher standards for these features
- CF: Mesh and ray tracing are behind unsafe experimental flags
- WGPU has value as a test bed, but more as a what folks want vs a finished design.
- BJ: When new features come in to WGPU, fantastic community test bed, but when something comes through committee it won't always mesh, what's process?
- JB: We break them
- CF: We break every 3 months, on schedule. We don't deal with long term support
- JB: By break, we do a major version bump
- CF: Often, we added a field to a struct. For subgroup, we'll have a name change, we'll just do that and folks code will break and they can fix it.
- BJ: That sounds fantastic.
- CF: We do get complaints. Whole ecosystem has to go at once. Various bindings need same versions. Costs ecosystem to have breaking change, but would be so difficult, and lacking tools, to do this without break/major changes. Not at stability level to say we do major every 6 vs 3 months. Wont' be stable for a long time, if ever
- BJ: If using WGPU is there some safe subset as a developer that doesn't care about fancy stuff, is there a solid subset?
- JB: Noting documented, but if you're using the basic WebGPu features they don't change. You just have to know, it's not documented.
- CW: In Dawn we use the chain struct mechanism to make extensions but rust doesn't seem to have that (could have ::default but annoying). Could tag whats in WebGPu spec as stable?
- JB: We could …
- CW: The struct members is easy-ish. Will talk about bindless this after noon. But, bindless in standard will look different from WGPU bindless. Will be a hard transition.
- CF: Don't have things that are that difficult. What we discuss for bindless will be a superset of what we have now. We're extremely limited right now. Not as big a concern. If we added a major restriction, we hoist that onto the users. I added a restriction in v24/25 and made life hard for bevy and they had to refactor. But, now works right on mac when it didn't before. Its about talking and weighing options of if we do it. How much will it hurt and how do we avoid that. But, we need to evolve and that's our stated policy. We don't promise anything and that's too much
- CW: For Dawn, wgpu is much more in the open source world. For Dawn we welcome but don't get as many contributions, so we end up more like a vendor, so we don't have the same community.
- CF: Credit to the Moz folks for us community members breaking and changing things. Making re-vendoring happen despite us changing stuff.
- LM: Are there enable flags in WGSL for native-only features?
- JB: Not WGSL but WGPU. Have lots of enable flags in WGPU that go way beyond what webgpu specifies.
- LM: In shader?
- JB: Not yet
- CF: Do want to do that
- LM: Working on ide support and want users to say "i want portable subset" or special style. Need something to look at.
- CF: We're talking about that.
- MOD: Can use dawn through webgpu.h, on rust side can we switch WGPU and Dawn?
- CF: Would have to do it yourself. Talk of doing a c backend for WGPU, but there is no use case so no-one has done it. You'd have to bind WGPU yourself.
- CW: From dawn side, haven't heard request. Someone could make a rust frontend to Dawn but we haven't seen it.
- CF: Advantage would be running wgpu test suite against dawn, but also to see who's faster.
- JB: Internally, many large projects, ARCanization to change memory stuffs. Faster compile through dynamic hal dispatch, better compile and code is faster (somehow ….). Nice cleanups for keeping resources alive, simpler and 15% faster. Resurrected the deno crate cts_runner which runs webgpu cts on wgpu (in shell). Huge win. Now it’s normal for contribution of a feature to add one additional line to say “now run the CTS test as well”. Shoutout to Arm; this is the kind of thing you want.
- MW: Shipped yesterday on Apple platforms
- Texture tier 1 in webkit main as well
- WebXR in main
- Want 100% pass rate but not a realistic goal
- Current challenges:
- Memory usage on iOS
- 2gb memory limit before kill tab. if 8gb of phone memory, or 1.5gb otherwise
- websites don't necessarily pay attention
- webkit main allows 1gb buffer sizes.
- Increases chance of exceeding limits
- Metal compiler compile both MSL -> IR and PSO -> Assembly see bugs
- Memory usage on iOS
- What we'd like to do (no timelines/commitments)
- subgroups
- missing features (clip-distance, dual source, primitive-index, texture tier2)
- Improvised HDR
- bindless
- Mesh
- Not ready to propose anything
- 11 chrome releases with updates (see blog posts)
- subgroups
- origin trial for compat
- Developed CTS with impl, so some "breaking changes"
- We missed somethings and other browsers find and we fix our conformance issues
- Moved compiler stack from AST to IR based
- We aren't the bottleneck anymore, now it's backend compilers
- Much similar to write transforms
- LM: Are there design docs?
- JM: In Naga, we’re internally discussing the pain of this kind of work. Might move to something like this. (haha might lose our speed advantage.)
- DS: We have dialects, backend-specific. That helps us write backends very simple that are a format change.
- JB: you don’t get a union-IR.
- DS: This lets us constrain it more. You know the only place you get a SPIR-V image is in the spirv-backend. Makes the core IR simpler.
- JB: But maintaining multiple IRs isn’t a burden?
- DS: Mostly the same; we have backend specific intrinsics files. E.g. methods on textures in HLSL; not in the core.
- CW: There’s a core IR, and then a per-backend IR. Super heavily code generated.
- DN: LLVM uses a union IR and it’s a mess. Hard to validate when you don’t necessarily know what the backend supports. This pattern is much better.
- CW: Uses in Chromium
- Chrome 2d rendering moving from first Skia version Ganesh targeting Gl/vulkan, moving to Graphite which targets Dawn. So WebGPU will be used for all internal rendering.
- Shipped on Mac ARM and being AB tested on Windows. AB testing on rest of Mac and on Android. Will be years to get shipped everywhere.
- Rendering speed / motionmark / etc
- JB: Why does swapping backend be so great?
- CW: Not just the backend, the algorithms are different and targeted to modern GPUs. Just moving to like 2010 GPU tech with depth buffer, etc
- JB: So skia architecture change
- CW: Yes, Skia rewrite and has a Dawn/WebGPU backend
- CW: Struggling with pipeline compilation for startup. Bunch of efforts on async pipeline compile, etc
- CW: Prototyping lots of things (some for graphite)
- memoryless render targets
- uma buffer mapping
- pixel local storage
- static samplers
- bindless
- subgroup matrices
- texel buffers
- CW: Prototyping before the spec to understand what's possible
- CW: Graphite, webgpu, builtin machine learning in Chrome and Edge
- Chrome uses LiteRT on Dawn
- Edge uses onnx-runtime on Dawn
- CW: Prototyping webgl on webgpu to move Angle and GLES
- Prototyping capture/replay
- Figuring out where it should go
- dawn.node published on npm
- Prototyping capture/replay
- TT: For Onnx, benefit for running on WebGPu instead of native?
- CW: Understanding is there is a single provider for execution. The goal is to have various AI PCs (cpu/npu) would have their own providers. Would still need a CPU fallback and that's WebGPU
- TT: For broader support?
- CF: Less work, that's why we use WGPU instead of all the backends. Easier to target one instead of each individual one
- JB: Then we need to make sure it does magically work everywhere.
WESL update WESL for the WGSL team Fall 2025
- LM: Intends to be a practical language to support community use of webgpu. Not tied to rust or typescript. Neutral on host language. Try to support both. Think of every wesl feature as a future webgpu proposal. Something users can try in advance, so help guide these features into something useful.
- LM: Target audience, not aiming to be niche language. Aspire to be expects in gpu programming and programming languages, but recognize users are spread across the spectrum . Our job is to make fancy stuff easy/more accessible.
- LM: What we've released in M1
- module system
- libraries of shader functions
- conditional compilation
- LM: WESL is a superset. Starts as WGSL and we add
- import from shader code in local files
- imports from libraries on npm and cargo
- conditional compilation with @if
- folks quickly wanted @elseif
- module system
- devs want to split shader code into files
- each file is a module
- worked through most of the semantics on how that works in corner cases
- Enable libraries
- Enable shader functions to be packaged, just shader functions. no overrides/entry points, etc
- Will talk about extending to support more complex libraries
- conditional transpilation
- Both at compile and runtime
- to avoid exponential shader explosion.
- Both at compile and runtime
- LM: Several tools
- Rust and JS transpiles for WESL -> WGSL
- Plugins for bundlers in TS community
- Treesitter grammar
- New tools coming for
- documentation generator
- language server
- LM: Language server is number 1 requested feature
- Extending tool to allow both WGSL and WESL
- SB: Working on language server
- Started with what does it need to do for parsing
- Needs to be error resilient
- lossless parsing tree
- working on code formatter which needs blankspace in AST
- maintainable
- Generated from grammar file
- performant
- JB: Where does grammar come from?
- SB: From spec, hand written
- LM: Point is it comes from a description, not in code
- SB: With import statements shaders depend on shaders
- Bevy has 200 shader files, doesn't scale for language server unless smart enough to do incremental parsing
- Adopted what rust-analyser does
- Working on type checking, want all proper type checking and validation rules
- JB: So bevy is using WESL?
- SB: Migrating in next release
- JB: So naga-oil becomes independent
- LM: Think we now support what naga-oil supports, support enough bevy can move
- JB: Really exciting. Want a kernel vs c-compiler where each side feeds the other
- LM: All bevy examples run in WESL. Compilation flag that users can set to write user code in Bevy. On a development branch where all examples should run.
- BB: We are porting some concepts from rust-analyser like project model. Workspace with dependencies being implemented now. Instead of external deps in an extensions setting file it's in a wesl.toml file. Bevy is strongly considering moving to WESL but final decision is coming. WESL has a goal of satisfying all bevy requirements.
- LM: WESL contributions to language server work for both WESL and WGSL
- LM: New stuff:
- Want WESL to stay close to WGSL. Match syntax. Test with CTS.
- Want to follow what the community needs. Trying to pick projects to inspire new features. Q, what other projects should we look at for things where ergonomics could help clean up the code.
- Several libraries
- random_wgsl
- bevy
- lygia
- gridwise
- Done a lot of study of this codebase to say how could we make more features to support it well
- LM: Coming in M2
- @else @elseif
- @param const
- Generates a wgsl
const - Done in community by string templated or #define. Proposing a unified feature where you want a
constto show up, how do we define a way to make it parameterizable (but not anoverride)Want const you can set a value from other modules and set from host code (ie rust, typescript, etc) and would be nice if we could unify conditions work (@if, etc) to use the same facility. Conditions have no way to set default value in shader code. If can unify can get both.
- Generates a wgsl
- AB: What's wrong with Overrides? Don't fit purpose?
- LM: Good question, we'll come back to that.
- Our first proposal was to call this override-const.
- Regular overrides don't work because of conditional compile in an @if. We want the WGSL to just have one side of the if, which is submitted. So, this substitution happens before createShaderModule. Override happens at pipeline creation.
- From users point of view, if want to say I have a value to set in host code and inject into shader, conceptually can use either. Don't like that we have 2 ways
- AB: Only thing here that's an issue is the array time (in the example on the slide).
- Type specifier seems to be really incompatibility with example.
- LM: There are 2 uses here, suspect conditional compilation will be needed.
- BJ: My expectation is most conditionals won't look like example on slide, Would be multiline blocks of code
- JB: and might refer to conditional params
- AB: Conditional compilation is different from @const
- LM: We conflated them here is that we don't have a way to declare those conditions or set them from shader code and we think we need that and param-const gives us a way to do that
- AB: Goal is to boil it down to what's actually needed. Cool idea, just want to make sure we understand.
- CW: Difficulty with override, is to make them implementable as backend overridable constants. If we decided to give that up it's a one way choice. Can do it with underlying system overrides?
- AB: Don't think you can
- TT: Constant operations
- AB: Could do a subset but not all of it
- CW: Override is specialized at pipeline creation and trying to map to backend of feature which is limited given what it can do. This seems to be shader module creation overrides. Similar but we don't care about the underlying feature so can have ultimate flexibility.
- DN: Still have pressure for compile time speed so don't want to push up dream of pushing work to createShaderModule. Want to defend that. Even if it's a subset it can be done ahead of time
- LM: If we do get a feature like this, should be related for the user.
- JB: Nice that the platform apis give us a maxima. Maybe definition in WGSL should be as much as we can do at module creation time, that's overrides job. If other stuff then we can push up to the earlier phase.
- AB: Just ot be clear, override is pipeline creation
- JB: You write it at module creations and provide value at pipeline creation.
- LM: Want to create specializations to set bools/numerics when importing a module so can control from the shader side or set from host code and inject. a
linkgenerates the WGSL for the shader module. - KN: Can you
linkthe same thing multiple times with different constants? - LM: Not in the runtime case, but in the shader code case?
- KN: Yes.
- LM: All imports must agree on new value or compile error
- Talked about it, tempted, but initially make it once and only once and try to understand what it would mean to fork
- KN: Then it becomes templates
- SB: Templates or ML style mode
- BJ: Similar question, if you have both forms, both import and override again at import, does it work?
- LM: Would argue that it would. There is also a
publishthat you write in the shader code which says this bit of parameterizable code is visible to the host. If you don't want it visible you don'tpublish. - JB: So the person importing doesn't have to care the outside will smash the value
- KN: Difference between library and main code? Can main code publish?
- LM: Can't do that, privilege the root shader file, the file that contains the entry point and that file is in charge.
- CW: So, publish is a way to control what API is exposed to host code
- LM: Yes slides coming
- LM: Interaction with overrides is bothersome, maybe no solution
- Have you discussed things like this to have pre-set constants?
- AB: Conceptually easy to have a new stage, a pre/post stage to shader creation which is link creation (for lack of a better name). Could be new stages added that give different lifetimes of the shader
- CW: So far, anything that isn't in the shader source you do string pasting yourself. How much do we want to build into create shader module.
- AB: If we want a linker will hae a link lifetime. If we don't we wouldn't.
- LM: Ideally we prototype and you keep it.
- DN: A lot of the fuss with create shader module is numeric overflow it creates a shader creation error and specifying the error cases and machinery to verify. THere are details.
- LM: Reasons to push outside at first, don't wnat to deal with filesystem.
- LM:
- @inline return params
- Problem from the bevy folks. Asked for help trying to cleanup one monster files which was difficult to refactor. Root problem there is a bunch of textures and conditional compilation and there are like 49 of them. Want to refactor into functions but non-constructable types can't be returned. So, we want to make a feature where they can be from a restricted class of functions. You tag the return with
@inlineand the compiler rewrites the return value to the underlying global. Needs to be statically analyzable. Compiler error if runtime control flow. Now can refactor large method into smaller functions
- Problem from the bevy folks. Asked for help trying to cleanup one monster files which was difficult to refactor. Root problem there is a bunch of textures and conditional compilation and there are like 49 of them. Want to refactor into functions but non-constructable types can't be returned. So, we want to make a feature where they can be from a restricted class of functions. You tag the return with
- @inline return params
- JB: this reminds me of unrestricted pointer args. If in WGSL the underlying platform cant' accept the type we specialize the callee for each pointer we're passing. If already going to that work (we haven't done it in naga yet) Doesn't seem like too much more work to extend to do other things as well. We're already paying the inlining cost.
- JB: Functions are one of our most important abstractions so removing restrictions is a very nice way to solve problems.
- CF: Want to emphasis how important returning textures from functions is. Using WGSL at work this was a motivation for using slang. Broke slang because using control flow and realized slang uses optimizer and inline and tried to do it's best. All abstractions like textures just need the ability to return from function. Hard to make modular code base without it.
- DN: We did (stadia) the HLSL compiler for DXC for Linux. Lots of HLSL code that does this kind of generality. Wrote recipe that does this kind of optimization and wrote an HLSL cookbook for the patterns that worked. Wasn't possible to have spec that does it statically. Good to make control flow only static conditions. https://github.com/microsoft/DirectXShaderCompiler/blob/main/docs/SPIRV-Cookbook.rst
- LM: Imagine it doesn't come up for WGSL as much but for conditional compilation
- JB: WGSL already has separation between override and const expressions and runtime expressions. So have this stratification and can use that to build more things. Can use WGSL but if you're doing this but these ifs have to have the right phase
- LM: Original version was to inline entire function but we since narrowed it to just the minimum of the return part. Then if the function has side effects then that is a function call. The presented code would just go away, but imagine you write tot he texture that still has to happen
- JB: So a mix of static and dynamic.
- LM: Ideally, Could inline the whole function if there is no side effect. Q, is inlining a useful user feature. Inlining a function.
- DN: In the programming model it should not matter to solve the issue. All backend compilers will do whatever.
- LM: SO doesn't need to be user feature
- DN: Except to work around these constraints
- CW: Yes and, also important for bindless which returns texture for a material and odn't want to copy/paste everywhere. Right now in WGSL without bindless and global bindings don't need feature because you can just get global. But with bindless / fixed sized binding arrays this changes. Need to assign/return textures from functions.
- AB: Still feels like conditional compilation is the key. Otherwise just having a function that returns without conditionality doesn't gain anything. It's the
@ifand else that make it do something useful. - CW: I want base colour of this slot and there are index compilation
- AB: Not that interesting without conditional compilation. You just return the value for the slot instead of the texture.
- LM: Does make it easier. If you can't pass a texture can only do it inline. (in bevy_pbr) There are 49 of these blocks in the code. My linter complains if the function is > 39 lines, so having 200 line functions is …
- CF: 2 related things. Thing textures lack is referential-transparency ( I think that's the name) and feels bizarre that you have this thing you can't use like another variable. Can't give a new name, cant' do other things. To further what DN said. Games are typically HLSL and the vibe you get is as long as the compiler can do it's optimizations and make DXIL then who cares. Will do as much as possible and reply on the optimizer to do everything. As long as it can make it work, it's fine. Gives you referential transparency. In naga just assign textures to variables because it's easier. Let DXC figure it out. By not having that games folks are like why can't I use this like a value, it's just a value.
- JB: Term is first-class values.
- CF: You can just treat it like any other value. Another reason why we went to slang. Just needed to treat every bit of code like every other bit with functions that take/return values. Baring bugs and bad validation it just works.
- AB: For more context, the cookbook was an entier optimizing compiler. Multiple levels, we asked at the beginning, how much of a compiler is the wgsl translator. A former colleague, said every compiler starts simple and becomes and optimizing compiler. It's a difficult question to answer with technical implementation work to do those things well on all backends. Obviously there is usage, but will take time to get there. Keep having these discussions on how much we should be optimizing.
- LM: Our favoured scenario is we prototype and you build it. The other option long term is you decide the browser is making JS and WGSL but long term there is a higher level language like TS and that's reasonable.
- LM:
- function overloads
- publish
- Comes from idea we want to enable libraries to have const or overrides but libraries are not in right position to decide. It publishes a lot of stuff the app may not use so app selects what is used. What's available to be overridden. Library cant' necessarily pick names. Solution is that the app controls through the root module the root module for the app by default everything is published. Simple case just works. Anything in root file for demo app flows through and is published. Anything in library or other module needs to be tagged with
publish. App is in charge and anything not published we do something reasonable. Override turns to const. Entry point is removed. Then no name mangling for libraries to host code.
- Comes from idea we want to enable libraries to have const or overrides but libraries are not in right position to decide. It publishes a lot of stuff the app may not use so app selects what is used. What's available to be overridden. Library cant' necessarily pick names. Solution is that the app controls through the root module the root module for the app by default everything is published. Simple case just works. Anything in root file for demo app flows through and is published. Anything in library or other module needs to be tagged with
- wesl.toml
- LM: What we'd like from you:
- More sample code
- More early adopters
- Source maps
- Really want to see first class support transpile to WGSL languages. Doesn't need to be browser format for source maps but the idea is we want when you call createShaderModule want users to see the error message refer to the original source language. There might be multiple stages there. Might be type-gpu -> WESL -> WGSL. Need to make that mapping and maybe a mix of type-gpu and WESL coming together. Eventually downstream tools like renderdoc and metal debugger to receive our source into those tools as well.
- JB: Added to agenda doc to link to naga source map support.
- Naga should support source maps #7463
- KN: Adding link to spec issue for source map support
- Reintroduce source maps #4844
- LM: Lots of discussions about syntax and things.
- AB: Good impetus to have some WGSL meetings.
- CW: For source maps the question is where does it live. You can always do that as as user side library
- LM: We did do that and it's pretty hairy. How do you get the right errors in the browser.
- CW: In devtools?
- SB: You have to intercept where the errors get reported
- CF: We did that at my old work and it was hairy.
- LM: Never great to do that, hard to work with. How to make work with rust, etc. All possible.
- CF: +1 for source maps in general. In native if debugging against renderdoc need debug information or it's hard to figure out the code the SPIR-V came from. Not fun.
- LM: Not just WESL, for others making specialized languages like rust-gpu or type-gpu. Will have same problem.
- LM: We'd love to collect more ideas
- CW: Tomorrow morning a few more discussions, but open afternoon, So, over today/tomorrow think of different things to discuss and make a list of interesting in-person topics and we can do them tomorrow afternoon.
Cooperative matrix multiply2025-09-16 WebGPU F2F - Subgroup Matrix
- AB: Why do we care? ML
- DN: Matmul is using thousands of multiplies
- AB: Get memory bandwidth limited fast. Use tiling to take advantage of efficient memory. Subgroup operations..
- AB: What are the speedups?
- ~35x from good scalar to coop matrix
- AB: What are the features for WebGPU? High or low level? Depends on the target and what you want to expose. May want multiple features. Experimental features are much more low level targeting middleware instead of end users. Difficult to get best performance out of it.
- AB: Chrome’s experiment:
- Intersection of what’s available on Vulkan and metal last year. Not available on D3D yet.
- Adapters expose many feature levels and not many fully commonly available features.
- JB: Hard to write really cross platform? Yes
- AB: Also difficult with uniformity and derivatives.
- AB: 3 new types. End up as a new shader scalar type in WGSL. Avoided writing multiple extensions to cover the native data types and abstracts some hardware. No conversion between these types.
- JB: Result is different type from operands? Yes, different types in SPIRV depending on left/right.
- This gets hidden from developers
- CF: Does this affect the layout in memory or internals of the compiler?
- AB: We’ll get to that.. Native APIs make this daily opaque
- DN: Sometimes component type is different than the accumulation type.
- AB: For integer stuff, it’s usually small integer into a larger integer accumulation, floating point is usually just f32 or f64
- AB: subgroup_id needed to use this API in a reasonable way. Microsoft documentation is insufficient currently to know if it exists
- Jesse Natalie: there isn’t this in D3D right now, but we should.
- ChrisB: Please file a bug on this on hlsl-specs. We’ll get to it.
- CW: Is this just missing docs or will this require future hardware?
- Nothing in the spec that allows this mapping currently. Works for 1D but 2D is not specified. Possible to polyfill with group shared counter.
- AB: Yeah, that’s how we wrote the CTS tests for subgroups. Will file the issue.
- AB: Load and store builtins. Take an array of the scalar type.
- JB: Are the colMajor params const?
- AB: Will talk about that later.
- Load and store between storage or workgroup address spaces
- DS: There is an offset parameter which is where you start reading, and then the stride from there
- AB: Take parameters for num rows/cols so that block loads are possible.
- CW: Do these APIs take advantage of hardware that does block loads? AB: Yes, all maps to intrinsics
- JB: Are the colMajor params const?
- AB: Matrix matmuls and matrix matmuls with accumulate functions
- Templated on result type. Often accumulate to different sized result. L, R, RT types.
- AB: Scalar arithmetic builtins. Add, sub, mult. Didn’t do divide because it was unclear if Metal would add them.
- AB: Scalar value here is the shader scalar type. Validated to be in range.
- CW: Shouldn’t divide be possible with reciprocal polyfils.
- AB: Not always possible for integers. In Metal for example need to have full matrices of predefined values to polyfill.
- CF: Must the value S be workgroup uniform?
- AB: Yes, ‘value’ must be workgroup-uniform. Unfortunately not just subgroup-uniform.
- JB: it only really needs to be subgroup uniform, but we don’t have the machinery yet.
- AB: Experimental results
- AB: Credit to James. Wrote a matrix multiplication benchmark. Very finicky
- Compared on nvidia hardware which has tensor units and against well written native vulkan.
- Experiment does way better than without using a native tensor unit, slight worse than native vulkan.
- On an M1 without dedicated tensor units, ~80% perf
- The tiled scalar impl gets very good result; no tensor units in the HW.
- Don’t lose too much perf if we only have to do clamping (don’t do predication).
- Lose a ton of performance if we have to use predication (~24% perf)
- Could potentially use clamping only based on heuristics but may need to fall back to slow path that does full robustness checks.
- Not a simple feature to plug in and get great performance on M1.
- AB: Experiment results: ONNX.
- Got about 3x perf improvement by using Dawn with subgroups under ONNX.
- AB: Future SPIR-V:
- More types: bfloat16 and float8.
- Vendors have agreed that these are reasonable formats.
- e4m3 and e5m2 seem to be accepted by industry as reasonable types.
- Vendor extensions:
- Cooperative matrix 2.
- Release by NVIDIA
- Biggest feature is workgroup scope instead of subgroup. Much easier to write code at this level. Looks like a regular matrix multiply. Headaches are pushed down into the compiler which knows the best memory configurations to use. Unclear how much mem is used but this is the tradeoff.
- Many new configurations for sizes.
- Reductions across rows or columns, reduce to 1 element. Pass a function call and it just gets done.
- No guarantees on L/R ordering.
- Per element operations. Can iterate over values held in each thread but no info on row/col that was being referenced so operations had to ignore. Now the row/col is provided and much more complicated algorithms are possible (layer fusions?). Very helpful to know the coordinates.
- Tensor addressing
- Block loading and sharing parts of the values.
- Everything except per-element operations can be polyfilled. Not clear that we would want to do this but it is possible.
- Per element operations is only theoretically possible. Needs to hard code per-hardware info.
- QC cooperative matrix conversion
- Array bitcasting
- Array slicing
- Composition and decomposition, create matrix from vectors stored in different threads.
- In base extensions, usually loading out of memory or scalars
- Not sure if these will be widely adopted in SPIRV
- HLSL Linalg Matrix
- In development, possible release in SM6.10. Available for testing now.
- Core functionality is subgroup matrix functionality plus:
- Threadgroup scope
- Comp/decomp
- More data types. 16 bit integers, … types not available on other platforms
- Interactions with their coop vector feature. “Neural shading”
- Arbitrary length vectors, better syntax than QC version.
- No API side yet, not sure what versions targeted.
- Will be queryable from driver.
- CB: composition/decomposition has been pulled out because it’s expensive for some HW vendors.
- CB: will finalize in about 6 weeks. Would love feedback to make sure we’re on the right track.
- Metal
- Metal 4 added tensors in two types.
- Cooperative tenders shared across threads
- ..
- Built-in robustness in Metal. Wouldn’t need to validate that all loads are valid.
- Harken back to the earlier perf slide: shows this would be a huge win.
- Scalable in terms of the simd size and can do subgroup to workgroup sizes.
- Windows and strides can be specified. Must only be done at threadgroup level tensor
- Metal 4 added tensors in two types.
- Standardization
- All platforms should be able to implement the experimental extension in WGSL/Dawn. D3D is only unknown right now.
- Don’t think we’ll get any new common ground for a long time.
- Argue what we’ve implemented is enough for us to make a feature we should ship. Focus for now on the intersection functionality of what’s currently available.
- The future is still unfolding.
- Future features that may be added:
- Workgroup matrices. Not supported in SPIRV yet.
- Tensor addressing is useful to make everything easier to use. Polyfillable to interop well with other higher level ML frameworks.
- Use conversion
- Reductions
- Per element operations. Requires platform support.
- All these come with caveats for polyfilling. (example slide for polyfilling)
- Some are particularly complex. Lambdas.
- Per element operations, could polyfill nvidia or qc way.
- CB: the new HLSL proposal is more like the Metal4 feature, where you iterate and then get the coordinates from it. (??)
- CB: Question we still have for HW vendors is when you iterate like that, are you guaranteed to visit each element exactly once.
- AB: Hope so.
- CB: Often it’s ok to repeat. But anything with side effects would be bad.
- AB: A reduction could be implemented with this kind of thing, for example.
- AB: Wanted to give a flavour of what polyfilling looks like. The amount of work.
- AB: Every platform will have a way of doing per-element operations eventually. Fingers crossed for SPIRV. Very useful for some ML and fusions.
- Example slides give a taste of how much work is needed
- CW: Back to what do we do for standardization. What do you want us to talk about in this F2F, or direction after the F2F. Your suggestion is coopmat1-like is available everywhere and useful, and we should standardize that. Then after that have targeted polyfils because we know they’re coming in the future?
- AB: We should investigate if those polyfills are worth doing. Should check and see how bad they are. Able to polyfill vs performant is unknown. Want to see per element operations in an extension but may want to wait and see.
- CW: Its not all coalesced in the APIs yet and we’re leading the other APIs a bit compared to other webgpu APIs here. Coopmat1 seems gelled. A bit worried about standardizing too far ahead.
- AB: Don’t expect it to coalesce for another year at least. This gives us an idea of where we want it to go. The workgroup-scope feature puts it into the hands of more users “webby”. Can decide what the line is, may want to wait until we have to polyfill on only one platform.
- CB: We had a discussion with a HW vendor. The gist was, if you used the feature set of coopmat1 to implement workgroup-scope matrices as purely a software feature, theoretically possible, we asked how much performance do you leave on the table. The answer was: a lot. If the driver knows it’s doing wg scope matrix, then driver can use a lot more tricks that are not exposed in the shader programming model. It was almost twice as much throughput.
- AB: That’s interesting. Triton language did this kind of thing on their own, and got good speedup. May depend on HW vendors.
- CB: Triton has advantage that they can target PTX directly, so they can exploit those non-general features, i.e. the ones that aren’t accessible to SPIR-V.
- JB: All this stuff is based on changes within a workgroup but there are no changes that add workgroup communications? No.
- JB: Now, the way people want to present more AI and compute to people. Isn’t there risk that because of power constraints etc, that some non-GPU kind of device would be used, and so puts a limited lifetime on this stuff.
- AB: Your phone probably has some version of that already. The trickyness is interop. E.g. compositing to the screen may keep the advantage of keeping the data on the GPU. The question is not answered. Not everything will solidify within 5 years for a single type of chip. GPUs for ML are not going away. Think ML on GPU will live longer than 5 years.
- Flexibility of GPU is the long term benefit
- CW: There’s WebNN, for NPUs. They added a set of operators, e.g. convolution is the thing, but then stable diffusion was a thing, then transformers. The landscape kept changing. GPUs are always the not great but not terrible. They are very programmable.
- MarkusT: What you’re doing is the correct thing. There are multiple layers. Coopmat1. There is also coopvectors, used by neural shading. Higher layers are wg-size matrix, and giant whole-GPU matrix multiply. Everything which is done here will not be deprecated by WebNN.
- Could eventually have library of ML operations in libraries like WESL which knows how to dispatch different operations.
- CW: To answer that: yes please, we’d love to have libraries to expose powerful algorithms for any developer. That’s why Google invested in R&D around decoupled fallback in John Owens research. We wanted to explore high perf algorithms and exposed in a reusable way. That overlapped with WESL work. Ultimately we want something like CUB but for WebGPU.
- MT: There’s WGML you should look at: WGML everything you need for llama.cpp. Everything SLAI Slang language Sebastian Crozet, at GOSIM conference. Shaders are written in slang and compiled to whatever targets are available (WGSL, …).
- https://paris2025.gosim.org/schedule/wgml-the-story-of-building-a-new-high-performance-cross-platform-on-device-inference-framework/
- Single-source cross-platform GPU LLM inference with Slang and Rust https://hangzhou2025.gosim.org/schedule/single-source-cross-platform-gpu-llm-inference-with-slang-and-rust/
- https://github.com/dimforge/slai (demo link will go up soon)
- MT: Important to think about programmable decoding of data. E.g. GGML has custom block-based compression and scaling scheme. Need a way to have programmable decode of the values.
- AB: We rely on what’s exposed in platforms, so expose something and we’ll use it.
- CB: We can always make a cooperative matrix 2 if the industry pivots after we ship if it turns in to a big mess.
- AB: I don’t think this will be a big mess with the current design.
- JB: Do you have spec language for this?
- AB: Almost.
- DN: Builtins and types are stable. Subgroup uniformity is another thing, That will take work because it makes uniformity analysis more complicated.
- Cooperative matrix 2.
- More types: bfloat16 and float8.
Bindless WebGPU Bindless 2025 F2F
- CW: Bindless was the biggest thing we planned at the last F2F
- CW: With current WebGPU, limited set of resources for each shader invocation. As soon as you want to do scene global rendering in each shader invocation, like in nanite, you want to store all potential resources in the whole scene in a full screen draw. Need to get all the materials, texture information etc and 16 is not enough.
- CW: Many other future features like ray tracing want scene global resources.
- CW: Bindless lets you address all resources in the shader and do scene global resources.
- CW: Bind groups are perf heavy to manipulate. This can avoid those costs.
- CW: Core WebGPU API was already 10 years behind state of the art. This helps catch up a bit.
- CW: MJP, well respected graphics engineer. Did a retrospective. Went all in on bindless and questions why you would do anything else. Don’t have to wrestle with bindgroups anymore.
- CW:Current status of bindless and Wgpu.
- Last F2F: small investigation, iterated a bit connor and jasper afterwards.
- In WPGU: some features to enable bindless, used by bevy. No validation, crashes if used wrong. Homogeneous, all bindless resources are of the same type of “kind”, ex: sampled textures. Bindless bindgroups are immutable, so new bindgroups need to be created to change anything. WGPU usage scope tracking is used and is slow.
- CF: Hitting limits of how many resources can be bound at once
- CW: Dawn/tint prototype does validation in the shader. Currently homogeneous, can be heterogeneous. Bind groups are currently mutable, have a proposal to make them immutable. Fast usage tracking.
- CW: F2F goals of agreeing on general design. After F2F start to work on companion features + prototypes. Ship next year?!
- Bindless in underlying APIs.
- D3D12
- Designed with bindless in mind unlike VK and Metal. Descriptor heaps contain many kinds of descriptors. Ranges of descriptor heap are bound to shaders “root descriptor table”. Can reference multiple tables to get full bindless.
- Stride of descriptors is constant for a device so homogeneous bindless is easy.
- Descriptor heaps can only be updated on the CPU, copies from a staging buffer to the GPU heap. No Queue operations to do this and can’t race with what the GPU is reading.
- HLSL uses unbdounded array of resources, can index it. Can declare layout. Multiple registers can point to the same descriptor table with different types. THis allows heterogeneity.
- SM6.6 adds dynamic resource. Allows direct indexing into the heaps
- Metal
- Bind groups are argument metal buffers. Filled with an argument encoder which knows the size of resource bindings.
- Need to tell driver which resources need to be resident so they are visible to the GPU.
- Argument buffers are just buffers and can be copied on the GPU.
- Don’t know how to support heterogeneous bindless on Metal yet. Can you just alias different types?
- MSL declares a structure or array and you index it.
- Vulkan
- Extension that became core in 1.2 but still an optional feature.
- Declare bind group entries as dynamic sized. Must be the last element and takes up the rest of the space. Similar to current prototype.
- When you create a descriptor set, then the size is set.
- VK has 1239478 features nat need to be checked and bindless is enabled by enabling multiple features.
- No heterogeneous bindless. Need to declare the resource type.
- No residency management.
- Descriptor updates must be done on the CPU. cannot race with the GPU.
- Many additional VK extensions that make heterogeneous bindless possible, GPU updates of descriptors, etc. These are not very prevalent.
- SPIRV: Non-uniform accesses of resources in SPIRV are an optional feature. Probably not possible to scalarize.
- SPIRV: Multiple bindings can alias a bind point for heterogeneous.
- GL
- NO PLANS. Probably possible.
- D3D12
- Bindgroup mutability
- Dependent on how implementations record commands.
- Webkit records commands to Metal immediately.
- WGPU does the same. May change in the future.
- Dawn does everything at queue submit time. Avoids some difficult cases.
- When SetBindGroup is called with bindless, typically a GPU pointer is recorded. Ex: in D3D12 a pointer to the descriptor table, ..
- No easy way to “patch” an existing command recording because it’s a GPU pointer.
- Device timeline updates must not race with the GPU.
- Means that bindgroups that are set in SetBindGroup cannot be “replaced” with a different object just before submit. Incompatible with native APIs and cannot be patched. Cannot shadow bind groups because of WebKit and WGPU’s implementations
- Cannot set a binding over an existing binding until you know that the GPU is finished using it. Must ensure the GPU cannot use a descriptor from a bindless bindgroup before a new one is written.
- Therefore: Bindings cannot be overwritten before onSubmittedWorkDone, since the last time they were visible on the queue timeline.
- Dawn’s prototype
- Either landed, or patches in review.
- Naming is not final.
- C++ API for now, showing JS as 1-1 correspondence.
- Not exactly happy with additional state, but didn’t find a good alternative. Welcome ideas.
- New optional features and limits..
- Bindless part of bind groups is called “dynamic binding array”
- Added limits for max number of resources in dynamic
- Do we expose homogeneous or put it in a limit?
- CF: We need a limit for samplers; they’re often limited to 1K samplers.
- CW: Can virtualize, so let’s put that to the side.
- Creating a bindless bindgroup requires creating a bindless bindgroup layout
- Currently have to specify the resource type until homogeneous.
- Can reflect things from the shader and get the bind group layout.
- BJ: Always has to be the last element
- CW: Yes, don’t know how many elements yet.
- BJ: So if you have two heaps then you need two bindgroups, ok.
- CW: When creating the bind group, specify the sizes of the dynamic arrays. Entries in layout can be sparse.
- CW: Heterogenous needs to know some texture usage information up front.
- CW: Added a .Destroy to reclaim memory on *only* bindless bind groups. They may have thousands of entries and need reclamation.
- CW: Updating: Not prototyped yet. Big open questions: how to allo update or copying? Needs to be possible for wgpu. We have to validate that when updating a slot, the GPU is not using that slot right now. Proposal: track, for each slot, when it was last used on the GPU. Allows updates but throw if it would be a race. Lets the browser return an available slot.
- CW: Companion feature to clone bind groups. Useful for changing small parts.
- CW: Need to be able to optimize validation and bind group tracking of resources. WGPU struggled with this. Can’t iterate over 10k+ resources. Need these resources to disappear from validation cost.
- Proposal: Add Pin/Unpin. Before resource is visible on the GPU, needs to be pinned to a certain usage that cannot be changed.
- Ex: create texture that is pinned to a certain usage and cannot be used for anything else until it’s unpinned.
- Resources know their bind groups and can validate at pin/unpin time, removing validation cost at draw.
- RC: If you do pinning wrong, what happens?
- CW: In the shader you get zeros. Maybe not great for debugging, but safe.
- Proposal: Add Pin/Unpin. Before resource is visible on the GPU, needs to be pinned to a certain usage that cannot be changed.
- CW: Shader side. Declare a binding “resource_binding”. In Prototype, full heterogenous even though it’s not supported yet to prototype the shading language feel and validation.
- Bunch of builtins for querying how many slots and types of resource at each slot. Getting typed resources at each slot. Have to figure out uniformity constraints.
- Companion features:
- need to know if textures are filterable or not based on type.
- Would be nice to store textures in variables.
- Companion features:
- Getting storage buffer from a resource_binding. Need more flexibility getting data out of storage buffers with offsets and layouts. Need a “data view” that is gettable from a storage buffer. Basically reinterpret casting storage buffers to structured types. Will need restrictions on scope.
- Immediates. Contain offsets to useful information in storage buffers.
- Shader side validation. When you access a resource that is out of bounds/wrong type, you are given a resource that is filled with all zeros. Done through adding a set of “default” resources that are returned when doing invalid accesses. Inject a storage buffer which contains per-slot information like resource type. When trying to access a resource in the bindless bind group and you get it wrong, change the index to point at the default resource so something is always returned.
- Need to replicate a bunch of resources that are accessible in different ways.
- Would be nice to add shader printf debugability here to tell user when they get it wrong.
- Currently 26 different resource types in the default resources.
- Big open questions:
- How to update contents of dynamic binding arrays
- How to reduce overhead of state tracking. Pinning is just a proposal
- BJ: You mentioned Metal backend has to do explicit residency management. How does that interact with the proposal. Ensure every resource inserted into a bindless bindgroup is resident with every operation?
- CW: Yes. In metal to manage residency you have a ResidencySet. You tell the command encoder you say make resource X Y Z resident, or make this ResidencySet resident. On submit, we have to walk the bindless bindgroups, we have to make ensure resources are still existing, and so at taht time, we tell Metal “make this residencySet resident”. On D3D, there’s only one way: make-resident or make-unresident. When we ‘pin’, that’s the time to tell D3D to make it resident.
- DS: On Metal we have to do residency management anyway, with Argumentbuffers.
- KR: Vulkan ecosystem looks most uneven in support. How are native applications navigating this?
- CF: Applications assume they have nonuniform indexing. Going back to Maxwell, that has the nonuniform indexing that we need. There’s a million limits and combinations, we just assume we have them all. Our proposal doesn’t have uniform-buffers. And so we get broad support when looking at the world that way.
- BJ: Idea about prevalence on Mobile?
- CF: Not really looked at Android. Think Apple silicon supports it. AMD has been bindless forever.
- CW: For uniformity, that’s going to be in the discussion. Looking at vulkan gpuinfo.org, the support for nonuniform indexing is not the whole universe.
- CF: I’ve queried to investigate that.
- MW: Heterogenous bindless. For each slot, resources can be of any type?
- CW: At an offset in an argument buffer, could be a pointer to a buffer, texture, etc.
- MW: Natively on metal you can’t cast between these types. Maybe Metal can be redesigned to support this.
- CF: Had concerns about storage buffers. Can cast the same resource to different storage buffer types at different times. Effectively aliasing u32 vs. f32 in storage. Considered UB in c++/Metal, maybe really hard to do runtime validation for. May need to change type based aliasing rules.
- LK: Resource pinning. Is it queue local? In the sample it looks global.
- CW: Something we’re discussing with CF. Most applications create a resource and stuff it in the bindless bind group. At submission, you use the resource and for rendering + sampling. One idea is to be able to pin small sections of bind groups, “local pin/unpinning”. A way to do this is have a small computer shader which updates the metadata storage buffer. If you have a small number of separate pins, can also be done with push constants. This is the idea of local pins, done at the encoder level.![][image1]
- BJ: ex while A is unpinned, I can still use the bind group but cannot access the resource? Yes
- Tracking pinning/unpinning per subresource?
- Potentially much more expensive and implementation work.
- Resources are effectively invisible to bindless when unpinned.
- BJ: A typical use case is writing to a texture once and then reading many times. You would upload data and then pin? Yes
-
- JB: You have to pin something before you record the command buffer, if you’re going to use it…
- CW: Pinning is independent state from command recording. You have to pin it before submit.
- JB: And it has to stay pinned until…
- CW: you pin, submit, then unpin. In a bindless bindgroup there’s two pieces of info; there’s the descriptor which we can’t update in the queue timeline. But the metadata with type IDs, we can change that whenever we want. When we say bingrgroup.update(this resource); that stays and never changes. When we pin, we update the metadata to record the typeid. When we unpin, we reset that type id (a word in the storage buffer holding IDs).
- CF: And that pin/unpin of type id is on the queue timeline.
- RC: Is it better to put it in the command encoder time?
- CW: doing it that way its queue stuff mutating global state. That’s ugly entanglement.
- What happens if you want to do a single submission with copies and bindless which requires pinning between?
- CW: Validation error. We would replace the check for the resource being valid/non-destroyed with a check that its usage is valid. Small overhead replacing the valid set with a map of resource→available usages.
- CF: Our trackers do that already, but just in a weird way.
- LK: Why not simply pin all resources at creation + have local unpinning?
- CW: You can do this manually as is. It may be more ergonomic to be default though.
- BJ: Render to texture is an example of needing to unpin. Assuming that pinning state would affect non bindless bindgroups too. What happens if my resource is pinned but not a valid usage for that regular bind group.
- CW: Wouldn’t catch that at bind group creation but this ends up being validated at queue submission.
- RC: Back to encoder question. Why not make this a queue operation to mix usages.
- CF: May be possible. Current design allows validation to happen earlier. Considering allowing pinning/unpinning as a recorded command, i.e. in a command encoder. Might make the patterns easier to use.
- CW: A bit similar to Loko’s suggestion of a resource as pinned, then local unpin/pin. My design was an initial step. Let’s iterate.
- KN: Can’t have all the pinning in command buffers. Needs to exist in the device timeline. Local unpinning/pinning can be in command encoders though
- CF: Only used on device timeline at submit. All validation happens in submit. End up with device timeline set of transitions to validate.
- KN: You record the pin/unpin the command encoder, but they get applied at submit time.
- Usage scope validation needs to know the state of pinning and must be on the device timeline. Because we don’t need it at usage scope validation time, we can put it in the command buffers.
- KN: Right now we just have a set of resource and the usage they are used as.
- CW: If you can pin.unpin inside the command buffer, the queue submit validation that resources are used with the correct usage, you need to start doing the validation at every command buffer instead of for the whole submission. Validation becomes more scoped.
- CF: Just added WGPU utilities to do mapasync on submit, could potentially have the same thing to pin a resource to a given usage.
- BJ: Concerned about pinning happening between submits is that individual submits can have a lot of fixed overhead. Will create multiple submissions when they were not needed.
- CW: CF suggested adding a list of transitions to submit which may help this.
- CF: Submits are very expensive, want to reduce them.
- JB: It’s just the resource validation that’s expensive though?
- CF: Some of it, garbage tracking too.
- CW: Want to come up with a plan for updating bindings.
- CW: CloneWithUpdates vs an InsertBinding which requires some memory management that may be exposed.
- BJ: Is it problematic to have a RemoveBinding?
- CW: No, could be Update(null)
- CW: CloneWithUpdates vs an InsertBinding which requires some memory management that may be exposed.
- JB: You have to pin something before you record the command buffer, if you’re going to use it…
- Dependent on how implementations record commands.
- BJ: primitive_index landed in chrome. Shipping in 142. Pretty small feature, approval in WebGPU group went very smoothly. Intent to ship process got stalled by request for explainer.
- BJ: Have explainer on how it maps to backend features but the request was for a traditional web explainer about user experience.
- BJ: If this is going to be common, we should have a place to put and reference explainer documents. Current one ended up in the proposals.
- KN: Doesn’t feel like a good place, subgroups is also there.
- BJ: Where do we want to put these artifacts going forward? Public facing documents to support shipping things with the web process
- JB: What should the discoverability be?
- JB: We aren’t doing these for each PR though?
- BJ: Majority of discussion happens in this group but before finally exposing it to the web, we may need these. At this point the features are largely done and approved.
- BJ: As for discoverability: We want these to be visible publicly and searchable. Something better than looking at a folder in a git repo.
- KN: For things that currently exist in proposals, should move them to a sub folder. In the blink process, it says don’t just point at a github issue. I think we should just put whole proposal in the issue though.
- DS: Usage examples tend not to go in the spec though.
- BJ: Also contains some extra info about backend implementation. Stuff that is useful beyond shipment process.
- KN: Really want all info to be in the same place instead of split between issue, proposal, correspondence document. Ideally github issue.
- DN: Don’t see the need for detailed/comprehensive explainers. Would prefer to have a good summary at the in one known place instead of having to read through a whole issue, or a tree of issues..
- KN: Issues eventually get out of date once they are merged into the spec.
- BJ: It’s acceptable to have some drift, it represents the feature at the time of proposal.
- JB: Thought we moved away from doing proposals for anything except large spec changes.
- BJ: Thought this too. Tried to engage with this process as lightly as possible but it was requested.
- DS: There is a process that requires killswitches on features and this requires the intent to ship process. (circular)
- JB: Have webgpu.github.io which has many of these documents, could add an explainer section.
- JB: If we’re forced to do this work, lets try to make sure it’s actually useful information. Make them available and visible. In Rust, RFC documents served this purpose, can look back to see justification for features.
- CF: From a user standpoint, the correspondence documents are not findable. Spec is even hard to find. Hosting these online with lists of links is important.
- JB: gpuweb.github.io is pointed at by many things but doesn’t actually contain anything useful.
- CW: It would be nice to have a landing page. Searching for WebGPU leads to webgpu fundamentals. Could have some useful links to specs, tutorials, examples.
- CW: Going back to proposals, if we want something like RFCs, proposals folder would be best.
- JB: It fits, this is a folder for proposals and discussions that have happened.
- Summary: Use proposals folder, try to make them up to date and useful. Update landing page.
- CF: Issues and PRs are hard to find after they land.
- KN: Are we going to write a proposal for each feature? Seems like too much.
- Every feature that you may want a killswitch.
[Immediate Data] Support SetImmediateData() in RenderBundle #5118
- Only discussion avoid render bundles remaining.
- CF: are render bundles used in the field?
- JB: Avoids JS overheads.
- RC: Babylon team loves render bundles for that reason.
- CF: does dawn use secondary command buffers?
- CW: No, we just replay the commands.
- KN: We concluded we didn’t want to do inheritance.
- CW: For WebKits implementation it would be terrible. For metal’s secondary-like command buffers the push constants are like set-bytes. It’s difficult to modify partially.
- KN: Question: do we want to have inside a render bundle?
- CW: Definitely yes.
- KN: so no inherit in a render bundle, and reset state afterward. (?)
- CW: Immediates at the start of a pass are filled with zeros; when you call ‘execute bundle’ then the data is reset to zero. I think this is implementable in WebKit, with no inheritance, right mike W?
- MW: Yes. Is there a benefit to immediates in a render bundle vs. binding a buffer with data.
- CW: Immediates are good at passing IDs of stuff cheaply. So even in a bundle you want to do that. E.g. to record where part of a scene is. You want to easily pass in object IDs.
- MW: So the proposal is the bytes would change at bundle creation?
- CW: They are fixed when you create the bundle. At the start of the bundle, they are zeros. You update them and make use of them in various draw calls, and at the end of a bundle they are reset to zero. The next bundle that executes bundles will see them again, or at the end of the executbundle, the immediates are reset to zero for the enclosing pass.
- KN: It’s useful in render bundles, so you can use immediates on pipelines used in a render bundle. Would be nice to have inheritance, but let’s defer that/ treat it separately.
- MW: I don’t object to it, but seems to be the same as data in a buffer.
- KN: Lets you use a single pipeline both ways: outside a bundle, and inside a bundle.
- CW: And immediates are supposed to be very cheaply.
- CF: Also helps with granularity. If you want to send a single bool value, it’s wasteful to allocate a buffer aligned to 256 bytes.
- CW: bind groups also reset?
- KN: yes. IMO immediates should be same as bindgroups.
- CW: at start of an encoder, bindgroups are empty/null; and immediates are zero. Validation error if you use an unset bind group. But it’s ok to use zero-initialized immediates.
- KN: as written in current proposal https://github.com/gpuweb/gpuweb/blob/main/proposals/push-constants.md
- JB: It seems to be a mistake to make an immediate, and then not set it.
- CW: It’s more validation. If you could validate it, you can set them to zero.
- BJ: You can set individual words. So now it’s problematic for validation, because partial initialized is what…. Bad or good?
- JB: How indicative is it of an error. Partial result is..?
- JB: Defer to a later meeting?
- CF: Our current implementation copied vulkan and it’s a disaster: validation is hard.
- CW: Please review the design and file issues. My goal is to be able to write spec and landing stuff. Maybe we need another round of discussion.
- CF: Where do I look? The proposals doc is up to date?
- CW: Hey Rafael please pass that on to Shaobo?
- RC: He’s been working on it. I’ll ask to make sure it’s up to date.
- JimB filed: [Immediate data] Should failing to set an immediate value used by the shader be a validation error? #5318
-
1. What should the descriptor format be? https://github.com/gpuweb/gpuweb/issues/5296
- KN: Issue opened where I'd like the syntax to be similar to WGSL
rgbainstead of dictionary with 8 letters in it. Do people prefer? - GM: I do
- CF: LFTM
- BJ: Yes
- CW: How webgpu.h do that?
- KN: Char array?
- CF: Enum with all permutations or a struct.
- KN: Not worried about that.
- CW: More translation to c
- KN: Can make it look similar, char array would be ok
- CF: Concern is you'd make it a string each time, but we do that for other things
- CF: Wgpu on WASM needs to generate a string each call instead of having a literal string
- KN: String is cheaper then dict with 4 strings.
- CF: True
- BJ: If it was en enum it's still a string in the dom. Plus get IDL pre-validation for invalid combinations. Just have a long enum
- KN: Would just have a string and validate int he steps.
- CW: Yes, looks more "fun" with the string version, but ultimilately this isnt' a feature most will use
- CF: GLTF does weird things with swizzles. Wants normals in G. This came up at work recently
- KN: Think it's easier to understand due to WGSL matching. Could write dict in weird order and I don't like that (not a big deal)
- GM: When writing test started with original and it was impossible to read when debugging and had to put the 4letter one beside it in order to read it.
- CW: Is WGSL adding 01 swizzle.
- KN: If we do this, and we make it look like WGSL, are we adding the 01 to WGSL.
- CF: Would expect WGSL to have this. Tried .r001 and it complained
- KN: Looking back at decision, what if you do .0?
- CW: Why is that an issue?
- DN: Looks like a float.
- KN: I think it would be fine if we did this in wgsl we didnd't allow .0. Could technically do .0000. Doesn't have to be exactly the same grammar
- DN: Could do _0 and then it's fine
- KN: It's matching enough.
- CW: What if less characters?
- KN: In the texture swizzle, you have to have exactly 4. Always sampling as vec4.
- CW: Yes. Are there strong objections?
- CF: Strong anti-objection
- CW: Then consensus to do it. Will need spec text
- CW: It's a proposal
- KN: Yea i think so.
- KN: Issue opened where I'd like the syntax to be similar to WGSL
-
2. Where should the validation happen? https://github.com/gpuweb/gpuweb/issues/5298
- GM: When you create a view, if you have usage of a render attachment or storage binding you get a validation error. When writing tests, that was annoying. You always have to do it because yo have them set to make mip maps and other reasons. Just always have to do the extra step. If move to you can't use a swizzle on a render attachment when you attach to the render pass and you can't use when attach to a bind group on a storage texture then that issue goes, somewhat, away and mostly works most of the time
- CW: Wayss to optimize in the impls by subsetting the set of usages by what hte swizzle is. Would be nice to not have to look at hte swizzle and see if the swizzle for r or the equivalent of something and g and b … It's slow
- GM: When you create the view, can remember internally the identity so you check one bit. Already a bunch of other things checked in a bunch of places, so not much more then checking a bit
- KN: and it's 4 bytes
- CW: Always worried about bind group overhead
- GM: Already checking not double aliasted and rgb32 not filtered and already like 6 checks, adding 1 bit more isn't the perf vs not perf
- CF: This is create bind group?
- KN: Yea, but we can pre-compute it
- CF: It's not as hot anyway
- CW: For storage textures, maybe.
- GM: 3rd suggestion, auto-remove
- KN: Was in the spec to auto-remove the usages you can't use it as if you set a swizzle. We just take those usages out of the view. Then an impl would have to give the error as to why the usage isn't there in'stead of just saying it isn't there, which isn't a huge deal
- BJ: Hand one issue of usages and view not matching, can't remember context
- TT: Was the bgra storage
- BJ: What was solution
- TT: Required to set on the view to be storage binding, so must be explicit
- BJ: Was my recollection as well, think this is the same issue and we should follow the same patter
- CW: So explicit
- BJ: Yes
- GL: Worried about setting it on a render target and then rendering into a swizzle? Can we name it sampling swizzle?
- CW: Also storage
- KN: No it isn't. Doesn't apply to writing as it doesn't make sense
- GL: Only sample, so the naming could imply the specificity then it's fine to set on the view that's also the render target but it's fine as it writes unswizzled.
- KN: Yea, kinda related to something we tried to figure out when implementing. In metal you cannot create a view with both renderable and swizzle. Was it find to have render and sample into swizzle on Metal?
- GM: Tried making texture and render to it and if it had swizzle it was an error
- KN: Need to make sure to not trigger that error through proposal validation.
- CW: Ultimately all proposals are workable. Consistency is nice. Can be strict now and lift restriction later. Just being consistent with bgraunorm storage would be good
- GM: To not annoy everyone using this feature would be good. Will get hte eror over and over again.
- KN: Changing later, arguably need feature detection but just do it the conservative way and everyone does that for a long time. Better to have solution now but need to answer metal question
- CW: Impl can detect
- KN: And do 2 texture views. What was the context with bgraunorm storage or was it rgba-srgb?
- CW: I think so.
- KN: That makes sense.
-
3. Are there new validation rules needed if the view is multisampled? https://github.com/gpuweb/cts/pull/4427#issuecomment-3248725731 and https://github.com/gpuweb/gpuweb/blob/main/proposals/texture-component-swizzle.md#open-questions
- CW: Answer is no it works on all the things we've tested.
-
4. [stretch] Should we disallow creating views that only includes COPY_* usages? https://github.com/gpuweb/gpuweb/issues/5317skip for today, not unique to texture swizzle and needs more thought first
Support for YUV textures #5281
7pm at Hothouse, 35 Church St. Reservation under “David Neto”
https://maps.app.goo.gl/ZBmaCCZpdMn6Aejr8
![][image2]
![][image3]![][image4]
Put the items in the Day2 afternoon agenda above!
- Trace and replay of WebGPU, with native Dawn and Wgpu players.
- That also supports ANDROID.
- Browser extension for recording the file. (Chrome and Firefox)
- Released 1.0
- Caveats
- Injects a content script in the main thread, not webworkers or iframes
- Presentation
- Captures support requestAnimationFrame. But not requestVideoFrameCallback.
- Object lifetimes
- Only know to delete when .destroy() called. Not sure what to do with garbage collector.
- CW: JS has weak maps, and has a finalizer callback (FinalizationRegistry). So you can observe the action of the garbage collector.
- JB: Different browsers behave differently; impossible to make completely portable.
- Labels: captured on initialization, but not on update.
github.com/Chainsawkitten/WebGPUReconstruct
SW: Does it capture content generated by other web apis? ,e.g. Canvas2d
BJ: Limitations; i have no advice for workers. You mentioned dependency on requestAnimnationFrame; what’s the dependency.
AB: it’s about knowing when to present in native.
(discussion between BJ and KN about how to do this.)
KN: register your own RAF; and present …. When app calls getCurrentTexture; at that point call RAF in the replay.
GL: what happens if the app used an optional feature and the replay dosn’t support.
AB: Nothing special. I expect you’d get a validation error on replay. For any feature that isn’t supported, I’ll scrub those from the adapter, to force the issue.
GL: Suggest having a mode that scrubs all optional features from the recording adapter. Also min the limits.
-
Oguz: Thank you very much for this opportunity to publicly present W2GPU for the first time as a preview.
Session URL at LMPL 2025 workshop at ACM SIGPLAN ICFP SPLASH 2025 Joint Conference (Singapore): https://conf.researchr.org/details/icfp-splash-2025/lmpl-2025-papers/22/W2GPU-Toward-WebAssembly-to-WebGPU-Program-Translation-via-Small-Language-Models
Presentation URL: https://github.com/w2gpu/w2gpu-lmpl2025/blob/main/docs/presentation-202509.pdf
-
Edited for this audience; full slides later in 2025 October.
-
“Glass ceiling” in compiler optimization: limitation to local information; missing opportunity of long-range changes.
- Limited by human bounds on abstraction/naming. Only so many names fit in human cognition, and we write compiler optimization passes that way.
-
Insight: machines don’t have the same memory limits. Exposes opportunity for long-range optimization.
-
Generate sets of program corpora: each set is semantically equivalent and ABI-equivalent programs, with different representation. Then train a neural net to choose and recognize valid transforms that optimize.
-
Map WebAssemblyText to WGSL.
-
W2GPU: use Small Language Model to learn robust mappings from WAT idioms to WGSL kernels.
-
Partial success; prelim success.
- Used shaders21k data set (fragment shaders): https://github.com/mbaradad/shaders21k
- 14K parallel pairs (WATk,m WGSL)
- 7-stage validated compilation and execution
- Sub-2B small language model with QLoRA fine-tuning.
- Converted frag shaders to compute shaders.
- End-2end compile success
- Image parity via PSNR on shader outputs
- Paper will be available open-access later with full details.
CW: you motivated as finding optimization opportunities. Reminds me of AMD’s FSR. They started with ML to find the initial direction, and then noticed which filters were effective, then hand tune starting from there. This is mostly the ML pipeline. How do you expect extraction of features; and how do you formally verify semantic equivalence.
MOD: Use Coq or Lean to verify. Might be hard, come with their own caveats. In the FSR case researcher experts looked at it. I think in this case it would be auto-proved.
LM: What if you want to focus learning on a subproblem; fusion rather than whole program. E.g. generate a dataset focused on optimization fusion (kernel fusion? Loop fusion?)
MOD: Using existing ML frameworks’ capability to output for both CPU and GPU could help, and before feeding to language model one can use recent ML-based optimizers for GPU kernels to further refine the dataset.
Toucan (Stephen) Toucan: a language for graphics (SIGGRAPH 2025)
LM: By making your own language, what do you win over writing everything in Rust or JS.
SW: See backup slide: table of advantages over various other environments.
- Main thing is unified declaration of objects, unified syntax and datatypes.
- LM: Everybody has the problem of embedding in various places. WESL tries to help people write code more easier, and this solves many problems at once.
LM: If we can’t use a new language, what could we learn from this. E.g. if we did really well with reflection, what do we leverage.
SW: Limitations of reflection: you only reflect a single shader. No way to reflect common data across a set of shaders.
- LM: I guess you can create them on the host side and then inject them into the shaders.
- SW: Maybe a clever C++ library can do something; use new reflection facility in C++.
- LM: Share the types make bindings and location numbers etc.
- LM: Your “things eliminated by Toucan” slide; that should be a target list.
SB: How to take advantage of the SIMD.
- SW: Can use software graphics API like SwiftShader. With SSE you get up to 4x improvement over scalar code. I removed vector-length limits, and got to machine limits with AVX512. Would like CPU-side multithreading but haven’t done it yet.
CF: Have you thought about FFI support (foreign function interface). Like, you wouldn’t write a whole app in this language. But part of an app, and then embed it.
- SW: Yes, good idea. Consider exposing a Toucan package as a class object in some other language.
Should failing to set an immediate value used by the shader be a validation error? #5318
- Current proposal https://github.com/gpuweb/gpuweb/blob/main/proposals/push-constants.md
- JB: The proposal says; shader
var<immediate>and some smallish type. The minimum maximum is small (e.g. 32 bytes). You declare a struct type in WGSL or integer value. Within a render pass, we assume there’s a buffer somewhere; in the command in the renderpass, you have data that is written into a sunblock of the data. The draw sees the data provided in the renderpass as the data in this immediate variable. The question I raise is, should it be a WebGPU validation error, to invoke a shader that uses an immediate, when, in that render pass, no value was provided to that render pass. The question is: is this more likely to be an error on the part of the content author, or simply relying on the zero-init of that buffer. What’s more useful to people. E.g. In JS properties are undef by default; seems the language is papering over a bug. Question is: What is most valuable to people using this feature. - CF: On the native APIs, if you don’t set a push constant, is that fine?
- CW: It’s a programming error that is caught by validation layers.
- JB: So there’s a VUID.?
- CW: Can check.
- CF: Question is what to graphics programmers expect.
- JB: An additional dimension: do we have a single bit to flag the whole immediate buffer as initialized or not, or do we track initialization status on a word by word basis. I hadn’t thought about that aspect when I filed this issue. Seems both are feasible. Feels performance of such a check is not significant.
- CW: Vulkan says the values are undefined, but it’s valid to use.
- KN: Found another rule: it’s UB to read an uninitialized value.
- BJ: Re Jim’s specific question, I’d say it feels like a programming error. Can’t imagine cases where you both want push constants but not want to put data in there always. Ok with that. Bigger question is the validation. From the pipeline we can get the bytes of the immediate it demans.
- CW: We have 3 things.
- Amount of data declared in the pipeline
- Amount of immediates actually used in the pipeline.
- Slots that are statically used by the pipeline.
- CW: all of them are pretty cheap. Bitset checks.
- KN: Because padding exists, the only reasonable thing is the last one; otherwise we will require you to set the bytes in the padding. E.g. if the immediates are a float and a vec4, then we don’t want to force you to set padding, which is weird.
- CF: We force you to set it for buffers.
- BJ: You might set each scalar in a separate call.
- KN: Not really buffers are initiated to 0 then it’s your business.
- JB: Seems mean to force them to set padding between items you don’t have names for even.
- BJ: I want to reuse the same immediate buffer. But you can change data types in between; e.g. matrix on first use, then float+vec4 on the second use. The type punning gets to a situation where now the byte-by-byte validation is the wrong thing.
- JB: We don’t try to enforce type consistency in buffer contents.
- BJ: Starts to feel like vertex buffers. Wildly different formats; we validate you set a buffer of an appropriate size, but don’t check the contents. At best check the range of bytes.
- CF: How do you tell what’s padding and what isn’t. Because push constants are structs, you’re always using all of it.
- JB: No. You statically analyze the module. Difference between p.x and p.y and reading all of P, when there is padding between the two.
- CW: we don’t need to be constainted by vulkan because we can initialize the padding. Question is when in WGSL do you touch the padding bytes.
- DN: WGSL speced to not touch the padding bytes.
- KN: We don’t do that analysis yet.
- JB: the size is tiny. Don’t worry about the bits.
- CF: when people port Vulkan to WebGPU, are we adding an extra constraint.
- JB: If it’s validation-clean, then we should not harass them. If it’s not validation-clean then I have an answer for them.
- CW: Currently Dawn tracks just the size of the struct. We validate against the pipeline layout, and also what data to upload. We handle dirty ranges; it’s at the range level, much easier than asking the compiler to do the analysis.
- JB: We’re going to throw a bitvec into each function, and then OR the result.
- JB: Seems should support reuse of the immediate buffer.
- GL: How is this different from any other resource accessed in a shader…
- BJ: Validating only the start and end seems like “validation theater”. Only makes sense to do every byte.
- AL: Force the application to set the whole range, to simplify things.
- CW: That’s not super-composible. And asks them to do work that we could have done for them.
- CF: Also a very limited budget, so don’t skew people to use more than they actually need.
- CW: Converging to word-precise validation. (You can only put 32-bit words at a time). For me it’s user ergonomics.
- SY: I also prefer to validate word by word in the immediate block.
- KN: Think we should validate all the words in the struct declaration.
- CF: what does WGSL “statically accessed” mean; member -only?
- DN: Whole variable or constants are statically accessed; that’s the only level of the concept.
- DN: Can put
@alignon a member to make extra space. - KN: Prefer to make it a warning.
- CW: Warnings are non-normative.
- CF: Warning means it won’t be implemented.
- KN: A case where it doesn’t help us on the backend. It’s only about telling the user something.
- CF: Vulkan cares, and tells people not to do that. Users then expect to do it. If we make it a warning, it may as well not exist.
- CW: both are fine and implementable, just debating which flavour. Not worrying about CPU overhead, and implementability.
- CW: Ok with validation that is slot-precise. Potentially more work, more useful.
- JB: Conservative is to validate it out. If it makes people unhappy or block porting, then we can remove it.
- KN: I have a concern; in ours you have one immediate buffer; in vulkan you can have multiple. A converter from SPIR-V to WGSL will stuff them in one struct. Then you get padding that now in WebGPU you have to set those slots that you never use.
- (KN: EDIT: I was wrong about this, you cannot have multiple immediate buffers in SPIR-V. Only one per entry point.)
- JB: If our goal is to faithfully represent SPIR-V then a single immediate data is not faithful.
- CW: getting lost in details.
- Every 32-bit word covered by members in the shader struct in for immediate data must have been set in the render pass.
SY: Question about importExternalTexture for the capture-replay discussion.
CW: When you see importExternalTexture, you copy the contents of the external texture into a texture. Because we support binding an external texture a texture, you can do that in the replay.
SY: That answers it. 🙂
KN: Immediates: minmax limit. Proposal says 64, our implementation is 16bytes right now. Need to resolve. Filed https://github.com/gpuweb/gpuweb/issues/5320
CW: Think we land spec text in a branch; update proposal docs. Then discuss final details at the PR when landing in main.
Stephen White
Of 23 issues (restrictions) in Compat proposal:
- 20 are landed on featurebranch-compat
- 1 is in progress: PR is written, in review by Kai and Gregg
- 2 are in review by everyone (in a single PR)
- New per stage limits #5295 (maxStorage[Buffers|Textures]in[Fragment|Vertex]Stage)
- Apple has concerns about adding new limits to Core; let’s discuss
SW: Want temperature of whether we’re close.
TT: Yes. From our side. About limits, we talked about this topic before and thought it was ok to add it to core.
MW: One of the objections we had in the limits, since it’s not used by the core spec, then compat is not a strict subset API. Consider some kind of way to ensure user agents don’t have to keep these forever. New features may add more limits, and so we get into a confusing state for clients and frameworks where some limits are needed here but not there etc.
SW: So confusion is limits not used in core. Cognitive load of limits not applied to core. And also inability to unship them in the future.
MW: Consider a featurelevel-aligned limits instead. Limits independent of feature levels, and (?) values that must live forever.
KN: On the bug I mentioned there is no way for us to remove these later that isn’t going to break apps. We want people to write apps against compat; and write compat apps against Chrome then run in Safari without testing in Safari.
KN: There’s basically no overhead to keep keys to ignore in the limit dict. What do we do on the JS API; make them undefined? Make them throw an exception? Don’t think there is a way to have compat without breaking them when removing. Think it’s cheap to have them forever.
SW: We have constraint to have compat apps be core apps, and be able to unship compat in the future. We’ve leaned toward the first constraint, but now shifts to the second. Seems to be incompatible constraints. Compat apps need and will use the limits. But considered in an implementation that doesn’t have compat will not be able to run that compat app.
SW: There’s another pattern, which is “if this property exists then return it, otherwise return this value”. Fragile.
KN: We could make Chrome behave that way easily. But devs would still have to actually test that (comment out featureLevel: "compatibility").
JB: Mozilla has not talked about it. My personal opinion is adding it to core is not costly.
CF: wgpu has compat. We’ve rigged it so that in a core context, we’ve made the limit defaulted so it didn’t matter. Seems harmful to break a compat app on a browser that doesn’t support compat. Or to break an forward compatibility. We can make the limits go high as the sky.
JB: The web doesn’t break old content because that’s considered a browser problem, and people switch browsers.
CF: If an app expects a validation error when it goes over a limit, is that a problem if the validation error is not triggered?
JB: We are always free to expand functionality and increase limits. That app doesn’t need to be supported in that sense.
MW: Have we considered …. automatic upgrading.
KN: Proposal already has automatic upgrading. But still a problem to not have it in limits.
CF: We have limits to ensure portability without having to test on all devices.
MW: Our concern is GPU supported limits becoming unwieldy. 40 limits is impractical.
JB: My understanding of compat is, once it’s merged that we won’t add features / wrinkles to it. So I don’t see the situation getting worse in the future. The work of designing compat was anticipating everything we wanted to rescind. There’s not an ongoing project of chopping it up any more. It’s not the nose of the camel in the tent; it’s the tail.
RC: Can we simplify the story; what are valid values for these in the field? Causes confusion and bewilderment.
SW: Bucket them?
RC: Have the limits, but document them for real devices.
KN: We do enforce the limits that are requested.
AB: We’ll add more limits to new additional features over time. The struct because larger over time. I don’t think adding these is the breaking point
CF: This feels like a documentation problem. Vulkan already has tons of limits. We can do better than Vulkan. We can document “these are for core” “these are for compat” “these are for feature level 17”. Having that kind of hint helps.
SW: So document that these limits are scoped to a particular feature.
CF: These limits are only relevant to compat. If you don’t have them you get defaults that are XYZ.
MW: I don’t have a formal proposal. Rafael’s idea of grouping them may be helpful. Don’t want to get into the situation of OpenGL or Vulkan. Think documentation is not read by developers.
KN: Think that problem is limits enabled by default, and we solved that already. We need a proposal that maintains compatibility across browsers.
MW: The specific proposal is to remove the per-fragment stage limit higher bound. Have per-vertex limit and compute stage limits.
KN: So we’d add 2 to core instead of adding 4 to core.
CW: Why did we add fragment ones. Specific devices?
SW: We had specific devices, but may have excluded them for other reasons.
KN: Pretty sure we still needed them. Think it’s documented in an issue.
SW: Issue of confusing limits is helped by WebGPU’s required validation. It tells you what is violated. Don’t get into the situation of OpenGL.
BJ: Regarding grouping specific limits. Have concerns about that. Sometimes they don’t group nicely. With primitive_index, it didn’t cluster cleanly; it belongs to mesh shading and … and by….. On the other hand, documentation can spell out what limits apply to what features.
CW: Part of the concern is that we eventually want to delete things at some point. Can’t delete because backward compatibility,. How about signalling intent by renaming the fields. maxCompatBLAHBLAHBLAH. It’s a wart that stays forever but it’s clear
CF: When I meant grouping, I meant in documentation, not in the API surface.
KN: I’m fine with renaming them. Would be clearer.
MW: Think with renaming, while verifying if we still need the fragment limits.
SW: bikeshed “compat” is not used; everywhere it’s “compatibility mode”
SW: Put compat string at the end?
DN: I prefer at the start. Being opinionated.
CW: At the start.
CW: We’re in Origin Trial. We can rename after stopping the OT, and landing the real feature.
CW: Procedurally, we’d like get to a point where we get approval to land the last changes in the branch, then one more in the merge to main.
- Swizzle assignment 2025-07-30 (reprise)
- DN: presentation reprise (presentation mostly hasn't changed since it was presented before)
- DN: [Changed the way compound assignment works so it reads the old values before evaluating the rhs and writing back all components at the end]
- RC: how does this avoid the bug farm?
- DN: Implementation would implement this themselves and not rely on the backend compiler.
- AM: unless they really want to try to recombine this into the backend language for some reason.
- DN: SPIR-V doesn't have swizzle assignment.
- CF: As long as there's a good tight spec, should do it, everyone wants it.
- MW: Noted 2x performance regression. Just prototype or architectural limitations?
- GL: No evidence of architectural limitations, don't think there will be except maybe a few weird GL semantics like shadow buffers.
- CW: Might have more happening on the JS main thread.
- GL: Lot of low hanging fruit. In 6mo can have an actual list of issues. Functionality not far enough along yet.
- KN: Take temperature of room on mapSync again — figure out whether we as a group will want to try again to convince the TAG
- Wasm JSPI still causes problems the larger the app gets, because of reentrancy.
- Async JS also causes problems the larger the app gets, have to make more and more code
async, plus reentrancy. - The fact that these problems become gradually worse in larger apps make it hard to justify that a solution is necessary. There's no minimal repro that demonstrates the problem.
- People are using Brandon's hack.
- Other browsers getting feedback asking for this yet?
- JB: Not yet.
- [a little bit of discussion happened here but nobody took notes]
- BJ: Figma used my terrible very bad no good hack (copy to WebGL and use readPixels)
- DC: We almost used that for GPU picking but decided to implement CPU picking instead, for now. But would really like this.
- CF: Ruffle needed sync readback in the web. Think they can put their event loop on a worker. But it comes up repeatedly. Happens when you use the GPU as a coprocessor, independent of presentation.
- JB: The TAG has a legit interest in discouraging people from doing things that degrade the quality of the web. Apps that block the main thread are un-web-like. Affects browser responsiveness. But it’s hard for us to provide any kind of feature here without immediately making it very easy to do the discouraged thing.
- BJ: I’d argue making it available only in Worker is exactly that. There’s precedent. There’s a github issue in 2022. The filesystem API has a filesystem Sync Access Handle, which fetches in a blocking way. I think somewhere in that they made reference to our discussion. It seems to me that strikes the right balance.
- MW: Like Jim we haven’t received specific feedback needing this. Think the worker solution might work. Feeling is this is due to older sites being forward ported. Also prefer to avoid people being prevented from migrating to WebGPU. Complex.
- SW: If you move to worker, does whole app have to move to the worker?
- CW: you have to move all your webgpu to the worker, effectively. Daniel provided info about their workaround; they do a copy of the whole frame, per frame, and then do cpu-based hit testing. That’s a lot of memory traffic; bad for perf and power. We’ve heard this kind of pain point from multiple viewpoints: machine learning, and other graphics. We’ve provided feedback mapsync on worker thread, and not on main thread.
- KN: Pretty much mapsync on worker thread should be fine for those use cases.
- KN: The filesystem access case is like ours: you know it’s going to finish. The filesystem action, and also wait-until-queue; because you can’t wait until the thing is enqueued. But some things aren’t like that (atomic wait) because the atomic wait is not yet signaled. The sync on worker is strictly better than doing stuff on the main thread.
- CW: Don’t think we’re going to converge now. Next steps? The chrome team has >5 projects that wants this. Suggest the Chrome team gathers that feedback, and encourage public voicing; and initiate question to TAG group for mapSync on workers; see if it’s something they would not veto at least.
- KN: Good to know there isn’t immediate feeling to kill the idea.
copyExternalImageToTexture should allow copying to the new 16unorm formats #5289
- CW: Were waiting for feedback from Mozilla. For all renderable formats.
- JB: The primitive we’re using don’t support all those feedbacks. Could implement it with a render pipeline. Then it should work?
- KN: Spec says these should all be renderable.
- JB: We’re not opposed in principle, it’s schedule delay.
- KN: These are part of texture-format-tier1. So if you don’t implement that then you don’t have to worry/wait.
- TT: We don’t implement tier1.
- JB: For the good of the spec, seems we should say “yes”
- KN: There are three parts. 1. Add 16unorm formats. 2. Add snorm formats. 3. All renderable formats.
- JB: Isn’t there some state we can stop negotiating on a format-by-format basis? Can we simplify to just the all the renderable formats.
- KN: Some question on whether snorm is not useful. You can’t write to the slightly negative values that might occur with a colour space conversion.
- MW: unorm formats definitely want to support. Snorm formats are interesting; normal maps are a case. You end up with data loss with some conversions. No opinion whether snorm should be supported, but it does have data loss. Maybe allow with no validation of the result; the data loss is unexpected.
- CW: when copying to float texture, the values are preserved; converting to unorm you get clamping; to snorm you get quantization, and also clipping above 1.
- JB: These problems are for any snorm usage, yes?
- CW: Snorm always have these caveats. So yes, the copy is lossy but graphics programmers should understand the caveats.
- MW: Are there use cases? No objection, but the wrong values for normal maps for the snorm case is somewhat pointless.
- KN: Can't think of one. Best idea is that someone fills an snorm texture with a background image, and then renders on top of it (and for some reason needs snorm and negative values).
- CW: For the sake of symmetry, I feel we should allow it. We don’t go through extra care for other float-related
- MW: Agree with that.
- GT: Feels that closes off a future path. If I want to load into snorms properly I have to write my own loader then, because the default behaviour is broken.
- KN: I wouldn’t make remapping the image to -1..1 the default behaviour (would be super weird to write 0.0 and then sample the texture and get -1.0 by default), so would want to have an explicit option for that anyway. Not closing ourselves off from that.
- CW: Think consensus is adding all [float/unorm/snorm] renderable formats to tier1.
-
PM: hear online this limits perf vs. native vulkan because it uses more memory. Likely use in MSAA.
-
MW: Memoryless render targets means, e.g. if I have 800x600 target then it uses zero memory in tilers, like in mobile. Good for MSAA buffers. Use case depth buffers in tile memory. And can ultimately resolve to single-sample render target, and so you get the memory savings for the MSAA case in the intermediate.
-
CW: In Dawn we implemented this to ship Chromium's (Skia Graphite) Dawn-based rendering on Mac Arm. it was essential for us. In Dawn it’s a new texture usage “transient attachment” that is combined with “render attachment”. And “transient” is not usable elsewhere only usable in render attachment. It just have loadOp clear, and storeOp Discard. And that’s it. That maps to Vulkan and metal; D3D12 doesn’t have this feature.
-
JB: How do you read from this in the render pass.
-
CW: in the render pass it’s used as a depth texture, and msAA buffer, and it’s read to resolve. In practice in tiler it’s resolved without writing out the intermediate.
-
JB: Only fixed function stuff that reads it.
-
AB: For now yes. There is also framebuffer fetch which can read it.
- CW: Have a possible future proposal for framebuffer fetch (pixel local storage).
-
CF: It saves memory bandwidth, not just memory. Can be significant.
-
AB: Depends; if you have load-clear, store-discard then you can already get those bandwidth/power savings.
-
CW: The Dawn way is straightforward, don’t want to expose Vulkan’s full thing.
-
CW: This is a tiler-only optimization. On desktop GPUs you still pay the full cost of memory. Even when pseudo-tiling, core may spill to main memory. (Spilling happens even on Qualcomm.)
-
JB: And the user can’t predict when that cost is incurred? If you don’t know then (don’t?) expose as optional feature?
-
SW: What about compat. There’s an extension (EXT_multisampled_render_to_texture) … Would be a big win for the MSAA case if we can exploit it.
-
GL: Render buffers are always supposed to be discarded with this extension, so it’s covered.
-
CW: We want people to always use it when they can. If we made it optional then they won’t see it, and then miss out out on massive win. So nice to make it required features.
-
JB: worry it obscures cost model. If it sometimes costs you 4k x 4k that is too surprising.
-
KN: it never costs more than the alternative. Right now you have to allocate the full intermediate buffer.
-
JB: I don’t want to expose it if it’s not actually implemented in the HW.
-
CF: Feel it should be available always. Desktop folks won’t use it, and mobile folks are worse off. If you require the feature, add adapterinfo hint saying whether the feature is actually an optimization.
-
CW: My position: expose it everywhere ….
-
KN: I’d go as far as watching the usages, and warn if you see a missed opportunity.
- MW: +1
-
CF: How do you know it’s not going to be used differently in the future?
-
KN: time-bound it, and therefore it’s only a warning.
-
CW: Hearing yes, we should expose it always, and potentially lint app behaviour.
-
CF: Needs to always be there, or people would never use it. Otherwise you pessimize mobile.
-
CW: I’ll sign up Chrome to make a proposal.
- Beliefs:
- string templating is a “cry for help”
- Conditions overuse, also a cry for help
- Reflection / injection of shader code
- JB: reduceBuffer is really a whole compute pass [multiple dispatches], not something you call from other shader code
- LM: reduceBuffer on this slide is called from JS. The compute shaders used need to be specialized for the reduction you want to do, so we have to generate specialized WGSL.
- LM: "injection" or "late-linking" with the specialization parameters. Here
sumF32 - DS: previously reflection demand was to write shader-structs on the host side, to deal with layouts. Make it easy to push data to your shaders.
- LM: Is that in high demand?
- CW: we get those calls. When we tell them getting shader reflection info would have to be asynchronous, because it has to come back from the device timeline, then they don’t like it as much. Most folks want it to be synchronous to be useful.
- LM: the late linking idea is synchronous, but a different use model.
- JB: They want something like a WASM version of the Naga frontend. It costs.
- LM: they want it tiny. :-)
- JB: at one point there was a JS parser as a JS API. Returned a parse tree to do that. That was an internal thing in Firefox. Hm. never made it to the web.
- Overloads
- Examples from Bevy, and lygia
- Bevy: the if-else is a smell that you want a language feature. Also current solution is not typechecked on both sides of the
@if - Lygia case: exposes the deduped name to user level.
- Issues:
- Conflict name between user-defined functions and builtins, e.g.
max - Generics: how do they fit into the conversion rank.
- JB: Generic should behave as a bunch of overloads. If you had bounds on T, e.g. ‘numeric type’ or some other condition on types. .eg. numeric based on floats. Etc.
- LM: Consider the specialization case; a generic and a tuned version for one case. E.g. some languages have both, and allow you to customize.
- JB: People doing specializations is that the specializations end up with slightly different semantics, and they should have used a different name in the first case. If you have a type variable with constrained bounds, then you’ve specified semantics for all those cases. Avoid allowing people to write confusing code.
- Conflict name between user-defined functions and builtins, e.g.
- KN: Don’t go through the effort of a type bound system; treat it as a template. Substitute the actually used type, and then compile it. SInce we have a solid foundation in WGSL, it’s ok for a template system on top to be loose.
- Stefan: Language server with separate compilation means I can’t type check if I don’t have the full uses, nor the type constraint.
- KN: The thrust is keep it simple.
- DN: User-specified overloads can get you into trouble where there is no single best candidate. When we defined the builtins, we carefully avoided running into trouble here. Once users can do this, we would have to add more rules.
- JB: Thought the rules we have now cover everything (produce an error in that case), just that it's never invokable.
- DN: True.
- CW: If we have overloads let's be careful not to have implicit conversions.
- DN: operators like + allow vector + scalar; that’s treated in the spec as a special behaviour of the + operator; not as an automatic conversion of the scalar to the vector.
-
CW: Picking up from yesterday. Confident about a few things. On the API side there’s the two open questions
- Pinning design. Is it the right direction
- How to update bind groups.
-
CW: I’d like to hear if all this is a palatable direction and can go on to prototyping in Dawn and Wgpu. Or do we need more discussion on direction.
-
LK: For the updating bindings. For index allocation. I think making up a binding number for them is too complicated.
-
CF: It’s no more complicated, just a slightly more convenient version of the existing things. On the content side, we have to validate if you are updating in the wrong place (update in a slot that’s not used). Allowing insert without that explicit index just has us pick it for you,.
-
CW: If we force the harder way, we’re forcing the user side to do the bookkeeping.
-
LK: That means we have to reflect all this stuff back to the (...?) main thread(?)., We don’t do this now.
-
CW: New API so it’s not implemented. As an implementation, internally you can set a callback to indicate which slots have been freed. We get that before onSubmittedWorkDone. It’s way more traffic back and forth.
-
LK: If you need that before the update function?
-
CF: Yes, because you have to reject it.
-
CW: But you can make it async; validate later.
-
KN: You’d have to hook .destroy() on the client side. Or the server can tell you, but makes it async.
-
CW: Two ways to implement this: 1. Server pushes free slots back to client (content process). 2. A lot of client-side tracking, which is hard. When you do texture.destroy(), you …. Lots of bookkeeping. If we do the onsubmittedWork you get browser differences with how deep your pipeline is. We could spec when objects get destroyed,and could spec that you always get the first free slot; that improves the consistency.
-
CF: Users already don’t know when the work is finished. They rely on onSubmittedWorkDone. As long as the feedback occurs at that time, everything is fine.
-
CW: Two approaches: Track in the JS side; vs. track in the GPU side. But you can have a race between them. That’s why the client-side validation of the update is important.
-
KN: Are the list of free slots on the client side out of date?
-
CW: It may see delayed updates, and that’s ok.
-
KN: It has to happen in a JS call. (..?)
-
BJ: This is only a problem if you’re bumping up against the limits of the allocation you made of the bindless bind group.
-
KN: That wasn’t my concern. One browser on two different runs may give you a different slot, on insertBinding. It would be nicer to be consistent.
-
CF: You can get timing issues with bubbles on the queue timeline. You’re working on frame 5, but frame 3 had or hadn’t finished.
-
KN: An update is a message …
-
CW: An update is 1. Client-side validation that the slot is free; if the slot is free then a message is sent to do the update on the server side. On the server side you assert there’s space (because you assume the client is correct). If we specify you always get the first available, then variability is low. We inevitably have a timing-based variability.
-
KN: Fundamentally the issue is where update may or may not work … I don’t understand the synchronization model. This is unlike other parts of WebGPU. You have to wait for something before you can update.
-
CW: Yes. In other cases you enqueue copies on the device timeline. D3D and Vulkan without extensions disallow pipelineing of these things. We have to make updates on the CPU side, and can’t update only on the GPU side. So either you do it on the CPU side or you get shadowing which is a mess.
-
CW: if-not-slot is an out-of-memory error. In practice a difference of one or two frames in recycling should not be a problem. You should ensure some slack to absorb that.
-
BJ: Now I’m more concerned. You say the validation must wait until the slot becomes free. How does it become free?
-
CW: It’s valid to update-to-null, which releases it.
-
CW: A slot can become pending-for-free, meaning it will be free when the GPU finishes the in-flight work. Happens when you .delete it, or you say “i no longer need this”. What always works is bindgroup.setThisSlot; waitforsubmitted; … (?)
-
CW: In wgpu and Metal, we already passed the pointer to the allocation in a command buffer. So we ended up with the design in the Connor/Jasper elaboration: Do a copy of the bindgroup info.
-
CF: Right, but these can get quite large. Memcopying MB of data. Should be a simple update. Scary as these numbers get larger.
-
CF: Morally equivalent to mapAsync with buffers, except you’re mapAsync a slot in a bindGroup. Prepare to be updated; once the submission that last used it is done, then I can update it to the new value.
-
CW: It’s a per-slot ownership transfer.
-
GT: Feels like no JS programmer is going to get this correct.
-
CW: which is why insertBindings.
-
GT: Picturing what this looks like in a JS RAF loop. What does it look like?
-
CW: In a game engine, then allocate a 50K bind group (huge).
-
CF: If they runout of slots, they hard the block; flush the GPU, then start again. (like GC) Consider it an exceptional case that blips.
-
GT: Inconsistent behaviour between fast and slow machines; and my customer might have a slow.
-
CW: That’s ereality.
-
CF: You can make it excessively big.
-
RC: Are we ever going to do multi-queue. I think about this global pin and unpin. Get out of order effects.
-
CW: If we do multi-queue, the way we work current WebGPU is to say objects get a home queue, and start on the “main” queue. If a bindgroup is used on 2 queues, then have to wait for both queues to flush. But have to audit the whole API.
-
CF: We should make a pass over bindless to make sure this doesn’t paint us into a bad corner.
-
RC: Concern about this global state / global functions.
-
CW: Pinning is the larger concern; multi-queue will be complex.
-
KN: Why is ownership per-slot instead of the whole thing.
-
CW: Resources are added to the bindgroup in a granular single-slot way.
-
KN: How do we know which resources are used in the queue.
-
CF: There’s a way to make it efficient.
-
CW: You track, per slot, the last time it became free.
-
KN: We can’t use the mapAsync model because it has to be on a per-slot basis.
-
CW: yes.
-
CF: Most things don’t care about the actual index they end up in. There are some cases where they have special textures in specific slots. But most are ok to be driven by the implementation.
-
CW: Sometimes engines want contiguous batches. E.g. reserve 1000 slots for their own use.
-
RC: when would you have non-hardcoded indices.
-
…
-
KN: Why assign the slot on the client side.
-
CW: To avoid a race.
-
KN: No, to avoid blocking.
-
CW: That’s an option; but can get a race with the validation.
-
CW: We need content-time tracking to avoid the race.
-
BJ: Should we put promises on these?
-
CW: It’s the same promise as if remove;onSubmittedWorkDone.
-
BJ: Vastly more ergonomic to use promises. Folks are going to write polyfills that wrap the remove;onSubmittedWorkDone.
-
CW: The polyfill is insertBindings. Alleviates the work of bookkeeping themselves.
-
CF: Question about what the metal looks like in the homogenous bindless case. I..e a group with buffers only; a group with textures only etc. I’ve done this for D3D and Vulkan. Question is about how it works for Metal, and does it work at all.
-
Showing WGSL and pseudo-MSL.
-
![][image5]
-
Have a heap of textures. Get a texture2 from one slot, and a texture2 from another slot. That’s the WGSL
-
Then look at the Metal: uses casts an dereferences.
-
Question is: is there a type we put in the ArgBuffer member, so that we can to disparate resulting types. (i.e. replace the void)
-
MW: Understand the question. The Metal front end does not support this cast. The unsized C array should work. You need to be specific about the type: all the way to texture2D; but can’t have both texture2D and texture3D in thes ame array because of that.
-
MW: Can do float and int conversions, but not these ones.
-
CF: Ok, how about using binding the same underlying heap to two different bindings. And typed differently in the shader.
-
MW: That might work; you should try that.
-
![][image6]
-
MW: Think you'd get an error you can't bind them both to the same slot
-
CF: Different slots, but same argument buffer bound to both.
-
MW: Might work.
-
CW: What if you use indirection: in the arg buffer, different members are typed pointers to the same virtual address, but as different types.
-
![][image7]
-
MW: You should try it. The metal team says they don’t support these kinds of casts, but it would be the wrong type when you access it. If you read via the wrong typed pointer, then it’s undefined behaviour.
-
CF: To understand what goes wrong; it’s not that the hardware does the wrong thing; it’s that the compiler frontend will block us. MoltenVK does this, and it seems to work fine.
-
CW: The difficulty is we don’t want to rely on udnefined bhaviour of the compiler. The trick where we have a bunch of indirections: I think we should try it; the compiler won’t understand what’s going on, and it should work.
-
CW: What about storage buffers.
-
![][image8]
-
CF: Same kind of problem. But now aliasing, viewing same buffer as both T and U. This is invalid in C++14 type-based alias analysis. (the same locations are interpreted as two different pointer types). WebGPU can't statically detect this case. (Many cases where Clang would detect the index value 0 is static but we wouldn't be able to.)
-
CF: Is this a thing that, as written, could be supported.
-
MW: To clarify, are you asking about a program that accesses both resource1 and resource2.
-
MW: The LLVM version of the Metal compiler may not support the flags that normal Clang has.
-
MW: C++17?
-
CF: Unlikely, would need a memcpy which we can't do
-
CW: Work around as a bunch of uint4s in the buffers?
-
CF: We could pretend all the buffers are uint4 and the reconstruct after the loads (in value space).
-
CW: It was good enough for D3D ByteAddressBuffer.
-
CF: Maybe with slowness.
-
CW: Problem 3?
-
![][image9]
-
CF: In the case of failure (in problem 1), resource1_valid is false, and resource2_valid is false. We have to return some placeholder resource. Now we have two variables where the MSL compiler is pointing to the same place but are read-write textures. Think this is ok in MSL?
-
MW: This is perfectly well defined.
-
CW: Context. Bindless in metal is super-powerful, e.g. because argbuffer and pointer-chasing. So the metal compiler can be more strict about aliasing because of teh base facility to express different kinds of objects. [So apps wouldn't need this.] But in WebGPU we have to have less expressibility and hence have type-aliasing.
-
CF: We have to pick a design that must be well-defined behaviour in the Metal stack.
-
MW: The Metal compiler is good at generating nonsense code when there is undefined behaviour. It exploits the opportunity.
- SW: There are no interesting devices that have zero Storage* in the fragment stage, but Adreno has fewer in fragment than compute
- SW: The maxStorage*InFragmentStage limits were added largely because of Adreno 5XX, which has 8 storage buffers in compute, but 4 each in vertex and fragment. If we don’t have separate limits for fragment, we have lower the existing maxStorage*PerShaderStage in Compat to 4 (meaning compute would be artificially limited to 4)
- SW: Most of this legwork was done by Teo in this comment
- JB: do we need all four limits?
- SW: yes, because of ARM Mali which has zero for vertex-related limits
- CW: In the interests of forward progress, could we agree to artificially limit compute in Compat to 4 and remove the two fragment-related limits?
- MW: we support adding all four limits in Core and Compat
- CW: propose still renaming the new limits to include “Compatibility”