这是indexloc提供的服务,不要输入任何密码
Skip to content

Push constant proposal #4612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 13, 2024
Merged

Push constant proposal #4612

merged 10 commits into from
Jul 13, 2024

Conversation

shaoboyan
Copy link
Contributor

Push constants proposal. Related issue #75

Copy link
Contributor

github-actions bot commented Apr 30, 2024

Previews, as seen when this build job started (6d92094):
WebGPU webgpu.idl | Explainer | Correspondence Reference
WGSL grammar.js | wgsl.lalr.txt

uint32_t pushConstantsSize = 0;
};
```
two pipeline layouts are defined to be “compatible for push constants” if they were created with identical push constant size. It means push constants values can share between pipeline layouts that are compatible for push constants.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's a compile error for a module to use push constants outside of its defined size range?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts is that we follow out-of-bounds access here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can validate that the shader module uses no more than the push constants allowed by the module so we should be fine. Also dynamic indexing of push constants is not possible so in the shader we know that all accesses are in bounds.

| --- | --- | --- | --- | --- | --- | --- |
| `var<push_constant>` | Immutable | [Module](https://www.w3.org/TR/WGSL/#module-scope) | [Concrete](https://www.w3.org/TR/WGSL/#type-concrete) [constructible](https://www.w3.org/TR/WGSL/#constructible) [host-shareable](https://www.w3.org/TR/WGSL/#host-shareable) excludes [array types](https://www.w3.org/TR/WGSL/#array-types) and [structure types](https://www.w3.org/TR/WGSL/#struct-types) contains array members | Disallowed | | Yes. [uniform buffer](https://www.w3.org/TR/WGSL/#uniform-buffer) |

NOTE: Each [entry point](https://www.w3.org/TR/WGSL/#entry-point) can statically use at most one push constant variable.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make implementation a simpler. So we don't need to introduce @offset in wgsl to describe multiple push constant variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enforce a single push constant interface we should consider an alternate design of making it a parameter to the entry point (identified via an attribute). If you did consider it, what did you see as the draw backs?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would prefer push constants to be like bindings, and not tied to a specific entry point. I tend to keep my bindings/push constants in a separate file I can reuse across passes like so.

HLSL seems to have requirements around having push constants be structs anyways gfx-rs/wgpu#5571

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was my strawman in #75 (comment) considered at all? If we assume 4-byte elements and alignment, I think this could become a lot easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magcius in that proposal, where are @parameter variable declared? At the global scope of as function arguments? It seems very similar to specifying an offset, except that it stays in "words" instead of bytes. It'd be nice to support smaller data types in push constants eventually so IMHO it's still be better to act on bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enforce a single push constant interface we should consider an alternate design of making it a parameter to the entry point (identified via an attribute). If you did consider it, what did you see as the draw backs?

I agree it is an optional design. Personally, I'm slightly prefer current address space mode proposal because:

  • It defines a global scope variable (like uniform), so that the helper functions could use it directly. ( instead of accepting attributes as parameter, if we pass push constant variable as entry point function parameter)
  • From API aspect, pushConstantBytes is related with PipelineLayout rather than ShaderModule. Take it as a special binding resource is natural.


```javascript
interface GPUCommandEncoder {
void setPushConstantU32(uint32_t offset, uint32_t value);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What units is offset in? If bytes, are there any restrictions? If I do setPushConstantU32(1, 0x12345678);, will it overwrite bytes 1, 2, 3, 4 and leave byte 0 alone? Should there be alignment restrictions?

Copy link
Contributor Author

@shaoboyan shaoboyan Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same as bufferOffset )
It is in bytes, and it should be multiple of 4 bytes. And the alignment should follow this(https://gpuweb.github.io/gpuweb/wgsl/#alignment-and-size). So it means a the pushconstantSize = variable sizes + padding sizes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about f16? Should it just be set via a packed shared buffer?

Copy link
Contributor Author

@shaoboyan shaoboyan May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F16 should be set via packed shared buffer, or setPushConstantU32 so that we don't change the bits.


| Address space | Sharing among invocations | Default access mode | Notes |
| --- | --- | --- | --- |
| `push_constant` | Invocations in [shader stage](https://www.w3.org/TR/WGSL/#shader-stages) | [read](https://www.w3.org/TR/WGSL/#access-read) | For [uniform buffer](https://www.w3.org/TR/WGSL/#uniform-buffer) variables exclude [array types](https://www.w3.org/TR/WGSL/#array-types) variable or [structure types](https://www.w3.org/TR/WGSL/#struct-types) variable contains [array types](https://www.w3.org/TR/WGSL/#array-types) attributes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the uniform buffer reference a copy paste error?

Separately, are arrays excluded to avoid Vulkan's default uniform access requirements? It will seem arbitrary to users unfamiliar with Vulkan. Is this a better resolution than requiring non-uniform indexing (probably)? Is it better than having a uniformity requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the uniform buffer reference a copy paste error?

No, I think uniform buffer variable defined in wgsl aligns with what I want with push constant. I think push constant variables should be host-shareable constructible type, and must satisfy the address space layout constraints.

Separately, are arrays excluded to avoid Vulkan's default uniform access requirements?

Not only Vulkan, but also DX12. DX12 says Arrays are not permitted in constant buffers that get mapped onto root constants since dynamic indexing in the root signature space is not supported (https://learn.microsoft.com/en-us/windows/win32/direct3d12/using-constants-directly-in-the-root-signature)

| --- | --- | --- | --- | --- | --- | --- |
| `var<push_constant>` | Immutable | [Module](https://www.w3.org/TR/WGSL/#module-scope) | [Concrete](https://www.w3.org/TR/WGSL/#type-concrete) [constructible](https://www.w3.org/TR/WGSL/#constructible) [host-shareable](https://www.w3.org/TR/WGSL/#host-shareable) excludes [array types](https://www.w3.org/TR/WGSL/#array-types) and [structure types](https://www.w3.org/TR/WGSL/#struct-types) contains array members | Disallowed | | Yes. [uniform buffer](https://www.w3.org/TR/WGSL/#uniform-buffer) |

NOTE: Each [entry point](https://www.w3.org/TR/WGSL/#entry-point) can statically use at most one push constant variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enforce a single push constant interface we should consider an alternate design of making it a parameter to the entry point (identified via an attribute). If you did consider it, what did you see as the draw backs?


```javascript
interface GPUCommandEncoder {
void setPushConstantU32(uint32_t offset, uint32_t value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about f16? Should it just be set via a packed shared buffer?

Four new functions in `GPUCommandEncoder`.

```javascript
interface GPUCommandEncoder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like these commands should either take a pipeline or be on render/compute encoders.

Copy link
Contributor Author

@shaoboyan shaoboyan May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On render/compute encoders. And that's why I tend to add these APIs in GPUCommandEncoder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands wouldn't all pipelines need the same push constants. That's more what I was getting at of needing to associate the command with a particular pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPUCommandEncoder is the top-level encoder. If you want to add these methods to render/compute encoders you need to add them to GPUBindingCommandsMixin (assuming they be used in render bundles).

We could associate push constant values with a particular pipeline by passing them as part of setPipeline() (kind of parallel to overrides in createPipeline()), but since IIUC setPushConstant() is often going to be called multiple times per setPipeline(), I think it makes more sense to make it parallel setBindGroup(): there are separate states for the current {pipeline, bind groups, push constants} and we capture all the current states each time we hit a draw/dispatch call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands wouldn't all pipelines need the same push constants. That's more what I was getting at of needing to associate the command with a particular pipeline.

Yes. My expected use case is:

passEncoder.setPipeline().
passEncoder.setBindGroup();
passEncoder.setPushConstant();
...
// setBindGroup();
// setPushConstant();

If you want to add these methods to render/compute encoders you need to add them to GPUBindingCommandsMixin (assuming they be used in render bundles).

Thanks! That's the place I expected.

@kdashg kdashg added the wgsl WebGPU Shading Language Issues label Apr 30, 2024
@kainino0x kainino0x added api WebGPU API proposal labels Apr 30, 2024
@shaoboyan shaoboyan requested a review from toji May 7, 2024 08:09
@mwyrzykowski mwyrzykowski self-requested a review May 8, 2024 23:29
@kdashg kdashg added this to the Milestone 2 milestone May 8, 2024
void setPushConstantU32(uint32_t offset, uint32_t value);
void setPushConstantI32(uint32_t offset, int32_t value);
void setPushConstantF32(uin32_t offset, float32_t value);
void setPushConstants(uint32_t offset, uint32_t count, AllowSharedBufferSource data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kelsey points out we would want to add a dataOffset and dataCount (like other entry points that take BufferSources) so we can point directly into a larger array buffer instead of copying it into a new ArrayBuffer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the API shape to void setImmediateDataRange(uint32_t rangeOffset, AllowSharedBufferSource data, optional dataOffset, optional size);

void setPushConstantU32(uint32_t offset, uint32_t value);
void setPushConstantI32(uint32_t offset, int32_t value);
void setPushConstantF32(uin32_t offset, float32_t value);
void setPushConstants(uint32_t offset, uint32_t count, AllowSharedBufferSource data);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good, but I think we only need the last signature taking an AllowSharedBufferSource and not the other overloads (U32, I32, F32, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But tend to keep these U32, I32, F32 for convenience , especially we tend to support small amount of data uploading.

@mwyrzykowski
Copy link

I think this will have impact on the already complex matrix of GPUSupportedLimits.

Have we considered making setVertexBuffer take an ArrayBufferSource? Then we don't need to change the limits.

I didn't take a look at the D3D or Vulkan backends yet so maybe that is the motivation? On Metal, this is as simple as calling setVertexBytes: / setFragmentBytes:

@litherum
Copy link
Contributor

litherum commented May 9, 2024

Assuming that push constants would be implemented using D3D12's root constants...

If I'm not mistaken, D3D12 has a fixed (per-device?) amount of space in the root signature. In shipping WebGPU, the WebGPU implementation gets to decide how to use that space.

For example, if the WebGPU Pipeline Layout includes 2 textures, the WebGPU implementation can place those 2 texture descriptors directly inside the D3D12 root signature. Or, alternatively, the WebGPU implementation can choose to put both texture descriptors into a descriptor table, and make the root signature point to the descriptor table. The WebGPU implementation has freedom here to use this fixed root signature space as it desires.

Okay, now let's add in support for push constants. Push constants would also live directly within the fixed root signature space. But what if all the root signature space is already occupied (e.g. by those texture descriptors from the above paragraph)?

Even if WebGPU only allows a single push constant, that doesn't actually solve the problem. The root signature space may be 100% occupied by descriptors, leaving no space at all for push constants.

There are a variety of potential solutions to this problem:

  • Make push constants space allocation be best-effort. If there isn't space within the root signature, fail to set the push constant. Different devices will fail in different situations.
  • Tell every implementation that it must leave X bytes of extra space in the root signature for push constants, where "X" is a literal value spelled out in the spec. This is unfortunate because it pessimizes - the app might not need all X bytes, and the remaining bytes might be better spent on descriptors
  • Have the app request how much push constant space it needs, within the WebGPU Pipeline Layout, and add platform-agnostic validation rules to make sure the app doesn't ask for too much space. Inventing such validation rules will be tricky, because the rules would have to balance the freedom of WebGPU implementations to place descriptors freely, with the desire to allow applications to request as much push constant memory as they want/need. Another reason I don't like this option is that, if the application requests too much push constant space and pipeline layout creation fails, that doesn't mean they can't access data of the requested size, but it instead just means they can't access data of the requested size as push constants.
  • (My favorite option): Let push constants be an implementation detail, rather than an explicit part of the WebGPU API, and treat the root signature space as just another part of the pipeline layout.
    • As @mwyrzykowski suggests, the group could add a new method to GPUBindingCommandsMixin (as a sibling to setBindGroup()) that accepts immediate data (in the form of an ArrayBuffer) rather than a BindGroup. (Or, maybe, alternatively, this could be an overload of createBindGroup() that takes immediate data.)
    • The group could even add a maxBindingSize field to GPUBufferBindingLayout, which would allow the the WebGPU pipeline creation routine can know whether the data is small enough that it can guarantee that the data fits into push constant space.
    • Limits accounting doesn't change at all, which is appealing

@litherum
Copy link
Contributor

litherum commented May 9, 2024

@Kangz
Copy link
Contributor

Kangz commented May 14, 2024

GPU Web WG 2024-05-08 Pacific-time
  • SY: Added a new address space called “push_constant”. Similar to uniform but:
    • Each entry point can only use one push_constant binding, for easier layout.
    • Cannot index arrays in push_constant space.
  • SY: Question of how large the space can be. We need to reserve some space for the implementation. Limit is small on backends (e.g. root descriptor in d3d12)
  • SY: Also question of how to upload push constants. Proposed to add where setBindGroup is available (render pass, compute pass, and render bundle encoders).
  • SY: Improved performance over uniform buffer
  • KG: Mozilla needs some time to review this
  • SY: Alan is asking about lifting the restriction for one push_constant per entry point.
  • MM: Isn’t the amount of available space dependent on the number of bind groups (which also take space in the root descriptor)?
    • SY: Yes, d3d12 64 DWORDs. Even a small limit is useful, though prefer to make it larger
    • MM: If you add accounting that bind groups + push constants is under some limit (think this is a good idea) then the mapping from bind groups to root constants needs to be described in our spec. Restricts implementations ability to hoist push constants in/out of tables.
    • SY: Would you mind leaving a comment on the PR?
    • MM: Sure
    • MW: Is there a reason we are designing this proposal instead of setVertexBuffer with dynamic data or similar?
    • MM: 3 or 4 years ago we discussed this; we ended up on the trail of whether push constants was actually faster than putting the data in a buffer. Why can’t it be polyfilled by the application. Not sure if performance was proven to be better. Then the discussion veered off into whether, if we’re going to specify push constants, should we say there are no limits on push constants, and the implementation would only use backend push constants if it was small enough. Otherwise silently promote to a buffer.
    • MW: Do think the performance is better (at least in metal backend for small sizes). Question is why introduce new comment instead of allowing attaching an ArrayBuffer as a vertex buffer.
      • Other thoughts:
        • I mean we could have either:
        • void setPushConstants(uint32_t offset, uint32_t count, AllowSharedBufferSource data);
        • or
        • setVertexBuffer(GPUIndex32 slot, AllowSharedBufferSource data, optional GPUSize64 offset, optional GPUSize64 size);
        • The latter seems simpler so we don't have to rework the GPUSupportedLimits as this will impact those. The backend internally can decide whether or not to create a buffer or use push constants.
        • There are also implications where if this is supported by GPURenderBundle, then we would likely create a buffer in any case
    • MM: In SPIR-V, shader needs to know whether it’s reading from a push constant. Would need to know at compile time where it’s pulling the data from.
    • KG: For why we have setPushConstantU32/I32/F32 instead of just the AllowSharedBufferSource overload, still can be debated.
  • KG: (This issue here is M2, so making this M2 also)
  • KG: We will take a closer look.

@shaoboyan
Copy link
Contributor Author

Let push constants be an implementation detail, rather than an explicit part of the WebGPU API, and treat the root signature space as just another part of the pipeline layout.

Just point out possible implementation issues I observed:

  • On Metal this works fine since it is super align with metal API SetVertexBytes() and so on. But it might have some problem for Vulkan and D3D12. Both Vulkan and D3D12 defines push constants or root constants in pipelineLayout. When calling SetPushConstants(), we need to ensure the pipeline has such push constants range. I think this implementation hints that we need to allocate remain root space as root constants and we need to always keep a internal binding point for each bind group to handle the case that "User setPushConstants() but we don't have enough space, so fallback to a buffer".

@mwyrzykowski
Copy link

I think this implementation hints that we need to allocate remain root space as root constants and we need to always keep a internal binding point for each bind group to handle the case that "User setPushConstants() but we don't have enough space, so fallback to a buffer".

It sounds like D3D and Vulkan can be optimized to not use a buffer when the push constants are small enough and there is space. With Metal, we always need to reserve a buffer slot for the push constants. Though it would be possible to map them all into one buffer slot.

Abstracting this concept seems to further justify that the term 'push constants' does not necessarily need to exist in WebGPU. Rather, the D3D and Vulkan backends can implement small, dynamic setBindGroup calls into push constants and on Metal, we would use set[Vertex/Fragment]Bytes unless the data is large, then we would make a buffer

@JMS55
Copy link

JMS55 commented May 15, 2024

If you tie push constants to specifically render pipelines, then you can't use them for compute shaders (and later, other types of pipelines), which would be unfortunate.

@litherum
Copy link
Contributor

@JMS55

Yeah, people in this thread are using a variety of different terminology. Above, when people say "setVertexBytes()" or "setFragmentBytes()," I believe they are referring to the Metal functions of the same name, as an example of a function that takes immediate data (rather than a method taking resources or bind groups). Even in Metal, there's a way to supply immediate data to compute pipelines.

@litherum
Copy link
Contributor

tl;dr: Thinking deeper, I think the only way it can work in Vulkan is either to just not add the feature at all, or add an explicit API for it.

I think the biggest unanswered question here regarding "let push constants be an implementation detail" is this: Consider such a program:

let renderPassEncoder = commandEncoder.beginRenderPass(...);
renderPassEncoder.setPipeline(...);
renderPassEncoder.setImmediateData(new Int32Array([42])); // Unclear how this will interact with bind groups, but let's assume that problem is surmountable...
renderPassEncoder.draw(...);
renderPassEncoder.setImmediateData(new Int32Array(/* a very long array */));
renderPassEncoder.draw(...);

The question is: What should the code in the shader be to handle both of these cases? In the first case, we'd want to promote the 42 to a push constant (probably? AFAICT it hasn't been demonstrated whether push constants are actually consistently better to use cross-platform), but in the second case, the data is too big to fit in push constants. Both situations would (probably) need to be supported by a single shader - setPipeline() is only called once in the above code sample.

In SPIR-V, the text of the shader is actually different depending on whether or not you're accessing a push constant or not. For example, in GLSL (which I'm using here just as a readable form of SPIR-V), you'd either have this:

layout(push_constant) uniform Constants {
	int pushConstantData;
};

Or this:

layout(set = 0, binding = 0) uniform Constants {
	int uboData;
};

If we operate under the constraint that no amount of recompilation is allowed to happen while commands are being recorded into the render pass - not even supplying specialization constants - then the only way this can work is if both blocks are present in the shader, and each access to the data is guarded by a flag - in software - to determine whether the data was bound to pushConstantData or uboData. Adding this selection in software probably defeats the whole purpose of trying to use push constants in the first place. It's probably a dealbreaker. (But its performance should probably be measured nevertheless.)

Assuming it is in fact a dealbreaker, it seems like the only way this could ever work (again, assuming that the committee does want to make it work, which I'm still not convinced has been sufficiently justified) would be to come up with a system where the implementation can guarantee that the immediate data being set by the app will fit into push constant space. This is why I suggested a maxBindingSize field within the pipeline layout. Such a field could allow the implementation to know that the data provided by the application will fit into push constants.

But there's even an additional problem - what about the following program:

let renderPassEncoder = commandEncoder.beginRenderPass(...);
renderPassEncoder.setPipeline(...);
renderPassEncoder.setImmediateData(new Int32Array([42]));
renderPassEncoder.draw(...);
renderPassEncoder.setBindGroup(/* clobbering the immediate data above */);
renderPassEncoder.draw(...);

How should this work, assuming no amount of recompilation is acceptable? Imagine the bind group contains a buffer whose contents is populated by a previous compute shader - the CPU has never seen this data, and therefore such data cannot be hoisted to push constants. There would have to be some way, at the API level, to distinguish the immediate data from the bind group data, so that the shader wouldn't have to handle both cases in software at runtime.

I think the only way this could work is to add a new type of descriptor in the bind group layout. That way, an entry in the bind group layout could either refer to a buffer, or refer to immediate data, but not both. Any data set by immediates for one draw call would be incapable of being clobbered by a GPU-populated buffer for a subsequent draw call - thereby allowing the shader compiler to know that it's safe to only emit the layout(push_constant) block above.

So then that begs the question: What should this new type of descriptor in the bind group layout be named? Perhaps it should be named ... "push constant"...

@magcius
Copy link

magcius commented May 15, 2024

This is turning into a bit of a bikeshed opportunity, but I'll add my two cents to the jar:

  • I do not want shader recompilation at command buffer runtime, after pipelines have been created. This would be a very unfortunate thing to require implementations to do, after we spent the effort to put PSOs into the spec. Expensive synchronous recompilations cause stalls that we are trying to avoid in WebGPU.
  • The big value I see in push constants is boot-strapping indices into other buffers or bind groups; this is where I would expect to see them come in the most handy.
  • Just because D3D12 offers an ability to embed some maximum number of root constants, does not mean that going all the way up to that number is necessarily free. My proposal was limited to 4 u32's for a reason.
  • Making push constants transparent to the application would be extra difficult to support uses that mix both bind groups and push constants. See below for more details.
  • Push constants might be especially relevant when talking about newer tools like "bindless", where shaders index arrays of resources like textures or buffers.

To give a bit of an unprompted history lesson and awareness of how these features are implemented in practice...

The simplified AMD interpretation of push constants (not applicable to all drivers and hardware, just giving one mental model): The hardware has some general-purpose 32-bit scalar registers (SGPRs); these registers are active storage for threads, but also can be initialized with user constants set up by the hardware that launches threads. Some of these registers are used by the driver itself for internal purposes (e.g. the "base draw offset" feature from Vulkan would get passed by an SGPR), some of them are used by the driver to implement binding models (D3D12 CBVs might get mapped directly to an SGPR), and some of them are used to store memory addresses to binding tables (e.g. if you have a lot of textures, the driver will allocate a block of memory which has the texture descriptors, and store a pointer to the table in an SGPR; these are D3D12's "descriptor tables").

The D3D12 idea was that all root signature parameters (root constants, CBVs, descriptor tables) would map to SGPRs; however, as the driver sometimes needs some for its own purposes, this isn't a 1:1 mapping, and sometimes the driver has to jettison some of the parameters to a memory block and do some finagling to rearrange all of the binding updates. Note that since the usage of an SGPR is arbitrary (it can contain a memory address or a value, just depending on how it's used), the usage needs to be determined at root signature time; in practice, the AMD compiler effectively reserves a few driver SGPRs that work across all platforms.

So, in order to make "transparent push constants" work, we can't simply replace the first N bytes of a uniform buffer of a bind group with immediate data; using it with a bind group will require a pointer in the SGPRs, while using it with push constants will require the values in SGPRs. Instead, we'll need syntax that lets us optimistically compile the first N push constants to SGPRs, and then the rest get jettisoned to a memory buffer. This is enabled a bit more by the extra pipeline layout field so we know up-front what will fit, but it still seems more complicated than just having a small fixed-size limit in the specification. We can always expand to support bigger sizes through emulation later.

@litherum
Copy link
Contributor

litherum commented May 15, 2024

The big value I see in push constants is boot-strapping indices into other buffers or bind groups

shaders index arrays of resources like textures or buffers.

I think we have to be careful here - if the biggest use case is simply "make some indices available to a shader" then a simple API convenience method would be sufficient, without changing the binding model or adding push constants at all.

@litherum
Copy link
Contributor

(I just realized that we already have that simple API convenience method: GPUQueue.writeBuffer(). I think the burden of proof needs to be "why push constants in addition to writeBuffer()?"

@Kangz
Copy link
Contributor

Kangz commented Jul 9, 2024

GPU Web WG 2024-06-26 Atlantic-time
  • CW: Can we discuss this without MikeW? Probably. He seemed to agree.
  • KG: We didn’t get a chance to talk about this on our team. We agree that it’s at least M2, and we wanted to prioritize the other issues, and there are a lot of other issues. Let’s not talk about this.
  • CW: Sounds fine. WGPU has it, MikeW agreed… it’s clear it will happen at some point. People can start experimenting, even if it’s not 100%. There will be output on the agenda request for a Pacific time meeting, then.

@mwyrzykowski
Copy link

  • CW: Sounds fine. WGPU has it, MikeW agreed… it’s clear it will happen at some point. People can start experimenting, even if it’s not 100%. There will be output on the agenda request for a Pacific time meeting, then.

It would still be nice to get rid of the typed method names which seem out of place with the rest of the specification:

        void setImmeidateDataU32(uint32_t offset, uint32_t value);
        void setImmediateDataI32(uint32_t offset, int32_t value);
        void setImmediateDataF32(uin32_t offset, float32_t value);

and maybe allow something more flexible like:

void setImmediateData(uint32_t offset, GPUPipelineConstantValue value, USVString type);

Not sure what @shaoboyan thinks

@toji
Copy link
Member

toji commented Jul 9, 2024

Unless we expect that we'll be adding a lot more data types I don't see any particular advantage to passing the type as an enum, and it probably adds some additional work for implementors to do the value cast in the implementation rather than allowing the WebIDL layer to handle it for you. Plus I think the variant with separate methods for each type is more WASM-friendly as well?

@magcius
Copy link

magcius commented Jul 9, 2024

I proposed the original wrapper functions as optional sugar, and I'm fine with them being removed in favor of an ArrayBuffer interface if we can't agree on them. We already require that users construct ArrayBuffers for writeBuffer so it's not like a user should have any difficulty figuring it out.

@shaoboyan
Copy link
Contributor Author

I'm a fan of using IDL to handle type conversion. I think immediate data is a perf focus API so I prefer avoiding any non-necessary steps.

Agreed. Here's a benchmark of a f32-to-u32 reinterpretation, a common math operation one would do with TypedArray. https://jsbench.me/2vlxcmotny/1

According to this benchmark, I think the perf gap between typed data and constructed array buffer is not trivial. So I support specific typed APIs

adds some additional work for implementors to do the value cast in the implementation

I'm slightly preferred to put type conversion in IDL layer so that we might benefit from future IDL opts. And again, this might help the uploading perf compared with extra enum variable.

@magcius
Copy link

magcius commented Jul 10, 2024

The question to me is comparing:

setImmediateDataF32(0, 123.0);
setImmediateDataF32(4, 456.0);
setImmediateDataF32(8, 789.0);
setImmediateDataF32(12, 101112.0);

and

setImmediateData(0, new Float32Array([123.0, 456.0, 789.0, 101112.0]));

There are already multiple tricks to cut down on array buffer allocation cost (e.g. allocate one large buffer which you reuse). I also wish that JS implementations would see this as an opportunity to improve TypedArray performance, but I probably can't have my wish here.

@kdashg
Copy link
Contributor

kdashg commented Jul 10, 2024

Internally in Firefox's WebGL implementation, we convert all calls to e.g. uniform1i into something called UniformData:
Content process:
https://searchfox.org/mozilla-central/rev/0b1d02b2cb5736511139cf0e40b318273e825899/dom/canvas/ClientWebGLContext.h#1910
Gpu process:
https://searchfox.org/mozilla-central/rev/0b1d02b2cb5736511139cf0e40b318273e825899/dom/canvas/WebGLContextGL.cpp#1281

This very nicely predisposes me to setImmediateData being more generic, rather than having typed entrypoints. :)

Copy link
Contributor

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per API meeting today - let's get this proposal doc landed! I've added comments about what we should write down before landing it.

@kainino0x kainino0x enabled auto-merge (squash) July 13, 2024 00:34
@kainino0x kainino0x merged commit 97f7abe into gpuweb:main Jul 13, 2024
4 checks passed
@Kangz
Copy link
Contributor

Kangz commented Jul 16, 2024

GPU Web WG 2024-07-10 Atlantic-time
  • CW: We have a general agreement on the general shape: set immediate data. Is there a variant that… How is it spelled? setImmediateData(float)? Offset with array buffer?
  • CW: Everyone is interested in array buffer variant, but single scalar variant are not needed and can be polyfilled.
  • CW: Given this, I’d like to suggest we just do typed array for now. If we get feedback, we can add the scalar versions.
  • MW, KN, KG: thumbs up / nods
  • JB: In WGPU we have something similar, but it lets you supply different values for vertex and fragment stages. Either we haven’t implemented this correctly on every platform, or we could consider this for the spec. If it’s possible and we don’t offer it, then we’re halving the number of push constants that are available, because people will have to provide a combined set of push constants for both stages.
  • CW: I might be misremembering, but in Vulkan, it’s a single space. You can segment it in ranges, but when you set constant s/GPU for the Web/GPU on the Web/ #4, it sets it at that index for all stages
  • TT: I remember seeing this functionality but it wasn’t in the proposal, so I wanted to compare them and have more feedback next week.
  • CW: That’s fine. This feature won’t be done today. There’s no rush.
  • MM: What are the pipeline layout implications?
  • CW: In the pipeline layout, there’s a new “immediateDataRangeByteSize” and you say how many immediates there pipeline will use. And maybe there’s a default layout. It also says that when you bind a pipeline that doesn’t have the same size as a previous pipeline, things are not invalidated. The same way bind groups are not invalidated.
  • MM: What if I request to create a pipeline with a lot of immediates?
  • CW+KN: It is a small limit that defaults to 64bytes in the proposals but devices could expose more.
  • MM: SO the device could expose more if the application asked for it?
  • CW: Yes.
  • BJ: If we’re limiting the API shape to only accept data arrays, let’s make sure we have data offset and size as optional parameters. But that’s already there. So, yeah.
  • KN: Since this is a proposal document and not a change to the spec, we should get it landed. We should take out the u32 i32 f32 versions, and maybe make some notes about open questions, but then we can get this landed?
  • CW: Sounds good.
  • CW: We can let editors add it now.

@Kangz
Copy link
Contributor

Kangz commented Jul 23, 2024

GPU Web WG 2024-07-17 Pacific-time
  • https://github.com/gpuweb/gpuweb/blob/main/proposals/push-constants.md
  • This has been merged into the proposals/ directory. 3 open questions written in that doc.
  • SY: discussion of per-stage push constants. All 3 APIs define immediate data with visibility per-stage. In our proposal, to simplify implementation, the immediate data is visible to all stages. Question is whether we should align with the native APIs - or keep implementation simpler.
    • JB: concern Teo brought up was having a combined set of push constants. All stages have to share the same capacity then, and the capacity's pretty low. If we can design the API to be upward-compatible, and insert an alternative system later with per-stage push constants, would be ideal.
    • KN: it differs between backends whether you actually share the space or not. In D3D these'd go into the root signature, so I assume you wouldn't lose space to have visibility at all stages. Not sure about all backends.
    • JB: if the push constants come out of shared space anyway then yes, visibility would only affect visibility and not the actual amount of immediate memory consumed.
    • KG: no conclusive answer yet - that's fine.
    • MM: DirectX doesn't split the parameter space (for root constants) - just checked.
    • KG: would an investigation be helpful?
    • SY: for D3D, defining different visibility for vertex/fragment - not sure it'd work
    • JB: I can check for Vulkan. If all platforms unify the namespace, that answers the question.
  • SY: should push constants affect bindgroup compatibility? If you have comfortable immediate range - user can have immediate data values. No need to make extra calls like "setImmediateData". Concern - have some impl details also using immediate data internally. In impl, might need to do some copy / internal setting. Hesitant to expose these concepts to WebGPU. Always call setImmediateData when calling setPipeline?
    • KG: initially - would say they aren't compatible. In the future, could make them more compatible.
    • SY: agree.
    • KN: we have the immediate byte range for push constants in the pipeline layout (?). Two pipeline layouts the same, both have 8 bytes immediate data, then they're incompatible with each other?
    • SY: not incompatible with each other, but incompatible in immediate data range.
    • KG: think we should try to maintain - have a pipeline layout descriptor that's teh same between two pipelines - they should be compatible. Shouldn't make them incompatible because you changed the immediate data byte size.
    • KN: ok. That's what the proposal says. Compatible for immediate data if they're the same size.
  • SY: the third issue is a question. Each entry point should only use a single immediate data. Should this restriction be extended?
  • KG: sounds difficult to expand this and know the order they happen in.
  • KN: if we expanded it we'd put an ID on it and you'd specify the ID when setting immediate bytes.
  • KG: sounds nicer to say 1 for now.
  • KN: does make the proposal simpler.
  • KN: Jasper was asking, why this restriction. Seemed artificial, compared to what's available in the backends.
  • MM: any precedent to allowing only 1 of anything in WebGPU?
  • KN: sort of. Only 2 attachments if using dual-source blending. Generally, don't think so.
  • MM: seems artificial to me too. From author's perspective: a shame if we have a great feature but you only get one.
  • KN: it's still limited to 64 bytes by default.
  • MM: could imagine someone wanting 2 integers.
  • KG: do it with structs. Right now, have 1 contiguous range of bytes. This is the time while this is in the proposals stage to explore different ways of doing this.

Comment on lines +121 to +123
- Should pipelineLayout defines immediate range compatible?
- Implementation internal immediate data usage could easily break compatibility. Implementation needs extra
effort to ensure such compatibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still an open question? There is this note on line 101:

NOTE: two pipeline layouts are defined to be “compatible for immediate data” if they were created with identical immediate data byte size. It means immediate data values can share between pipeline layouts that are compatible for immediate data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api WebGPU API proposal wgsl WebGPU Shading Language Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.