-
Notifications
You must be signed in to change notification settings - Fork 345
Description
Background
Traditional non-tessellated 3D rendering works well when the ratio of mesh density to screen-space size is roughly constant. However, in 3D graphics, it’s common for objects to move closer and farther from the camera, meaning that their screen-space size can change dramatically. When the object appears larger on the screen, it demands a higher density of triangles in order to maintain visual fidelity.
Tessellation is a way of combatting this problem. Rather than representing a mesh as a collection of triangles, tessellation represents a mesh as a collection of “patches,” where a patch represents a smooth, curved, mathematical surface (e.g. Bezier patch). Rather than the artist baking this mesh into a collection of triangles at authoring time, the GPU has facilities to convert this mesh into triangles at draw-call time. Each draw can have entirely independent parameters for this conversion, which means the density of triangles can change fluidly from one frame to the next.
Motivation
There are a few different pieces of motivation here, grouped into categories of 1) performance 2) memory usage and 3) new rendering possibilities that weren't possible before
- The pipeline reads patch data from memory, rather than triangle data. Because the number of patches is almost always smaller than the number of generated triangles, this decreases the amount of memory bandwidth needed to render the mesh (and therefore increases performance)
- Skinning and morphing can be done on the patch control points, rather than the vertex data itself. Because the number of control points is smaller than the number of vertices, this decreases the total amount of work the GPU must perform (and therefore increases performance)
- Memory usage is decreased because the high-resolution model is never stored in memory
- Rather than having a fixed number of LODs for your mesh, and snapping between them, tessellation allows for fluidly changing the density of the mesh per-frame. This kind of flexibility is impossible without tessellation
- If the tessellation factors are all 1, the control points and the vertices are identical, which means that the domain shader effectively acts as a poor-man’s geometry shader. It can consult with the vertices for the entire triangle, rather than each vertex being independent like in a vertex shader (but it can’t generate additional geometry). Given the general direction of WebGPU to not include a real geometry shader, this can get us halfway there.
Performance
I wanted to understand the performance claims, so I wrote a benchmark Tessellation.zip to measure it (it’s the “PerformanceTest” target in the linked project). The goal is to compare the performance of a tessellated mesh against a non-tessellated, but identical mesh. I’m not trying to compare the performance of any computation performed in any shader, so the shaders involved do essentially nothing.
Unfortunately, Metal doesn’t seem to allow for reading back the tessellated mesh, so I wrote this benchmark using Direct3D 12. One target (the “Tessellation” target in the linked project) draws a triangle at maximum (64) tessellation, and uses the geometry shader to write out the locations of all the interpolated vertices to a UAV, which gets saved to disk. Then, the benchmark (the “PerformanceTest” target in the linked project) draws this same tessellated mesh (but without the geometry shader), and compares that to drawing the pre-tessellated mesh, which it has read off-disk. The test runs on Windows, so I can’t test a TBDR renderer, but I can at least get some data.
The Intel GPU shows no performance change. (Rather, the performance delta is within the noise).
The Nvidia GPU shows that tessellation is 8% faster.
This is reassuring; it shows that, even on a non-TBDR renderer, tessellation is no slower and sometimes faster than pretessellated models. This, coupled with the other benefits of tessellation (memory savings and frame-by-frame flexibility) shows that tessellation is worth pursuing.
D3D12
D3D12 models tessellation as two additional stages inside the existing graphics pipeline. The tessellated pipeline looks like: Vertex Shader > Hull Shader > Tessellation > Domain Shader > Geometry Shader > Rasterization > Fragment Shader.
The vertex data and vertex shader act the same as they do without tessellation, except they operate on the control points of the mesh, rather than on vertices. The hull shader is executed once per control point, like a vertex shader, but it has read access to all the control points in the patch, like a geometry shader. Its job is to do two things: 1) transform the control points of the mesh (e.g. a basis transformation) and 2) output per-patch data, including the tessellation factors. In D3D12, these two jobs are separated into two separate functions, associated with each other via the patchconstantfunc() attribute. Presumably, because the output of this function is identical for each control point, it only needs to be run once per patch (as distinct from the hull shader proper, which needs to run once per control point).
The next stage is the tessellator. It consumes the tessellation factors, and doesn’t consume the control points of the mesh. It outputs normalized vertices with 0-1 coordinates.
The domain shader runs once per tessellated vertex. Its job is to combine the normalized tessellation coordinates with the control point information outputted by the hull shader. Similarly to the hull shader, the domain shader is allowed to read all the control points for the patch, even though it just operates on a single vertex. These control points are passed in modified between the hull shader and domain shader; the tessellator doesn’t touch the control point information. If you’re doing something like mapping tessellated vertices onto a bezier patch, this is the place where the actual mapping equations would be.
Vulkan
It’s almost identical to the D3D12 model, except the “hull shader” is called a “tessellation control shader” and the “domain shader” is called a “tessellation evaluation shader.”
The only real difference I could find is that, rather than the hull shader being separated into two distinct functions, the tessellation control shader just has a single function. The per-patch outputs (e.g. tessellation factors) are accessible from any invocation in the patch; however, just like flat shading, if two invocations write conflicting data to these patch outputs, one of them wins (just like the “provoking vertex” concept).
Metal
Metal’s model is quite different than the other two. It’s much simpler, and was designed with compute shaders in mind. The tessellated graphics pipeline is: Tessellator > Post-Tessellation Vertex Shader > Rasterizer > Fragment Shader.
You’ll notice that this pipeline is shorter than their other two API’s pipelines. This is intentional to keep the model simple and understandable, without loss of expressivity: the missing stages are designed to be implemented by compute shaders instead, if necessary.
The tessellator is the first stage in the tessellated graphics pipeline. It reads the tessellation factors from a buffer, which is set up from the API very similarly to a vertex buffer. It generates normalized vertices which are fed to the post-tessellation vertex shader.
The post-tessellation vertex shader is identical to the domain shader. It operates once per vertex, and is responsible for transforming that vertex for rasterization. Like the domain shader, it has access to all the control point information for that patch.
Control point information is passed into the post-tessellation vertex shader using the same stage-in facilities that the non-tessellated graphics pipeline uses. There are two new vertex buffer step modes, perPatch and perPatchControlPoint which allow for streaming data into the post-tessellation vertex shader.
Bonus: Mesh Shading
Another way tessellation can be achieved is with mesh shading. This is an entirely new pipeline, rather than an addition to the existing graphics pipeline. The new pipeline is Task Shader (also known as “Amplification Shader”) > Mesh Shader > Rasterization > Fragment Shader. The two new shaders are based off compute shaders, where they execute with almost no inputs and have local workgroup sizes.
The mesh shader’s job is for each workgroup to produce a tiny vertex buffer and index buffer, though these “buffers” are kept on-chip and never hit memory. It can output to a collection of vertex attributes.
The task / amplification shader is optional, and its job is simply for each workgroup to spawn 0 or more mesh shader workgroups. The task / amplification shader just writes into a threadgroup output variable for how many mesh workgroups to spawn.
There’s no vertex fetch stage; the mesh shader is responsible for reading from memory, or not reading from memory, depending on what it’s trying to do. It can’t access the fixed-function tessellator hardware, but the tessellation algorithm can be implemented in software in the mesh shader.
Unfortunately, VK_NV_mesh_shader is 4% on Windows, 2% on Linux, and 0% on Android. Mesh shading isn't present (yet?) in D3D, and isn't present in Metal. We probably can't use it in WebGPU.
Analysis
I wanted to try to understand the differences between the different approaches to see which one would fit best for WebGPU. Vulkan‘s approach is almost identical to D3D’s approach, so I’ll be considering them as a unit together when comparing approaches.
First, I wanted to determine if there was any performance difference between the Metal approach and the D3D approach. The biggest difference is that, in the Metal model, control point transformations are performed in a compute shader, whereas in D3D, control point transformations are performed in the graphics pipeline.
In order to determine if there was a performance difference, I wrote another benchmark Tessellation.zip (it's the "ModelTest" target in the linked project), that performs an artificially complex operation for each control point in a tessellated triangle. (The expensive operation is controlPoint = tanh(sinh(controlPoint)) in a loop 10,000 times.) The benchmark compares the runtime of moving this expensive operation to different places in the pipeline: In a prepass compute shader, vertex shader, hull shader, and domain shader. Rather than maxing out the tessellation factor, I picked a medium value (20) to try to be more representative of average content. For the compute shader, I picked a group size of (1, 1, 1) to be conservative and maximally flexible.
D3D can execute the Metal model by using a compute shader, and using the domain shader instead of the post-tessellation vertex shader. However, Metal can’t execute the D3D model because it doesn’t have a vertex shader or hull shader in the tessellated graphics pipeline. Therefore, in order to compare both models on the same hardware, I implemented the benchmark in D3D.
“Baseline” is the execution time without performing the expensive computation. Each of the other bars on the chart represents the runtime when the expensive computation is performed in that particular shader stage.
The results seem to indicate that the performance between the two models is roughly identical.
The fact the slow domain shader is so much slower than the rest of the other stages makes intuitive sense. The domain shader is executed once for each tessellated vertex, rather than each control point, so doing the expensive computation there increases the total amount of work performed. Also, the expensive computation needs to be performed on each control point, and the domain shader gets access to all the control points in its patch, which means that the domain shader needs to perform the expensive computation multiple times: once per control point.
Recommendation
Given that:
- The performance between the two models is roughly equivalent
- The Metal model is simpler and easier to understand
- The D3D model can naturally express the Metal model, but the Metal model can’t naturally express the D3D model. In order to represent the D3D model on Metal, WebGPU would have to interrupt the currently-executing graphics pass and insert a compute pass in the middle. Alternatively, it could inject the compute pass before the render pass, and add additional restrictions to the set of things an author can do in their render pass. However, both these options place an undue burden on either the browser developers or the website author.
Therefore, the Metal model is a better fit for WebGPU than the D3D model.