[Impeller][iOS] Remove nextDrawable latency by deferring drawable acquisition.

# Background

See also: https://github.com/flutter/flutter/issues/134959

iOS only allows applications to request two or three swapchain images ("drawables"). When running at the device framerate Flutter applications frequently block for a handful of milliseconds on the raster thread as it waits for the next available drawable. iOS documentation [recommends requesting the drawable "as late as possible"](https://developer.apple.com/library/archive/documentation/3DDrawing/Conceptual/MTLBestPracticesGuide/Drawables.html) to avoid having this wait stall rendering workloads:

> Always acquire a drawable as late as possible; preferably, immediately before encoding an on-screen render pass. A frame’s CPU work may include dynamic data updates and off-screen render passes that you can perform before acquiring a drawable.

While the UI thread workload is performed before drawable acquisition, the engine workload is not. This can reduce the available rendering time from 8ms down to less than 4ms on high frame (120hz) rate iOS devices.

![image](https://github.com/flutter/flutter/assets/8975114/8d53c3bd-25f4-41f3-be61-a919217ff28f)

Pictured Above: On a 120hz iPhone we have 8.33 ms to complete rendering (CPU and GPU workloads) of a frame and submit the drawable. According to metal system traces from  [#134959](https://github.com/flutter/flutter/issues/134959), this can take 20-25ms. Assuming constantly produced frames and ordering of drawables, this implies that frame 4 will need to wait ~4ms for drawable acquisition before it can do any work. Oftentimes this doesn't result in dropped frames, as we can usually finish the frame workload in under a ms for simple apps. But that isn't always true, and the persistently high raster times also worry our user base.

![image](https://github.com/flutter/flutter/assets/8975114/89b6f674-446d-4f91-80c8-b84e27aa9301)

Pictured Above: Swiping through wondrous application. GPU times are < 2ms and CPU times are under a ms. But frames take 8.33 ms because the drawable isn't available until 1ms before the end of the frame.

Oncreen Texture Usage

Flutter prefers to use the swapchain image as the surface to use for onscreen drawing. A simple Flutter app with limited compositing will often render everything to the swapchain image without any offscreen textures. Compared to offscreen rendering, this reduces memory usage. Which is important because the size of a fullscreen texture on an iPhone 13 Pro is (2556 * 1179 * 32 bits), or 12-24 MB with Wide Gamut, and maybe back to ~(6-12MB) if the device supports texture compression. Drawable texture size is NOT attributed to the Flutter application.

## Partial Repaint

"Partial repaint/dirty region management" allows the Flutter engine to reduce the engine/GPU workload when rendering frames that are similar to previous frames. The typical example would be something like a blinking cursor: the engine computes a damage rect and then is able to avoid rendering everything outside of this damage rect, which dramatically reduces CPU/GPU usage. Partial repaint requires the engine to know the id of the drawable resource that will be used before the frame starts in order to compute the damage rect. While there are only three drawables, they are not necessarily cycled in order. This means that partial repaint actually requires that we acquire the drawable as early as possible (from the perspective of the raster thread). 

Partial repaint/Dirty region management was originally filled in https://github.com/flutter/flutter/issues/33939 . I'm not sure if there is a separate design document, but all of the discussion is in that issue.

### Why do we need the drawable ID?

Partial repaint is frequently thought of as a difference between the current frame and the previous frame. But this is wrong, it's the difference between the current frame and the swapchain image we are given. On iOS, this may be any of the previous drawables.

![image](https://github.com/flutter/flutter/assets/8975114/a9d2798d-11d0-4a86-9299-e492dc08f813)

Pictured Above: an application renders a red triangle, then a blue circle, then a green square. Finally, on frame 4 we will render a red triangle again. If the system gives us Drawable A, then we will compute the diff and realize we don't have to do any work! But if we are given B or C, then we still have to re-render. If we don't know the diff before we start rendering, then we can't actually save any work.

### How Important is Partial Repaint?

For the canonical example of the blinking cursor, partial repaint drops CPU usage from 40% down to 15% on iOS. The substantial cost is due to the fade in/out that ticks every single frame instead of once every 100-something ms. See also: https://github.com/flutter/flutter/issues/124526#issuecomment-1512229903 . For other kinds of interaction, like scrolling and page navigation, partial repaint does almost nothing.

It's worth noting that we do not have any other ideas in the pipeline for improving blinking cursor performance without partial repaint enabled. So this change would, in the near term, regress blinking cursor across the board with no amelioration.

# Overview

The structural changes required to defer drawable acquisition are fairly straightforward:

  * impeller::SurfaceMTL should be constructed from a CAMetalLayer and not a Drawable/Texture. The layer already contains the pixel format and size information that is necessary to initialize the render pass descriptors.
  * A special purpose TextureMTL object is created that wraps the CAMetalLayer without acquiring the drawable, exposing a bit that tells the backend it is drawable backed.
  * When the metal texture is requested, the special texture object requests the drawable.
  * RenderPassMTL and BlitPassMTL check for the special drawable blit when configuring the MTLRenderPass/BlitPass descriptors. 
  * Rather than set the descriptor eagerly, the engine waits until it is encoding the cmd buffer (The last possible second).
  * SubmitCommandsAsync checks if the render target texture is the special drawable texture, and only waits on the drawable in the worker thread task.
  * The worker thread task pool is reduced to a single thread to enforce ordering. 

I've built a prototype implementation here: https://github.com/flutter/engine/pull/47976 

![image](https://github.com/flutter/flutter/assets/8975114/f2017656-98b2-4c58-9e52-cb464956a22a)


# Open questions

  * How do we handle error states like failure to acquire a drawable?
  * Can we afford to lose partial repaint?  Is there another path to blinking cursor improvements?
  * Does this improve things for 16.6 Are we vulnerable to another change like 16.6 again?
  * With the removal of partial repaint of Impeller on iOS, the only production backend shipping with it is the legacy iOS Skia renderer. Should we remove it altogether?
    * It was disabled on Skia/Android due to rendering bugs with certain vendors.
    * https://github.com/flutter/flutter/issues/113314 
    * https://github.com/flutter/flutter/issues/105093 
    * Removing partial repaint on Android saved ~1ms per frame on some benchmarks.

  * Do we need to also do this for macOS?
    * macOS drawable acquisition is abstracted away from the embedder, meaning that we can't actually defer it. So we could leave partial repaint running on macOS (though I don't think anyone has wired it up yet)

# Testing plan

Changing how we acquire and present drawables is risky, and unit tests won't be sufficient to cover the change in behavior. Aside from devicelab integration tests and manual tests of some common scenarios, including fast and slow apps, we have a few other tools for reducing risk.

We could land the new drawable acquisition path behind a Plist flag, and then ask certain members of the community to test it. This could be done by opting in during one of the next betas, and then leaving an opt out we can easily cherry pick if there are problems.

I would not ask g3 customers to opt-in or opt-out, instead I would coordinate with the g3 rollers to change the Plist configuration like we've done for Impeller/Wide Gamut already. Then after a release if there are no known issues I would remove the configuration. It's unclear if this should be opt in or opt out, but if it's opt in then we'll need to adjust the devicelab to get coverage.

# Alternatives Considered

  * We can keep partial repaint and remove drawable blocking by unconditionally rendering to an offscreen texture. Then blitting this texture to the onscreen in a worker task. The cost of this is several fold:
    * Small additional overhead of a per-frame blit (hundreds of microseconds)
    * Large additional memory overhead of fullscreen offscreen (12-24 MB)
  * No additional memory overhead if there is a backdrop filter present, as it means that we can re-enable "Read from resolve" optimization.
  * Could we do some sort of dynamic switching?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Impeller][iOS] Remove nextDrawable latency by deferring drawable acquisition. #138490

Background

Partial Repaint

Why do we need the drawable ID?

How Important is Partial Repaint?

Overview

Open questions

Testing plan

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Impeller][iOS] Remove nextDrawable latency by deferring drawable acquisition. #138490

Description

Background

Partial Repaint

Why do we need the drawable ID?

How Important is Partial Repaint?

Overview

Open questions

Testing plan

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions