这是indexloc提供的服务,不要输入任何密码
Skip to content

mapAsync(WRITE) and zero-filling #2926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

litherum
Copy link
Contributor

@litherum litherum commented May 22, 2022

This is required to make UMA and non-UMA devices behave portably.

It would be unfortunate if we had to run memset() on the CPU for triply-mapped buffers on UMA (GPU Process, web process, and GPU). The only way to offload the zero-filling work to the GPU is to do it in mapAsync(), and if we do it in mapAsync() then it has to clear the whole mapped region (not just the regions getMappedRange() is called on).

(I also didn't see anything in the definition of getMappedRange() about zero-filling so I'm assuming this is just an oversight.)


Preview | Diff

This is required to make UMA and non-UMA devices behave portably.
@litherum litherum requested review from Kangz, kainino0x and toji May 22, 2022 03:02
@litherum litherum changed the title mapAsync(WRITE) needs to zero-fill the relevant region of the buffer mapAsync(WRITE) zero-fills the relevant region of the buffer May 22, 2022
@github-actions
Copy link
Contributor

Previews, as seen when this build job started (ae7350e):
WebGPU | IDL
WGSL
Explainer

@litherum
Copy link
Contributor Author

I just realized that https://github.com/gpuweb/cts/blob/main/src/webgpu/api/operation/buffers/map.spec.ts#L318 explicitly populates a buffer, maps the buffer for WRITE, then reads the data from the WRITE mapping.

This is pretty unfortunate because it means, on non-UMA systems, mapping for WRITE has to actually download the data from the device, just for it to be clobbered by javascript.

@greggman
Copy link
Contributor

Can you just map any memory as long as that memory came from the same page?

In other words, mapAsync(WRITE, offset, size), when mode === WRITE, it doesn't matter what's in the arrayBuffer that will eventually be returned by calling getMappedRange as long as the data in that arrayBuffer comes from the page itself.

So, you can keep a cache of arrayBuffers behind the scenes and when mapAsync(mode === WRITE) is true, then, if you have an unused arrayBuffer that is large enough in your cache of arrayBuffers for mapping you just return that buffer. It's only got data the user put in it from previous mapAsync calls so it's their data.

Is that a solution or is it scary because it means potentially different behavior across implementations if the way they implement their cache is different?

@litherum
Copy link
Contributor Author

litherum commented May 23, 2022

Oh, maybe this is why this validation rule exists during buffer creation:

If descriptor.usage contains MAP_WRITE:
descriptor.usage contains no other flags except COPY_SRC.

I guess the idea is that, if you can guarantee that the only way data gets into the buffer is via buffer mapping (or writeBuffer()), then the web process can just create a shadow ArrayBuffer around, holding the contents that was last uploaded into the buffer.

That seems like a pretty high penalty, though, just to enforce portability between UMA and non-UMA devices. Zero-filling would also achieve that portability without the requirement that map_writable buffers can't be used for anything else.

@Kangz
Copy link
Contributor

Kangz commented May 23, 2022

Oh, maybe this is why this validation rule exists during buffer creation:

Exactly, and the intent is that a shadow copy is kept as you guessed. There is a single way for data to get into a MAP_WRITE buffer which is filling it with javascript (either after mappedAtCreation:true or mapAsync). WriteBuffer requires COPY_DST so it is not possible to use it on MAP_WRITE buffers.

If we were to fill the buffers with zeroes on mapping, then we would essentially perform half a copy more for uploads (memset 0 is equivalent memory-wise to half a memcpy). That copy would happen on the CPU in the Web process when buffers aren't triply-mapped (which should be common, triply mapping will stress the OS and is not available everywhere anyway), so it will be quite expensive.

Note that in the short proposal for UMA mapping, MAP_WRITE buffers can be used with any readonly usages, which is somewhat generous. We could also add support for MAP_READ|MAP_WRITE buffers that can support all usages on UMA, but when buffers aren't triply-mapped, then you'd have one extra copy (GPU->Web for mapping for writing, Web->GPU when mapping for reading).

@greggman

This comment was marked as outdated.

@greggman

This comment was marked as outdated.

@kdashg
Copy link
Contributor

kdashg commented Sep 28, 2022

Presumably, when people map, they want to overwrite data. Zeroing that data would let us not have to keep a shadow map around, but it would be kinda pessimal for large uploads like images or other large buffers, to zero them first, and then only after zeroing, overwrite (almost?) all of the buffer again with the data to upload.

If an author is fine with zero-filled data, I believe that's what mappedAtCreation:true does during createBuffer, so they might have the opportunity to choose that tradeoff, if using a new buffer object as a one-shot upload is an option.

We can also talk about behavior of getMappedRange if doing zero-filling there would be useful instead. Having to zero-fill the whole mapAsync range feels heavy.

@kainino0x kainino0x added this to the V1.0 milestone Sep 28, 2022
@kainino0x
Copy link
Contributor

kainino0x commented Sep 28, 2022

For background for tomorrow's discussion: the original premise of this PR was that getMappedRange was zeroing instead of mapAsync - and that it should be moved to mapAsync. However, this was incorrect, as no zeroing occurs in either place. Mapping for WRITE gives you the current data in the buffer, which is usually expected to be kept in a shadow copy on non-triple-mapping systems.

So the tradeoff to discuss here, I think, is to add the cost of zeroing on all systems, and remove the shadow copy memory cost on non-triple-mapping systems.

@Kangz
Copy link
Contributor

Kangz commented Sep 28, 2022

Note that browsers could remove shadow copies in GPU-renderer process shmem if they have memory pressure, and recreate the shmem the next time the buffer is used. When that happens a memcpy is needed instead of memset 0, so it's more expensive.

We could also have a GPUMapMode.ZERO that says the mapping for writing should be filled with zeroes, and using it consistently makes the shadow copy not be created. (but that's kinda like a new feature, so polish-postv1?)

@litherum
Copy link
Contributor Author

litherum commented Sep 28, 2022

Sorry, I meant to write this last week but got sidetracked. This issue is incorrectly titled, and muddled in what it’s asking for.

Current state

Right now, 2 things are required of a conformant implementation:

  1. If descriptor.usage contains MAP_WRITE: descriptor.usage contains no other flags except COPY_SRC.
  2. The ArrayBuffer the JS receives for MAP_WRITE calls comes prepopulated with the contents of the buffer in it. So, if an application maps for writing, they can then (erroneously?) read the buffer they receive. This requirement is (ostensibly) for portability - the contents of the Array Buffer have to be well-defined.

On a discrete GPU, implementations would want to avoid downloading the data in the buffer to service MAP_WRITE calls, because the common case when an application is using MAP_WRITE correctly would be to never read the contents of the array buffer. So, the intention here is for the implementation to keep a CPU-side buffer alive that mirrors the contents of the GPU buffer. Because of (1) above, this ‘mirror’ is possible to maintain, because the only way changes can appear in the buffer is via MAP_WRITE calls. It’s a one-directional data flow.

So, the implementation of MAP_WRITE is intended to be:
a. Every MAP_WRITE-able buffer has a CPU-side shadow allocation, with (at least) the same lifetime as the buffer
b. MAP_WRITE delivers pointers to this shadow allocation to the JS
c. JS can read or write the shadow allocation
d. When unmapping, WebGPU performs a copy from the shadow allocation to the backing GPU buffer

Analysis

Both of the requirements listed above are unfortunate:
A. The fact that descriptor.usage must contain no other flags except COPY_SRC means that these mappable buffers are useless by themselves for any non-trivial use-case. Any application that wants to upload data from the CPU and then use it in a shader must do a double allocation - one for the MAP_WRITE-able buffer, and one for the buffer that will actually be used. The fact that the application is forced to do a double allocation means that they must pessimize; the WebGPU implementation can’t elide one of the buffers, because the source program explicitly tells us to create them.
B. But it’s worse than that, because the ArrayBuffer comes prepopulated with the contents of the buffer ahead of time, thereby requiring either i) an expensive download just in case the application might look at the contents of the buffer, or ii) a third buffer-sized memory allocation.

On mobile devices, it’s incredibly wasteful to allocate 3x as much memory as is necessary.

Proposal

It doesn’t have to be this way, though! The contents of the ArrayBuffer delivered to JS just has to be well-defined; it doesn't have to hold the contents of the buffer. So, an alternative is for the contents of these JS-exposed buffers to be zero-ed out.

The downside to this is that, yes, the mapping operation would have to zero-fill some memory (or run a system call to get pre-zero’ed memory). This can be somewhat mitigated by having the GPU zero-fill the memory, rather than the CPU literally call memset(). This requires that the zero-ing happens as part of the task scheduled by mapAsync(), rather than inside getMappedRange().

The upside, however, is a 33% (or possibly 66%) memory reduction for common usage patterns. By zero-ing out the data, we don’t need to have a shadow allocation; we just need to cause either the GPU or the CPU to zero-fill the mappable destination buffer directly. For a triply-mapped buffer (which we expect to be every MAP_WRITE-able buffer, at least on Apple Silicon, probably Intel devices, and maybe even on all devices), the ArrayBuffers exposed to JS can point directly into the real destination buffer. And for non-triply-mapped buffers, it’s still better than what we have today, because the shadow buffers could be temporary and transient, rather than being required to live for at least the lifetime of the GPUBuffer they’re shadowing.

But it gets better, because, if we don’t need to maintain shadow buffers, we can open up write access to MAP_WRITE-able buffers to shaders. We can remove the restriction that MAP_WRITE-able buffers cannot be used as storage buffers, etc. Concurrent accesses would be naturally forbidden using our existing buffer state tracking infrastructure - ownership of the contents of a buffer would be transferred from GPU to CPU via map()/unmap() calls, and access by the non-owning side can be statically detected and averted.

On a discrete GPU, an application might not want to use the MAP_WRITE-able buffer directly in their shaders. There are a few possible ways of allowing this to happen:

  • Just adding a note in the spec saying “Performance of GPU operations on MAP_WRITEable buffers may be suboptimal
  • By exposing a “you might want your shaders to operate on non-MAP_WRITE-able buffers for performance” bit
  • By adding in a bake() call which doesn’t need to do anything on UMA but on discrete cards would move the data out of the CPU-visible memory region
  • Maybe other options

If we did something like this, then instead of allocating 3x as much memory as is necessary, we’d only be allocating 1x as much memory as is necessary. And, not only that, but it becomes way easier for authors to write WebGPU applications - they don’t have to write code to make twice as many buffers as they need and shuffle data between them.

@kainino0x
Copy link
Contributor

Very useful post, thanks. I'm confused about one thing. You say zeroing would benefit triple-mapping implementations because the clearing could be done on the GPU. But wouldn't keeping the original contents be even better? Then triple-mapping implementations don't have to clear at all.

That said, I think in theory, zeroing could still be better, if there is memory compression for zeroed pages. Then you could avoid flushing all those zeroes out to main memory (and loading them back in if they get read, regardless of whether by the GPU or the CPU). I have no idea whether anything like this is possible though.

@litherum
Copy link
Contributor Author

litherum commented Sep 29, 2022

Right, that’s a good point. On a UMA system, the triply-mapped buffer gets the data for free.

I’m imagining a world like this:

  1. On a UMA device, zero-filling occurs to try to maintain portability with non-UMA systems. This is a pessimization.
  2. Un-pessimizing would be part of a larger extension for the future, focused solely on UMA.
  3. Even without the extension, discrete cards wouldn’t have to have the shadow copy
  4. Authors wouldn’t be required to use staging resources, on any card

Pros:

  1. Memory use on a discrete card could go down by 66% ({shadow buffer, staging buffer, useful buffer} => {useful buffer})
  2. Memory use on a UMA card could go down by 50% ({staging buffer, useful buffer} => {useful buffer})
  3. Authors don’t have to schedule copies from the staging buffer to the useful buffer

Cons:

  1. On a triply mapped buffer, every MAP_WRITE requires a GPU blit (which isn’t necessary today)
  2. On a non-triply mapped buffer, every MAP_WRITE requires either an mmap(MAP_ANONYMOUS) (which returns zero-filled memory), or a memset().

I think the pros outweigh the cons.

@kainino0x
Copy link
Contributor

Ah, got it, what I got confused about was triple-mapping on non-UMA.

@kdashg
Copy link
Contributor

kdashg commented Oct 6, 2022

GPU Web 2022-10-05
  • CW: discussed offline, think we shouldn't change anything, but we'll discuss with Myles in the room.

@austinEng
Copy link
Contributor

There are two aspects I see to the proposal:

  1. allow usages other than COPY_SRC with MAP_WRITE
  2. zero-fill the contents of the buffer in a task scheduled by mapAsync. Then, when you call getMappedRange(), you see zeros instead of the buffer contents

We don’t think we should pursue either of these changes and should keep the current spec as-is. We could improve memory usage and reduce copies in a UMA extension.

Part 1:

Allowing usages other than COPY_SRC with MAP_WRITE is just like issue #2388. While it would be great to reduce the number of copies on unified memory architectures, lifting this restriction should be done in a fuller proposal for a UMA extension. If done poorly, lifting the usage rules would make it easy for developers to use an extremely inefficient path on discrete GPUs.

Myles had some good ideas here - maybe a bake() step so that flushing updates is explicit. Regardless though - we need a fully fleshed out proposal for how this (or something else) will work that will allow developers to get good behavior on both UMA and discrete systems without performance pitfalls.

Part 2:

I’m not convinced by the purported memory savings of returning a zero-filled mapping. As discussed in todays's meeting, at least Safari/Chrome expect to be able to use “triple mapping” everywhere. So, there should never be a need for a CPU-side shadow allocation. On all platforms, a MAP_WRITE|COPY_SRC buffer is backed by a single buffer allocation that is both visible to the web page and accessible to the GPU.
Yes, without a UMA extension, you still need to copy from this staging buffer into the “useful buffer” but there is no third “shadow buffer”.

The remaining memory savings argued in Myles’ proposal come from the fact that since the contents of the mapping are zero-filled, the implementation is free to make their storage temporary and transient. While this is true, this is exactly the same memory savings the application would get if they explicitly destroyed their staging buffer and recreated it again when they needed to. It is better to give control of allocations to the application so they can manage it themselves.

The current design of buffer mapping is clear that device.createBuffer handles allocation, and buffer.mapAsync mediates access to the memory. If the implementation is freeing and allocating buffers under the hood, then mapAsync will need to handle out-of-memory errors as well. I don’t think it’s good to automatically manage this memory when there are existing mechanisms for the application to robustly do so.

So given we don’t see memory savings to zero-filling we don’t think it’s worth doing. It also has a few downsides.

  • It complicates the error handling mechanisms without sufficient benefit
  • Will still need to pay the cost of the memcpy/GPU-clear
  • It makes it more difficult for an application to do sparse writes to the buffer. Mapping would clear out the entire range A-B-C, so the application needs to write the entire range again even if you only want to modify parts A and C

I do think zero-filling could be part of the future UMA extension - perhaps adding a GPUMapMode.ZERO. Zero-filling will be valuable in some situations on a discrete GPU. It would allow the implementation to avoid reading back data into the mappable staging buffer. This could happen in situations where the buffer has MAP_WRITE|STORAGE usage, and the GPU modifies the contents in a shader. In the current spec, mapping the buffer after storage writes means those writes should be visible in the mapping. There are options here, but I think they should be left to the UMA extension. Perhaps:

  • Just pay the cost, with a note in the spec.
  • Validate you use GPUMapMode.ZERO
  • Have explicit methods to enqueue the readback if the mapping is GPU-writable. The readback does nothing on UMA. Validate that you have requested to read back all the ranges you try to map.
  • Something else

@greggman
Copy link
Contributor

I just want to clarify my understanding based on @austinEng 's comments above.

It sounds like a buffer created with

buf = device.createBuffer({usage: GPUBufferUsage.WRITE | GPUBufferUsage.COPY_SRC, size});

Is not intended to be a GPU buffer at all. It's entirely a single allocation, ideally that both the CPU and GPU process can see. Those 2 flags make it a CPU->GPU staging buffer and that's all it's useful for.

If an implementation can do some optimization where it's actually allocated in memory the GPU can see that's great, but without that optimization it's a single CPU ram buffer, ideally in shared memory between the CPU and GPU process. It's possibly not a Metal/Vulkan/D3D buffer at all.

Correct or am I mis-understanding?

@greggman
Copy link
Contributor

greggman commented Oct 13, 2022

Sorry if that last post didn't make any sense.

In my mind, given the restrictions of MAP_WRITE | COPY_SRC I thought one valid implementation would be something like

class GPUBufferImpl {}

class RealGPUBuffer : public GPUBufferImpl {
    APISpecificBuffer mNativeBuffer;
}

class CPUOnlyBuffer : public GPUBufferImpl {
    void* mSharedMemoryBytes;
}

class GPUBuffer {
  GPUBuffer(impl) : mImpl(impl) {}
  GPUBufferImpl mImpl;
}


class GPUDevice {
   createBuffer(desc) {
     if (desc.usage == (MAP_WRITE | COPY_SRC)) {
        return new GPUBuffer(new CPUOnlyBuffer());
     } else {
        return new GPUBuffer(new RealGPUBuffer());
     }
   }
}

And that a MAP_WRITE_ | COPY_SRC buffer is just an obfusticated way of accessing a chunk of shared memory between the page and the GPU process. Whether that chunk of memory also happens to represent a native GPU buffer is an implementation detail. Behind the scenes, device.copyBufferToBuffer can do whatever it needs to copy the data into some other buffer.

I guess this is wrong though.

@austinEng
Copy link
Contributor

I believe your understanding is correct. And, you're right that the implementation may implement this buffer as a CPU-only shared memory buffer. Whether or not it is GPU-accessible is an implementation detail. However, we expect to be able to share the shared memory directly with the GPU everywhere. This is what we've been referring to in this issue as "triple mapping".

@kdashg
Copy link
Contributor

kdashg commented Oct 14, 2022

GPU Web Meeting 2022-10-12/13 APAC-timed
  • MM: socialized this. Not much discussion. Socialize it again?
  • JB: I could use a recap
  • MM: talking about map-write. Spec says: when you map-write, (considering it an atomic process), you get ArrayBuffer, write into it, unmap. Then map it again. When you see the ArrayBuffer you see the contents you saw before.
  • One impl strategy: second map-write downloads the data from the GPU.
  • Other way, more realistic: elsewhere in spec, any buffer that's map-writeable can't be used for any other usage, other than transfer-src. So, only situation that could write into the buffer is with a map-write. So all the impl does is keep the arraybuffer around in the web process. Second map-write returns the same AB. Nothing else could have changed that data.
  • MM: unfortunate for 2 reasons. First, this involves double- or triple-allocations. Obvious extra alloc: map-writeable buffer can't use for any other usage, it's useless for anything else. They may want to bind it as a storage buffer. In order to use map-writeable buffer, you have to have a second buffer, and copy back and forth. Double allocation
  • MM: then also a shadow buffer. First mapAsync, you keep the AB around on the CPU. That's another copy of the buffer. Lifetime of the shadow copy is at least as long as the buffer. Not transitory.
  • MM: on mobile, this is a thumbs-down.
  • MM: doesn't have to be this way. I'm proposing - relax the constraint that the second map-write sees the results of the first one. App doesn't intend to actually read the data. If we zero-fill the buffer, that should be fine from POV of app. Don't need the shadow copy - can just clear it. Can remove the other spec section where map-writeable buffers can't have other usages. Map-writeable buffers become useful. Removes need for second copy.
  • MM: downside - need to zero-fill upon map-write. If GPU does the blit, can use blitting hardware. Can arrange it so GPU can do the clearing, should be fast. It is a tradeoff.
  • MM: it is a tradeoff.
  • KG: do you expect to be able to use the GPU for the clear?
  • MM: yes. Considering the map call separately - map does the zero-fill on the content timeline - that'll work. Only way it won't work is that the second call does it. That would have to be on the CPU.
  • JB: that'd be a big memset before the GPU communication will be faster. Does anyone do GPU fills to get zero-filled CPU buffers?
  • KG: sounds unusual. Memory bandwidth is the highest cost. Depending on shmem caching properties - might not have a choice.
  • MM: on our impl - we expect triply-mapped buffers to be the norm, from the GPU process to the web process.
  • KG: interesting, we don't expect to do that ourselves.
  • AE: two parts we see to this. One, as myles said - allowing other usages and copy-src with map-write. Second is zero-filling bit.
  • AE: for first part - i don't think we support just lifting this usage restriction. Similar to #2388. Would be great to reduce # copies in UMA - just lifting this won't perform well for discrete GPUs. Have a CPU-writeable thing readable from GPU - will have lot of memory traffic every time CPU writes. Myles had ideas about flushing of mapped writes - but think we need a more full proposal for UMA to support lifting the usage restriction.
  • AE: for second part, about zero-filling - we also expect to do this triple-mapping concept. Have shmem, shared with GPU process, also visible to GPU hardware. In this scenario, not convinced by memory savings / saved allocations. In this world - there's no CPU-only shadow buffer. You have one buffer. No double-allocation there.
  • AE: returning 0 in the mapping does let the impl delete the buffer whenever it wants - can be transient, don't need to preserve contents. But - this is pretty much the same as the web app destroying buffers themselves. I think this is strictly better - makes the app handle allocation explicitly . Handle OOM explicitly themselves. If we're creating buffers for you during mapAsync - will complicate the error handling mechanisms. There are already ways to do this in the current API that are working pretty well.
  • KG: other Q - how different is this from impl of writeBuffer? Once you eventually say we'll discard the contents & replace them - we already have that. Theoretically you could write it into your triply mapped buffer, just memcpy it in.
  • MM: what's the story we tell to web developers? Start using writeBuffer, and if it doesn't work for you because of perf, you can go to lower-level map primitive yourself. Round-robin your buffers.
  • MM: confusing - for someone who starts using writeBuffer, and mapping has a huge memory cost, so we go back to writeBuffer?
  • KG: what are you asking for here that's not satisfied by writeBuffer? Just that we have an indication that these buffers are triply-mapped?
  • MM: right now, there's a subsystem in the WebGPU with pretty atrocious memory characteristics, and I want to avoid this.
  • KG: I think the opposition is, the shadow copies not that expensive … and can be managed by app
  • MM: my proposal - for triply mapped, you'd get 50% reduction. DIscrete GPU case where not triply mapped, 66% memory reduction.
  • AE: we plan to always triply map. 50% reduction, and can not get that if we assume UMA. We don't think we should simply remove the restriction, but have a full UMA proposal.
  • KR: The characterization of atrocious memory consumption is an implementation problem and not mandated by the spec. If we put in the zero-filling behavior then we’re going to make things worse for triple mapping. No mandate to keep a CPU shadow copy right now. Think any reasonable implementation should [...]. If we adopt zero-filling now we’ll constrain what the API can do in the future backward-compatibly.
  • MM: can you describe the future?
  • KR: Through clever passing of GPU handles between processes we can directly, safely access memory of an actual buffer from JavaScript. Maybe allow some racing but fundamentally seeing the real mapping in JS. Maybe only works in UMA. Maybe works with discrete on some systems. May require more OS primitives.
  • AE: Exactly right. Safari planning to do this and we’re planning as well. Windows APIs exist for it.
  • KR: If we mandate zero-filling it eliminates the possibility of doing this. Similar constraints in the WebGL API - can’t transform feedback to an index buffer. Here, in this one area, we don’t allow GPU writes to these buffers now, want to in the future. Internally we discussed options like “either zero or old data” but think it’s not good for developers.
  • MM: OK. Like to urge Google rep to write a position with the technical underpinnings to the issue, and I'd like to study them.
  • AE: sounds good.

@litherum litherum changed the title mapAsync(WRITE) zero-fills the relevant region of the buffer mapAsync(WRITE) and zero-filling Nov 2, 2022
@kdashg
Copy link
Contributor

kdashg commented Nov 9, 2022

GPU Web meeting 2022-11-02/03 APAC-timed
  • MM: did a lot of research the past week about this.
  • MM: somewhat in the same place I was previously. Not sure how to characterize the problem I think exists and what I'm asking for.
  • MM: aiming for: a way to enable some mapping code path that works on both UMA and non-UMA. Ideally, sooner rather than later, but - as long as we can implement something in the future, that's OK. Without an opt-in, though. Trying to ensure there's a path forward that we can add a feature in the future, works great on UMA, don't have to opt-in. That's the concern about not zero-filling mappable buffers. Where we want mapping to work for non-UMA and UMA, having the non-UMA use case have to download the contents of the buffer is unfortunate.
  • (...more…)
  • If mapping such a buffer requires JS to have to be able to see contents of the buffer - for discrete devices, they'd have to download the contents.
  • MM: not 100% sure what I want to get to. Worried that if the contents have to exist when mapping the buffer now, that'll make our job harder in the future.
  • JB: to clarify - right now, we can't create buffers that are both mappable, and visible to shaders. In the future we hope to have buffers that are mappable and visible to shaders. And you want that to be not opt-in, you don't have to specify anything to get that behavior.
  • MM: yes. One way to achieve that goal's relaxing that restriction, but that options' otp of mind right now.
  • JB: with the goal you're aiming for - the ArrayBuffer's contents are the bytes the GPU is reading from - it's possible to have 0 copies in the entire path. That's what you're aiming for. You're willing to have that path be required to 0 the buffers in some cases.
  • MM: think that's right.
  • JB: on the non-UMA architectures, it's always cheaper to leave the contents there. Zeroing it will always be a cost that's not necessary. We don't have any option here which is optimal for both situations. Make one side ideal by imposing a cost on the other side.
  • KN: I don't quite understand the concern about future-proofing. Assuming we add a way to make mappable buffers usable as other things, surely we can do what we collectively want at that point. Maybe a flag saying zero the memory. Or, we don't do that until JS adds a write-only ArrayBuffer type. Seems to me that we have flexibility. If we have to add a new map function to do this, we can. Don't see how we're designing ourselves into a corner.
  • JB: the fact that that combination of flags is forbidden now, is the opt-in.
  • MM: I don't want it to be an opt-in. Adding something with new semantics is opt-in.
  • JB: already, today, people can't write programs that treat things as mappable and shader-accessible. We can specify that the combination of those flags means we zero it, for example. We'll always know when we need to apply new rules.
  • MM: that's a good point. A little spooky but maybe that's OK.
  • KN: I wouldn't necessarily design it that way, but we can design it differently later on. Maybe you have to specify a MAP_ZERO flag during mapAsync call. But Jim's point stands.
  • JB: not advocating any particular design.
  • KN: wouldn't be as spooky.
  • MM: understand that's a tradeoff, pros/cons. Can discuss later. Most important part of this that I think is worth discussing now - don't want the thing we've described to be behind an extension. Want to just add it. Can be feature detectable - but not that to make your code work on UMA, you need to enable this. It'll just come to all browsers.
  • KN: sounds fine, as long as things are feature detectable. Don't need people to enable features unless they won't be universal.
  • KG: this isn't mentally actionable for me right now - don't understand the user story that we're solving here. The way I see it - there's no way to have a solution that's good for both UMA and non-UMA. You have to pick one.
  • MM: one example - referencing Metal - in Metal there is a storage mode on buffers. One in particular is "managed". On UMA devices, you can map it and the thing you map is the buffer. On non-UMA, you can map it, and the thing you map is not the buffer. On unmap, the data's copied to the right place. API works great for both UMA and non-UMA. You can make a buffer that's not this way. Reason it's useful to not do the thing I just described - when you want to populate a texture. On discrete GPU, using managed buffers - you map, populate, unmap - causes copy to show up in dest buffer on GPU. Then to get it into texture, have to schedule another copy. 2 copies to fill a texture. Metal therefore has 2 methods. One is that (managed); the other's shared. Even on discrete cards, can have a buffer that lives in CPU-accessible memory. That's for staging buffers, populating texture rather than buffer. There's room for an API that works great everywhere. Such asn API isn't the only one that has to exist. We could have different APIs that have different strengths/weaknesses.
  • KR: Our team has gained some experience with that Metal API recently for both upload and readback cases. While it’s certainly possible to use these APIs optimally on various GPU types, practically they perform very differently. Many factors in performance difference that means we have to heuristically chose an approach, basically by looking at the GPU architecture. … Think there will have to at least be flags to tell applications which of several available approaches they should use to be optimal. Hope we won’t try to design something that is optimal everywhere because it’s really not possible.
    • KR: I can provide links to ANGLE bugs where both the upload and readback paths were tuned.
  • KG: fwiw that experience lives up to my trepidations. Not convinced that it unconditionally works great on both APIs (GPU types?). We can explore the space of compromises.
  • JB: Would like to hear more information about what you had to do on Metal
  • KR: Need to go back to ANGLE notes. Gregg just did optimizations in ANGLE’s readPixels. Had to write three codepaths, one for Intel, one for M1, one for discrete. Choosing the technique by architecture gained substantial performance.
  • DG: Ken, do we need to expose these directly to the application, or are these things we would need to do inside WebGPU implementations?
  • KR: … Will always need to know whether to write to a staging buffer or write to the resource directly.
  • KG: concerned we can't make forward progress. Focus on user stories, and prioritize them. Working in the abstract, we can't make progress.
  • MM: was about to say what you did. Think it's an interesting discussion. Want to have it, but don't need to have it right now. Think Jim/Kai were convincing that we can create something in the future, after V1, without an extension, without opt-in, that'll slot in and be great. Think we're in agreement.
  • KG: philosophical disagreement about how things like this are exposed. In WebGL, needing to know how various things interact - people don't understand them. When we have things that aren't opt-ins - causes problems. Had to write more clarifying docs recently, and we have warnings - for instance - if you try to use Float32 blending and you didn't enable it explicitly, it's not guaranteed. Not enough to enable 32-bit render targets. For historical reasons we said, we'll just enable it because we'll break too much content - but we wish we could have made people conscious about it. It just works, until you find a GPU where it doesn't.
  • JB: don't think Kai/I advocated for a particular API. Maybe it should be exposed more clearly describing the tradeoffs. Using this feature will require new behavior on the part of content, so we have the chance to introduce new semantics too.
  • KR: my main question was whether we can close out this issue.
  • MM: had the zero-filling question, think we can close this, but would like to keep UMA issue open. Convinced by Kai/Jim, we can consider UMA later.
  • KN: responding to Kelsey - fair. Wanting to enable this in the future without a feature flag. We can make that decision in the future. We could also put it behind a feature flag and take it out from behind it once all browsers implement it. Can decide later.
  • KG: we're going to close the zeroing issue, leave the UMA issue open.

@kainino0x kainino0x closed this Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants