-
Notifications
You must be signed in to change notification settings - Fork 329
Description
Better performance with divergent kernels and more localized data sharing is a strength of modern SIMT hardware, making massively parallel algorithm considerations viable for more fields than before. The usage of this level of thread grouping was mostly implicit, but with new iterations of GPGPU APIs, explicit control has been introduced. This handling, either implicit or explicit, is accomplished by the grouping of threads which are called by various names like waves, wavefronts, warps by hardware architecture designers but this article will refer to this structure as subgroups. This grouping mostly takes place at the hardware level, giving an opportunity to coordinate data and execution between a specific amount of threads faster than shared and global coordination. Explicit handling of this fourth grouping level (system, device global, device local, device subgroup) gives algorithm designers a chance to fit their algorithms to more tight runtime budgets by using deeper understanding and pre-consideration of given problem compared to implicit handling, which makes this fit with aims of modern GPGPU APIs.
Subgroup implementations have different capabilities. But it is possible to categorize these capabilities in a way that creates overlaps and let the developer decide the best way to proceed with given hardware by having a query interface for subgroup capabilities. This article presents query interfaces of different APIs, a proposed query pseudointerface and how each API maps to it.
Existing Query Interfaces
DirectX 12 Feature Level 12.0
ID3D12Device
has member function CheckFeatureSupport
which provides a D3D12_FEATURE_DATA_D3D12_OPTIONS1
structure when D3D12_FEATURE Feature
argument is set to D3D12_FEATURE_D3D12_OPTIONS1
0 1. D3D12_FEATURE_DATA_D3D12_OPTIONS1
structure has a BOOL WaveOps
member that is set to true if the device supports all subgroup operations (this statement needs clarification and/or confirmation) and subgroup size range is also provided as UINT WaveLaneCountMin
and UINT WaveLaneCountMax
members which are the inclusive minimum and maximum of the range 2. Subgroup operations including quad operations are only supported at the fragment and compute stages 3.
HRESULT CheckFeatureSupport(
D3D12_FEATURE Feature,
void *pFeatureSupportData,
UINT FeatureSupportDataSize
);
typedef struct D3D12_FEATURE_DATA_D3D12_OPTIONS1 {
BOOL WaveOps;
UINT WaveLaneCountMin;
UINT WaveLaneCountMax;
UINT TotalLaneCount;
BOOL ExpandedComputeResourceStates;
BOOL Int64ShaderOps;
};
Metal 2.1
MTLDevice
has supportsFeatureSet
member function that takes in a feature set and returns true if it is supported 4 5. Which feature set supports which subgroup operations are provided with tables to be directly derived from supported feature set6 and documentation to be indirectly derived from the supported feature set by extracting OS information from feature set 7. A temporary MTLComputePipelineState
needs to be created in order to get subgroup size range, which is provided with threadExecutionWidth
as the same value for inclusive minimum and maximum of range 8 9. Subgroup operations including quad operations are only supported at the fragment and compute stages.
func supportsFeatureSet(_ featureSet: MTLFeatureSet) -> Bool
var threadExecutionWidth: Int { get }
Vulkan 1.1
VkPhysicalDevice
has vkGetPhysicalDeviceProperties2
which provides VkPhysicalDeviceProperties2
structure which can contain a pointer to VkPhysicalDeviceSubgroupProperties
structure in its void* pNext
member’s pointer chain if implementation details are provided 10 11 12. If implementation details are not provided, subgroup operations are not supported (this statement needs clarification and/or confirmation) 13. Supported subgroup operations are stated using VkSubgroupFeatureFlags supportedOperations
member of VkPhysicalDeviceSubgroupProperties
. Subgroup size range is provided with uint32_t subgroupSize
which is the same value for inclusive minimum and maximum of the range. Subgroup operation support in specific stages are exposed using VkShaderStageFlags supportedStages
with an extra VkBool32 quadOperationsInAllStages
member which limits support for quad operations only to fragment and compute stages if set to false.
void vkGetPhysicalDeviceProperties2(
VkPhysicalDevice physicalDevice,
VkPhysicalDeviceProperties2* pProperties);
typedef struct VkPhysicalDeviceProperties2 {
VkStructureType sType;
void* pNext;
VkPhysicalDeviceProperties properties;
} VkPhysicalDeviceProperties2;
typedef struct VkPhysicalDeviceSubgroupProperties {
VkStructureType sType;
void* pNext;
uint32_t subgroupSize;
VkShaderStageFlags supportedStages;
VkSubgroupFeatureFlags supportedOperations;
VkBool32 quadOperationsInAllStages;
} VkPhysicalDeviceSubgroupProperties;
Proposed Pseudointerface
The proposed pseudointerface of this article resonates more with Vulkan API but adds a member to convert subgroupSize to an inclusive range. It would be better to bake this structure into WebGPUDevice
at WebGPU API initialization since some backends require temporary object creation and deletion to get the required information which can be detrimental to overall runtime if repeated at every query 13. All APIs listed above are mappable to this interface using only documentations and tables. API versions that have no query or implementation support for subgroups can be detected, and support flags can be set to false.
typedef [EnforceRange] unsigned long GPUSubgroupFeatureFlags;
interface GPUSubgroupFeatureFlags {
const GPUSubgroupFeatureFlags BASIC = 0x1;
const GPUSubgroupFeatureFlags VOTE = 0x2;
const GPUSubgroupFeatureFlags BALLOT = 0x4;
const GPUSubgroupFeatureFlags ARITHMETIC = 0x8;
const GPUSubgroupFeatureFlags QUAD = 0xF;
};
interface GPUSubgroupProperties {
readonly GPUSize32 subgroupSizeInclusiveMinimum;
readonly GPUSize32 subgroupSizeInclusiveMaximum;
readonly GPUShaderStageFlags supportedStages;
readonly GPUSubgroupFeatureFlags supportedOperations;
readonly boolean quadOperationsInAllStages;
}
Coarse Analysis of Device Support
DirectX 12 Feature Level 12.0 Devices
Not able to find a database with subgroup capabilities information of devices.
Metal 2.1 Devices
MacOS hardware already has extensive support for subgroup features. Although iOS hardware just got limited support for subgroup features, it is enough to improve the performance of applications 7.
Vulkan 1.1 Devices
Desktop Vulkan hardware already has extensive support for subgroup features. Mobile Vulkan hardware does not have mainstream Vulkan 1.1 support at the moment, but beta information indicates that some part of subgroup features will be exposed 14.
WHLSL Support
See 15.