1// Copyright 2021-2023 The Khronos Group Inc.
2//
3// SPDX-License-Identifier: CC-BY-4.0
4
5= VK_EXT_mesh_shader
6:toc: left
7:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/
8:spec: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html
9:sectnums:
10
11This extension provides a new mechanism allowing applications to generate collections of geometric primitives via programmable mesh shading.
12
13
14== Problem Statement
15
16The rasterization pipeline for Vulkan is fairly fixed - a fixed input stage assembles vertex data for the vertex shader, data optionally passes through tessellation and geometry stages, then fixed vertex processing passes the resulting primitives to the rasterizer.
17As rendering engines get increasingly complex, the fixed nature of this pipeline has become a bottleneck; many developers are augmenting their rasterization pipelines with compute shaders to make them more flexible.
18Using compute shaders comes at a cost though - data has to be piped via global memory, and once there it still has to be optimised for an implementation's vertex caches or face performance penalties.
19With compute shaders processing geometry, the role of the vertex shader is also somewhat redundant - transformations could be performed just as easily in compute shaders, and the vertex shader serves merely as a way to get at the fixed rasterization interface.
20
21This proposal aims to find a way to make the geometry pipeline more flexible, removing the unnecessary cost of an extra shader.
22
23
24== Solution Space
25
26Making the geometry pipeline more flexible requires rethinking how data is transmitted from buffers to device memory; ideally a solution should allow applications to flexibly modify the set of geometry as a whole in the way developers are currently using compute shaders, and how they may use them in future (e.g. culling, grouping, decompression), without compromising on efficiency. Some considered ways to do this include:
27
28  . Giving vertex shaders defined grouping, and enabling communication via subgroup operations
29  . Skipping the vertex shader and using geometry or tessellation shaders at the start of the pipeline
30  . Use compute shaders in place of pre-rasterization shader stages
31
32While a defined grouping in the vertex shaders would give applications the ability to manipulate sets of vertices, there is very limited ability to cull or remove vertices or primitives within a group, and the groups would be limited by the size of subgroups on the GPU without significant modifications.
33Similarly, applications would be largely constrained by fixed input assembly requiring a 1:1 ratio of indices to input vertices, meaning things like decompressing meshes or acting at any granularity other than vertices would be infeasible.
34Changing the vertex stage to accommodate this extra flexibility would require significant changes to how the stage works, which could likely better be accommodated by other existing stages.
35
36Skipping vertex shading and jumping straight to geometry or tessellation stages would provide a level of flexibility that could be interesting - both have access to geometric data at a granularity other than vertices, and are able to remove or add geometric detail before rasterization.
37The main issues with these stages is that they are still rather fixed in terms of inputs, output topology, and the granularity they operate at - they are also historically not very efficient across platforms, and removing the vertex shader from the front of the pipeline would not do much to help that.
38
39The last clear option is to effectively use compute shaders in place of existing rasterization stages, enabling applications to more easily port existing compute shaders to this new extension with all the flexibility intact.
40The main difficulty with this is simply defining the interface between the shader and the rasterizer, as existing compute shaders simply write out to buffers, whereas rasterization hardware is highly specialized, and may need to be fed in a particular manner.
41
42This extension opts for something close to option 3, by replacing all pre-rasterization shaders in the graphics pipeline with two new stages that operate like compute shaders but with output that can be consumed by the rasterizer.
43
44
45== Proposal
46
47=== New Shaders
48
49This proposal adds two new shader stages, which can be used in place of the existing pre-rasterization shader stages; <<Mesh Shaders>> and <<Task Shaders>>.
50
51
52==== Mesh Shaders
53
54Mesh shaders are a new compute-like shader stage that has a primary aim of generating a set of primitives to the rasterizer, which are passed via its new output interface.
55For the most part, these map well to compute shaders - they are dispatched in workgroups and have access to shared memory, workgroup barriers, and local IDs, allowing for a lot of flexibility in how they are executed.
56
57Unlike vertex shaders, they do not use the vertex input interface, and geometry and indices must: be generated by the shader or read from buffers with no requirement for applications to provide the data in a particular way.
58As such, this allows applications to read or generate data however they need to, removing the need to prepare data before launching the graphics pipeline.
59This allows items such as decompression or decryption of data to be performed within the graphics pipeline directly, avoiding the bandwidth cost typically associated with compute shaders.
60
61Another key part of mesh shaders is that the number of primitives that a given workgroup can emit is dynamic - there are limits to how much can be emitted, which is advertised by the implementation, but applications can freely emit fewer primitives than the maximum.
62This allows things like modifying the level of detail at a fine granularity without the use of tessellation.
63Vertex outputs are written via the `Output` storage class and using standard built-ins like `Position` and `PointSize`, but are written as arrays in the same way as a tessellation control shader's outputs.
64Additionally, the mesh shader outputs indices per primitive according to the output primitive type (points, lines, or triangles). Indices are emitted as a separate array in a similar fashion to vertex outputs.
65Mesh shader output topologies are lists only - there is no support for triangle or line strips or triangle fans; data in these formats must be unpacked by the shader.
66Other user data can be emitted at per-vertex or per-primitive rates alongside these built-ins.
67Mesh shaders must specify the actual number of primitives and vertices being emitted before writing them, via the `OpSetMeshOutputsEXT` instruction.
68Subsequent fragment shaders can retrieve input data at both rates, tied to the vertices and primitive being rasterized.
69
70Mesh shaders can be dispatched from the API like a compute shader would be, or launched via <<Task Shaders>>.
71
72
73==== Task Shaders
74
75Task shaders are an optional shader stage that executes ahead of mesh shaders. A task shader is dispatched like a compute shader, and indirectly dispatches mesh shader workgroups.
76This shader is another compute-like stage that is executed in workgroups and has all the other features of compute shaders as well, with the addition of access to a dedicated instruction to launch mesh shaders.
77
78The primary function of task shaders is to dispatch mesh shaders via code:OpEmitMeshTasksEXT, which takes as input a number of mesh shader groups to emit, and a payload variable that will be visible to all mesh shader invocations launched by this instruction.
79This instruction is executed once per workgroup rather than per-invocation, and the payload itself is in a workgroup-wide storage class, similar to shared memory.
80Once this instruction is called, the workgroup is terminated immediately, and the mesh shaders are launched.
81
82Task shaders can be used for functionality like coarse culling of entire meshlets or dynamically generating more or less geometry depending on the level of detail required.
83This is notionally similar to the purpose of tessellation shaders, but without the constraint of fixed functionality, and with greater flexibility in how primitives are executed.
84Applications can use task shaders to determine the number of launched mesh shader workgroups at whatever input granularity they want, and however they see fit.
85
86
87==== Rasterization Order
88
89As task and mesh shaders change how primitives are dispatched, a subsequent modification of rasterization order is made.
90Within a mesh shader workgroup, primitives are rasterized in the order in which they are defined in the output.
91A group of mesh shaders either launched directly by the API, indirectly by the API,
92or indirectly from a single task shader workgroup will rasterize their outputs in sequential order based on their flattened global invocation index,
93equal to asciimath:[x + y * width + z * width * height], where `x`, `y`, and `z` refer to the components of the code:GlobalInvocationId built-in.
94`width` and `height` are equal to code:NumWorkgroups times code:WorkgroupSize for their respective dimensions.
95When using task shaders, there is no rasterization order guarantee between mesh shaders launched by separate task shader workgroups, even within the same draw command.
96
97
98=== API Changes
99
100==== Graphics Pipeline Creation
101
102Graphics pipelines can now be created using mesh and task shaders in place of vertex, tessellation, and geometry shaders.
103This can be achieved by omitting existing pre-rasterization shaders and including a mesh shader stage, and optionally a task shader stage.
104When present, a graphics pipeline is complete without the inclusion of the link:{spec}#pipeline-graphics-subsets-vertex-input[vertex input state subset], as this state does not participate in mesh pipelines.
105No other modifications to graphics pipelines are necessary.
106Two new shader stages are added to the API to describe the new shader stages:
107
108```c
109VK_SHADER_STAGE_TASK_BIT = 0x40,
110VK_SHADER_STAGE_MESH_BIT = 0x80,
111```
112Note that `VK_SHADER_STAGE_ALL_GRAPHICS_BIT` was defined as a mask of existing bits during Vulkan 1.0 development, and thus cannot include these new bits; modifying it would break compatibility.
113
114==== Synchronization
115
116New pipeline stages are added for synchronization of these new stages:
117
118```c
119VK_PIPELINE_STAGE_TASK_SHADER_BIT_EXT = 0x80000,
120VK_PIPELINE_STAGE_MESH_SHADER_BIT_EXT = 0x100000,
121```
122
123```c
124static const VkPipelineStageFlagBits2KHR VK_PIPELINE_STAGE_TASK_SHADER_BIT_2_EXT = 0x00080000ULL;
125static const VkPipelineStageFlagBits2KHR VK_PIPELINE_STAGE_MESH_SHADER_BIT_2_EXT = 0x00100000ULL;
126```
127
128These new pipeline stages interact similarly to compute shaders, with all the same access types and operations.
129They are also logically ordered before fragment shading, but have no logical ordering compared to existing pre-rasterization shader stages.
130The `VK_PIPELINE_STAGE_2_PRE_RASTERIZATION_SHADERS_BIT` stage added by link:{refpage}VK_KHR_synchronization2.html[VK_KHR_synchronization2] includes these new shader stages, and can be used identically.
131
132
133==== Queries
134
135Pipeline statistics queries are updated with new bits to count mesh and task shader invocations, in a similar manner to how other shader invocations are counted:
136
137```c
138VK_QUERY_PIPELINE_STATISTIC_TASK_SHADER_INVOCATIONS_BIT_EXT = 0x800,
139VK_QUERY_PIPELINE_STATISTIC_MESH_SHADER_INVOCATIONS_BIT_EXT = 0x1000,
140```
141
142An additional standalone query counting the number of mesh primitives generated is added:
143
144```c
145VK_QUERY_TYPE_MESH_PRIMITIVES_GENERATED_EXT = 1000328000,
146```
147
148An active query of this type will generate a count of every individual primitive emitted from any mesh shader workgroup that is not culled by fixed function culling.
149
150
151==== Draw Calls
152
153Three new draw calls are added to the API to dispatch mesh pipelines:
154
155```c
156VKAPI_ATTR void VKAPI_CALL vkCmdDrawMeshTasksEXT(
157    VkCommandBuffer                             commandBuffer,
158    uint32_t                                    groupCountX,
159    uint32_t                                    groupCountY,
160    uint32_t                                    groupCountZ);
161
162VKAPI_ATTR void VKAPI_CALL vkCmdDrawMeshTasksIndirectEXT(
163    VkCommandBuffer                             commandBuffer,
164    VkBuffer                                    buffer,
165    VkDeviceSize                                offset,
166    uint32_t                                    drawCount,
167    uint32_t                                    stride);
168
169VKAPI_ATTR void VKAPI_CALL vkCmdDrawMeshTasksIndirectCountEXT(
170    VkCommandBuffer                             commandBuffer,
171    VkBuffer                                    buffer,
172    VkDeviceSize                                offset,
173    VkBuffer                                    countBuffer,
174    VkDeviceSize                                countBufferOffset,
175    uint32_t                                    maxDrawCount,
176    uint32_t                                    stride);
177
178typedef struct VkDrawMeshTasksIndirectCommandEXT {
179    uint32_t    x;
180    uint32_t    y;
181    uint32_t    z;
182} VkDrawMeshTasksIndirectCommandEXT;
183```
184
185`vkCmdDrawMeshTasksEXT` is the simplest as it functions the same as link:{refpage}vkCmdDispatch.html[vkCmdDispatch], but dispatches the mesh or task shader in a graphics pipeline with the specified workgroup counts, rather than a compute shader.
186
187`vkCmdDrawMeshTasksIndirectEXT` functions similarly to link:{refpage}vkCmdDispatchIndirect.html[vkCmdDispatchIndirect], but with the draw count functionality from other draw commands.
188Multiple draws are dispatched according to the `drawCount` parameter, with data in buffer being consumed as a strided array of `VkDrawMeshTasksIndirectCommandEXT` structures, with stride equal to `stride`.
189Each element of this array defines a separate draw call's workgroup counts in each dimension, and dispatches mesh or task shaders for the current pipeline accordingly.
190
191`vkCmdDrawMeshTasksIndirectCountEXT` functions as `vkCmdDrawMeshTasksIndirectEXT`, but takes its draw count from the device as well.
192The draw count is read from `countBuffer` at an offset of `countBufferOffset`, and must be lower than `maxDrawCount`.
193
194
195==== Properties
196
197Several new properties are added to the API - some dictating hard limits, and others indicating performance considerations:
198
199```c
200typedef struct VkPhysicalDeviceMeshShaderPropertiesEXT {
201    VkStructureType    sType;
202    void*              pNext;
203    uint32_t           maxTaskWorkGroupTotalCount;
204    uint32_t           maxTaskWorkGroupCount[3];
205    uint32_t           maxTaskWorkGroupInvocations;
206    uint32_t           maxTaskWorkGroupSize[3];
207    uint32_t           maxTaskPayloadSize;
208    uint32_t           maxTaskSharedMemorySize;
209    uint32_t           maxTaskPayloadAndSharedMemorySize;
210    uint32_t           maxMeshWorkGroupTotalCount;
211    uint32_t           maxMeshWorkGroupCount[3];
212    uint32_t           maxMeshWorkGroupInvocations;
213    uint32_t           maxMeshWorkGroupSize[3];
214    uint32_t           maxMeshSharedMemorySize;
215    uint32_t           maxMeshPayloadAndSharedMemorySize;
216    uint32_t           maxMeshOutputMemorySize;
217    uint32_t           maxMeshPayloadAndOutputMemorySize;
218    uint32_t           maxMeshOutputComponents;
219    uint32_t           maxMeshOutputVertices;
220    uint32_t           maxMeshOutputPrimitives;
221    uint32_t           maxMeshOutputLayers;
222    uint32_t           maxMeshMultiviewViewCount;
223    uint32_t           meshOutputPerVertexGranularity;
224    uint32_t           meshOutputPerPrimitiveGranularity;
225    uint32_t           maxPreferredTaskWorkGroupInvocations;
226    uint32_t           maxPreferredMeshWorkGroupInvocations;
227    VkBool32           prefersLocalInvocationVertexOutput;
228    VkBool32           prefersLocalInvocationPrimitiveOutput;
229    VkBool32           prefersCompactVertexOutput;
230    VkBool32           prefersCompactPrimitiveOutput;
231} VkPhysicalDeviceMeshShaderPropertiesEXT;
232```
233
234The following limits affect task shader execution:
235
236 * `maxTaskWorkGroupTotalCount` indicates the total number of workgroups that can be launched for a task shader.
237 * `maxTaskWorkGroupCount` indicates the number of workgroups that can be launched for a task shader in each given dimension.
238 * `maxTaskWorkGroupInvocations` indicates the total number of invocations that can be launched for a task shader in a single workgroup.
239 * `maxTaskWorkGroupSize` indicates the maximum number of invocations for a task shader in each dimension for a single workgroup.
240 * `maxTaskPayloadSize` indicates the maximum total size of task shader output payloads.
241 * `maxTaskSharedMemorySize` indicates the maximum total size of task shader shared memory variables.
242 * `maxTaskPayloadAndSharedMemorySize` indicates the maximum total combined size of task shader output payloads and shared memory variables.
243
244Similar limits affect task shader execution:
245
246 * `maxMeshWorkGroupTotalCount` indicates the total number of workgroups that can be launched for a mesh shader.
247 * `maxMeshWorkGroupCount` indicates the number of workgroups that can be launched for a mesh shader in each given dimension.
248 * `maxMeshWorkGroupInvocations` indicates the total number of invocations that can be launched for a mesh shader in a single workgroup.
249 * `maxMeshWorkGroupSize` indicates the maximum number of invocations for a mesh shader in each dimension for a single workgroup.
250 * `maxMeshSharedMemorySize` indicates the maximum total size of mesh shader shared memory variables.
251 * `maxMeshPayloadAndSharedMemorySize` indicates the maximum total combined size of mesh shader input payloads and shared memory variables.
252 * `maxMeshSharedMemorySize` indicates the maximum total size of mesh shader output variables.
253 * `maxMeshPayloadAndOutputMemorySize` indicates the maximum total combined size of mesh shader input payloads and output variables.
254 * `maxMeshOutputComponents` is the maximum number of components of mesh shader output variables.
255 * `maxMeshOutputVertices` is the maximum number of vertices a mesh shader can emit.
256 * `maxMeshOutputPrimitives` is the maximum number of primitives a mesh shader can emit.
257 * `maxMeshOutputLayers` is the maximum number of layers that a mesh shader can render to.
258 * `maxMeshMultiviewViewCount` is the maximum number of views that a mesh shader can render to.
259
260When considering the above properties, the number of mesh shader outputs a shader uses are rounded up to implementation-defined numbers defined by the following properties:
261
262 * `meshOutputPerVertexGranularity` is the alignment of each per-vertex mesh shader output.
263 * `meshOutputPerPrimitiveGranularity` is the alignment of each per-primitive mesh shader output.
264
265The following properties are implementation preferences.
266Violating these limits will not result in validation errors, but it is strongly recommended that applications adhere to them in order to maximize performance on each implementation.
267
268 * `maxPreferredTaskWorkGroupInvocations` indicates the maximum preferred number of task shader invocations in a single workgroup.
269 * `maxPreferredMeshWorkGroupInvocations` indicates the maximum preferred number of mesh shader invocations in a single workgroup.
270 * If `prefersLocalInvocationVertexOutput` is `VK_TRUE`, the implementation will perform best when each invocation writes to an array index in the per-vertex output matching code:LocalInvocationIndex.
271 * If `prefersLocalInvocationPrimitiveOutput` is `VK_TRUE`, the implementation will perform best when each invocation writes to an array index in the per-primitive output matching code:LocalInvocationIndex.
272 * If `prefersCompactVertexOutput` is `VK_TRUE`, the implementation will perform best if there are no unused vertices in the output array.
273 * If `prefersCompactPrimitiveOutput` is `VK_TRUE`, the implementation will perform best if there are no unused primitives in the output array.
274
275Note that even if some of the above values are false, the implementation can still perform just as well whether or not the corresponding preferences are followed. It is recommended to follow these preferences unless the performance cost of doing so outweighs the gains of hitting the optimal paths in the implementation.
276
277
278==== Features
279
280A few new features are introduced by this extension:
281
282```c
283typedef struct VkPhysicalDeviceMeshShaderFeaturesEXT {
284    VkStructureType    sType;
285    void*              pNext;
286    VkBool32           taskShader;
287    VkBool32           meshShader;
288    VkBool32           multiviewMeshShader;
289    VkBool32           primitiveFragmentShadingRateMeshShader;
290    VkBool32           meshShaderQueries;
291} VkPhysicalDeviceMeshShaderFeaturesEXT;
292```
293
294 * `taskShader` indicates support for task shaders and associated features - if not supported, only mesh shaders can be used.
295 * `meshShader` indicates support for mesh shaders and associated features - if not supported, none of the features in this extension can be used.
296 * `multiviewMeshShader` indicates support for the use of multi-view with mesh shaders.
297 * `primitiveFragmentShadingRateMeshShader` indicates whether the per-primitive fragment shading rate can be written by mesh shaders when fragment shading rates are supported.
298 * `meshShaderQueries` indicates support for the new queries added by this extension.
299
300
301=== SPIR-V Changes
302
303One new capability is added gating all of the new functionality:
304
305```
306MeshShadingEXT
307```
308
309Two new execution models are added, corresponding to the two <<New Shaders>> added by this extension:
310
311```
312TaskEXT
313MeshEXT
314```
315
316Task shader output/mesh shader input payloads are declared in a new storage class:
317
318```
319TaskPayloadWorkgroupEXT
320```
321
322Variables in this storage class are accessible by all invocations in a workgroup in a task shader, and is broadcast to all invocations in workgroups dispatched by the same task shader workgroup where it is read-only.
323
324In task shaders, code:TaskPayloadWorkgroupEXT is a hybrid of code:Output and code:Workgroup storage classes. It supports all usual operations code:Workgroup supports, with the caveats of:
325
326  . No explicit memory layout support with `VK_KHR_workgroup_memory_explicit_layout`
327  . Can be declared independently of code:Workgroup, meaning local scratch workgroup memory can still be used with `VK_KHR_workgroup_memory_explicit_layout`
328  . Has two separate limits for size, `maxTaskPayloadSize` for its size in isolation, and `maxTaskPayloadAndSharedMemorySize` for the combined size
329
330Mesh shaders declare the type of primitive being output by way of three execution modes, two of which are introduced by this extension:
331
332```
333OutputPoints
334OutputLinesEXT
335OutputTrianglesEXT
336```
337
338Mesh shaders declare the maximum number of vertex and primitives the shader will ever emit for the invocation group by way of two execution modes, one of which is introduced by this extension:
339
340```
341OutputVertices
342OutputPrimitivesEXT
343```
344
345A new decoration is added to for mesh shader outputs/fragment shader inputs to indicate per-primitive data rather than per-vertex data:
346
347```
348PerPrimitiveEXT
349```
350
351New per-primitive built-ins are added:
352
353```
354PrimitivePointIndicesEXT
355PrimitiveLineIndicesEXT
356PrimitiveTriangleIndicesEXT
357CullPrimitiveEXT
358```
359
360Each of the `Primitive*IndicesEXT` built-ins is used when the corresponding execution mode is specified, declared as scalars or vectors with a number of components equal to the number of vertices in the primitive type.
361`CullPrimitiveEXT` is a per-primitive boolean value indicating to the implementation that its corresponding primitive must not be rasterized and is instead discarded with no further processing once emitted.
362
363A new instruction is added to task shaders to launch mesh shader workgroups:
364
365
366[cols="1,1,2,2,2*2",width="100%"]
367|=====
3685+|[[OpEmitMeshTasksEXT]]*OpEmitMeshTasksEXT* +
369 +
370Defines the grid size of subsequent mesh shader workgroups to generate
371upon completion of the task shader workgroup. +
372 +
373'Group Count X Y Z' must each be a 32-bit unsigned integer value.
374They configure the number of local workgroups in each respective dimensions
375for the launch of child mesh tasks. See Vulkan API specification for more detail. +
376 +
377'Payload' is an optional pointer to the payload structure to pass to the generated mesh shader invocations.
378'Payload' must be the result of an *OpVariable* with a storage class of *TaskPayloadWorkgroupEXT*. +
379 +
380The arguments are taken from the first invocation in each workgroup.
381Any invocation must execute this instruction exactly once and under uniform
382control flow.
383This instruction also serves as an *OpControlBarrier* instruction, and also
384performs and adheres to the description and semantics of an *OpControlBarrier*
385instruction with the 'Execution' and 'Memory' operands set to *Workgroup* and
386the 'Semantics' operand set to a combination of *WorkgroupMemory* and
387*AcquireRelease*.
388Ceases all further processing: Only instructions executed before
389*OpEmitMeshTasksEXT* have observable side effects. +
390 +
391This instruction must be the last instruction in a block. +
392 +
393This instruction is only valid in the *TaskEXT* Execution Model.
394|Capability: +
395*MeshShadingEXT*
396| 4 + variable | 5294 | '<id>' +
397'Group Count X' | '<id>' +
398'Group Count Y' | '<id>' +
399'Group Count Z' | Optional +
400'<id>' +
401'Payload'
402|=====
403
404A new mesh shader instruction is added to set the number of actual primitives and vertices that a mesh shader writes, avoiding unnecessary allocations or processing by the implementation:
405
406[cols="1,1,2*3",width="100%"]
407|=====
4083+|[[OpSetMeshOutputsEXT]]*OpSetMeshOutputsEXT* +
409 +
410Sets the actual output size of the primitives and vertices that the mesh shader
411workgroup will emit upon completion. +
412 +
413'Vertex Count' must be a 32-bit unsigned integer value.
414It defines the array size of per-vertex outputs. +
415 +
416'Primitive Count' must a 32-bit unsigned integer value.
417It defines the array size of per-primitive outputs. +
418 +
419The arguments are taken from the first invocation in each workgroup.
420Any invocation must execute this instruction no more than once and under
421uniform control flow.
422There must not be any control flow path to an output write that is not preceded
423by this instruction. +
424 +
425This instruction is only valid in the *MeshEXT* Execution Model.
426|Capability: +
427*MeshShadingEXT*
428| 3 | 5295 | '<id>' +
429'Vertex Count' | '<id>' +
430'Primitive Count'
431|=====
432
433This instruction must be called before writing to mesh shader outputs.
434
435
436=== GLSL Changes
437
438Mesh shaders defined in GLSL the same as compute shaders, with the addition of access to shader outputs normally available in vertex shaders and the following new features:
439
440```glsl
441out uint  gl_PrimitivePointIndicesEXT[];
442out uvec2 gl_PrimitiveLineIndicesEXT[];
443out uvec3 gl_PrimitiveTriangleIndicesEXT[];
444```
445
446These built-ins correspond to the identically named SPIR-V constructs, and are written in the same way.
447Applications should access only the index output corresponding to the primitive type declared by the following layout qualifiers:
448
449```glsl
450points
451lines
452triangles
453```
454
455Each layout qualifier is declared as `layout(<qualifier>) out;`.
456
457A new auxiliary storage qualifier can be added to interface variables to indicate that they are per-primitive rate:
458
459```glsl
460perprimitiveEXT
461```
462
463New write-only output blocks are defined for built-in output values from mesh shaders:
464
465```glsl
466out gl_MeshPerVertexEXT {
467  vec4  gl_Position;
468  float gl_PointSize;
469  float gl_ClipDistance[];
470  float gl_CullDistance[];
471} gl_MeshVerticesEXT[];
472
473perprimitiveEXT out gl_MeshPerPrimitiveEXT {
474  int  gl_PrimitiveID;
475  int  gl_Layer;
476  int  gl_ViewportIndex;
477  bool gl_CullPrimitiveEXT;
478  int  gl_PrimitiveShadingRateEXT;
479} gl_MeshPrimitivesEXT[];
480```
481
482Note that some existing outputs that previously were associated by provoking vertices are now directly declared as per-primitive variables.
483
484Finally a new mesh-shader function is added:
485
486```glsl
487    void SetMeshOutputsEXT(uint vertexCount,
488                           uint primitiveCount)
489```
490
491This function maps exactly to the `OpSetMeshOutputsEXT` instruction - setting the number of valid vertices and primitives that are output by the mesh shader workgroup.
492
493Task shader payloads can be declared in task and mesh shaders using the new `taskPayloadSharedEXT` storage qualifier as follows:
494
495```glsl
496taskPayloadSharedEXT MyPayloadStruct {
497    ...
498} payload;
499```
500
501Finally a new function corresponding to `OpEmitMeshTasksEXT` is added to launch mesh workgroups:
502
503```glsl
504    void EmitMeshTasksEXT(uint groupCountX,
505                          uint groupCountY,
506                          uint groupCountZ)
507```
508
509
510=== HLSL Changes
511
512The HLSL specification for mesh shaders is defined by Microsoft® here: https://microsoft.github.io/DirectX-Specs/d3d/MeshShader.html.
513
514Everything in that specification should work directly as described, with the exception of linking per-primitive interface variables between pixel and mesh shaders.
515Microsoft defined the fragment/mesh interface to effectively be fixed up at link time - making no distinction between per-vertex and per-primitive variables in the pixel shader.
516This works okay with monolithic pipeline construction, but with the addition of things like link:{refpage}VK_EXT_graphics_pipeline_library.html[VK_EXT_graphics_pipeline_library], modifying this at link time would cause undesirable slowdown.
517As a result, the Vulkan version of this feature requires the `\[[vk::perprimitive]]` attribute on pixel shader inputs in order to generate a match with mesh shader outputs denoted with the `primitives` qualifier.
518
519Mapping to SPIR-V is largely performed identically to any other shader for both mesh and task shaders, with most new functionality mapping 1:1.
520One outlier is in index generation - the primitive index outputs are denoted by a variable in the function signature preceded by `out indices ...`.
521The HLSL compiler should map this variable to the appropriate built-in value based on the selected `outputtopology` qualifier.
522
523Another outlier is the groupshared task payload. In HLSL this is declared as groupshared, but must be declared in the code:TaskPayloadWorkgroupEXT storage class in SPIR-V.
524The call to `DispatchMesh()` can inform the compiler which groupshared variable to promote to code:TaskPayloadWorkgroupEXT.
525
526
527== Issues
528
529=== What are the differences to VK_NV_mesh_shader?
530
531The following changes have been made to the API:
532
533  * Drawing mesh tasks can now be done with a three-dimensional number of workgroups, rather than just one-dimensional.
534  * There are new device queries for the number of mesh primitives generated, and the number of shader invocations for the new shader stages.
535  * A new command token is added when interacting with VK_NV_device_generated_commands, as mesh shaders from each extension are incompatible.
536  * New optional features have been added for interactions with multiview, primitive fragment shading rate specification, and the new queries.
537  * Several more device properties are expressed to enable app developers to use mesh shaders optimally across vendors (see <<Properties>> for details of how these are expressed and used).
538
539Note that the SPIR-V and GLSL expression of these extensions have changed, details of which are outlined in those extensions.
540These changes aim to make the extension more portable across multiple vendors, and increase compatibility with the similar feature in Microsoft® DirectX®.
541
542=== What are the differences to DirectX® 12's Mesh shaders?
543
544From the shader side, declaring mesh or amplification shaders in HLSL will have no meaningful differences - HLSL code written for DirectX should also work fine in Vulkan, with all the expected limits and features available.
545One difference is present in pixel shaders though - any user-declared attributes with the "primitive" keyword in the mesh shader will need to be declared in the fragment shader with the `\[[vk::perprimitive]]` attribute to facilitate linking.
546This makes it so that the shader can be compiled without modifying the input interface, which is particularly important for interactions with extensions like link:{refpage}VK_EXT_graphics_pipeline_library.html[VK_EXT_graphics_pipeline_library].
547
548Some amount of massaging by the HLSL compiler will be required to the shader interfaces as DirectX does linking by name rather than location between mesh and pixel shaders, but the requirement to use `\[[vk::perprimitive]]` allows the different attributes to continue using locations in Vulkan.
549
550The only notable difference on the API side is that Vulkan provides additional device properties that allow developers to tune their shaders to different vendors' fast paths, should they wish to.
551Details of how these are expressed are detailed <<Properties, here>>.
552
553=== Can there be more than one output payload in a task shader?
554
555There can only be one output payload per task shader; one declaration in HLSL or GLSL, and only one in the interface declaration for SPIR-V.
556More would have no effect anyway, as only one payload can be emitted for mesh shader consumption.
557
558=== Should developers port everything to mesh shading?
559
560Mesh shaders are not necessarily a performance win compared to the existing pipeline - their purpose is to offer greater flexibility at decent performance, but this flexibility may come at a cost, and that cost is likely platform dependent.
561What task and mesh shading offer is a way to perform novel techniques efficiently compared to the hoops developers would previously have to jump through.
562Task and mesh shaders are a tool that should be used when it makes sense to do so - if a developer has a novel technique that would be easier to implement using task and mesh shaders, then they are likely the appropriate tool.
563Moving from an existing optimized pipeline without this consideration may lead to decreased performance.
564
565=== Does vertex input interface state interact with task/mesh shaders?
566
567No, topology information is specified within the mesh shader, and data must be read or generated these shader stages programmatically.
568