I recently added a basic support of mesh shaders in GeeXLab and here is an overview of mesh shaders for GL/VK devs based on the following articles published on the GeeXLab blog:

- RGB Triangle with Mesh Shaders in OpenGL
- RGB Triangle with Mesh Shaders in Vulkan
- Textured Quad with Mesh Shaders in OpenGL and Vulkan
- Meshlets and Mesh Shaders (Vulkan)

**Mesh shaders** are a feature introduced with NVIDIA Turing GPUs. The **mesh shading pipeline** replaces the regular VTG pipeline (VTG = Vertex / Tessellation / Geometry).

This illustration from NVIDIA shows the mesh shading pipeline versus regular VTG pipeline:

In few words, a mesh shader program can be seen as the association of a compute-like shader and a fragment shader. Why compute-like shader? Because like compute shaders, you can set the number of threads for work groups or use synchronization functions like `barrier()`. Mesh shaders are computing-like shaders specialized in graphics tasks.

The mesh shading pipeline adds two new shader stages: the **task shader** and the **mesh shader**. I haven’t played with the task shader yet, only with the mesh shader. But basically, a task shader generates work for mesh shaders while a mesh shader generates primitives (points, lines or triangles). A mesh shading pipeline can have a task, a mesh and a pixel shaders or a mesh and pixel shaders (the task shader is an optional stage).

Mesh shaders are available in **OpenGL**, **Vulkan** and DirectX 12 Ultimate (dx12 won’t be covered in this article).

The nice thing is that the mesh shader has no defined inputs. You can for example generate primitives *ex-nihilo*, in that case, there is no input, the primitives are generated procedurally (a primitive can be a point, a line or a triangle). But if you want the mesh shader to work on existing data, GPU buffers (uniform, storage) or textures are the way to pass data to the mesh shader.

Adding a basic mesh shaders support to an existing engine is simple. All you need is to add the new shader stages (OpenGL: **GL_MESH_SHADER_NV** and **GL_TASK_SHADER_NV** used with glCreateShader – Vulkan: **VK_SHADER_STAGE_MESH_BIT_NV** and **VK_SHADER_STAGE_TASK_BIT_NV** used in VkPipelineShaderStageCreateInfo) and a function to launch the execution of mesh shaders (**glDrawMeshTasksNV** in OpenGL, **vkCmdDrawMeshTasksNV** in Vulkan).

Currently, mesh shaders are only supported by **NVIDIA Turing** GPUs (GeForce RTX 20 Series, GeForce GTX 16 Series). According to some news, **AMD RDNA2** GPUs will support mesh shaders too.

You can check the support of mesh shaders by looking at the presence of the **GL_NV_mesh_shader** extension in OpenGL or the **VK_NV_mesh_shader** device extension in Vulkan.

Let’s see a very simple GPU program, made up of a mesh shader and a pixel shader, that takes no input and generates a RGB triangle. The following mesh shader comes from the RGB Triangle sample available here:

– Triangle Mesh Shader in OpenGL

– Triangle Mesh Shader in Vulkan

The main objective of the mesh shader is to fill the following built-in output variables:

– **gl_MeshVerticesNV**: vertices array. A triangle has 3 vertices.

– **gl_PrimitiveIndicesNV**: indices array. A triangle has 3 indices, one per vertex.

– **gl_PrimitiveCountNV**: number of primitives. A triangle is one primitive made up of three vertices.

```
#version 450
#extension GL_NV_mesh_shader : require
layout(local_size_x = 1) in;
layout(triangles, max_vertices = 3, max_primitives = 1) out;
// Custom vertex output block
layout (location = 0) out PerVertexData
{
vec4 color;
} v_out[]; // [max_vertices]
const vec3 vertices[3] = {vec3(-1,-1,0), vec3(0,1,0), vec3(1,-1,0)};
const vec3 colors[3] = {vec3(1.0,0.0,0.0), vec3(0.0,1.0,0.0), vec3(0.0,0.0,1.0)};
void main()
{
// Vertices position
gl_MeshVerticesNV[0].gl_Position = vec4(vertices[0], 1.0);
gl_MeshVerticesNV[1].gl_Position = vec4(vertices[1], 1.0);
gl_MeshVerticesNV[2].gl_Position = vec4(vertices[2], 1.0);
// Vertices color
v_out[0].color = vec4(colors[0], 1.0);
v_out[1].color = vec4(colors[1], 1.0);
v_out[2].color = vec4(colors[2], 1.0);
// Triangle indices
gl_PrimitiveIndicesNV[0] = 0;
gl_PrimitiveIndicesNV[1] = 1;
gl_PrimitiveIndicesNV[2] = 2;
// Number of triangles
gl_PrimitiveCountNV = 1;
}
```

The pixel shader:

```
#version 450
layout(location = 0) out vec4 FragColor;
in PerVertexData
{
vec4 color;
} fragIn;
void main()
{
FragColor = fragIn.color;
}
```

A mesh shader is limited in the number of vertices and primitives it can generate. Two important hardware limits are the maximum number of vertices and the maximum number of primitives that can be generated. In OpenGL, you can read these limits with **GL_MAX_MESH_OUTPUT_VERTICES_NV** and **GL_MAX_MESH_OUTPUT_PRIMITIVES_NV**:

```
glGetIntegerv(GL_MAX_MESH_OUTPUT_VERTICES_NV, &x)
glGetIntegerv(GL_MAX_MESH_OUTPUT_PRIMITIVES_NV, &x)
```

In Vulkan, you have to read the following members of the **VkPhysicalDeviceMeshShaderPropertiesNV** structure:

– maxMeshOutputVertices

– maxMeshOutputPrimitives

Here is the dump of the entire VkPhysicalDeviceMeshShaderPropertiesNV structure for my GeForce RTX 2070 + latest R445.98:

– maxDrawMeshTasksCount => 65535

– maxTaskWorkGroupInvocations => 32

– maxTaskWorkGroupSize => [32;1;1]

– maxTaskTotalMemorySize => 16384

– maxTaskOutputCount => 65535

– maxMeshWorkGroupInvocations => 32

– maxDrawMeshTasksCount => [32;1;1]

– maxMeshTotalMemorySize => 16384

–maxMeshOutputVertices => 256

–maxMeshOutputPrimitives => 512

– maxMeshMultiviewViewCount => 4

– meshOutputPerVertexGranularity => 32

– meshOutputPerPrimitiveGranularity => 32

To launch the previous GPU program, just call **glDrawMeshTasksNV** in OpenGL or **vkCmdDrawMeshTasksNV** in Vulkan. In OpenGL, the GPU program must be bound before while in Vulkan, a pipeline built with the GPU program must be bound before.

OpenGL:

```
glUseProgram(mesh_prog);
unsigned int num_workgroups = 1;
glDrawMeshTasksNV(0, num_workgroups);
```

Vulkan:

```
vkCmdBindPipeline(cmdbuf, VK_PIPELINE_BIND_POINT_GRAPHICS, mesh_pipeline);
uint32_t num_workgroups = 1;
vkCmdDrawMeshTasks(cmdbuf, num_workgroups, 0);
```

Too bad that the parameters are not in the same order in OpenGL and Vulkan!

In the previous mesh shader, one thread per work group has been set:

```
layout(local_size_x = 1) in;
```

We can set more threads per work group, the maximum number of threads being **32** (32 is the size of a **WARP** on NVIDIA GPUs – more about WARP can be found in this article). This value comes from the reading of the first component of maxDrawMeshTasksCount (Vulkan) or GL_MAX_MESH_WORK_GROUP_SIZE_NV (OpenGL).

For the triangle, we can set the number of threads to 3 (one thread per vertex). In that case the mesh shader can be re-written in a more compact way:

```
#version 450
#extension GL_NV_mesh_shader : require
layout(local_size_x=3) in;
layout(max_vertices=3, max_primitives=1) out;
layout(triangles) out;
out PerVertexData
{
vec4 color;
} v_out[];
const vec3 vertices[3] = {vec3(-1,-1,0), vec3(0,1,0), vec3(1,-1,0)};
const vec3 colors[3] = {vec3(1.0,0.0,0.0), vec3(0.0,1.0,0.0), vec3(0.0,0.0,1.0)};
void main()
{
uint thread_id = gl_LocalInvocationID.x;
gl_MeshVerticesNV[thread_id].gl_Position = vec4(vertices[thread_id], 1.0);
gl_PrimitiveIndicesNV[thread_id] = thread_id;
v_out[thread_id].color = vec4(colors[thread_id], 1.0);
gl_PrimitiveCountNV = 1;
}
```

Let’s quickly talk about meshlets.

As said previously, a mesh shader can output (send to the rasterizer) only a limited number of primitives (point, lines or triangles). For example, on a GeForce RTX 2070, the mesh shader can output a maximum of 256 vertices and 512 primitives.

When primitive mode is set to triangle, the output of a mesh shader is always a small mesh, called a **meshlet**. The triangle is the smallest meshlet.

With these limitations (max output vertices and max output primitives), how can we process/render an existing big mesh with a mesh shader?

A way to render an existing mesh with a mesh shader is to decompose the mesh into multiple meshlets, each meshlet being processed by a work group.

Here is a more detailed definition of a meshlet (source):

What exactly is a Meshlet?

A meshlet is a subset of a mesh created through an intentional partition of the geometry. Meshlets should be somewhere in the range of 32 to around 200 vertices, depending on the number of attributes, and will have as many shared vertices as possible to allow for vertex re-use during rendering. This partitioning will be pre-computed and stored with the geometry to avoid computation at runtime, unlike the current Input Assembler which must attempt to dynamically identify vertex reuse every time a mesh is drawn. Titles can convert meshlets into regular index buffers for vertex shader fallback if a device does not support Mesh Shaders.

Here is a simple technique for creating meshlets from a single mesh (source):

So, a quite viable strategy for creating meshlets is: just scan the index buffer linearly, accumulating the set of vertices used, until you hit either 64 vertices or 126 triangles; reset and repeat until you’ve gone through the whole mesh. This could be done at art build time, or it’s simple enough that you could even do it in the engine at level load time.

A possible structure for a meshlet can be:

```
struct Meshlet
{
uint32_t vertices[64];
uint32_t indices[378]; // 126 triangles => 378 indices
uint32_t vertex_count;
uint32_t index_count;
};
```

In this article, NVIDIA recommends a maximum of 64 vertices and 126 primitives (or 3*126 = 378 indices):

We recommend using up to 64 vertices and 126 primitives. The ‘6’ in 126 is not a typo. The first generation hardware allocates primitive indices in 128 byte granularity and and needs to reserve 4 bytes for the primitive count. Therefore 3 * 126 + 4 maximizes the fit into a 3 * 128 = 384 bytes block. Going beyond 126 triangles would allocate the next 128 bytes. 84 and 40 are other maxima that work well for triangles.

Here is a mesh shader that handles meshlets:

```
#version 450
#extension GL_NV_mesh_shader : require
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout(triangles, max_vertices = 64, max_primitives = 126) out;
//-------------------------------------
// transform_ub: Uniform buffer for transformations
//
layout (std140, binding = 0) uniform uniforms_t
{
mat4 ViewProjectionMatrix;
mat4 ModelMatrix;
} transform_ub;
//-------------------------------------
// vb: storage buffer for vertices.
//
struct s_vertex
{
vec4 position;
vec4 color;
};
layout (std430, binding = 1) buffer _vertices
{
s_vertex vertices[];
} vb;
//-------------------------------------
// mbuf: storage buffer for meshlets.
//
struct s_meshlet
{
uint vertices[64];
uint indices[378]; // up to 126 triangles
uint vertex_count;
uint index_count;
};
layout (std430, binding = 2) buffer _meshlets
{
s_meshlet meshlets[];
} mbuf;
// Mesh shader output block.
//
layout (location = 0) out PerVertexData
{
vec4 color;
} v_out[]; // [max_vertices]
// Color table for drawing each meshlet with a different color.
//
#define MAX_COLORS 10
vec3 meshletcolors[MAX_COLORS] = {
vec3(1,0,0),
vec3(0,1,0),
vec3(0,0,1),
vec3(1,1,0),
vec3(1,0,1),
vec3(0,1,1),
vec3(1,0.5,0),
vec3(0.5,1,0),
vec3(0,0.5,1),
vec3(1,1,1)
};
void main()
{
uint mi = gl_WorkGroupID.x;
uint thread_id = gl_LocalInvocationID.x;
uint vertex_count = mbuf.meshlets[mi].vertex_count;
for (uint i = 0; i < vertex_count; ++i)
{
uint vi = mbuf.meshlets[mi].vertices[i];
vec4 Pw = transform_ub.ModelMatrix * vb.vertices[vi].position;
vec4 P = transform_ub.ViewProjectionMatrix * Pw;
// GL->VK conventions...
P.y = -P.y; P.z = (P.z + P.w) / 2.0;
gl_MeshVerticesNV[i].gl_Position = P;
v_out[i].color = vb.vertices[vi].color * vec4(meshletcolors[mi%MAX_COLORS], 1.0);
}
uint index_count = mbuf.meshlets[mi].index_count;
gl_PrimitiveCountNV = uint(index_count) / 3;
for (uint i = 0; i < index_count; ++i)
{
gl_PrimitiveIndicesNV[i] = uint(mbuf.meshlets[mi].indices[i]);
}
}
```

Each meshlet is rendered by a work group. Then if you have to render `num_meshlets`, the drawing could be:

```
vkCmdBindPipeline(cmdbuf, VK_PIPELINE_BIND_POINT_GRAPHICS, mesh_pipeline);
uint32_t num_workgroups = num_meshlets;
vkCmdDrawMeshTasks(cmdbuf, num_workgroups, 0);
```

The previous mesh shader comes from this Vulkan demo:

Turing GPUs are limited to 65535 mesh tasks count (OpenGL: GL_MAX_DRAW_MESH_TASKS_COUNT_NV - Vulkan: maxDrawMeshTasksCount). In his Framework 4, Humus has a way to render more than 65535 meshlets:

```
// MaxDrawMeshTasksCount is currently set very low in NVIDIA drivers,
// only 65535, so we may have to issue multiple calls if count is larger than that.
const uint max_count = device->MaxDrawMeshTasksCount;
while (count > max_count)
{
vkCmdDrawMeshTasks(commandBuffer, max_count, start);
start += max_count;
count -= max_count;
}
vkCmdDrawMeshTasks(commandBuffer, count, start);
```

I will update this article with a simple example of a task shader as soon as possible...

### References

- NVIDIA - Introduction to Turing Mesh Shaders
- MESH SHADERS IN TURING (37-page PDF)
- Meshlet generation
- Metaballs 2 in Vulkan
- Mesh Shader Possibilities
- Turing Mesh Shaders (6-page PDF)
- GL_NV_mesh_shader extension
- GLSL_NV_mesh_shader extension
- GL_NV_mesh_shader Simple Mesh Shader Example
- Skeletal Animation Optimization Using Mesh Shaders (68-page PDF)
- DirectX 12— Mesh Shaders and Amplification Shaders: Reinventing the Geometry Pipeline

If you have other links on mesh shaders, post them in comments, I will update this list of references.