{"id":11828,"date":"2018-09-17T06:00:19","date_gmt":"2018-09-17T13:00:19","guid":{"rendered":"https:\/\/developer.nvidia.com\/blog\/?p=11828"},"modified":"2023-08-02T08:21:12","modified_gmt":"2023-08-02T15:21:12","slug":"introduction-turing-mesh-shaders","status":"publish","type":"post","link":"https:\/\/developer.nvidia.com\/blog\/introduction-turing-mesh-shaders\/","title":{"rendered":"Introduction to Turing Mesh Shaders"},"content":{"rendered":"<p>The Turing architecture introduces a new programmable geometric shading pipeline through the use of <strong class=\"asterisk\">mesh shaders<\/strong>. The new shaders bring the compute programming model to the graphics pipeline as threads are used cooperatively to generate compact meshes (<strong class=\"asterisk\">meshlets<\/strong>) directly on the chip for consumption by the rasterizer. Applications and games dealing with high-geometric complexity benefit from the flexibility of the two-stage approach, which allows efficient culling, level-of-detail techniques as well as procedural generation.<\/p>\n<p>This blog introduces the new pipeline and gives some concrete examples in GLSL for OpenGL or Vulkan rendering. The new capabilities are accessible through extensions in OpenGL and Vulkan, and using <a href=\"https:\/\/devblogs.microsoft.com\/directx\/announcing-directx-12-ultimate\/\">DirectX 12 Ultimate<\/a>.<\/p>\n<p>Most of the following content is taken from this <a href=\"http:\/\/on-demand.gputechconf.com\/siggraph\/2018\/video\/sig1811-3-christoph-kubisch-mesh-shaders.html\">recorded presentation<\/a>, for which the full slide deck will be available at a later date.<\/p>\n<div class=\"longTOC\">\n<p><a class=\"level1\" href=\"#toc2\"><span class=\"tocNumber\">1\u00a0 <\/span>Mesh Shading Pipeline<\/a><br \/>\n<a class=\"level1\" href=\"#toc3\"><span class=\"tocNumber\">2\u00a0 <\/span>Meshlets and Mesh Shading<\/a><br \/>\n<a class=\"level1\" href=\"#toc4\"><span class=\"tocNumber\">3\u00a0 <\/span>Pre-Computed Meshlets<\/a><br \/>\n<a class=\"level2\" href=\"#toc4.1\"><span class=\"tocNumber\">3.1\u00a0 <\/span>Data Structures<\/a><br \/>\n<a class=\"level2\" href=\"#toc4.2\"><span class=\"tocNumber\">3.2\u00a0 <\/span>Rendering Resources and Data Flow<\/a><br \/>\n<a class=\"level2\" href=\"#toc4.3\"><span class=\"tocNumber\">3.3\u00a0 <\/span>Cluster Culling with Task Shader<\/a><br \/>\n<a class=\"level1\" href=\"#toc5\"><span class=\"tocNumber\">4\u00a0 <\/span>Conclusion<\/a><br \/>\n<a class=\"level1\" href=\"#toc6\"><span class=\"tocNumber\">5\u00a0 <\/span>References<\/a><\/p>\n<\/div>\n<h1>Motivation<\/h1>\n<p>The real world is a visually rich, geometrically complex place. Outdoor scenes in particular can be composed of hundreds of thousands of elements (rocks, trees, small plants, etc.). CAD models present similar challenges with both complex shaped surfaces as well as machinery made of many small parts. In visual effects large structures, for example spaceships, are often detailed with &#8220;greebles&#8221;. Figure 1 shows several examples where today\u2019s graphics pipeline with vertex, tessellation, and geometry shaders, instancing and multi draw indirect, while very effective, can still be limited when the full resolution geometry reaches hundreds of millions of triangles and hundreds of thousands of objects.<\/p>\n<figure id=\"attachment_11837\" aria-describedby=\"caption-attachment-11837\" style=\"width: 625px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_motivation.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-11837 size-large\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_motivation-625x402.png\" alt=\"Turing mesh shaders geometric complexity motivation\" width=\"625\" height=\"402\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_motivation-625x402.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_motivation-179x115.png 179w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_motivation-300x193.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_motivation-768x494.png 768w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_motivation-466x300.png 466w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_motivation-140x90.png 140w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_motivation-362x233.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_motivation-171x110.png 171w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_motivation.png 892w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><figcaption id=\"caption-attachment-11837\" class=\"wp-caption-text\">Figure 1. The need for increasing realism drives massive increases in geometric complexity.<\/figcaption><\/figure>\n<p>Other use-cases not shown above include geometries found in scientific computing (particles, glyphs, proxy objects, point clouds) or procedural shapes (electric engineering layouts, vfx particles, ribbons and trails, path rendering).<\/p>\n<p>In this post we look at <em class=\"underscore\">mesh shaders<\/em> to accelerate rendering of heavy triangle meshes. The original mesh is segmented into smaller <strong class=\"asterisk\">meshlets<\/strong> as figure 2 shows. Each meshlet ideally optimizes the vertex re-use within it. Using the new hardware stages and this segmentation scheme, we can render more geometry in parallel while fetching less overall data.<\/p>\n<table class=\"aligncenter\" style=\"width: 100%;border-collapse: collapse\" border=\"1\">\n<caption>Figure 2. : Large meshes can be decomposed into meshlets, which are rendered by mesh shaders.<\/caption>\n<tbody>\n<tr>\n<td style=\"width: 50%\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_sample.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-11841\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_sample.png\" alt=\"\" width=\"500\" height=\"437\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_sample.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_sample-300x262.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_sample-343x300.png 343w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_sample-103x90.png 103w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_sample-362x316.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_sample-126x110.png 126w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><\/a><\/td>\n<td style=\"width: 50%\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_bunny.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-11833\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_bunny.png\" alt=\"\" width=\"376\" height=\"435\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_bunny.png 376w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_bunny-259x300.png 259w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_bunny-78x90.png 78w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_bunny-362x419.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_bunny-95x110.png 95w\" sizes=\"auto, (max-width: 376px) 100vw, 376px\" \/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For example CAD data can reach tens to hundreds of millions of triangles. Even after <a href=\"https:\/\/github.com\/nvpro-samples\/gl_occlusion_culling\">occlusion culling<\/a>\u00a0a significant amount of triangles can exist. Some fixed-function steps in the pipeline may do wasteful work and memory loads in this scenario:<\/p>\n<ul>\n<li class=\"minus\">Vertex batch creation by the hardware&#8217;s <em class=\"underscore\">primitive distributor<\/em> scanning the indexbuffer each time even if the topolgy doesn&#8217;t change<\/li>\n<li class=\"minus\">Vertex and attribute fetch for data that is not visible (backface, frustum, or sub-pixel culling)<\/li>\n<\/ul>\n<p>The <em class=\"underscore\">mesh shader<\/em> gives developers new possibilities to avoid such bottlenecks. The new approach allows the memory to be read once and kept on-chip as opposed to previous approaches, such as <em class=\"underscore\">compute shader<\/em>-based primitive culling (see [3],[4],[5]), where index buffers of visible triangles are computed and drawn indirectly.<\/p>\n<p>The mesh shader stage produces triangles for the rasterizer, but uses a cooperative thread model internally instead of using a single-thread program model, similar to compute shaders. Ahead of the mesh shader in the pipeline is the task shader. The task shader operates similarly to the control stage of tessellation, in that it is able to dynamically generate work. However, like the mesh shader, it uses a cooperative thread model and instead of having to take a patch as input and tessellation decisions as output, its input and output are user defined.<\/p>\n<p>This simplifies on-chip geometry creation compared to the previous rigid and limited tessellation and geometry shaders, where threads had to be used for specific tasks only, as shown in figure 3.<\/p>\n<figure id=\"attachment_11834\" aria-describedby=\"caption-attachment-11834\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_comparison.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-11834 size-full-page-width\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_comparison-1024x524.png\" alt=\"\" width=\"1024\" height=\"524\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_comparison-1024x524.png 1024w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_comparison-300x154.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_comparison-768x393.png 768w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_comparison-625x320.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_comparison-500x256.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_comparison-160x82.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_comparison-362x185.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_comparison-215x110.png 215w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_comparison.png 1066w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption id=\"caption-attachment-11834\" class=\"wp-caption-text\">Figure 3. Mesh shaders represent the next step in handling geometric complexity<\/figcaption><\/figure>\n<p><a class=\"target\" name=\"meshshadingpipeline\"><\/a>\u00a0<a class=\"target\" name=\"toc2\"><\/a><\/p>\n<h1>Mesh Shading Pipeline<\/h1>\n<p>A new, two-stage pipeline alternative supplements the classic attribute fetch, <em class=\"underscore\">vertex, tessellation, geometry shader<\/em> pipeline. This new pipeline consists of a <em>task shader<\/em> and\u00a0<em>mesh shader:<\/em><\/p>\n<ul>\n<li class=\"minus\"><strong class=\"asterisk\">Task shader<\/strong> : a programmable unit that operates in workgroups and allows each to emit (or not) mesh shader workgroups<\/li>\n<li class=\"minus\"><strong class=\"asterisk\">Mesh shader<\/strong> : a programmable unit that operates in workgroups and allows each to generate primitives<\/li>\n<\/ul>\n<p>The mesh shader stage produces triangles for the rasterizer using the above-mentioned cooperative thread model internally. The task shader operates similarly to the hull shader stage of tessellation, in that it is able to dynamically generate work. However, like the mesh shader, the task shader also uses a cooperative thread mode. Its input and output are user defined instead of having to take a patch as input and tessellation decisions as output.<\/p>\n<p>The interfacing with the <em class=\"underscore\">pixel\/fragment shader<\/em> is unaffected. The traditional pipeline is still available and can provide very good results depending on the use-case. Figure 4 highlights the differences in the pipeline styles.<\/p>\n<figure id=\"attachment_11838\" aria-describedby=\"caption-attachment-11838\" style=\"width: 677px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_pipeline.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-11838 size-full\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_pipeline.png\" alt=\"\" width=\"677\" height=\"348\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline.png 677w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline-300x154.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline-625x321.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline-500x257.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline-160x82.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline-362x186.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline-214x110.png 214w\" sizes=\"auto, (max-width: 677px) 100vw, 677px\" \/><\/a><figcaption id=\"caption-attachment-11838\" class=\"wp-caption-text\">Figure 4. Differences in the traditional versus task\/mesh geometry pipeline<\/figcaption><\/figure>\n<p>The new mesh shader pipeline provides a number of benefits for developers:<\/p>\n<ul>\n<li class=\"minus\"><strong class=\"asterisk\">Higher scalability<\/strong> through shader units by reducing fixed-function impact in primitive processing. The generic purpose use of modern GPUs helps a greater variety of applications to add more cores and improve shader&#8217;s generic memory and arithmetic performance.<\/li>\n<li class=\"minus\"><strong class=\"asterisk\">Bandwidth-reduction<\/strong>, as de-duplication of vertices (vertex re-use) can be done upfront, and reused over many frames. The current API model means the index buffers have to be scanned by the hardware every time. Larger meshlets mean higher vertex re-use, also lowering bandwidth requirements. Furthermore developers can come up with their own compression or procedural generation schemes.<br \/>\nThe optional expansion\/filtering via <em class=\"underscore\">task shaders<\/em> allows to skip fetching more data entirely.<\/li>\n<li class=\"minus\"><strong class=\"asterisk\">Flexibility<\/strong> in defining the mesh topology and creating graphics work. The previous <em class=\"underscore\">tessellation shaders<\/em> were limited to fixed tessellation patterns while\u00a0<em class=\"underscore\">geometry shaders<\/em> suffered from an inefficient threading, unfriendly programming model which created triangle strips per-thread.<\/li>\n<\/ul>\n<p>Mesh shading follows the programming model of <em class=\"underscore\">compute shaders<\/em>, giving developers the freedom to use threads for different purposes and share data among them. When rasterization is disabled, the two stages can also be used to do generic compute work with one level of expansion.<\/p>\n<figure id=\"attachment_11840\" aria-describedby=\"caption-attachment-11840\" style=\"width: 675px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_pipeline3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-11840 size-full\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_pipeline3.png\" alt=\"Turing GPU mesh shader execution architecture\" width=\"675\" height=\"357\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline3.png 675w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline3-300x159.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline3-625x331.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline3-500x264.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline3-160x85.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline3-362x191.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline3-208x110.png 208w\" sizes=\"auto, (max-width: 675px) 100vw, 675px\" \/><\/a><figcaption id=\"caption-attachment-11840\" class=\"wp-caption-text\">Figure 5. Mesh shaders behave similarly to compute shaders in using a cooperative thread model.<\/figcaption><\/figure>\n<p>Both <em class=\"underscore\">mesh and task shaders<\/em> follow the programming model of <em class=\"underscore\">compute shaders<\/em>, using cooperative thread groups to compute their results and having <strong class=\"asterisk\">no inputs other than a workgroup index<\/strong>. These execute on the graphics pipeline; therefore the hardware directly manges memory passed between stages and kept on-chip.<\/p>\n<p>We will show an example of how this can be used to do primitive culling, as the threads can access all vertices within a workgroup later. Figure 6 illustrates the ability of task shaders to take care of early culling.<\/p>\n<figure id=\"attachment_11839\" aria-describedby=\"caption-attachment-11839\" style=\"width: 676px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_pipeline2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-11839 size-full\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_pipeline2.png\" alt=\"\" width=\"676\" height=\"347\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline2.png 676w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline2-300x154.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline2-625x321.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline2-500x257.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline2-160x82.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline2-362x186.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_pipeline2-214x110.png 214w\" sizes=\"auto, (max-width: 676px) 100vw, 676px\" \/><\/a><figcaption id=\"caption-attachment-11839\" class=\"wp-caption-text\">Figure 6. While optional, task shaders enable early culling to improve throughput.<\/figcaption><\/figure>\n<p>The optional expansion via <em class=\"underscore\">task shaders<\/em> allows early culling of a group of primitives or making LOD decisions upfront. The mechanism scales across the GPU and is therefore superseding instancing or multi draw indirect for small meshes. This configuration is similar to the <em class=\"underscore\">tessellation control shader<\/em> setting up how much a patch (~task workgroup) is tessellated and then influencing how many <em class=\"underscore\">tessellation evaluation<\/em> invocations (~mesh workgroups) are created.<\/p>\n<p>There is a limitation on how many <em class=\"underscore\">mesh workgroups<\/em> a single <em class=\"underscore\">task workgroup<\/em> can emit. The first generation hardware supports a maximum of 64K children that can be generated <em class=\"underscore\">per task<\/em>. There is no limit on the total number of mesh children across all tasks within the same draw call. Likewise if no <em class=\"underscore\">task shader<\/em> is used, no limits exist on the amount of mesh workgroups generated by the draw call. Figure 7 illustrates how this works.<\/p>\n<figure id=\"attachment_11842\" aria-describedby=\"caption-attachment-11842\" style=\"width: 635px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_tree.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-11842 size-full\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_tree.png\" alt=\"NVIDIA Turing GPU mesh shaders workgroup flow\" width=\"635\" height=\"336\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_tree.png 635w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_tree-300x159.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_tree-625x331.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_tree-500x265.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_tree-160x85.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_tree-362x192.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_tree-208x110.png 208w\" sizes=\"auto, (max-width: 635px) 100vw, 635px\" \/><\/a><figcaption id=\"caption-attachment-11842\" class=\"wp-caption-text\">Figure 7. Mesh shader workgroup flow<\/figcaption><\/figure>\n<p>Children of the task T are guaranteed to be launched after children of task T-1. However, task and mesh workgroups are fully pipelined, so that there is no waiting for the completion of previous children or tasks.<\/p>\n<p>The <em class=\"underscore\">task shader<\/em> should be used for dynamic work generation or filtering. Static setups benefit from using the <em class=\"underscore\">mesh shaders<\/em> alone.<\/p>\n<p>The rasterization output ordering of the meshes and the primitives within them is preserved. With rasterization disabled, both task and mesh shaders can be used to implement basic compute-trees.<\/p>\n<p><a class=\"target\" name=\"meshletsandmeshshading\"><\/a>\u00a0<a class=\"target\" name=\"toc3\"><\/a><\/p>\n<h1>Meshlets and Mesh Shading<\/h1>\n<p>Each meshlet represents a variable number of vertices and primitives. There are no restrictions regarding the connectivity of these primitives. However, they must stay below a maximum amount, specified within the shader code.<\/p>\n<div class=\"admonition tip\">We recommend using up to 64 vertices and 126 primitives. The &#8216;6&#8217; in 126 is not a typo. The first generation hardware allocates primitive indices in 128 byte granularity and and needs to reserve 4 bytes for the primitive count. Therefore <code>3 * 126 + 4<\/code> maximizes the fit into a <code>3 * 128 = 384<\/code> bytes block. Going beyond 126 triangles would allocate the next 128 bytes. 84 and 40 are other maxima that work well for triangles.<\/div>\n<p>In each GLSL <em class=\"underscore\">mesh-shader<\/em> code, a fixed amount of mesh memory per workgroup is allocated in the graphics pipeline for every workgroup.<\/p>\n<p>Maximums and sizes and primitive output are defined as follows:<\/p>\n<p>The allocation size of each meshlet depends on the compile-time sizing information as well as which output attributes are referenced by the shader. The smaller the allocation, the more workgroups can be executed in parallel on the hardware. As with\u00a0<em class=\"underscore\">compute<\/em>, workgroups share a common section of on-chip memory they can access. Therefore we recommend you be as efficient as possible in the way all outputs or shared memory is used. This is already true for current shaders. However, the memory footprint can be higher since we allow a greater number of vertices and primitives than in the current programming.<\/p>\n<pre style=\"color: #000000;background: #ffffff\"><span style=\"color: #3f7f59\">\/\/ Set the number of threads per workgroup (always one-dimensional).<\/span>\n  <span style=\"color: #3f7f59\">\/\/ The limitations may be different than in actual compute shaders.<\/span>\n  layout(local_size_x=32) in;\n\n  <span style=\"color: #3f7f59\">\/\/ the primitive type (points,lines or triangles)<\/span>\n  layout(triangles) out;\n  <span style=\"color: #3f7f59\">\/\/ maximum allocation size for each meshlet<\/span>\n  layout(max_vertices=64, max_primitives=126) out;\n\n  <span style=\"color: #3f7f59\">\/\/ the actual amount of primitives the workgroup outputs ( &lt;= max_primitives)<\/span>\n  out uint gl_PrimitiveCountNV;\n  <span style=\"color: #3f7f59\">\/\/ an index buffer, using list type indices (strips are not supported here)<\/span>\n  out uint gl_PrimitiveIndicesNV[]; <span style=\"color: #3f7f59\">\/\/ [max_primitives * 3 for triangles]<\/span>\n<\/pre>\n<p>Turing supports another new GLSL extension, <code>NV_fragment_shader_barycentric<\/code>, which enables the fragment shader to fetch the raw data of the three vertices that make a primitive and interpolate it manually. This raw access means we can output \u201cuint\u201d vertex attributes, but use the various pack\/unpack functions to store floats as fp16, unorm8 or snorm8. This can greatly reduce the per-vertex footprint again for normals, texture coordinates, and basic color values and benefits both standard as well as the mesh shading pipeline.<\/p>\n<p>Additional attributes for vertices and primitives are defined as follows:<\/p>\n<pre style=\"color: #000000;background: #ffffff\">out gl_MeshPerVertexNV {\n     vec4  gl_Position;\n     <span style=\"color: #7f0055;font-weight: bold\">float<\/span> gl_PointSize;\n     <span style=\"color: #7f0055;font-weight: bold\">float<\/span> gl_ClipDistance[];\n     <span style=\"color: #7f0055;font-weight: bold\">float<\/span> gl_CullDistance[];\n  } gl_MeshVerticesNV[];            <span style=\"color: #3f7f59\">\/\/ [max_vertices]<\/span>\n\n  <span style=\"color: #3f7f59\">\/\/ define your own vertex output blocks as usual<\/span>\n  out Interpolant {\n    vec2 uv;\n  } OUT[];                          <span style=\"color: #3f7f59\">\/\/ [max_vertices]<\/span>\n\n  <span style=\"color: #3f7f59\">\/\/ special purpose per-primitive outputs<\/span>\n  perprimitiveNV out gl_MeshPerPrimitiveNV {\n    <span style=\"color: #7f0055;font-weight: bold\">int<\/span> gl_PrimitiveID;\n    <span style=\"color: #7f0055;font-weight: bold\">int<\/span> gl_Layer;\n    <span style=\"color: #7f0055;font-weight: bold\">int<\/span> gl_ViewportIndex;\n    <span style=\"color: #7f0055;font-weight: bold\">int<\/span> gl_ViewportMask[];          <span style=\"color: #3f7f59\">\/\/ [1]<\/span>\n  } gl_MeshPrimitivesNV[];          <span style=\"color: #3f7f59\">\/\/ [max_primitives]<\/span>\n<\/pre>\n<p>One goal is to have the smallest number of meshlets, therefore maximizing vertex re-use within the meshlets, and hence wasting fewer allocations. It can be beneficial to apply a vertex cache optimizer on the indexbuffer prior to the generation of the meshlet data. For example, <a href=\"https:\/\/tomforsyth1000.github.io\/papers\/fast_vert_cache_opt.html\">Tom Forsyth&#8217;s linear-speed optimizer<\/a> can be used for this. Optimizing the vertex locations along with the index buffer is also beneficial, as the ordering of original triangles will be preserved when using the <em class=\"underscore\">mesh shaders<\/em>. CAD models often are often \u201cnaturally\u201d generated with strips and therefore can already have good data locality. Changing the indexbuffers can have negative side effects on the cluster culling properties of a meshlet for such data (see task-level culling).<\/p>\n<p><a class=\"target\" name=\"pre-computedmeshlets\"><\/a>\u00a0<a class=\"target\" name=\"toc4\"><\/a><\/p>\n<h1>Pre-Computed Meshlets<\/h1>\n<p>As an example, we render static content where the <em class=\"underscore\">index buffers<\/em> are not changing for many frames. Therefore the cost of generating the meshlet data can be hidden during upload of vertices\/indices to device memory. Additional benefits can be achieved when the <em class=\"underscore\">vertex<\/em> data is also static (no per-vertex animation; no changes in vertex positions), allowing precomputing data useful for quickly culling entire meshlets.<br \/>\n<a class=\"target\" name=\"datastructures\"><\/a>\u00a0<a class=\"target\" name=\"toc4.1\"><\/a><\/p>\n<h2 id=\"data_structures\" >Data Structures<a href=\"#data_structures\" aria-label=\"Scroll to Data Structures section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>In future samples we will provide a meshlet builder that contains a basic implementation that scans the provided indices and creates a new meshlet every time either of the the size limitations (vertex or primitive count) are hit.<\/p>\n<p>For an input triangle mesh it generates the following data:<\/p>\n<pre style=\"color: #000000;background: #ffffff\"><span style=\"color: #7f0055;font-weight: bold\">struct<\/span> MeshletDesc {\n    uint32_t vertexCount; <span style=\"color: #3f7f59\">\/\/ number of vertices used<\/span>\n    uint32_t primCount;   <span style=\"color: #3f7f59\">\/\/ number of primitives (triangles) used<\/span>\n    uint32_t vertexBegin; <span style=\"color: #3f7f59\">\/\/ offset into vertexIndices<\/span>\n    uint32_t primBegin;   <span style=\"color: #3f7f59\">\/\/ offset into primitiveIndices<\/span>\n  }\n\n  std::<span style=\"color: #7f0055;font-weight: bold\">vector<\/span>&lt;meshletdesc&gt;  meshletInfos;\n  std::<span style=\"color: #7f0055;font-weight: bold\">vector<\/span>&lt;uint8_t&gt;      primitiveIndices;\n\n  <span style=\"color: #3f7f59\">\/\/ use uint16_t when shorts are sufficient<\/span>\n  std::<span style=\"color: #7f0055;font-weight: bold\">vector<\/span>&lt;uint32_t&gt;     vertexIndices;\n<\/pre>\n<p><strong class=\"asterisk\">Why are there two index buffers?<\/strong><\/p>\n<p>The following original triangle index buffer sequence<\/p>\n<pre style=\"color: #000000;background: #ffffff\"><span style=\"color: #3f7f59\">\/\/ let's look at the first two triangles of a batch of many more triangleIndices = { 4,5,6, 8,4,6, ...}<\/span>\n<\/pre>\n<p>is split into two new indexbuffers.<\/p>\n<p>We build a set of unique vertex indices as we iterate through the triangle indices. This process is also known as <strong class=\"asterisk\">vertex de-duplication<\/strong>.<\/p>\n<pre style=\"color: #000000;background: #ffffff\">vertexIndices = { 4,5,6,  8, ...}\n<span style=\"color: #3f7f59\">\/\/ For the second triangle only vertex 8 must be added<\/span>\n<span style=\"color: #3f7f59\">\/\/ and the other vertices are re-used.<\/span>\n<\/pre>\n<p>The primitive indices are adjusted relative to the <code>vertexIndices<\/code> entries.<\/p>\n<pre style=\"color: #000000;background: #ffffff\"><span style=\"color: #3f7f59\">\/\/ original data<\/span>\ntriangleIndices  = { 4,5,6,  8,4,6, ...}\n<span style=\"color: #3f7f59\">\/\/ new data<\/span>\nprimitiveIndices = { 0,1,2,  3,0,2, ...}\n<span style=\"color: #3f7f59\">\/\/ the primitive indices are local per meshlet<\/span>\n<\/pre>\n<p>Once the appropriate size limitation is hit (either too many unique vertices, or too many primitives), a new meshlet is started. Subsequent meshlets will then create their own set of unique vertices.<\/p>\n<p><a class=\"target\" name=\"renderingresourcesanddataflow\"><\/a>\u00a0<a class=\"target\" name=\"toc4.2\"><\/a><\/p>\n<h2 id=\"rendering_resources_and_data_flow\" >Rendering Resources and Data Flow<a href=\"#rendering_resources_and_data_flow\" aria-label=\"Scroll to Rendering Resources and Data Flow section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>During rendering we use the original vertex buffers. However, instead of the original triangle indexbuffer we use three new buffers, shown in figure 8 below:<\/p>\n<ul>\n<li class=\"asterisk\"><strong class=\"asterisk\">Vertex Index Buffer<\/strong> as explained above. Each meshlet references a set of unique vertices. The indices for those vertices are stored in a buffer for all meshlets sequentially.<\/li>\n<li class=\"asterisk\"><strong class=\"asterisk\">Primitive Index Buffer<\/strong> as explained above. Each meshlet represents a varying number of primitives. Every triangle requires three primitive indices which are stored in a single buffer. <em class=\"underscore\">Note<\/em>: Extra indices may be added to get four byte alignment after each meshlet.<\/li>\n<li class=\"asterisk\"><strong class=\"asterisk\">Meshlet Desc Buffer.<\/strong>\u00a0Stores the information of workload and buffer offsets for each meshlet, as well as cluster culling information.<\/li>\n<\/ul>\n<p>These three buffers are actually smaller than the original index-buffers due to the higher vertex re-use that mesh shading allows. We noticed a reduction to around 75% of the original index-buffer sizes typically occurred.<\/p>\n<figure id=\"attachment_11835\" aria-describedby=\"caption-attachment-11835\" style=\"width: 632px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_data.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-11835 size-full\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_data.png\" alt=\"NVIDIA Turing GPU mesh shader buffer structure\" width=\"632\" height=\"161\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_data.png 632w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_data-300x76.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_data-625x159.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_data-500x127.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_data-160x41.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_data-362x92.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_data-432x110.png 432w\" sizes=\"auto, (max-width: 632px) 100vw, 632px\" \/><\/a><figcaption id=\"caption-attachment-11835\" class=\"wp-caption-text\">Figure 8. Meshlet buffer structure<\/figcaption><\/figure>\n<ul>\n<li><strong class=\"asterisk\">Meshlet Vertices:<\/strong> <code>vertexBegin<\/code> stores the starting location from where we will start fetching vertex indices. <code>vertexCount<\/code> stores the number of contiguous vertices involved. The vertices are unique within a meshlet; there are no duplicate index values.<\/li>\n<li class=\"asterisk\"><strong class=\"asterisk\">Meshlet Primitives:<\/strong> <code>primBegin<\/code> stores the starting location for the primitive indices from where we will start fetching indices. <code>primCount<\/code> stores the amount of primitives involved in the meshlet. Note that the number of indices depends on the primitive type (here: 3 for triangles). It is important to notice that the indices are referencing vertices relative to <code>vertexBegin<\/code>, meaning that index &#8216;0&#8217; would refer to the vertex index located at <code>vertexBegin<\/code>.<\/li>\n<\/ul>\n<p>The following pseudo code describes what each <em class=\"underscore\">mesh shader<\/em> workgroup performs in principle. It is serial only for illustration purposes.<\/p>\n<pre style=\"color: #000000;background: #ffffff\"><span style=\"color: #3f7f59\">\/\/ This code is just a serial pseudo code,<\/span>\n  <span style=\"color: #3f7f59\">\/\/ and doesn't reflect actual GLSL code that would<\/span>\n  <span style=\"color: #3f7f59\">\/\/ leverage the workgroup's local thread invocations.<\/span>\n\n  <span style=\"color: #7f0055;font-weight: bold\">for<\/span> (<span style=\"color: #7f0055;font-weight: bold\">int<\/span> v = 0; v &lt; meshlet.vertexCount; v++){\n    <span style=\"color: #7f0055;font-weight: bold\">int<\/span> vertexIndex = texelFetch(vertexIndexBuffer, meshlet.vertexBegin + v).x;\n    vec4 vertex = texelFetch(vertexBuffer, vertexIndex);\n    gl_MeshVerticesNV[v].gl_Position = <span style=\"color: #7f0055;font-weight: bold\">transform<\/span> * vertex;\n  }\n\n  <span style=\"color: #7f0055;font-weight: bold\">for<\/span> (<span style=\"color: #7f0055;font-weight: bold\">int<\/span> p = 0; p &lt; meshlet.primCount; p++){\n    uvec3 triangle = getTriIndices(primitiveIndexBuffer, meshlet.primBegin + p);\n    gl_PrimitiveIndicesNV[p * 3 + 0] = triangle.x;\n    gl_PrimitiveIndicesNV[p * 3 + 1] = triangle.y;\n    gl_PrimitiveIndicesNV[p * 3 + 2] = triangle.z;\n  }\n\n  <span style=\"color: #3f7f59\">\/\/ one thread writes the output primitives<\/span>\n  gl_PrimitiveCountNV = meshlet.primCount;\n<\/pre>\n<p>The mesh shader could look something like this when written in parallel fashion:<\/p>\n<pre style=\"color: #000000;background: #ffffff\"><span style=\"color: #7f0055;font-weight: bold\">void<\/span> <span style=\"color: #7f0055;font-weight: bold\">main<\/span>() {\n  ...\n\n  <span style=\"color: #3f7f59\">\/\/ As the workgoupSize may be less than the max_vertices\/max_primitives<\/span>\n  <span style=\"color: #3f7f59\">\/\/ we still require an outer loop. Given their static nature<\/span>\n  <span style=\"color: #3f7f59\">\/\/ they should be unrolled by the compiler in the end.<\/span>\n\n  <span style=\"color: #3f7f59\">\/\/ Resolved at compile time<\/span>\n  <span style=\"color: #7f0055;font-weight: bold\">const<\/span> uint vertexLoops =\n    (MAX_VERTEX_COUNT + GROUP_SIZE - 1) \/ GROUP_SIZE;\n\n  <span style=\"color: #7f0055;font-weight: bold\">for<\/span> (uint loop = 0; loop &lt; vertexLoops; loop++){\n    <span style=\"color: #3f7f59\">\/\/ distribute execution across threads<\/span>\n    uint v = gl_LocalInvocationID.x + loop * GROUP_SIZE;\n\n    <span style=\"color: #3f7f59\">\/\/ Avoid branching to get pipelined memory loads.<\/span>\n    <span style=\"color: #3f7f59\">\/\/ Downside is we may redundantly compute the last<\/span>\n    <span style=\"color: #3f7f59\">\/\/ vertex several times<\/span>\n    v = <span style=\"color: #7f0055;font-weight: bold\">min<\/span>(v, meshlet.vertexCount-1);\n    {\n      <span style=\"color: #7f0055;font-weight: bold\">int<\/span> vertexIndex = texelFetch( vertexIndexBuffer, \n                                    <span style=\"color: #7f0055;font-weight: bold\">int<\/span>(meshlet.vertexBegin + v)).x;\n      vec4 vertex = texelFetch(vertexBuffer, vertexIndex);\n      gl_MeshVerticesNV[v].gl_Position = <span style=\"color: #7f0055;font-weight: bold\">transform<\/span> * vertex;\n    }\n  }\n\n  <span style=\"color: #3f7f59\">\/\/ Let's pack 8 indices into RG32 bit texture<\/span>\n  uint primreadBegin = meshlet.primBegin \/ 8;\n  uint primreadIndex = meshlet.primCount * 3 - 1;\n  uint primreadMax   = primreadIndex \/ 8;\n\n  <span style=\"color: #3f7f59\">\/\/ resolved at compile time and typically just 1<\/span>\n  <span style=\"color: #7f0055;font-weight: bold\">const<\/span> uint primreadLoops =\n    (MAX_PRIMITIVE_COUNT * 3 + GROUP_SIZE * 8 - 1) \n      \/ (GROUP_SIZE * 8);\n\n  <span style=\"color: #7f0055;font-weight: bold\">for<\/span> (uint loop = 0; loop &lt; primreadLoops; loop++){\n    uint p = gl_LocalInvocationID.x + loop * GROUP_SIZE;\n    p = <span style=\"color: #7f0055;font-weight: bold\">min<\/span>(p, primreadMax);\n\n    uvec2 topology = texelFetch(primitiveIndexBuffer, \n                                <span style=\"color: #7f0055;font-weight: bold\">int<\/span>(primreadBegin + p)).rg;\n\n    <span style=\"color: #3f7f59\">\/\/ use a built-in function, we took special care before when <\/span>\n    <span style=\"color: #3f7f59\">\/\/ sizing the meshlets to ensure we don't exceed the <\/span>\n    <span style=\"color: #3f7f59\">\/\/ gl_PrimitiveIndicesNV array here<\/span>\n\n    writePackedPrimitiveIndices4x8NV(p * 8 + 0, topology.x);\n    writePackedPrimitiveIndices4x8NV(p * 8 + 4, topology.y);\n  }\n\n  <span style=\"color: #7f0055;font-weight: bold\">if<\/span> (gl_LocalInvocationID.x == 0) {\n    gl_PrimitiveCountNV = meshlet.primCount;\n  }\n<\/pre>\n<p>This example is just a straight-forward implementation. Due to all data fetching being done by the developer, custom encodings, decompression via subgroup intrinsics or shared memory, or temporarly using the vertex outputs are possible to save additional bandwidth.<\/p>\n<p><a class=\"target\" name=\"clustercullingwithtaskshader\"><\/a>\u00a0<a class=\"target\" name=\"toc4.3\"><\/a><\/p>\n<h2 id=\"cluster_culling_with_task_shader\" >Cluster Culling with Task Shader<a href=\"#cluster_culling_with_task_shader\" aria-label=\"Scroll to Cluster Culling with Task Shader section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>We try to squeeze more information into a meshlet descriptor to perform early culling. We have experimented with using 128-bit descriptors that encode the previous mentioned values, as well as relative bbox and a cone for backface-cluster culling as presented by <a href=\"https:\/\/frostbite-wp-prd.s3.amazonaws.com\/wp-content\/uploads\/2016\/03\/29204330\/GDC_2016_Compute.pdf\">G.Wihlidal<\/a>. When generating meshlets, one needs to balance good cluster-culling properties with improved vertex re-use. One may influence the other negatively.<\/p>\n<p>The task shader below culls up to 32 meshlets.<\/p>\n<pre style=\"color: #000000;background: #ffffff\">layout(local_size_x=32) in;\n\ntaskNV out Task {\n  uint      baseID;\n  uint8_t   subIDs[GROUP_SIZE];\n} OUT;\n\n<span style=\"color: #7f0055;font-weight: bold\">void<\/span> <span style=\"color: #7f0055;font-weight: bold\">main<\/span>() {\n  <span style=\"color: #3f7f59\">\/\/ we padded the buffer to ensure we don't access it out of bounds<\/span>\n  uvec4 desc = meshletDescs[gl_GlobalInvocationID.x];\n\n  <span style=\"color: #3f7f59\">\/\/ implement some early culling function<\/span>\n  <span style=\"color: #7f0055;font-weight: bold\">bool<\/span> render = gl_GlobalInvocationID.x &lt; meshletCount &amp;&amp; !earlyCull(desc);\n\n  uvec4 vote  = subgroupBallot(render);\n  uint  tasks = subgroupBallotBitCount(vote);\n\n  <span style=\"color: #7f0055;font-weight: bold\">if<\/span> (gl_LocalInvocationID.x == 0) {\n    <span style=\"color: #3f7f59\">\/\/ write the number of surviving meshlets, i.e. <\/span>\n    <span style=\"color: #3f7f59\">\/\/ mesh workgroups to spawn<\/span>\n    gl_TaskCountNV = tasks;\n\n    <span style=\"color: #3f7f59\">\/\/ where the meshletIDs started from for this task workgroup<\/span>\n    OUT.baseID = gl_WorkGroupID.x * GROUP_SIZE;\n  }\n\n  {\n    <span style=\"color: #3f7f59\">\/\/ write which children survived into a compact array<\/span>\n    uint idxOffset = subgroupBallotExclusiveBitCount(vote);\n    <span style=\"color: #7f0055;font-weight: bold\">if<\/span> (render) {\n      OUT.subIDs[idxOffset] = uint8_t(gl_LocalInvocationID.x);\n    }\n  }\n}\n<\/pre>\n<p>The corresponding mesh shader now uses the information form the task shader to identify which meshlet to generate.<\/p>\n<pre style=\"color: #000000;background: #ffffff\">taskNV in Task {\n  uint      baseID;\n  uint8_t   subIDs[GROUP_SIZE];\n} IN;\n\n<span style=\"color: #7f0055;font-weight: bold\">void<\/span> <span style=\"color: #7f0055;font-weight: bold\">main<\/span>() {\n  <span style=\"color: #3f7f59\">\/\/ We can no longer use gl_WorkGroupID.x directly<\/span>\n  <span style=\"color: #3f7f59\">\/\/ as it now encodes which child this workgroup is.<\/span>\n  uint meshletID = IN.baseID + IN.subIDs[gl_WorkGroupID.x];\n  uvec4 desc = meshletDescs[meshletID];\n  ...\n}\n<\/pre>\n<p>We only culled the meshlets in the task shader in the context of rendering large triangle models. Other scenarios may involve picking different meshlet data later on depending on level-of-detail decision making, or completely generating the geometry (particles, ribbons etc.). Figure 9 below is from a demo that uses task shaders for level-of-detail computation.<\/p>\n<figure id=\"attachment_11836\" aria-describedby=\"caption-attachment-11836\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_lod.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-11836 size-full-page-width\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2018\/09\/meshlets_lod-1024x592.png\" alt=\"NVIDIA Turing GPU mesh shader demo\" width=\"1024\" height=\"592\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_lod-1024x592.png 1024w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_lod-300x174.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_lod-768x444.png 768w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_lod-625x362.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_lod-500x289.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_lod-156x90.png 156w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_lod-362x209.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_lod-190x110.png 190w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_lod.png 1922w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption id=\"caption-attachment-11836\" class=\"wp-caption-text\">Figure 9. NVIDIA Asteroids demo uses mesh shading<\/figcaption><\/figure>\n<p><a class=\"target\" name=\"conclusion\"><\/a>\u00a0<a class=\"target\" name=\"toc5\"><\/a><\/p>\n<h1>Conclusion<\/h1>\n<p>Some of the key takeaways:<\/p>\n<ul>\n<li class=\"asterisk\">A triangle mesh can be converted into meshlets by scanning the index buffer once. Vertex cache optimizers that help classic rendering also help improve meshlet packing efficiency. More sophisticated clustering allows improved early rejection in the task shader stage (tighter bounding boxes, coherent triangle normals etc.).<\/li>\n<li class=\"asterisk\">The <em class=\"underscore\">task shader<\/em> allows skipping of a group of primitives early, before the hardware needs to allocate vertex\/primitive memory for a <em class=\"underscore\">mesh shader<\/em> invocation on-chip. It also enables generating more than one child invocation if necessary.<\/li>\n<li class=\"asterisk\">Vertices are processed in parallel across the workgroup&#8217;s threads, just like the original <em class=\"underscore\">vertex shaders<\/em>.<\/li>\n<li class=\"asterisk\"><em class=\"underscore\">Vertex shaders<\/em> can be made mostly compatible with <em class=\"underscore\">mesh shaders<\/em> with a few preprocessor insertions.<\/li>\n<li class=\"asterisk\">Less data needs to be fetched due to greater vertex re-use (classic vertex shaders operate with a limit of max_vertices = 32 and max_primitives = 32). Average triangle mesh valences suggest that using twice the amount of triangles as vertices is beneficial.<\/li>\n<li class=\"asterisk\">All data loads are handled via shader instructions instead of the classic fixed function primitive fetch and therefore scales better with more <em class=\"asterisk\">Streaming Multiprocessors<\/em>. It also allows easier use of custom vertex encodings to further reduce bandwidth.<\/li>\n<li class=\"asterisk\">For heavy use of vertex attributes, a primitive culling phase that also operates in parallel may be beneficial. This allows us to skip loading vertex data for primitives that would be culled away. However, the best gains are made by efficient culling at the task-level.<\/li>\n<\/ul>\n<p>You can find more information on the Turing architecture <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-turing-architecture-in-depth\/\">here<\/a>. Please add your thoughts in the comments section, below. Sample code and driver support will soon be available.\u00a0 If you&#8217;re an NVIDIA developer working with Turing advanced shaders, check out the the <a href=\"https:\/\/devtalk.nvidia.com\/default\/board\/60\/visual-and-game-development\/\">game developer forums<\/a>, where you can interact with a community of NVIDIA developers.<\/p>\n<p><a class=\"target\" name=\"references\"><\/a>\u00a0<a class=\"target\" name=\"toc6\"><\/a><\/p>\n<h2 id=\"references\" >References<a href=\"#references\" aria-label=\"Scroll to References section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<ul>\n<li class=\"asterisk\">[1]: <a href=\"https:\/\/www.facebook.com\/artbyrens\">Art by Rens<\/a><\/li>\n<li class=\"asterisk\">[2]: <a href=\"https:\/\/www.flickr.com\/photos\/14136614@N03\/6209344182\">photo by Chris Christian \u2013 model by Russell Berkoff<\/a><\/li>\n<li class=\"asterisk\">[3]: <a href=\"https:\/\/frostbite-wp-prd.s3.amazonaws.com\/wp-content\/uploads\/2016\/03\/29204330\/GDC_2016_Compute.pdf\">Optimizing Graphics Pipeline with Compute \u2013 Graham Wihlidal<\/a><\/li>\n<li class=\"asterisk\">[4]: <a href=\"http:\/\/advances.realtimerendering.com\/s2015\/aaltonenhaar_siggraph2015_combined_final_footer_220dpi.pdf\">GPU-Driven Rendering Pipelines \u2013 Ulrich Haar &amp; Sebastian Aaltonen<\/a><\/li>\n<li class=\"asterisk\">[5]: <a href=\"http:\/\/www.conffx.com\/Visibility_Buffer_GDCE.pdf\">The filtered and culled Visibility Buffer \u2013 Wolfgang Engel<\/a><\/li>\n<\/ul>\n<h2 id=\"appendix_siggraph_presentation\" >Appendix: SIGGRAPH Presentation<a href=\"#appendix_siggraph_presentation\" aria-label=\"Scroll to Appendix: SIGGRAPH Presentation section\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>Here is the full SIGGRAPH presentation upon which this blog post builds, for your viewing pleasure.<\/p>\n<p><span class=\"embed-youtube\" style=\"text-align:center; display: block;\"><iframe loading=\"lazy\" class=\"youtube-player\" width=\"640\" height=\"360\" src=\"https:\/\/www.youtube.com\/embed\/Ge427_2VORo?version=3&#038;rel=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;fs=1&#038;hl=en-US&#038;autohide=2&#038;wmode=transparent\" allowfullscreen=\"true\" style=\"border:0;\" sandbox=\"allow-scripts allow-same-origin allow-popups allow-presentation allow-popups-to-escape-sandbox\"><\/iframe><\/span><\/p>\n<div class=\"markdeepFooter\">\n<div style=\"font-size: 13px;font-family: 'Times New Roman',serif;vertical-align: middle\"><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>The Turing architecture introduces a new programmable geometric shading pipeline through the use of mesh shaders. The new shaders bring the compute programming model to the graphics pipeline as threads are used cooperatively to generate compact meshes (meshlets) directly on the chip for consumption by the rasterizer. Applications and games dealing with high-geometric complexity benefit &hellip; <a href=\"https:\/\/developer.nvidia.com\/blog\/introduction-turing-mesh-shaders\/\">Continued<\/a><\/p>\n","protected":false},"author":507,"featured_media":11833,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"video","meta":{"_acf_changed":false,"publish_to_discourse":"","publish_post_category":"318","wpdc_auto_publish_overridden":"","wpdc_topic_tags":"","wpdc_pin_topic":"","wpdc_pin_until":"","discourse_post_id":"602203","discourse_permalink":"https:\/\/forums.developer.nvidia.com\/t\/introduction-to-turing-mesh-shaders\/148633","wpdc_publishing_response":"success","wpdc_publishing_error":"","nv_subtitle":"","ai_post_summary":"<ul><li>The Turing architecture&#039;s mesh shading pipeline improves rendering efficiency by allowing cooperative threads to generate compact meshes directly on the chip.<\/li><li>The task shader stage enables early culling of meshlets, reducing unnecessary vertex and primitive processing, and can dynamically generate work for the mesh shader stage.<\/li><li>Meshlets are used to segment large meshes into smaller, more manageable pieces, with a recommended maximum of 64 vertices and 126 primitives per meshlet to optimize vertex re-use and reduce memory bandwidth.<\/li><li>The mesh shading pipeline provides several benefits, including higher scalability, reduced bandwidth requirements, and increased flexibility in defining mesh topology and creating graphics work.<\/li><li>The new pipeline is accessible through extensions in OpenGL and Vulkan, and using DirectX 12 Ultimate, allowing developers to leverage the capabilities of the Turing architecture.<\/li><\/ul>","footnotes":"","_links_to":"","_links_to_target":""},"categories":[503],"tags":[1932,3459,635,643],"coauthors":[640],"class_list":["post-11828","post","type-post","status-publish","format-video","has-post-thumbnail","hentry","category-simulation-modeling-design","tag-development-tools-and-libraries","tag-meshes","tag-turing","tag-turing-advanced-shaders","post_format-video","tagify_workload-generative-ai","tagify_workload-graphics","tagify_workload-simulation-modeling-design"],"acf":{"post_industry":[],"post_products":[],"post_learning_levels":[],"post_content_types":[],"post_collections":[]},"jetpack_featured_media_url":"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2018\/09\/meshlets_bunny.png","primary_category":{"category":"Simulation \/ Modeling \/ Design","link":"https:\/\/developer.nvidia.com\/blog\/category\/simulation-modeling-design\/","id":503,"data_source":""},"nv_translations":[{"language":"zh_CN","title":"Turing \u7f51\u683c\u7740\u8272\u5668\u7b80\u4ecb","post_id":349}],"jetpack_shortlink":"https:\/\/wp.me\/pcCQAL-34M","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/11828","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/users\/507"}],"replies":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/comments?post=11828"}],"version-history":[{"count":30,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/11828\/revisions"}],"predecessor-version":[{"id":42441,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/11828\/revisions\/42441"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media\/11833"}],"wp:attachment":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media?parent=11828"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/categories?post=11828"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/tags?post=11828"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/coauthors?post=11828"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}