GPU Mesh Optimization Pipeline with meshoptimizer

Why Optimize Meshes? GPU Bottlenecks Exposed

GPUs process triangle meshes through vertex fetch, shader execution, cache reuse, rasterization, and overdraw-prone pixel shading. Unoptimized data wastes bandwidth and cycles: redundant vertices bloat buffers, poor index order kills cache hits (historically 16-32 slots, now thread-group batched), scattered fetches hammer memory, and front-to-back naive ordering ignores overdraw. meshoptimizer targets these with a proven pipeline, proven across Vulkan/D3D12, reducing draw calls, memory, and shader invocations. It's not hype—it's algorithms tuned for NVIDIA Turing+, AMD RDNA2, and mobile tiled renderers, with C-compatible headers for FFI integration.

Tradeoff upfront: Optimizations are sequential and destructive (in-place rewrites), so bake them into asset pipelines. Quantization trades precision for bandwidth (e.g., normals to 10-10-10 SNORM lose <1e-3 error). Overdraw opt sacrifices ~5% cache efficiency for pixel savings—test on your hardware, skip on PowerVR/Apple GPUs.

Core Pipeline: Indexing to Shadow Indexing

Start with de-duplication via meshopt_generateVertexRemap: Builds a remap table matching vertices by binary equivalence (zero-pad gaps), collapsing unindexed or redundant buffers. For float drift (normals/tangents), preempt with quantization or meshopt_generateVertexRemapCustom tolerancing attributes:

size_t vertex_count = meshopt_generateVertexRemapCustom(&remap[0], NULL, index_count, &unindexed_vertices[0].px, unindexed_vertex_count, sizeof(Vertex),
  [&](unsigned int lhs, unsigned int rhs) -> bool {
    const Vertex &lv = unindexed_vertices[lhs], &rv = unindexed_vertices[rhs];
    return fabsf(lv.tx - rv.tx) < 1e-3f && fabsf(lv.ty - rv.ty) < 1e-3f;
  });
meshopt_remapIndexBuffer(indices, NULL, index_count, &remap[0]);
meshopt_remapVertexBuffer(vertices, &unindexed_vertices[0], unindexed_vertex_count, sizeof(Vertex), &remap[0]);

This yields unique vertices + indices. Next, meshopt_optimizeVertexCache reorders triangles for locality—adaptive across architectures, or faster meshopt_optimizeVertexCacheFifo(16) for iteration (2x speed, slightly worse perf).

Optional meshopt_optimizeOverdraw(indices, indices, index_count, &vertices[0].x, vertex_count, sizeof(Vertex), 1.05f);—reorders for omnidirectional front-to-back, balancing vs. cache via threshold (1.05f caps hit ratio drop at 5%).

Then meshopt_optimizeVertexFetch reorders vertices by access order (approximates cache, not exact model). Finish with quantization: Positions to half-floats (meshopt_quantizeHalf), normals to packed SNORM (meshopt_quantizeSnorm(nx, 10) into 10_10_10). Dequantize in shaders via normalized inputs or meshopt_dequantizeHalf on CPU.

For shadows/depth-prepass: meshopt_generateShadowIndexBuffer generates a lean index buffer ignoring seams (e.g., UV/lightmaps), using only positions (or Multi for extras). Cache-optimize it separately.

"The algorithm tries to maintain a balance between vertex cache efficiency and overdraw; the threshold determines how much the algorithm can compromise the vertex cache hit ratio, with 1.05 meaning that the resulting ratio should be at most 5% worse than before the optimization." — Docs on overdraw threshold, highlighting explicit perf tuning.

Clusterization for Mesh Shaders and Raytracing

Mesh shaders (NVIDIA Turing+, AMD RDNA2) ditch index/vertex shaders for programmable batches. Convert meshes to meshlets (max 64 verts/126 tris NVIDIA rec): meshopt_buildMeshlets balances cache reuse, cone culling radius, and divergence. cone_weight=0.25 trades topo efficiency for culling; trim overallocated arrays post-build.

const size_t max_vertices = 64, max_triangles = 126;
size_t max_meshlets = meshopt_buildMeshletsBound(indices.size(), max_vertices, max_triangles);
// ... allocate meshlets, vertices, triangles
size_t meshlet_count = meshopt_buildMeshlets(meshlets.data(), meshlet_vertices.data(), meshlet_triangles.data(), indices.data(), indices.size(), &vertices[0].x, vertices.size(), sizeof(Vertex), max_vertices, max_triangles, 0.25f);
// Trim: meshlet_vertices.resize(last.vertex_offset + last.vertex_count);
// Per-meshlet: meshopt_optimizeMeshlet(...);

Feed to shaders (GLSL example provided for VK_EXT_mesh_shader). AMD favors square limits (64/64). Enables culling (frustum/occlusion/cone), in-memory compression. Alternatives: meshopt_buildMeshletsScan for load-time from cache-opt indices.

"Note that for earlier AMD GPUs, the best configurations tend to use the same limits for max_vertices and max_triangles, such as 64 and 64, or 128 and 128." — Hardware-specific tuning, avoiding one-size-fits-all.

Compression, Simplification, and Analyzers

Beyond pipeline: Vertex/index compression shrinks buffers (e.g., meshopt_compressVertexBuffer), meshlet/point cloud variants. Filters encode deltas. Simplification: meshopt_simplify drops tris error-bound, attribute-aware preserves UV/normals, permissive allows topology breaks. Advanced: Vertex updates, point clouds.

Efficiency analyzers (meshopt_analyzeVertexCache, etc.) score buffers pre/post-opt (ACM hits, overdraw ratio)—essential for iteration.

Deinterleave for multi-stream layouts. Specialized: Strips, adjacency, tessellation, visibility buffers, opacity micromaps.

Integration Realities and Tradeoffs

Header-only C++ (src/*.cpp), CMake or direct include. Platforms: vcpkg/Conan, distro pkgs. Companions: gltfpack (glTF opt), clusterlod.h (LOD). Rust/JS bindings. Memory: Allocates temp remaps (~vertex_count).

Why this order? Indexing enables reuse; cache/overdraw/fetch depend on indices; quantize last for float access. Failures: Drift needs custom remap; overdraw skips on tiled GPUs. Replicate: Measure with analyzers—e.g., FIFO vs. adaptive (adaptive wins cross-GPU).

"While it generally produces less performant results on most GPUs, it FIFO runs ~2x faster, which may benefit rapid content iteration." — Tradeoff callout for dev workflows.

Key Takeaways

Pipeline strictly: Index → Cache opt → Overdraw (opt, 1.05f thresh) → Fetch opt → Quantize → Shadow index.
Custom remap for drift: Tolerance floats like 1e-3f on tangents.
Meshlets: 64v/126t NVIDIA, square for AMD; cone_weight=0.25f if culling.
Quantize domain-specific: Half-float pos, 10-10-10 SNORM normals.
Always analyze: meshopt_analyzeVertexCache for hits, overdraw ratio.
Test hardware: Skip overdraw on tiled mobile; separate shadow IB for seams.
Trim meshlet arrays; optimize each in-place for locality.
gltfpack for glTF assets—native binaries over npm for speed/texture comp.