Scaling MoE Inference with NVIDIA Dynamo on A4X Rack-Scale
🚀 This post describes two validated deployment recipes for serving large Mixture-of-Experts (MoE) models on Google Cloud's A4X machines using NVIDIA Dynamo. The recipes provide throughput- and latency-optimized configurations that exploit the 72‑GPU GB200 NVL72 rack fabric, WideEP/DeepEP parallelism, global KV cache, and GKE-aware rack-level scheduling. Performance validation reports >6K tokens/sec/GPU for the throughput recipe and a 10ms median inter-token latency for the latency-optimized recipe.
