Disaggregated AI Inference with NVIDIA Dynamo on GKE
⚡ This post announces a reproducible recipe to deploy NVIDIA Dynamo for disaggregated LLM inference on Google Cloud’s AI Hypercomputer using Google Kubernetes Engine, vLLM, and A3 Ultra (H200) GPUs. The recipe separates prefill and decode phases across dedicated GPU pools to reduce contention and lower latency. It includes single-node and multi-node examples and step-by-step deployment actions. The repository provides configuration guidance and future plans for broader GPU and engine support.
