Multi‑cluster GKE inference with TPUs and DRANET
🧭 This blog documents an experiment using Google Cloud to deploy a Gemma 3 inference workload across two regional GKE clusters, leveraging TPU v6e instances, managed DRANET for accelerator networking, and a multi-cluster Inference Gateway for cross‑region routing and failover. It describes building VPCs, reserving internal IPs, configuring Cloud Storage FUSE for model storage, creating TPU node pools with managed DRANET, registering clusters into a GKE Fleet, and deploying the inference server and gateway with health checks and autoscaling metrics. The objective is resilient, low‑latency routing to the nearest region with automatic failover to the other region if one fails.