All news with #gke inference gateway tag

Mon, October 20, 2025

G4 VMs: High-performance P2P Fabric for Multi‑GPU Workloads

#Google #NVIDIA #G4 VMs #GKE Inference Gateway

🚀 Google Cloud's newly GA G4 VMs combine NVIDIA RTX PRO 6000 Blackwell GPUs with a custom, software-defined PCIe fabric to enable high-performance peer-to-peer (P2P) GPU communication. The platform accelerates collective operations like All-Gather and All-Reduce without code changes, delivering up to 2.2x faster collectives. For tensor-parallel inference, customers can see up to 168% higher throughput and up to 41% lower inter-token latency. G4 integrates with GKE Inference Gateway for horizontal scaling and production deployments.

Wed, September 10, 2025

GKE Inference Gateway and Quickstart Achieve GA Status

#Google Cloud #GKE Inference Gateway #GKE Inference Quickstart #vLLM #TPU #Run:ai #Anywhere Cache

🚀 GKE Inference Gateway and GKE Inference Quickstart are now generally available, bringing production-ready inferencing features built on AI Hypercomputer. New capabilities include prefix-aware load balancing, disaggregated serving, vLLM support on TPUs and Ironwood TPUs, and model streaming with Anywhere Cache to cut model load times. These features target faster time-to-first-token and time-per-output-token, higher throughput, and lower inference costs, while Quickstart offers data-driven accelerator and configuration recommendations.