Experimenting with GPUs, GKE DRANET and Inference Gateway
π§ This post walks through deploying and serving a large model on Google Kubernetes Engine using managed DRANET and NVIDIA B200 GPUs. It explains how RDMA networking is provisioned as an isolated regional VPC for low-latency GPU-to-GPU communication and how to provision A4 nodes and reservations for RoCEv2-capable accelerators. The author provides example gcloud and kubectl commands to create the cluster, a GPU node pool with DRA labels, a ResourceClaimTemplate for mrdma workloads, and steps to serve a DeepSeek model privately via GKE Inference Gateway and a regional internal Application Load Balancer.
