Tag Banner

All news with #gke tag

Tue, November 18, 2025

Using Private NAT for Overlapping Private IP Spaces

🔒 Google Cloud's Private NAT enables secure private-to-private translation to connect networks with overlapping or non-routable IPv4 ranges without running NAT appliances. As a managed Cloud NAT feature, it delivers high availability, automatic scalability, and centralized control for hybrid and multi‑VPC topologies. The post includes practical gcloud examples and Network Connectivity Center use cases to guide implementation.

read more →

Mon, November 17, 2025

Hands-on with Gemma 3: Deploying Open Models on GCP

🚀 Google Cloud introduces hands-on labs for Gemma 3, a family of lightweight open models offering multimodal (text and image) capabilities and efficient performance on smaller hardware footprints. The labs present two deployment paths: a serverless approach using Cloud Run with GPU support, and a platform approach using GKE for scalable production environments. Choose Cloud Run for simplicity and cost-efficiency or GKE Autopilot for control and robust orchestration to move models from local testing to production.

read more →

Tue, November 11, 2025

Agent Sandbox: Kubernetes Enhancements for AI Agents

🛡️ Agent Sandbox is a new Kubernetes primitive designed to run AI agents with strong, kernel-level isolation. Built on gVisor with optional Kata Containers and developed in the Kubernetes community as a CNCF project, it reduces risks from agent-executed code. On GKE, managed gVisor, container-optimized compute and pre-warmed sandbox pools deliver sub-second startup latency and up to 90% cold-start improvement. A Python SDK and a simple API abstract YAML so AI engineers can manage sandbox lifecycles without deep infrastructure expertise; Agent Sandbox is open source and deployable on GKE today.

read more →

Tue, November 11, 2025

GKE: Unified Platform for Agents, Scale, and Inference

🚀 Google details a broad set of GKE and Kubernetes enhancements announced at KubeCon to address agentic AI, large-scale training, and latency-sensitive inference. GKE introduces Agent Sandbox (gVisor-based) for isolated agent execution and a managed GKE Agent Sandbox with snapshots and optimized compute. The platform also delivers faster autoscaling through Autopilot compute classes, Buffers API, and container image streaming, while inference is accelerated by GKE Inference Gateway, Pod Snapshots, and Inference Quickstart.

read more →

Mon, November 10, 2025

Full-Stack Approach to Scaling RL for LLMs on GKE at Scale

🚀 Google Cloud describes a full-stack solution for running high-scale Reinforcement Learning (RL) with LLMs, combining custom TPU hardware, NVIDIA GPUs, and optimized software libraries. The approach addresses RL's hybrid demands—reducing sampler latency, easing memory contention across actor/critic/reward models, and accelerating weight copying—by co-designing hardware, storage (Managed Lustre, Cloud Storage), and orchestration on GKE. The blog emphasizes open-source contributions (vLLM, llm-d, MaxText, Tunix) and integrations with Ray and NeMo RL recipes to improve portability and developer productivity. It also highlights mega-scale orchestration and multi-cluster strategies to run production RL jobs at tens of thousands of nodes.

read more →

Mon, November 10, 2025

Google Cloud N4D VMs with AMD EPYC Turin Generally Available

🚀 Google Cloud announces general availability of the N4D machine series built on 5th Gen AMD EPYC 'Turin' processors and Google's Titanium infrastructure. N4D targets cost-optimized, general-purpose workloads — web and app servers, data analytics, and containerized microservices — with up to 96 vCPUs, 768 GB DDR5, 50 Gbps networking, and Hyperdisk storage. Google cites up to 3.5x web-serving throughput versus N2D and material price-performance gains for general compute and Java workloads.

read more →

Fri, November 7, 2025

Tiered KV Cache Boosts LLM Performance on GKE with HBM

🚀 LMCache implements a node-local, tiered KV Cache on GKE to extend the GPU HBM-backed Key-Value store into CPU RAM and local SSD, increasing effective cache capacity and hit ratio. In benchmarks using Llama-3.3-70B-Instruct on an A3 mega instance (8×nvidia-h100-mega-80gb), configurations that added RAM and SSD reduced Time-to-First-Token and materially increased token throughput for long system prompts. The results demonstrate a practical approach to scale context windows while balancing cost and latency on GKE.

read more →

Tue, November 4, 2025

Kubernetes introduces control-plane minor-version rollback

🔁 Google and the Kubernetes community introduced control-plane minor-version rollback in Kubernetes 1.33, giving operators a safe, observable path to revert control-plane upgrades. The new KEP-4330 emulated-version model separates binary upgrades from API and storage transitions into a two-step process, enabling validation before committing changes. This capability is available in open-source Kubernetes and will be generally available in GKE 1.33 soon, reducing upgrade risk and shortening recovery time from unexpected regressions.

read more →

Tue, November 4, 2025

How Google Cloud Networking Supports AI Workloads at Scale

🔗 Networking is a critical enabler for AI on Google Cloud, connecting models, storage, and inference endpoints while preserving security and performance. The post outlines seven capabilities—from private API access and RDMA-backed GPU interconnects to hybrid Cross-Cloud links—that reduce latency, prevent data exfiltration, and simplify model serving. It also highlights options for exposing inference (managed services, GKE, load balancing) and previews AI-driven network operations using Gemini.

read more →

Mon, November 3, 2025

Ray on GKE: New AI Scheduling and Scaling Features

🚀 Google Cloud and Anyscale describe tighter integration between Ray and Kubernetes to improve distributed AI scheduling and autoscaling on GKE. The release introduces a Ray Label Selector API (Ray v2.49) to align task, actor and placement-group placement with Kubernetes labels and GKE custom compute classes, enabling targeted placement and fallback strategies for GPUs and markets. It also adds Dynamic Resource Allocation for A4X/GB200 racks, writable cgroups for Ray resource isolation on GKE v1.34+, TPU/JAX training support via a JAXTrainer in Ray v2.49, and in-place pod resizing (Kubernetes v1.33) for vertical autoscaling and higher efficiency.

read more →

Mon, November 3, 2025

Ray on TPUs with GKE: Native, Lower-Friction Integration

🚀 Google Cloud and Anyscale have enhanced the Ray experience on Cloud TPUs with GKE to reduce setup complexity and improve performance. The new ray.util.tpu library and a SlicePlacementGroup with a label_selector API automatically reserve co-located TPU slices and preserve SPMD topology to avoid resource fragmentation. Ray Train and Ray Serve gain expanded TPU support including alpha JAX training, while TPU metrics and libtpu logs appear in the Ray Dashboard for faster troubleshooting and migration between GPUs and TPUs.

read more →

Fri, October 31, 2025

GKE and Gemini CLI Integration Enhances Developer Workflows

🚀 Google has open-sourced the GKE Gemini CLI extension, bringing Google Kubernetes Engine directly into the Gemini CLI ecosystem while also functioning as an MCP server for other MCP clients. The extension injects GKE-specific context, tools, and tailored prompts so developers can use shorter, more natural language interactions and integrated slash commands to complete complex workflows. It simplifies common operations—like selecting models and accelerators or generating Kubernetes manifests for inference—while improving compatibility with Cloud Observability. The project is actively maintained with regular releases and community contributions.

read more →

Thu, October 30, 2025

Global Payments: Resilient Scale Architecture with Cloud SQL

☁️ Global Payments partnered with Google Cloud to design a multi-region, highly available database architecture using Cloud SQL Enterprise Plus. The deployment spans three regions with zonal replication, read replicas, cascading replication, and Cloud SQL Auth Proxy integration to support low-latency reads and rapid failover. This configuration yields near-zero planned downtime, sub-minute RTO and zero RPO for Tier 1 workloads, while meeting PCI DSS, GDPR, and NIST requirements.

read more →

Tue, October 28, 2025

Giles AI on Google Cloud: Transforming Medical Research

🚀 Giles AI migrated its healthcare-focused platform to Google Cloud to reduce latency, improve scalability, and accelerate developer velocity. Using Google Kubernetes Engine, Cloud Run, and Compute Engine, the company orchestrates complex clinical data flows and routes prompts through Vertex AI and Model Garden to remain model-agnostic. Data storage and extraction are handled with Cloud SQL, Cloud Storage, and Document AI, while Cloud Armor and Security Command Center bolster security and compliance. Early customer results include dramatic reductions in research time and improvements in response accuracy.

read more →

Tue, October 28, 2025

A4X Max, GKE Networking, and Vertex AI Training Now Shipping

🚀 Google Cloud is expanding its NVIDIA collaboration with the new A4X Max instances powered by NVIDIA GB300 NVL72, delivering 72 GPUs with high‑bandwidth NVLink and shared memory for demanding multimodal reasoning. GKE now supports DRANET for topology‑aware RDMA scheduling and integrates NVIDIA NeMo Guardrails into GKE Inference Gateway, while Vertex AI Model Garden will host NVIDIA Nemotron models. Vertex AI Training adds NeMo and NeMo‑RL recipes and a managed Slurm environment to accelerate large‑scale training and deployment.

read more →

Tue, October 28, 2025

Google Cloud launches managed DRANET for GKE with A4X Max

🚀 Google Cloud is previewing managed DRANET on GKE, enabling Kubernetes to treat high-performance RDMA network interfaces as schedulable resources. The integration aligns NICs and GPUs by NUMA topology to reduce latency and increase throughput, while abstracting away operational complexity. It launches with the new A4X Max instances to deliver topology-aware networking for large multi-GPU AI workloads. Developers can request specific network interfaces in pod specs and rely on GKE to co-schedule NICs and accelerators, improving utilization and simplifying operations.

read more →

Mon, October 20, 2025

Design Patterns for Scalable AI Agents on Google Cloud

🤖 This post explains how System Integrator partners can build, scale, and manage enterprise-grade AI agents using Google Cloud technologies like Agent Engine, the Agent Development Kit (ADK), and Gemini Enterprise. It summarizes architecture patterns including runtime, memory, the Model Context Protocol (MCP), and the Agent-to-Agent (A2A) protocol, and contrasts managed Agent Engine with self-hosted options such as Cloud Run or GKE. Customer examples from Deloitte and Quantiphi illustrate supply chain and sales automation benefits. The guidance highlights security, observability, persistent memory, and model tuning for enterprise readiness.

read more →

Mon, October 20, 2025

AI Hypercomputer Update: vLLM on TPUs and Tooling Advances

🔧 Google Cloud’s Q3 AI Hypercomputer update highlights inference improvements and expanded tooling to accelerate model serving and diagnostics. The release integrates vLLM with Cloud TPUs via the new tpu-inference plugin, unifying JAX and PyTorch runtimes and boosting TPU inference for models such as Gemma, Llama, and Qwen. Additional launches include improved XProf profiling and Cloud Diagnostics XProf, an AI inference recipe for NVIDIA Dynamo, NVIDIA NeMo RL recipes, and GA of the GKE Inference Gateway and Quickstart to help optimize latency and cost.

read more →

Mon, October 20, 2025

Google Cloud G4 VMs: NVIDIA RTX PRO 6000 Blackwell GA

🚀 The G4 VM is now generally available on Google Cloud, powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs and offering up to 768 GB of GDDR7 memory per instance class. It targets latency-sensitive and regulated workloads for generative AI, real-time rendering, simulation, and virtual workstations. Features include FP4 precision support, Multi-Instance GPU (MIG) partitioning, an enhanced PCIe P2P interconnect for faster multi‑GPU All-Reduce, and an NVIDIA Omniverse VMI on Marketplace for industrial digital twins.

read more →

Fri, October 17, 2025

Use Gemini CLI to Deploy Cost-Effective LLM Workloads on GKE

🛠️ Google Cloud demonstrates how the Gemini CLI and GKE Inference Quickstart integrate via the Model Context Protocol (MCP) to streamline selecting, benchmarking, and deploying LLMs on GKE. The post outlines installation steps, example prompts to discover cost and performance trade-offs, and how manifests can be generated for target accelerators. This approach reduces manual tuning and provides data-driven recommendations to optimize cost-per-token while preserving performance.

read more →