< ciso
brief />
Tag Banner

All news with #google kubernetes engine tag

34 articles

GKE Agent Sandbox GA and Agent Substrate Launch on GKE

🚀 Google Cloud announced general availability of GKE Agent Sandbox and introduced the open-source Agent Substrate. Agent Sandbox is a cloud-native execution environment designed for AI agents, offering pod snapshots to suspend idle workloads, an integrated warm pool for sub-second provisioning, gVisor and pluggable kernel isolation, and standby suspended VMs to reduce warm-pool cost. Agent Substrate aims to provide a minimal control plane and scheduler optimizations to support ultra-dense, low-latency agent workloads at scale.
read more →

GKE Node Startup Up to 4x Faster for Autopilot Workloads

🚀 Google Cloud has reworked GKE node provisioning to deliver up to 4× faster node startup for qualifying nodes, reducing cold-start latency out of the box. This architectural upgrade combines intelligent compute buffers, fast-starting virtual machines, and a redesigned control plane so clusters scale more quickly without any customer configuration. The improvement is live for GKE Autopilot on select NVIDIA and general-purpose instance types, lowering the need to over-provision and speeding AI inference.
read more →

GKE Cloud Storage FUSE Profiles for AI/ML Workload I/O

⚡ GKE’s Cloud Storage FUSE Profiles automate performance tuning for AI/ML workloads by providing pre-defined, dynamically managed StorageClasses optimized for training, serving, and checkpointing. Instead of manually adjusting many mount and CSI options, users select a profile and GKE scans the bucket and node resources to calculate cache sizes and backing media. The CSI driver mounts the volume with those calculated options and dynamically adjusts cache behavior using real-time signals to maximize throughput while protecting node stability.
read more →

Experimenting with GPUs, GKE DRANET and Inference Gateway

🔧 This post walks through deploying and serving a large model on Google Kubernetes Engine using managed DRANET and NVIDIA B200 GPUs. It explains how RDMA networking is provisioned as an isolated regional VPC for low-latency GPU-to-GPU communication and how to provision A4 nodes and reservations for RoCEv2-capable accelerators. The author provides example gcloud and kubectl commands to create the cluster, a GPU node pool with DRA labels, a ResourceClaimTemplate for mrdma workloads, and steps to serve a DeepSeek model privately via GKE Inference Gateway and a regional internal Application Load Balancer.
read more →

Top Infrastructure and GKE Sessions at Cloud Next '26

📣 This guide highlights the Infrastructure and GKE sessions at Cloud Next '26, offering a curated set of technical breakouts across Compute, AI infrastructure, migration, modernization, and scale. Attend spotlights and deep dives to hear from Google leaders and engineering teams about Gemini, Google Distributed Cloud, and the AI Hypercomputer. Sessions cover TPU/GPU roadmaps, high‑performance compute, agentic AI pipelines, and practical migration and FinOps strategies designed to help organizations build resilient, AI‑ready infrastructure.
read more →

Unifying Real-Time and Async Inference with GKE Platform

🚀 GKE Inference Gateway enables teams to run both real-time and asynchronous AI inference on a single shared pool of accelerators (GPUs/TPUs). It applies latency-aware scheduling using runtime signals such as KV cache utilization to prioritize deterministic, low-latency requests while treating queued batch work as 'filler' via an Async Processor Agent integrated with Cloud Pub/Sub. The open-source stack reduces idle capacity, consolidates software stacks, and preserves strict priority and retry controls for reliable delivery.
read more →

GKE Active Buffer reduces Kubernetes scale-out latency

⚡Active Buffer is a GKE preview that implements the Kubernetes CapacityBuffer API to remove scale-out latency by keeping spare node capacity warm. It replaces manual 'balloon' pod hacks and costly over-provisioning with a declarative resource the Cluster Autoscaler treats as pending demand, so critical pods can land instantly. Buffers can be sized by fixed replicas, percentage of deployments, or resource limits.
read more →

Multi-Cluster GKE Inference Gateway for Scalable AI

🚀 Google Cloud announced the preview of the multi-cluster GKE Inference Gateway, an extension of the GKE Gateway API that provides model-aware, intelligent load balancing across multiple GKE clusters and regions. It centralizes ingress configuration in a dedicated "config cluster" while exporting model-serving backends from distributed "target clusters." The gateway pools GPUs/TPUs, supports routing based on custom metrics, and offers in-flight request limits to optimize latency, utilization, and fault tolerance.
read more →

GKE Adds Native Custom Metrics for Horizontal Scaling

🚀 Google Cloud now provides native custom metrics for GKE Horizontal Pod Autoscaler (HPA), eliminating the need for external adapters, agents, and complex Workload Identity bindings. The agentless design sources pod metrics directly and exposes them via a new AutoscalingMetric controller, reducing latency, cost, and operational fragility. Users declare an AutoscalingMetric that points to a pod metric and reference it in an HPA, allowing HPAs to scale on custom workload signals just like CPU or memory. Google frames this as an initial step toward intent-based autoscaling for AI, gaming, batch, and other demanding workloads.
read more →

GKE for Telco: Building a Resilient AI-Native Core

🚀 Google Cloud demonstrates how Google Kubernetes Engine (GKE) can form a high-performance foundation for telco modernization via two complementary paths: cloud-centric evolution for full cloud migration and strategic hybrid modernization to retain local control over latency-sensitive functions. The post highlights carrier-grade enhancements—multi-networking API, simulated L2, a telco CNI, persistent IP, and GKE IP route—with sub-second convergence and HA Policy to minimize downtime. It frames modernization as a means to enable predictive AIOps, intent-driven automation, faster time-to-market, and new monetization opportunities through AI and data platforms.
read more →

Starfish Space Uses Google Cloud for Satellite Servicing

🚀 Starfish Space is using Google Cloud to accelerate development and validation of its autonomous satellite-servicing vehicle, Otter. The company runs millions of Monte Carlo simulations on Google Compute Engine and Google Kubernetes Engine to train and harden docking software in virtual orbital environments. Managed Kubernetes lets engineers scale high-performance compute for complex simulations and control costs by scaling down resources when not required. This software-first model supports contracts with NASA, the U.S. Space Force, SES, and the Space Development Agency.
read more →

Faster GKE Node Pool Auto-Creation with Concurrency

🚀 Google Cloud announced concurrency for GKE node pool auto-creation, significantly reducing provisioning latency and improving autoscaling responsiveness. Internal benchmarks report up to an 85% improvement in provisioning speed, especially for heterogeneous, multi-tenant, and AI workloads that require multiple distinct node types. The improvement is available in version 1.34.1-gke.1829001 and requires only upgrading GKE; no additional configuration is necessary.
read more →

Supercharging Agentic Workloads on GKE with Sandboxing

🔒 The post summarizes a recent Agent Factory episode where Google product leaders discuss running agentic workloads on GKE. It highlights the Agent Development Kit (ADK), containerized deployments to Artifact Registry, and why Kubernetes provides governance and fine-grained control for large-scale agents. Google demonstrated an Agent Sandbox using gVisor and strict network policies, and introduced Pod Snapshots to cut sandbox startup from minutes to seconds, enabling lower-latency, secure agent workflows.
read more →

Designing for GKE's Flat Network: Practical Recommendations

🔍 This post previews Google's new design recommendation for leveraging GKE's flat network, explaining how it differs from island-mode networking and how teams can adapt existing architectures. It highlights recommended patterns and a reference design that emulates island-mode behavior within the flat model. The guidance focuses on IP address management, scalability, and integration points to ease migration for critical workloads such as generative AI.
read more →

Building the Largest Known GKE Cluster: 130,000 Nodes

🚀 Google Cloud engineers demonstrated an experimental GKE cluster running 130,000 nodes to validate extreme scalability for AI/ML workloads. The test sustained control-plane throughput near 1,000 operations per second, supported over one million datastore objects, and achieved a baseline of 130,000 Pods launching in 3 minutes 40 seconds. The project combined API-server caching KEPs, a Spanner-backed key-value storage backend, and job-level orchestration via Kueue to enable predictable admission, rapid preemption, and efficient utilization at massive scale.
read more →

Hands-on with Gemma 3: Deploying Open Models on GCP

🚀 Google Cloud introduces hands-on labs for Gemma 3, a family of lightweight open models offering multimodal (text and image) capabilities and efficient performance on smaller hardware footprints. The labs present two deployment paths: a serverless approach using Cloud Run with GPU support, and a platform approach using GKE for scalable production environments. Choose Cloud Run for simplicity and cost-efficiency or GKE Autopilot for control and robust orchestration to move models from local testing to production.
read more →

GKE: Unified Platform for Agents, Scale, and Inference

🚀 Google details a broad set of GKE and Kubernetes enhancements announced at KubeCon to address agentic AI, large-scale training, and latency-sensitive inference. GKE introduces Agent Sandbox (gVisor-based) for isolated agent execution and a managed GKE Agent Sandbox with snapshots and optimized compute. The platform also delivers faster autoscaling through Autopilot compute classes, Buffers API, and container image streaming, while inference is accelerated by GKE Inference Gateway, Pod Snapshots, and Inference Quickstart.
read more →

Tiered KV Cache Boosts LLM Performance on GKE with HBM

🚀 LMCache implements a node-local, tiered KV Cache on GKE to extend the GPU HBM-backed Key-Value store into CPU RAM and local SSD, increasing effective cache capacity and hit ratio. In benchmarks using Llama-3.3-70B-Instruct on an A3 mega instance (8×nvidia-h100-mega-80gb), configurations that added RAM and SSD reduced Time-to-First-Token and materially increased token throughput for long system prompts. The results demonstrate a practical approach to scale context windows while balancing cost and latency on GKE.
read more →

GKE and Gemini CLI Integration Enhances Developer Workflows

🚀 Google has open-sourced the GKE Gemini CLI extension, bringing Google Kubernetes Engine directly into the Gemini CLI ecosystem while also functioning as an MCP server for other MCP clients. The extension injects GKE-specific context, tools, and tailored prompts so developers can use shorter, more natural language interactions and integrated slash commands to complete complex workflows. It simplifies common operations—like selecting models and accelerators or generating Kubernetes manifests for inference—while improving compatibility with Cloud Observability. The project is actively maintained with regular releases and community contributions.
read more →

Giles AI on Google Cloud: Transforming Medical Research

🚀 Giles AI migrated its healthcare-focused platform to Google Cloud to reduce latency, improve scalability, and accelerate developer velocity. Using Google Kubernetes Engine, Cloud Run, and Compute Engine, the company orchestrates complex clinical data flows and routes prompts through Vertex AI and Model Garden to remain model-agnostic. Data storage and extraction are handled with Cloud SQL, Cloud Storage, and Document AI, while Cloud Armor and Security Command Center bolster security and compliance. Early customer results include dramatic reductions in research time and improvements in response accuracy.
read more →