Tag Banner

All news with #tpu tag

Thu, November 13, 2025

Google Cloud expands Hugging Face support for AI developers

🤝 Google Cloud and Hugging Face are deepening their partnership to speed developer workflows and strengthen enterprise model deployments. A new gateway will cache Hugging Face models and datasets on Google Cloud so downloads take minutes, not hours, across Vertex AI and Google Kubernetes Engine. The collaboration adds native TPU support for open models and integrates Google Cloud’s threat intelligence and Mandiant scanning for models served through Vertex AI.

read more →

Tue, November 11, 2025

Lightricks Scales Video Diffusion Training with JAX

🚀 Lightricks rewrote its training stack in JAX to scale high-performance video diffusion models on TPUs after hitting limits with PyTorch/XLA. The migration enabled reliable sharding, fixed FlashAttention and data-loading issues, and delivered linear scaling across small and large TPU pods. These improvements translated to ~40% more training steps per day, faster iteration, and doubled team productivity. Their stack leverages Flax, Optax, Orbax, and the MaxText blueprint for robust, testable, and efficient large-scale training.

read more →

Mon, November 10, 2025

Full-Stack Approach to Scaling RL for LLMs on GKE at Scale

🚀 Google Cloud describes a full-stack solution for running high-scale Reinforcement Learning (RL) with LLMs, combining custom TPU hardware, NVIDIA GPUs, and optimized software libraries. The approach addresses RL's hybrid demands—reducing sampler latency, easing memory contention across actor/critic/reward models, and accelerating weight copying—by co-designing hardware, storage (Managed Lustre, Cloud Storage), and orchestration on GKE. The blog emphasizes open-source contributions (vLLM, llm-d, MaxText, Tunix) and integrations with Ray and NeMo RL recipes to improve portability and developer productivity. It also highlights mega-scale orchestration and multi-cluster strategies to run production RL jobs at tens of thousands of nodes.

read more →

Thu, November 6, 2025

Inside Ironwood: Google's Co‑Designed TPU AI Stack

🚀 The Ironwood TPU stack is a co‑designed hardware and software platform that scales from massive pre‑training to low‑latency inference. It combines dense MXU compute, ample HBM3E memory, and a high‑bandwidth ICI/OCS interconnect with compiler-driven optimizations in XLA and native support for JAX and PyTorch. Pallas and Mosaic enable hand‑tuned kernels for peak performance, while observability and orchestration tools address resilience and efficiency across pods and superpods.

read more →

Wed, November 5, 2025

Building Software Sustainably with AI and Efficiency

🌱 Google presents a Sustainable by Design approach to reduce the environmental footprint of AI and software. The post highlights projects like Green Light and Project Contrails, improvements in hardware efficiency such as Ironwood TPUs, and a fleet-wide Power Usage Effectiveness of 1.09. It introduces the 4Ms—Machine, Model, Mechanisation, Map—to guide infrastructure and development choices. The emphasis is on embedding efficiency across the software lifecycle to cut energy use, costs, and water consumption.

read more →

Mon, November 3, 2025

Ray on TPUs with GKE: Native, Lower-Friction Integration

🚀 Google Cloud and Anyscale have enhanced the Ray experience on Cloud TPUs with GKE to reduce setup complexity and improve performance. The new ray.util.tpu library and a SlicePlacementGroup with a label_selector API automatically reserve co-located TPU slices and preserve SPMD topology to avoid resource fragmentation. Ray Train and Ray Serve gain expanded TPU support including alpha JAX training, while TPU metrics and libtpu logs appear in the Ray Dashboard for faster troubleshooting and migration between GPUs and TPUs.

read more →

Wed, September 10, 2025

GKE Inference Gateway and Quickstart Achieve GA Status

🚀 GKE Inference Gateway and GKE Inference Quickstart are now generally available, bringing production-ready inferencing features built on AI Hypercomputer. New capabilities include prefix-aware load balancing, disaggregated serving, vLLM support on TPUs and Ironwood TPUs, and model streaming with Anywhere Cache to cut model load times. These features target faster time-to-first-token and time-per-output-token, higher throughput, and lower inference costs, while Quickstart offers data-driven accelerator and configuration recommendations.

read more →