Tag Banner

All news with #elastic training tag

Wed, December 3, 2025

Amazon SageMaker HyperPod Adds Elastic Training at Scale

⚡ Amazon SageMaker HyperPod now supports elastic training, automatically scaling distributed training jobs to absorb idle accelerators and contract when higher‑priority workloads require resources. This eliminates the manual cycle of halting jobs, reconfiguring parameters, and restarting distributed training, which previously demanded specialized engineering time. Organizations can start training with minimal resources and grow opportunistically, improving cluster utilization and reducing costs. Elastic training can be enabled with zero code changes for public models like Llama and GPT OSS, and requires only lightweight configuration updates for custom architectures.

read more →