All news with #sagemaker hyperpod tag
Mon, September 15, 2025
Amazon SageMaker HyperPod: Slurm Health Agent Now GA
🩺 Amazon announces general availability of the SageMaker HyperPod health monitoring agent for Slurm clusters. The agent runs continuously on GPU- and Trainium-based nodes to perform passive background checks, detect hardware faults (for example, unresponsive GPUs and NVLink errors), and mark and replace unhealthy nodes automatically. It supports automatic reboots and coordinates with Slurm job auto-resume so training can continue from the last checkpoint, reducing manual intervention and downtime.
Mon, September 8, 2025
Managed Tiered Checkpointing for Amazon SageMaker HyperPod
⚡ Amazon Web Services has announced general availability of managed tiered checkpointing for Amazon SageMaker HyperPod, a hybrid checkpointing capability that caches frequent checkpoints in CPU memory and periodically persists them to Amazon S3 for durability. The approach reduces model recovery time and minimizes training progress loss on large-scale clusters. It integrates with PyTorch Distributed Checkpoint (DCP) and is enabled via a CreateCluster/UpdateCluster API parameter; customers can use the sagemaker-checkpointing Python library to adopt it with minimal code changes. Currently available for HyperPod clusters using the EKS orchestrator.