Managed Tiered Checkpointing for Amazon SageMaker HyperPod
⚡ Amazon Web Services has announced general availability of managed tiered checkpointing for Amazon SageMaker HyperPod, a hybrid checkpointing capability that caches frequent checkpoints in CPU memory and periodically persists them to Amazon S3 for durability. The approach reduces model recovery time and minimizes training progress loss on large-scale clusters. It integrates with PyTorch Distributed Checkpoint (DCP) and is enabled via a CreateCluster/UpdateCluster API parameter; customers can use the sagemaker-checkpointing Python library to adopt it with minimal code changes. Currently available for HyperPod clusters using the EKS orchestrator.
