Tag Banner

All news with #sagemaker hyperpod tag

Wed, December 3, 2025

Amazon SageMaker HyperPod Adds Checkpointless Training

🚀 Amazon SageMaker HyperPod now supports checkpointless training, a foundational capability that eliminates the need for checkpoint-based, job-level restarts for distributed model training. Checkpointless training preserves forward training state across the cluster, automatically swaps out failed nodes, and uses peer-to-peer state transfer to resume progress, reducing recovery time from hours to minutes. The feature can deliver up to 95% training goodput at very large scale, is available in all Regions where HyperPod runs, and can be enabled with zero code changes for popular recipes or with minimal PyTorch modifications for custom models.

read more →

Wed, December 3, 2025

Amazon SageMaker HyperPod Adds Elastic Training at Scale

⚡ Amazon SageMaker HyperPod now supports elastic training, automatically scaling distributed training jobs to absorb idle accelerators and contract when higher‑priority workloads require resources. This eliminates the manual cycle of halting jobs, reconfiguring parameters, and restarting distributed training, which previously demanded specialized engineering time. Organizations can start training with minimal resources and grow opportunistically, improving cluster utilization and reducing costs. Elastic training can be enabled with zero code changes for public models like Llama and GPT OSS, and requires only lightweight configuration updates for custom architectures.

read more →

Wed, November 26, 2025

SageMaker HyperPod: Managed Tiered KV Cache Launch

⚡ Amazon SageMaker HyperPod now offers Managed Tiered KV Cache and Intelligent Routing to optimize LLM inference for long-context prompts and multi-turn conversations. The two-tier cache combines local CPU memory (L1) with disaggregated cluster storage (L2) — with AWS-native tiered storage recommended and Redis optional — to reuse computed key-value pairs and reduce recomputation. Intelligent Routing directs requests using prefix-aware, KV-aware, or round-robin strategies, while built-in observability integrates with Amazon Managed Grafana and deployment is enabled via InferenceEndpointConfig or SageMaker JumpStart.

read more →

Wed, November 26, 2025

Amazon SageMaker HyperPod: Programmatic Node Recovery

🚀 Amazon SageMaker HyperPod is now generally available with new programmatic APIs that let administrators reboot or replace cluster nodes at scale. The BatchRebootClusterNodes and BatchReplaceClusterNodes APIs provide an orchestrator-agnostic way to recover unresponsive or degraded nodes for both Slurm and EKS clusters. Each API supports batch operations for up to 25 instances and complements existing orchestrator-specific workflows. The capabilities are currently available in US East (Ohio), Asia Pacific (Mumbai), and Asia Pacific (Tokyo) and are accessible via the AWS CLI, SDKs, or API calls.

read more →

Mon, November 24, 2025

Amazon SageMaker HyperPod Adds Spot Instance Support

⚡ Amazon SageMaker HyperPod now supports Spot Instances, enabling customers to reduce GPU compute costs by up to 90% compared with on-demand instances. The integration is available on HyperPod EKS clusters and works with Karpenter for intelligent autoscaling, automatic Spot capacity discovery, and interruption handling. You can enable Spot when creating instance groups via the CreateCluster API or the AWS Console, and the feature supports all HyperPod instance types across available regions.

read more →

Mon, September 15, 2025

Amazon SageMaker HyperPod: Slurm Health Agent Now GA

🩺 Amazon announces general availability of the SageMaker HyperPod health monitoring agent for Slurm clusters. The agent runs continuously on GPU- and Trainium-based nodes to perform passive background checks, detect hardware faults (for example, unresponsive GPUs and NVLink errors), and mark and replace unhealthy nodes automatically. It supports automatic reboots and coordinates with Slurm job auto-resume so training can continue from the last checkpoint, reducing manual intervention and downtime.

read more →

Mon, September 8, 2025

Managed Tiered Checkpointing for Amazon SageMaker HyperPod

⚡ Amazon Web Services has announced general availability of managed tiered checkpointing for Amazon SageMaker HyperPod, a hybrid checkpointing capability that caches frequent checkpoints in CPU memory and periodically persists them to Amazon S3 for durability. The approach reduces model recovery time and minimizes training progress loss on large-scale clusters. It integrates with PyTorch Distributed Checkpoint (DCP) and is enabled via a CreateCluster/UpdateCluster API parameter; customers can use the sagemaker-checkpointing Python library to adopt it with minimal code changes. Currently available for HyperPod clusters using the EKS orchestrator.

read more →