Amazon SageMaker HyperPod Adds RIG Observability for Training
🔍 Amazon SageMaker HyperPod now provides integrated observability for Restricted Instance Groups (RIG), giving teams training foundation models with Nova Forge a unified view of compute resources and training workloads. A pre-configured Amazon Managed Grafana dashboard, backed by Amazon Managed Service for Prometheus, aggregates metrics from four exporters to show GPU utilization, NVLink bandwidth, CPU pressure, FSx for Lustre usage, network fabric, Kubernetes state, and curated logs including epoch progress, step-level logs, pipeline errors, and Python tracebacks. Observability is automatically enabled for new RIG clusters and can be turned on for existing clusters via the HyperPod console; it is available in all Regions where SageMaker HyperPod RIG is supported.
