Tag Banner

All news with #pytorch tag

Wed, December 3, 2025

Amazon SageMaker HyperPod Adds Checkpointless Training

🚀 Amazon SageMaker HyperPod now supports checkpointless training, a foundational capability that eliminates the need for checkpoint-based, job-level restarts for distributed model training. Checkpointless training preserves forward training state across the cluster, automatically swaps out failed nodes, and uses peer-to-peer state transfer to resume progress, reducing recovery time from hours to minutes. The feature can deliver up to 95% training goodput at very large scale, is available in all Regions where HyperPod runs, and can be enabled with zero code changes for popular recipes or with minimal PyTorch modifications for custom models.

read more →

Wed, December 3, 2025

Picklescan Flaws Enable Malicious PyTorch Model Execution

⚠️ Picklescan, a Python pickle scanner, has three critical flaws that can be abused to execute arbitrary code when loading untrusted PyTorch models. Discovered by JFrog researchers, the issues — a file-extension bypass (CVE-2025-10155), a ZIP CRC bypass (CVE-2025-10156) and an unsafe-globals bypass (CVE-2025-10157) — let attackers present malicious models as safe. The vulnerabilities were responsibly disclosed on June 29, 2025 and fixed in Picklescan 0.0.31 on September 9; users should upgrade and review model-loading practices and downstream automation that accepts third-party models.

read more →

Tue, December 2, 2025

Critical PickleScan Zero-Days Threaten AI Model Supply

🔒 Three critical zero-day vulnerabilities in PickleScan, a widely used scanner for Python pickle files and PyTorch models, could enable attackers to bypass model-scanning safeguards and distribute malicious machine learning models undetected. The JFrog Security Research Team published an advisory on 2 December after confirming all three flaws carry a CVSS score of 9.3. JFrog has advised upgrading to PickleScan 0.0.31, adopting layered defenses, and shifting to safer formats such as safetensors.

read more →

Thu, November 6, 2025

Inside Ironwood: Google's Co‑Designed TPU AI Stack

🚀 The Ironwood TPU stack is a co‑designed hardware and software platform that scales from massive pre‑training to low‑latency inference. It combines dense MXU compute, ample HBM3E memory, and a high‑bandwidth ICI/OCS interconnect with compiler-driven optimizations in XLA and native support for JAX and PyTorch. Pallas and Mosaic enable hand‑tuned kernels for peak performance, while observability and orchestration tools address resilience and efficiency across pods and superpods.

read more →