All news with #kv cache tag
Wed, November 26, 2025
SageMaker HyperPod: Managed Tiered KV Cache Launch
⚡ Amazon SageMaker HyperPod now offers Managed Tiered KV Cache and Intelligent Routing to optimize LLM inference for long-context prompts and multi-turn conversations. The two-tier cache combines local CPU memory (L1) with disaggregated cluster storage (L2) — with AWS-native tiered storage recommended and Redis optional — to reuse computed key-value pairs and reduce recomputation. Intelligent Routing directs requests using prefix-aware, KV-aware, or round-robin strategies, while built-in observability integrates with Amazon Managed Grafana and deployment is enabled via InferenceEndpointConfig or SageMaker JumpStart.
Fri, October 31, 2025
Choosing Google Cloud Managed Lustre for External KV Cache
🚀 This post explains how an external KV Cache backed by Google Cloud Managed Lustre can accelerate transformer inference and lower costs by offloading expensive prefill compute to I/O. In experiments with a 50K token context and ~75% cache-hit, Managed Lustre increased inference throughput by 75% and cut mean time-to-first-token by 44%. The analysis projects a 35% TCO reduction and up to ~43% fewer GPUs for the same workload, and the article summarizes practical steps: provision Managed Lustre in the same zone, deploy an inference server that supports external caching (for example vLLM), enable o_direct, and tune I/O parallelism.