Tag Banner

All news with #kv cache tag

Fri, October 31, 2025

Choosing Google Cloud Managed Lustre for External KV Cache

🚀 This post explains how an external KV Cache backed by Google Cloud Managed Lustre can accelerate transformer inference and lower costs by offloading expensive prefill compute to I/O. In experiments with a 50K token context and ~75% cache-hit, Managed Lustre increased inference throughput by 75% and cut mean time-to-first-token by 44%. The analysis projects a 35% TCO reduction and up to ~43% fewer GPUs for the same workload, and the article summarizes practical steps: provision Managed Lustre in the same zone, deploy an inference server that supports external caching (for example vLLM), enable o_direct, and tune I/O parallelism.

read more →