High-Performance LLMs on Cloudflare Workers AI Platform
🚀 Cloudflare details optimizations to run extra-large open-source LLMs on Workers AI, notably making Kimi K2.5 three times faster and adding more models. The post explains hardware tuning, prefill–decode disaggregation, token-aware load balancing, and prompt-caching via an x-session-affinity header to improve throughput and tail latency. It also covers KV-cache sharing with Mooncake, speculative decoding with NVIDIA EAGLE-3, and Cloudflare’s Rust-based inference engine Infire for multi-GPU, low-memory, fast cold-start inference.
