Ray Serve LLM on GKE: Major performance gains
๐ Developers using Ray Serve for LLM inference on Google Kubernetes Engine (GKE) now get significantly better performance thanks to a joint effort with Anyscale. Three architectural changes โ HAProxy integration for internal routing, a direct token streaming path, and a v2 Ray executor backend for vLLM โ reduce overhead and latency. Benchmarks on A4 VMs with NVIDIA HGX B200 hardware show up to 5x higher throughput and 8x lower latency, while preserving Ray's developer-friendly features.