All news with #vllm tag

Fri, November 14, 2025

ShadowMQ Deserialization Flaws in Major AI Inference Engines

#Security Advisory #Patch #AI Security #Meta #NVIDIA #Microsoft #vLLM #RCE

⚠️ Oligo Security researcher Avi Lumelsky disclosed a widespread insecure-deserialization pattern dubbed ShadowMQ that affects major AI inference engines including vLLM, NVIDIA TensorRT-LLM, Microsoft Sarathi-Serve, Modular Max Server and SGLang. The root cause is using ZeroMQ's recv_pyobj() to deserialize network input with Python's pickle, permitting remote arbitrary code execution. Patches vary: some projects fixed the issue, others remain partially addressed or unpatched, and mitigations include applying updates, removing exposed ZMQ sockets, and auditing code for unsafe deserialization.

Fri, November 7, 2025

Tiered KV Cache Boosts LLM Performance on GKE with HBM

#GKE #NVIDIA #LMCache #vLLM #Retrieval-Augmented Generation

🚀 LMCache implements a node-local, tiered KV Cache on GKE to extend the GPU HBM-backed Key-Value store into CPU RAM and local SSD, increasing effective cache capacity and hit ratio. In benchmarks using Llama-3.3-70B-Instruct on an A3 mega instance (8×nvidia-h100-mega-80gb), configurations that added RAM and SSD reduced Time-to-First-Token and materially increased token throughput for long system prompts. The results demonstrate a practical approach to scale context windows while balancing cost and latency on GKE.

Mon, October 20, 2025

AI Hypercomputer Update: vLLM on TPUs and Tooling Advances

#Google #Cloud TPU #vLLM #GKE #NVIDIA

🔧 Google Cloud’s Q3 AI Hypercomputer update highlights inference improvements and expanded tooling to accelerate model serving and diagnostics. The release integrates vLLM with Cloud TPUs via the new tpu-inference plugin, unifying JAX and PyTorch runtimes and boosting TPU inference for models such as Gemma, Llama, and Qwen. Additional launches include improved XProf profiling and Cloud Diagnostics XProf, an AI inference recipe for NVIDIA Dynamo, NVIDIA NeMo RL recipes, and GA of the GKE Inference Gateway and Quickstart to help optimize latency and cost.

Wed, September 10, 2025

GKE Inference Gateway and Quickstart Achieve GA Status

#Google Cloud #GKE Inference Gateway #GKE Inference Quickstart #vLLM #TPU #Run:ai #Anywhere Cache

🚀 GKE Inference Gateway and GKE Inference Quickstart are now generally available, bringing production-ready inferencing features built on AI Hypercomputer. New capabilities include prefix-aware load balancing, disaggregated serving, vLLM support on TPUs and Ironwood TPUs, and model streaming with Anywhere Cache to cut model load times. These features target faster time-to-first-token and time-per-output-token, higher throughput, and lower inference costs, while Quickstart offers data-driven accelerator and configuration recommendations.

Mon, August 25, 2025

vLLM Performance Tuning for xPU Inference Configs Guide

#Google #Hugging Face #NVIDIA #VLLM

⚙️ This guide from Google Cloud authors Eric Hanley and Brittany Rockwell explains how to tune vLLM deployments for xPU inference, covering accelerator selection, memory sizing, configuration, and benchmarking. It shows how to gather workload parameters, estimate HBM/VRAM needs (example: gemma-3-27b-it ≈57 GB), and run vLLM’s auto_tune to find optimal gpu_memory_utilization and throughput. The post compares GPU and TPU options and includes practical troubleshooting tips, cost analyses, and resources to reproduce benchmarks and HBM calculations.