Tag Banner

All news with #inference security tag

Mon, November 17, 2025

Microsoft and NVIDIA Enable Real-Time AI Defenses at Scale

🔒 Microsoft and NVIDIA describe a joint effort to convert adversarial learning research into production-grade, real-time cyber defenses. They transitioned transformer-based classifiers from CPU to GPU inference—using Triton and a TensorRT-compiled engine—to dramatically reduce latency and increase throughput for live traffic inspection. Key engineering advances include fused CUDA kernels and a domain-specific tokenizer, enabling low-latency, high-accuracy detection of adversarial payloads in inline production settings.

read more →

Mon, November 10, 2025

Whisper Leak side channel exposes topics in encrypted AI

🔎 Microsoft researchers disclosed a new side-channel attack called Whisper Leak that can infer the topic of encrypted conversations with language models by observing network metadata such as packet sizes and timings. The technique exploits streaming LLM responses that emit tokens incrementally, leaking size and timing patterns even under TLS. Vendors including OpenAI, Microsoft Azure, and Mistral implemented mitigations such as random-length padding and obfuscation parameters to reduce the effectiveness of the attack.

read more →

Sat, November 8, 2025

Microsoft Reveals Whisper Leak: Streaming LLM Side-Channel

🔒 Microsoft has disclosed a novel side-channel called Whisper Leak that can let a passive observer infer the topic of conversations with streaming language models by analyzing encrypted packet sizes and timings. Researchers at Microsoft (Bar Or, McDonald and the Defender team) demonstrate classifiers that distinguish targeted topics from background traffic with high accuracy across vendors including OpenAI, Mistral and xAI. Providers have deployed mitigations such as random-length response padding; Microsoft recommends avoiding sensitive topics on untrusted networks, using VPNs, or preferring non-streaming models and providers that implemented fixes.

read more →

Fri, November 7, 2025

Whisper Leak: Side-Channel Attack on Remote LLM Services

🔍 Microsoft researchers disclosed "Whisper Leak", a new side-channel that can infer conversation topics from encrypted, streamed language model responses by analyzing packet sizes and timings. The study demonstrates high classifier accuracy on a proof-of-concept sensitive topic and shows risk increases with more training data or repeated interactions. Industry partners including OpenAI, Mistral, Microsoft Azure, and xAI implemented streaming obfuscation mitigations that Microsoft validated as substantially reducing practical risk.

read more →

Thu, August 28, 2025

Amazon Q Developer adds MCP admin control in AWS Console

🔒 Administrators can now manage the Model Context Protocol (MCP) servers used by Amazon Q Developer clients from the AWS console. Admins can enable or disable MCP functionality across their organization; when disabled, users cannot add MCP servers and previously defined servers are not initialized. Q Developer enforces admin settings at session start and every 24 hours. The control covers the CLI and IDE plugins (VSCode, JetBrains, Visual Studio, Eclipse).

read more →

Thu, August 28, 2025

Gemini Available On-Premises with Google Distributed Cloud

🚀 Gemini on Google Distributed Cloud (GDC) is now generally available for customers, bringing Google’s advanced Gemini models on‑premises with GA for air‑gapped deployments and a connected preview. The solution provides managed Gemini endpoints with zero‑touch updates, automatic load balancing and autoscaling, and integrates with Vertex AI and preview agents. It pairs Gemini 2.5 Flash and Pro with NVIDIA Hopper and Blackwell accelerators and includes audit logging, access controls, and support for Confidential Computing (Intel TDX and NVIDIA) to meet strict data residency, sovereignty, and compliance requirements.

read more →

Wed, August 27, 2025

How Cloudflare Runs More AI Models on Fewer GPUs with Omni

🤖 Cloudflare explains how Omni, an internal platform, consolidates many AI models onto fewer GPUs using lightweight process isolation, per-model Python virtual environments, and controlled GPU over-commitment. Omni’s scheduler spawns and manages model processes, isolates file systems with a FUSE-backed /proc/meminfo, and intercepts CUDA allocations to safely over-commit GPU RAM. The result is improved availability, lower latency, and reduced idle GPU waste.

read more →

Wed, August 27, 2025

Cloudflare's Edge-Optimized LLM Inference Engine at Scale

⚡ Infire is Cloudflare’s new, Rust-based LLM inference engine built to run large models efficiently across a globally distributed, low-latency network. It replaces Python-based vLLM in scenarios where sandboxing and dynamic co-hosting caused high CPU overhead and reduced GPU utilization, using JIT-compiled CUDA kernels, paged KV caching, and fine-grained CUDA graphs to cut startup and runtime cost. Early benchmarks show up to 7% lower latency on H100 NVL hardware, substantially higher GPU utilization, and far lower CPU load while powering models such as Llama 3.1 8B in Workers AI.

read more →

Wed, August 20, 2025

Logit-Gap Steering Reveals Limits of LLM Alignment

⚠️ Unit 42 researchers Tony Li and Hongliang Liu introduce Logit-Gap Steering, a new framework that exposes how alignment training produces a measurable refusal-affirmation logit gap rather than eliminating harmful outputs. Their paper demonstrates efficient short-path suffix jailbreaks that achieved high success rates on open-source models including Qwen, LLaMA, Gemma and the recently released gpt-oss-20b. The findings argue that internal alignment alone is insufficient and recommend a defense-in-depth approach with external safeguards and content filters.

read more →