All news with #model routing tag
Tue, November 18, 2025
Anthropic Claude Models Available in Microsoft Foundry
🚀 Microsoft announced integration of Anthropic's Claude models into Microsoft Foundry, making Azure the only cloud to provide both Claude and GPT frontier models on a single platform. The release brings Claude Haiku 4.5, Sonnet 4.5, and Opus 4.1 to Foundry with enterprise governance, observability, and deployment controls. Foundry Agent Service, the Model Context Protocol, skills-based modularity, and a model router are highlighted as tools to operationalize agentic workflows for coding, research, cybersecurity, and business automation. Token-based pricing tiers for the Claude models are published for standard deployments.
Mon, September 29, 2025
OpenAI Routes GPT-4o Conversations to Safety Models
🔒 OpenAI confirmed that when GPT-4o detects sensitive, emotional, or potentially harmful activity it may route individual messages to a dedicated safety model, reported by some users as gpt-5-chat-safety. The switch occurs on a per-message, temporary basis and ChatGPT will indicate which model is active if asked. The routing is implemented as an irreversible part of the service's safety architecture and cannot be turned off by users; OpenAI says this helps strengthen safeguards and learn from real-world use before wider rollouts.
Wed, September 3, 2025
Cloudflare AI Week 2025: Product, Security, and Tools
🔒 Cloudflare framed AI Week 2025 around products and controls to help organizations adopt AI while retaining safety and visibility. The company emphasized four core priorities: securing AI environments and workflows; protecting original content from misuse; enabling developers to build secure AI experiences; and applying AI to improve Cloudflare’s services. Key launches included AI Gateway, Infire, AI Crawl Control, expanded CASB scanning, and MCP Server Portals, with a continued focus on customer feedback and ongoing investment.
Wed, September 3, 2025
Amazon Bedrock: Global Cross-Region Inference for Claude 4
🔁 Anthropic's Claude Sonnet 4 is now available with Global cross‑Region inference in Amazon Bedrock, allowing inference requests to be routed to any supported commercial AWS Region for processing. The Global profile helps optimize compute resources and distribute traffic to increase model throughput. It supports both on‑demand and batch inference and is intended for use cases that do not require geography‑specific routing.
Wed, August 27, 2025
Cloudflare AI Gateway updates: unified billing, routing
🤖 Cloudflare’s AI Gateway refresh centralizes AI traffic management, offering unified billing, secure key storage, dynamic routing, and built-in security through a single endpoint. The update integrates Cloudflare Secrets Store for AES-encrypted BYO keys, provides an automatic normalization layer for requests/responses across providers, and introduces dashboard-driven Dynamic Routes for traffic splits, chaining, and limits. Native Firewall DLP scanning and configurable profiles add data protection controls, while partner access to 350+ models across six providers and a credits-based billing beta simplify procurement and cost management.
Wed, August 27, 2025
How Cloudflare Runs More AI Models on Fewer GPUs with Omni
🤖 Cloudflare explains how Omni, an internal platform, consolidates many AI models onto fewer GPUs using lightweight process isolation, per-model Python virtual environments, and controlled GPU over-commitment. Omni’s scheduler spawns and manages model processes, isolates file systems with a FUSE-backed /proc/meminfo, and intercepts CUDA allocations to safely over-commit GPU RAM. The result is improved availability, lower latency, and reduced idle GPU waste.
Wed, August 27, 2025
Cloudflare's Edge-Optimized LLM Inference Engine at Scale
⚡ Infire is Cloudflare’s new, Rust-based LLM inference engine built to run large models efficiently across a globally distributed, low-latency network. It replaces Python-based vLLM in scenarios where sandboxing and dynamic co-hosting caused high CPU overhead and reduced GPU utilization, using JIT-compiled CUDA kernels, paged KV caching, and fine-grained CUDA graphs to cut startup and runtime cost. Early benchmarks show up to 7% lower latency on H100 NVL hardware, substantially higher GPU utilization, and far lower CPU load while powering models such as Llama 3.1 8B in Workers AI.
Thu, August 7, 2025
GPT-5 in Azure AI Foundry: Enterprise AI for Agents
🚀 Today Microsoft announced general availability of OpenAI's flagship model, GPT-5, in Azure AI Foundry, positioning it as a frontier LLM for enterprise applications. The GPT-5 family (GPT-5, GPT-5 mini, GPT-5 nano, GPT-5 chat) spans deep reasoning, real-time responsiveness, and ultra-low-latency options, all accessible through a single Foundry endpoint and managed by a model router to optimize cost and performance. Foundry pairs agent orchestration, tool-calling, developer controls, telemetry, and compliance-aware deployment choices to help organizations move from pilot projects to production.