Tag Banner

All news with #model isolation tag

Thu, November 13, 2025

Four Steps for Startups to Build Multi-Agent Systems

🤖 This post outlines a concise four-step framework for startups to design and deploy multi-agent systems, illustrated through a Sales Intelligence Agent example. It recommends choosing between pre-built, partner, or custom agents and describes using Google's Agent Development Kit (ADK) for code-first control. The guide covers hybrid architectures, tool-based state isolation, secure data access, and a three-step deployment blueprint to run agents on Vertex AI Agent Engine and Cloud Run.

read more →

Mon, October 6, 2025

Vertex AI Model Garden Adds Self-Deploy Proprietary Models

🔐 Google Cloud’s Vertex AI now supports secure self-deployment of proprietary third-party models directly into customer VPCs via the Model Garden. Customers can discover, license, and deploy closed-source and restricted-license models from partners such as AI21 Labs, Mistral AI, Qodo and others, with one-click provisioning and managed inference. Deployments adhere to VPC-SC controls, selectable regions, autoscaling, and pay-as-you-go billing. This central catalog brings Google, open, and partner models together for enterprise-grade control and compliance.

read more →

Wed, August 27, 2025

Cloudflare's Edge-Optimized LLM Inference Engine at Scale

⚡ Infire is Cloudflare’s new, Rust-based LLM inference engine built to run large models efficiently across a globally distributed, low-latency network. It replaces Python-based vLLM in scenarios where sandboxing and dynamic co-hosting caused high CPU overhead and reduced GPU utilization, using JIT-compiled CUDA kernels, paged KV caching, and fine-grained CUDA graphs to cut startup and runtime cost. Early benchmarks show up to 7% lower latency on H100 NVL hardware, substantially higher GPU utilization, and far lower CPU load while powering models such as Llama 3.1 8B in Workers AI.

read more →

Wed, August 27, 2025

How Cloudflare Runs More AI Models on Fewer GPUs with Omni

🤖 Cloudflare explains how Omni, an internal platform, consolidates many AI models onto fewer GPUs using lightweight process isolation, per-model Python virtual environments, and controlled GPU over-commitment. Omni’s scheduler spawns and manages model processes, isolates file systems with a FUSE-backed /proc/meminfo, and intercepts CUDA allocations to safely over-commit GPU RAM. The result is improved availability, lower latency, and reduced idle GPU waste.

read more →