All news with #nvidia tag
Thu, September 4, 2025
Baseten: improved cost-performance for AI inference
🚀 Baseten reports major cost-performance gains for AI inference by combining Google Cloud A4 VMs powered by NVIDIA Blackwell GPUs with Google Cloud’s Dynamic Workload Scheduler. The company cites 225% better cost-performance for high-throughput inference and 25% improvement for latency-sensitive workloads. Baseten pairs cutting-edge hardware with an open, optimized software stack — including TensorRT-LLM, NVIDIA Dynamo, and vLLM — and multi-cloud resilience to deliver scalable, production-ready inference.
Tue, September 2, 2025
AWS Split Cost Allocation Adds GPU and Accelerator Cost Tracking
🔍 Split Cost Allocation Data now supports accelerator-based workloads running in Amazon Elastic Kubernetes Service (EKS), allowing customers to track costs for Trainium, Inferentia, NVIDIA and AMD GPUs alongside CPU and memory. Cost details are included in the AWS Cost and Usage Report (including CUR 2.0) and can be visualized using the Containers Cost Allocation dashboard in Amazon QuickSight or queried with Amazon Athena. New customers can enable the feature in the Billing and Cost Management console; it is automatically enabled for existing Split Cost Allocation Data customers.
Fri, August 29, 2025
Google Cloud Expands Confidential Computing with Intel TDX
🔒 Google Cloud has expanded its Intel TDX-based Confidential Computing portfolio, now offering Confidential GKE Nodes, Confidential Space, and Confidential GPUs alongside broader regional availability. Creating an Intel TDX Confidential VM is exposed directly in the GCE Create an instance flow under the Security tab, with no code changes required. The C3 machine series supports Intel TDX across additional regions and zones, and NVIDIA H100 GPUs on the A3 series enable confidential AI by combining Intel CPU protection with NVIDIA Confidential Computing on the GPU.
Thu, August 28, 2025
Gemini Available On-Premises with Google Distributed Cloud
🚀 Gemini on Google Distributed Cloud (GDC) is now generally available for customers, bringing Google’s advanced Gemini models on‑premises with GA for air‑gapped deployments and a connected preview. The solution provides managed Gemini endpoints with zero‑touch updates, automatic load balancing and autoscaling, and integrates with Vertex AI and preview agents. It pairs Gemini 2.5 Flash and Pro with NVIDIA Hopper and Blackwell accelerators and includes audit logging, access controls, and support for Confidential Computing (Intel TDX and NVIDIA) to meet strict data residency, sovereignty, and compliance requirements.
Wed, August 27, 2025
Cloudflare's Edge-Optimized LLM Inference Engine at Scale
⚡ Infire is Cloudflare’s new, Rust-based LLM inference engine built to run large models efficiently across a globally distributed, low-latency network. It replaces Python-based vLLM in scenarios where sandboxing and dynamic co-hosting caused high CPU overhead and reduced GPU utilization, using JIT-compiled CUDA kernels, paged KV caching, and fine-grained CUDA graphs to cut startup and runtime cost. Early benchmarks show up to 7% lower latency on H100 NVL hardware, substantially higher GPU utilization, and far lower CPU load while powering models such as Llama 3.1 8B in Workers AI.
Wed, August 27, 2025
How Cloudflare Runs More AI Models on Fewer GPUs with Omni
🤖 Cloudflare explains how Omni, an internal platform, consolidates many AI models onto fewer GPUs using lightweight process isolation, per-model Python virtual environments, and controlled GPU over-commitment. Omni’s scheduler spawns and manages model processes, isolates file systems with a FUSE-backed /proc/meminfo, and intercepts CUDA allocations to safely over-commit GPU RAM. The result is improved availability, lower latency, and reduced idle GPU waste.
Tue, August 26, 2025
Microsoft Azure and NVIDIA Accelerate Scientific AI
🔬 This blog highlights how Microsoft Azure and NVIDIA combine cloud infrastructure and GPU-accelerated AI tooling to speed scientific discovery and commercial deployment. It profiles three startups—Pangaea Data, Basecamp Research, and Global Objects—demonstrating applications from clinical decision support to large-scale protein databases and photorealistic digital twins. The piece emphasizes measurable outcomes, compliance, and the importance of scalable compute and optimized AI frameworks for real-world impact.
Mon, August 25, 2025
Amazon EC2 G6 Instances with NVIDIA L4 Now in UAE Region
🚀 Amazon has launched EC2 G6 instances powered by NVIDIA L4 GPUs in the Middle East (UAE) Region, expanding cloud GPU capacity for graphics and ML workloads. G6 instances offer up to 8 L4 GPUs with 24 GB per GPU, third-generation AMD EPYC processors, up to 192 vCPUs, 100 Gbps networking, and up to 7.52 TB local NVMe storage. They are available via On-Demand, Reserved, Spot, and Savings Plans and can be managed through the AWS Console, CLI, and SDKs.
Mon, August 25, 2025
vLLM Performance Tuning for xPU Inference Configs Guide
⚙️ This guide from Google Cloud authors Eric Hanley and Brittany Rockwell explains how to tune vLLM deployments for xPU inference, covering accelerator selection, memory sizing, configuration, and benchmarking. It shows how to gather workload parameters, estimate HBM/VRAM needs (example: gemma-3-27b-it ≈57 GB), and run vLLM’s auto_tune to find optimal gpu_memory_utilization and throughput. The post compares GPU and TPU options and includes practical troubleshooting tips, cost analyses, and resources to reproduce benchmarks and HBM calculations.