< ciso
brief />
Tag Banner

All news with #model evaluation tag

8 articles

Trustpilot’s real-time data enrichment with Gemma

🧩Trustpilot built a high-volume streaming pipeline using fine-tuned Gemma models to process millions of user reviews in near real-time under tight latency and cost constraints. The team replaced variable per-token pricing with fixed infrastructure costs, fine-tuned lightweight models for tasks like NER, sentiment, and topic classification, and separated classifier and LLM endpoints. Performance tuning, vLLM optimizations, and load testing enabled scalable inference despite challenges with private networking, deployment observability, and GPU availability.
read more →

Google AI Edge Portal Adds On‑Device LLM Benchmarking

🚀 Google AI Edge Portal now enables developers to benchmark and debug on-device LLMs across a physical lab of over 120 representative Android devices. It profiles initialization time, prefill and decode speeds, and peak memory usage across CPU, GPU, and NPU backends to surface real user-impacting metrics. The integrated Model Explorer visualizes model graphs, tensor shapes, and traces to speed root-cause analysis and collaboration.
read more →

AI Hallucinations Introduce Critical Security Risks

⚠️ AI hallucinations—confident but incorrect outputs—are increasingly driving risky decisions in critical infrastructure and cybersecurity operations, exploiting human trust in authoritative-sounding responses. A 2025 AA-Omniscience benchmark of 40 models found most systems were more likely to offer a confident wrong answer on difficult questions, underscoring that AI outputs must be treated as potential vulnerabilities until vetted. Effective controls include enforced human review before sensitive actions, treating training data as a security asset, strict least-privilege for AI systems, and prompt-engineering training to reduce ambiguous inputs.
read more →

Eighth-Generation TPUs: TPU 8t and TPU 8i Deep Dive

🚀 Google Cloud presents its eighth-generation TPUs as two specialized systems: TPU 8t for massive pre-training and embedding workloads, and TPU 8i for low-latency sampling, serving, and reasoning. TPU 8t emphasizes throughput with a SparseCore for embedding collectives, native FP4 precision, VPU/MXU overlap, and the scale-out Virgo network to reduce DCN bottlenecks. TPU 8i prioritizes on-chip SRAM, a Collectives Acceleration Engine (CAE), and the Boardfly topology to cut network diameter and tail latency. The release is paired with a performance-first AI stack — Pallas, Mosaic, native PyTorch preview, and compatibility with JAX and Pathways.
read more →

Amazon Bedrock AgentCore Evaluations Now Generally Available

🎯 Amazon Bedrock AgentCore Evaluations is now generally available to deliver automated, continuous and on-demand quality assessment for AI agents. The feature provides online evaluation to sample and score live production traces and on-demand evaluation for programmatic tests in CI/CD pipelines and interactive workflows. It includes 13 built-in evaluators covering response quality, safety, task completion and tool usage, plus Ground Truth and customizable LLM- or code-based evaluators.
read more →

Master Generative AI Evaluation: From Prompts to Agents

🔍 This article outlines a practical, metrics-driven approach to testing generative AI systems, moving teams from ad-hoc inspection to systematic evaluation. It introduces four hands-on labs that cover evaluating single LLM outputs, assessing RAG systems with Vertex AI Evaluation, tracing and grading agent behavior with the Agent Development Kit (ADK), and validating SQL-generating agents against BigQuery. Each lab emphasizes measurable metrics—safety, groundedness, faithfulness, and factual accuracy—to help productionize GenAI with confidence.
read more →

A Methodical Approach to Agent Evaluation: Quality Gate

🧭 Hugo Selbie presents a practical framework for evaluating modern multi-step AI agents, emphasizing that final-output metrics alone miss silent failures arising from incorrect reasoning or tool use. He recommends defining clear, measurable success criteria up front and assessing agents across three pillars: end-to-end quality, process/trajectory analysis, and trust & safety. The piece outlines mixed evaluation methods—human review, LLM-as-a-judge, programmatic checks, and adversarial testing—and prescribes operationalizing these checks in CI/CD with production monitoring and feedback loops.
read more →

Preventing ML Data Leakage Through Strategic Splitting

🔐 CrowdStrike explains how inadvertent 'leakage' — when dependent or correlated observations are included in training — can inflate machine learning performance and undermine threat detection. The article shows that blocked or grouped data splits and blocked cross-validation produce more realistic performance estimates than random splits. It also highlights trade-offs, such as reduced predictor-space coverage and potential underfitting, and recommends careful partitioning and continuous evaluation to improve cybersecurity ML outcomes.
read more →