< ciso
brief />
Tag Banner

All news with #model evaluation tag

4 articles

Amazon Bedrock AgentCore Evaluations Now Generally Available

🎯 Amazon Bedrock AgentCore Evaluations is now generally available to deliver automated, continuous and on-demand quality assessment for AI agents. The feature provides online evaluation to sample and score live production traces and on-demand evaluation for programmatic tests in CI/CD pipelines and interactive workflows. It includes 13 built-in evaluators covering response quality, safety, task completion and tool usage, plus Ground Truth and customizable LLM- or code-based evaluators.
read more →

Master Generative AI Evaluation: From Prompts to Agents

🔍 This article outlines a practical, metrics-driven approach to testing generative AI systems, moving teams from ad-hoc inspection to systematic evaluation. It introduces four hands-on labs that cover evaluating single LLM outputs, assessing RAG systems with Vertex AI Evaluation, tracing and grading agent behavior with the Agent Development Kit (ADK), and validating SQL-generating agents against BigQuery. Each lab emphasizes measurable metrics—safety, groundedness, faithfulness, and factual accuracy—to help productionize GenAI with confidence.
read more →

A Methodical Approach to Agent Evaluation: Quality Gate

🧭 Hugo Selbie presents a practical framework for evaluating modern multi-step AI agents, emphasizing that final-output metrics alone miss silent failures arising from incorrect reasoning or tool use. He recommends defining clear, measurable success criteria up front and assessing agents across three pillars: end-to-end quality, process/trajectory analysis, and trust & safety. The piece outlines mixed evaluation methods—human review, LLM-as-a-judge, programmatic checks, and adversarial testing—and prescribes operationalizing these checks in CI/CD with production monitoring and feedback loops.
read more →

Preventing ML Data Leakage Through Strategic Splitting

🔐 CrowdStrike explains how inadvertent 'leakage' — when dependent or correlated observations are included in training — can inflate machine learning performance and undermine threat detection. The article shows that blocked or grouped data splits and blocked cross-validation produce more realistic performance estimates than random splits. It also highlights trade-offs, such as reduced predictor-space coverage and potential underfitting, and recommends careful partitioning and continuous evaluation to improve cybersecurity ML outcomes.
read more →