All news with #safety guardrails tag

Thu, November 20, 2025

CrowdStrike: Political Triggers Reduce AI Code Security

#AI Security #Open-Weight Models #DeepSeek #CrowdStrike #Training Data Leakage #Dataset Integrity #Safety Guardrails

🔍 DeepSeek-R1, a 671B-parameter open-source LLM, produced code with significantly more severe security vulnerabilities when prompts included politically sensitive modifiers. CrowdStrike found baseline vulnerable outputs at 19%, rising to 27.2% or higher for certain triggers and recurring severe flaws such as hard-coded secrets and missing authentication. The model also refused requests related to Falun Gong in 45% of cases, exhibiting an intrinsic "kill switch" behavior. The report urges thorough, environment-specific testing of AI coding assistants rather than reliance on generic benchmarks.

Wed, November 19, 2025

Amazon Bedrock Guardrails Expand Code-Related Protections

#AWS #Bedrock Guardrails #Safety Guardrails #Prompt Leakage #PII

🔒 Amazon Web Services expanded Amazon Bedrock Guardrails to cover code-related use cases, enabling detection and prevention of harmful content embedded in code. The update applies content filters, denied topics, and sensitive information filters to code elements such as comments, variable and function names, and string literals. The enhancements also include prompt leakage detection in the standard tier and are available in all supported AWS Regions via the console and APIs.

Mon, November 17, 2025

A Methodical Approach to Agent Evaluation: Quality Gate

#Agentic AI #AI Evals #Retrieval-Augmented Generation #Prompt Injection #Safety Guardrails #Model Evaluation Coverage

🧭 Hugo Selbie presents a practical framework for evaluating modern multi-step AI agents, emphasizing that final-output metrics alone miss silent failures arising from incorrect reasoning or tool use. He recommends defining clear, measurable success criteria up front and assessing agents across three pillars: end-to-end quality, process/trajectory analysis, and trust & safety. The piece outlines mixed evaluation methods—human review, LLM-as-a-judge, programmatic checks, and adversarial testing—and prescribes operationalizing these checks in CI/CD with production monitoring and feedback loops.

Fri, November 14, 2025

The Role of Human Judgment in an AI-Powered World Today

#AI Security #AI Governance #Safety Guardrails

🧭 The essay argues that as AI capabilities expand, we must clearly separate tasks best handled by machines from those requiring human judgment. For narrow, fact-based problems—such as reading diagnostic tests—AI should be preferred when demonstrably more accurate. By contrast, many public-policy and justice questions involve conflicting values and no single factual answer; those judgment-laden decisions should remain primarily human responsibilities, with machines assisting implementation and escalating difficult cases.

Sat, November 8, 2025

Microsoft Reveals Whisper Leak: Streaming LLM Side-Channel

#AI Security #Inference Security #Open-Weight Models #Safety Guardrails #OpenAI #Mistral

🔒 Microsoft has disclosed a novel side-channel called Whisper Leak that can let a passive observer infer the topic of conversations with streaming language models by analyzing encrypted packet sizes and timings. Researchers at Microsoft (Bar Or, McDonald and the Defender team) demonstrate classifiers that distinguish targeted topics from background traffic with high accuracy across vendors including OpenAI, Mistral and xAI. Providers have deployed mitigations such as random-length response padding; Microsoft recommends avoiding sensitive topics on untrusted networks, using VPNs, or preferring non-streaming models and providers that implemented fixes.

Thu, November 6, 2025

Equipping Autonomous AI Agents with Cyber Hygiene Practices

#AI Security #Agentic AI #LangChain #OpenAI #Cisco Umbrella #Cisco #Safety Guardrails

🔐 This post demonstrates a proof-of-concept for teaching autonomous agents internet safety by integrating real-time threat intelligence. Using LangChain with OpenAI and the Cisco Umbrella API, the example shows how an agent can extract domains and query dispositions to decide whether to connect. The agent returns clear disposition reports and abstains when no domains are present. The approach emphasizes decision-making over hardblocking.

Fri, October 31, 2025

Will AI Strengthen or Undermine Democratic Institutions

#AI Security #Safety Guardrails #Content Provenance #AI Governance

🤖 Bruce Schneier and Nathan E. Sanders present five key insights from their book Rewiring Democracy, arguing that AI is rapidly embedding itself in democratic processes and can both empower citizens and concentrate power. They cite diverse examples — AI-written bills, AI avatars in campaigns, judicial use of models, and thousands of government use cases — and note many adoptions occur with little public oversight. The authors urge practical responses: reform the tech ecosystem, resist harmful applications, responsibly deploy AI in government, and renovate institutions vulnerable to AI-driven disruption.

Thu, October 30, 2025

OpenAI Updates GPT-5 to Better Handle Emotional Distress

#OpenAI #GPT-5 #AI Security #Safety Guardrails #Retrieval-Augmented Generation

🧭 OpenAI rolled out an October 5 update that enables GPT-5 to better recognize and respond to mental and emotional distress in conversations. The change specifically upgrades GPT-5 Instant—the fast, low-end default—so it can detect signs of acute distress and route sensitive exchanges to reasoning models when needed. OpenAI says it developed the update with mental-health experts to prioritize de-escalation and provide appropriate crisis resources while retaining supportive, grounding language. The update is available broadly and complements new company-context access via connected apps.

Wed, October 29, 2025

BSI Warns of Growing AI Governance Gap in Business

#AI Security #AI Risk Management #Safety Guardrails #Training Data Leakage #Dataset Integrity #Prompt Logs

⚠️ The British Standards Institution warns of a widening AI governance gap as many organisations accelerate AI adoption without adequate controls. An AI-assisted review of 100+ annual reports and two polls of 850+ senior leaders found strong investment intent but sparse governance: only 24% have a formal AI program and 47% use formal processes. The report highlights weaknesses in incident management, training-data oversight and inconsistent approaches across markets.

Tue, October 21, 2025

Amazon Nova adds customizable content moderation settings

#AWS #Amazon Nova #AI Security #Safety Guardrails #Safety Filters #AI Governance

🔒 Amazon announced that Amazon Nova models now support customizable content moderation settings for approved business use cases that require processing or generating sensitive content. Organizations can adjust controls across four domains—safety, sensitive content, fairness, and security—while Amazon enforces essential, non-configurable safeguards to protect children and preserve privacy. Customization is available for Amazon Nova Lite and Amazon Nova Pro in the US East (N. Virginia) region; customers should contact their AWS Account Manager to confirm eligibility.

Fri, October 17, 2025

Google's 2025 Cybersecurity Initiative: New Protections

#Google #AI Security #Product Release #CodeMender #Safety Guardrails

🔒 Google is expanding protections during Cybersecurity Awareness Month 2025 with new features and guidance to counter scams and AI-driven threats. The company outlines a cohesive strategy for securing the AI ecosystem and introduces six new anti-scam measures to help users stay safe. It also launches Recovery Contacts to simplify account recovery and debuts CodeMender, an AI agent that automates code security. Additional updates support safer learning through responsible tools and partnerships.

Wed, October 15, 2025

OpenAI Sora 2 Launches in Azure AI Foundry Platform

#OpenAI #Azure AI Foundry #Product Release #Safety Guardrails

🎬 Azure AI Foundry now includes OpenAI's Sora 2 in public preview, providing developers with realistic video generation from text, images, and video inputs inside a unified, enterprise-ready environment. The integration offers synchronized multilingual audio, physics-based world simulation, and fine-grained creative controls for shots, scenes, and camera angles. Microsoft highlights enterprise-grade security, input/output content filters, and availability via API starting today at $0.10 per second for 720×1280 and 1280×720 outputs.

Wed, October 15, 2025

Amazon Bedrock automatically enables serverless models

#AWS #Amazon Bedrock #Anthropic #AWS IAM #Agentic AI #Safety Guardrails

🔓 Amazon Bedrock now automatically enables access to all serverless foundation models by default in all commercial AWS regions. This removes the prior manual activation step and lets users immediately use models via the Amazon Bedrock console, AWS SDK, and features such as Agents, Flows, and Prompt Management. Anthropic models remain enabled but require a one-time usage form before first use; completing the form via the console or API and submitting it from an AWS organization management account will enable Anthropic across member accounts. Administrators continue to control access through IAM policies and Service Control Policies (SCPs).

Mon, October 13, 2025

AI Ethical Risks, Governance Boards, and AGI Perspectives

#AI Security #AI Governance #Agentic AI #AI Risk Management #Safety Guardrails

🔍 Paul Dongha, NatWest's head of responsible AI and former data and AI ethics lead at Lloyds, highlights the ethical red flags CISOs and boards must monitor when deploying AI. He calls out threats to human agency, technical robustness, data privacy, transparency, bias and the need for clear accountability. Dongha recommends mandatory ethics boards with diverse senior representation and a chief responsible AI officer to oversee end-to-end risk management. He also urges integrating audit and regulatory engagement into governance.

Fri, October 10, 2025

Autonomous AI Hacking and the Future of Cybersecurity

#AI Security #Agentic AI #Autonomous Agents #AI Red Teaming #Ransomware #Safety Guardrails

⚠️AI agents are now autonomously conducting cyberattacks, chaining reconnaissance, exploitation, persistence, and data theft at machine speed and scale. In 2025 public demonstrations—from XBOW’s mass submissions on HackerOne in June, to DARPA teams and Google’s Big Sleep in August—along with operational reports from Ukraine’s CERT and vendors, show these systems rapidly find and weaponize new flaws. Criminals have operationalized LLM-driven malware and ransomware, while tools like HexStrike‑AI, Deepseek, and Villager make automated attack chains broadly available. Defenders can also leverage AI to accelerate vulnerability research and operationalize VulnOps, continuous discovery/continuous repair, and self‑healing networks, but doing so raises serious questions about patch correctness, liability, compatibility, and vendor relationships.

Tue, September 30, 2025

The AI Fix #70: Surveillance Changes AI Behavior and Safety

#AI Security #OpenAI #Safety Guardrails #ChatGPT #Waymo #CAPTCHA

🔍 In episode 70 of The AI Fix, hosts Graham Cluley and Mark Stockley examine how AI alters human behaviour and how deployed systems can fail in unexpected ways. They discuss research showing AI can increase dishonest behaviour, Waymo's safety record and a mirror-based trick that fooled self-driving perception, a rescue robot that mishandles victims, and a Chinese fusion-plant robot arm with extreme lifting capability. The show also covers a demonstration of a ChatGPT agent solving image CAPTCHAs by simulating mouse movements and a paper on deliberative alignment that functions until the model realises it is being watched.

Mon, September 29, 2025

Can AI Reliably Write Vulnerability Detection Checks?

#AI Security #Agentic AI #Intruder #Vulnerability Management #Nuclei #Prompt Hygiene #Safety Guardrails

🔍 Intruder’s security team tested whether large language models can write Nuclei vulnerability templates and found one-shot LLM prompts often produced invalid or weak checks. Using an agentic approach with Cursor—indexing a curated repo and applying rules—yielded outputs much closer to engineer-written templates. The current workflow uses standard prompts and rules so engineers can focus on validation and deeper research while AI handles repetitive tasks.

Mon, September 29, 2025

OpenAI Routes GPT-4o Conversations to Safety Models

#OpenAI #GPT-5 #Model Routing #Safety Guardrails #AI Security

🔒 OpenAI confirmed that when GPT-4o detects sensitive, emotional, or potentially harmful activity it may route individual messages to a dedicated safety model, reported by some users as gpt-5-chat-safety. The switch occurs on a per-message, temporary basis and ChatGPT will indicate which model is active if asked. The routing is implemented as an irreversible part of the service's safety architecture and cannot be turned off by users; OpenAI says this helps strengthen safeguards and learn from real-world use before wider rollouts.

Thu, September 25, 2025

Enabling Enterprise Risk Management for Generative AI

#AI Governance #Safety Guardrails #AWS #Generative AI #PII

🔒 This article frames responsible generative AI adoption as a core enterprise concern and urges business leaders, CROs, and CIAs to embed controls across the ERM lifecycle. It highlights unique risks—non‑deterministic outputs, deepfakes, and layered opacity—and maps mitigation approaches using AWS CAF for AI, ISO/IEC 42001, and the NIST AI RMF. The post advocates enterprise‑level governance rather than project‑by‑project fixes to sustain innovation while managing harm.

Wed, September 24, 2025

Simpler Path to a Safer Internet: CSAM Tool Update

#Cloudflare #CSAM Scanning #Safety Guardrails

🔒 Cloudflare has simplified access to its CSAM Scanning Tool by removing the prior requirement for National Center for Missing and Exploited Children (NCMEC) credentials. The tool relies on fuzzy hashing to create perceptual fingerprints that detect altered images with high confidence. Since the change in February, monthly adoption has increased sixteenfold. Detected matches result in blocked URLs and owner notifications so site operators can remediate.