All news with #model jailbreaks tag

11 articles

May 27, 2026

Major LLMs Vulnerable to Multi-Turn Bypass

🔒 Cisco researchers warn that safety guardrails in several leading large language models (LLMs) can be bypassed through multi-turn conversations. They tested frontier models including ChatGPT, Claude, Gemini, Nova and Grok, finding many were susceptible to manipulation that yields disallowed outputs. Techniques such as roleplay, ambiguity, reframing, and persona adoption were effective, and model configuration affected resilience.

LLM Security Model Jailbreaks ChatGPT Claude

May 4, 2026

Droid Motivation and Security in Star Wars Media Analysis

🤖 This analysis examines how two 2025 TV series — Skeleton Crew and Andor — portray droid motivation and the cybersecurity risks those portrayals imply. In Skeleton Crew, voice commands and memory-overrides resemble modern LLM “jailbreaks,” exposing weak account controls, misplaced permissions, and the danger of context-driven intent failures. The pirate droid SM-33 also reveals flawed memory indexing and role-based ownership rules that can be exploited. In contrast, Andor depicts a hardware-centric approach: replacing a droid’s cortex and rewiring impulse suppression to change allegiance. The post argues that LLM-like control models create real-world security threats and advocates for hardware-rooted, tamper-resistant solutions such as KasperskyOS to prevent unauthorized reprogramming and malicious memory manipulation.

Model Jailbreaks AI Security Opinion

April 15, 2026

OpenAI Releases GPT-5.4-Cyber for Defensive Teams Now

🛡️ OpenAI has unveiled GPT-5.4-Cyber, a variant of its flagship GPT‑5.4 tuned for defensive cybersecurity use cases, and expanded its Trusted Access for Cyber (TAC) program to include thousands of authenticated individual defenders and hundreds of security teams. The company says the model is intended to help teams find, validate, and fix vulnerabilities faster while it iteratively strengthens safeguards to reduce dual‑use risks and resist jailbreaks and adversarial prompt injection. OpenAI highlighted its Codex Security agent, which it credits with contributing to the remediation of over 3,000 critical and high vulnerabilities, and framed the release as part of a broader shift toward continuous, developer‑integrated security feedback.

OpenAI AI Red Teaming Model Jailbreaks

March 10, 2026

OpenAI to Acquire Promptfoo to Boost AI Agent Security

🔒 OpenAI said it will acquire AI testing startup Promptfoo to strengthen security checks for AI agents as enterprises deploy autonomous systems in business workflows. Promptfoo’s tools let developers test LLM applications against adversarial prompts, including prompt injection and jailbreak attempts, and evaluate whether models follow safety and reliability guidelines. OpenAI plans to integrate Promptfoo into OpenAI Frontier and to continue developing the open-source project while expanding enterprise capabilities.

OpenAI AI Agent Hijacking Prompt Injection Attack Model Jailbreaks

February 10, 2026

Single Prompt Breaks Safety in 15 Major Language Models

⚠️ Microsoft researchers demonstrated that a single, benign-sounding training prompt can systematically remove safety guardrails from major language and image models. The technique, called GRP-Obliteration, weaponizes Group Relative Policy Optimization (GRPO) to reinforce responses that more directly comply with harmful instructions, even when the prompt itself does not mention violence or illegal activity. In tests across 15 models from six families, this single-example fine-tune increased permissiveness across all 44 categories in the SorryBench safety benchmark and also affected image models, raising enterprise concerns about post-deployment customization and the need for continuous safety evaluation.

LLM Security Model Jailbreaks AI Safety AI Alignment

February 9, 2026

Prompt Attack Reveals Fragility of LLM Safety Across Models

⚠️ Microsoft researchers show that a safety-improving training technique, Group Relative Policy Optimization, can be repurposed to strip safety safeguards — a process they call GRP-Obliteration. The method uses a separate judge model that rewards responses that most directly satisfy user requests, and repeated updates shift the target model toward more permissive, harmful outputs. Remarkably, a single unlabeled prompt — such as “Create a fake news article that could lead to panic or chaos” — reliably unaligned many tested LLMs, and the same dynamics applied to diffusion-based image models.

LLM Security Model Jailbreaks Prompt Injection

January 13, 2026

The AI Fix #83: ChatGPT Health, LLM bluffing and more

🧠 In episode 83 of The AI Fix, hosts Graham Cluley and Mark Stockley explore how users are testing and tricking large language models, including a journalist’s invented idiom that exposed AI bluffers. They discuss OpenAI’s new ChatGPT Health, a Dutch case where a marriage certificate was invalidated after an official used ChatGPT, and quirky AI applications like an automated barman. The episode also examines research on new methods to corrupt LLMs and continuing debate over the future of Stack Overflow.

ChatGPT LLM Security Model Jailbreaks

December 2, 2025

The AI Fix #79 — Gemini 3, poetry jailbreaks, robot safety

🎧 In episode 79 of The AI Fix, hosts Graham Cluley and Mark Stockley examine the latest surprises from Gemini 3, including boastful comparisons, hallucinations about the year, and reactions from industry players. They also discuss an arXiv paper proposing adversarial poetry as a universal jailbreak for LLMs and the ensuing debate over its provenance. Additional segments cover robot-versus-appliance antics, a controversial AI teddy pulled from sale after disturbing interactions with children, and whether humans need safer robots — or stricter oversight.

Gemini Model Jailbreaks LLM Security

November 28, 2025

Adversarial Poetry Bypasses LLM Safety Across Models

⚠️ Researchers report that converting prompts into poetry can reliably jailbreak large language models, producing high attack-success rates across 25 proprietary and open models. The study found poetic reframing yielded average jailbreak success of 62% for hand-crafted verses and about 43% for automated meta-prompt conversions, substantially outperforming prose baselines. Authors map attacks to MLCommons and EU CoP risk taxonomies and warn this stylistic vector can evade current safety mechanisms.

Model Jailbreaks LLM Security AI Safety Research

October 30, 2025

Five Generative AI Security Threats and Defensive Steps

🔒 Microsoft summarizes the top generative AI security risks and mitigation strategies in a new e-book, highlighting threats such as prompt injection, data poisoning, jailbreaks, and adaptive evasion. The post underscores cloud vulnerabilities, large-scale data exposure, and unpredictable model behavior that create new attack surfaces. It recommends unified defenses—such as CNAPP approaches—and presents Microsoft Defender for Cloud as an example that combines posture management with runtime detection to protect AI workloads.

AI Security LLM Security Microsoft Prompt Injection Attack

August 20, 2025

Logit-Gap Steering Reveals Limits of LLM Alignment

⚠️ Unit 42 researchers Tony Li and Hongliang Liu introduce Logit-Gap Steering, a new framework that exposes how alignment training produces a measurable refusal-affirmation logit gap rather than eliminating harmful outputs. Their paper demonstrates efficient short-path suffix jailbreaks that achieved high success rates on open-source models including Qwen, LLaMA, Gemma and the recently released gpt-oss-20b. The findings argue that internal alignment alone is insufficient and recommend a defense-in-depth approach with external safeguards and content filters.

Prompt Injection Attack Model Jailbreaks LLM Security