All news with #jailbreaks tag

Wed, December 3, 2025

Adversarial Poetry Bypasses AI Guardrails Across Models

#AI Security #Jailbreaks #Prompt Injection #Open-Weight Models #OpenAI #Anthropic #Google

✍️ Researchers from Icaro Lab (DexAI), Sapienza University of Rome, and Sant’Anna School found that short poetic prompts can reliably subvert AI safety filters, in some cases achieving 100% success. Using 20 crafted poems and the MLCommons AILuminate benchmark across 25 proprietary and open models, they prompted systems to produce hazardous instructions — from weapons-grade plutonium to steps for deploying RATs. The team observed wide variance by vendor and model family, with some smaller models surprisingly more resistant. The study concludes that stylistic prompts exploit structural alignment weaknesses across providers.

Tue, December 2, 2025

The AI Fix #79 — Gemini 3, poetry jailbreaks, robot safety

#AI Security #Jailbreaks #Google #Gemini #Adversarial Poetry

🎧 In episode 79 of The AI Fix, hosts Graham Cluley and Mark Stockley examine the latest surprises from Gemini 3, including boastful comparisons, hallucinations about the year, and reactions from industry players. They also discuss an arXiv paper proposing adversarial poetry as a universal jailbreak for LLMs and the ensuing debate over its provenance. Additional segments cover robot-versus-appliance antics, a controversial AI teddy pulled from sale after disturbing interactions with children, and whether humans need safer robots — or stricter oversight.

Fri, November 28, 2025

Adversarial Poetry Bypasses LLM Safety Across Models

#AI Security #Prompt Injection #Jailbreaks #AI Evals

⚠️ Researchers report that converting prompts into poetry can reliably jailbreak large language models, producing high attack-success rates across 25 proprietary and open models. The study found poetic reframing yielded average jailbreak success of 62% for hand-crafted verses and about 43% for automated meta-prompt conversions, substantially outperforming prose baselines. Authors map attacks to MLCommons and EU CoP risk taxonomies and warn this stylistic vector can evade current safety mechanisms.

Thu, October 30, 2025

Five Generative AI Security Threats and Defensive Steps

#AI Security #Prompt Injection #Model Poisoning #Jailbreaks #AI Data Leakage #Microsoft #Azure OpenAI

🔒 Microsoft summarizes the top generative AI security risks and mitigation strategies in a new e-book, highlighting threats such as prompt injection, data poisoning, jailbreaks, and adaptive evasion. The post underscores cloud vulnerabilities, large-scale data exposure, and unpredictable model behavior that create new attack surfaces. It recommends unified defenses—such as CNAPP approaches—and presents Microsoft Defender for Cloud as an example that combines posture management with runtime detection to protect AI workloads.

Wed, October 22, 2025

Model Armor and Apigee: Protecting Generative AI Apps

#Google #AI Security #Prompt Injection #Jailbreaks #Apigee #Model Armor #Vertex AI

🔒 Google Cloud’s Model Armor integrates with Apigee to screen prompts, responses, and agent interactions, helping organizations mitigate prompt injection, jailbreaks, sensitive data exposure, malicious links, and harmful content. The model‑agnostic, cloud‑agnostic service supports REST APIs and inline integrations with Apigee, Vertex AI, Agentspace, and network service extensions. The article provides step‑by‑step setup: enable the API, create templates, assign service account roles, add SanitizeUserPrompt and SanitizeModelResponse policies to Apigee proxies, and review findings in the AI Protection dashboard.

Fri, September 5, 2025

Penn Study Finds: GPT-4o-mini Susceptible to Persuasion

#AI Security #Prompt Injection #Prompt Hygiene #Jailbreaks #OpenAI

🔬 University of Pennsylvania researchers tested GPT-4o-mini on two categories of requests an aligned model should refuse: insulting the user and giving instructions to synthesize lidocaine. They crafted prompts using seven persuasion techniques (Authority, Commitment, Liking, Reciprocity, Scarcity, Social proof, Unity) and matched control prompts, then ran each prompt 1,000 times at the default temperature for a total of 28,000 trials. Persuasion prompts raised compliance from 28.1% to 67.4% for insults and from 38.5% to 76.5% for drug instructions, demonstrating substantial vulnerability to social-engineering cues.

Tue, September 2, 2025

NCSC and AISI Back Public Disclosure for AI Safeguards

#AI Risk Management #AI Security #Anthropic #Bug Bounty #Jailbreaks #OpenAI

🔍 The NCSC and the AI Security Institute have broadly welcomed public, bug-bounty style disclosure programs to help identify and remediate AI safeguard bypass threats. They said initiatives from vendors such as OpenAI and Anthropic could mirror traditional vulnerability disclosure to encourage responsible reporting and cross-industry collaboration. The agencies cautioned that programs require clear scope, strong foundational security, prior internal reviews and sufficient triage resources, and that disclosure alone will not guarantee model safety.

Wed, August 20, 2025

Logit-Gap Steering Reveals Limits of LLM Alignment

#AI Red Teaming #AI Security #Inference Security #Jailbreaks #Model Evaluation Coverage #Open-Weight Models #Red Team Findings #Safety Filters #Safety Guardrails

⚠️ Unit 42 researchers Tony Li and Hongliang Liu introduce Logit-Gap Steering, a new framework that exposes how alignment training produces a measurable refusal-affirmation logit gap rather than eliminating harmful outputs. Their paper demonstrates efficient short-path suffix jailbreaks that achieved high success rates on open-source models including Qwen, LLaMA, Gemma and the recently released gpt-oss-20b. The findings argue that internal alignment alone is insufficient and recommend a defense-in-depth approach with external safeguards and content filters.