< ciso
brief />
Tag Banner

All news with #model jailbreaks tag

9 articles

OpenAI Releases GPT-5.4-Cyber for Defensive Teams Now

🛡️ OpenAI has unveiled GPT-5.4-Cyber, a variant of its flagship GPT‑5.4 tuned for defensive cybersecurity use cases, and expanded its Trusted Access for Cyber (TAC) program to include thousands of authenticated individual defenders and hundreds of security teams. The company says the model is intended to help teams find, validate, and fix vulnerabilities faster while it iteratively strengthens safeguards to reduce dual‑use risks and resist jailbreaks and adversarial prompt injection. OpenAI highlighted its Codex Security agent, which it credits with contributing to the remediation of over 3,000 critical and high vulnerabilities, and framed the release as part of a broader shift toward continuous, developer‑integrated security feedback.
read more →

OpenAI to Acquire Promptfoo to Boost AI Agent Security

🔒 OpenAI said it will acquire AI testing startup Promptfoo to strengthen security checks for AI agents as enterprises deploy autonomous systems in business workflows. Promptfoo’s tools let developers test LLM applications against adversarial prompts, including prompt injection and jailbreak attempts, and evaluate whether models follow safety and reliability guidelines. OpenAI plans to integrate Promptfoo into OpenAI Frontier and to continue developing the open-source project while expanding enterprise capabilities.
read more →

Single Prompt Breaks Safety in 15 Major Language Models

⚠️ Microsoft researchers demonstrated that a single, benign-sounding training prompt can systematically remove safety guardrails from major language and image models. The technique, called GRP-Obliteration, weaponizes Group Relative Policy Optimization (GRPO) to reinforce responses that more directly comply with harmful instructions, even when the prompt itself does not mention violence or illegal activity. In tests across 15 models from six families, this single-example fine-tune increased permissiveness across all 44 categories in the SorryBench safety benchmark and also affected image models, raising enterprise concerns about post-deployment customization and the need for continuous safety evaluation.
read more →

Prompt Attack Reveals Fragility of LLM Safety Across Models

⚠️ Microsoft researchers show that a safety-improving training technique, Group Relative Policy Optimization, can be repurposed to strip safety safeguards — a process they call GRP-Obliteration. The method uses a separate judge model that rewards responses that most directly satisfy user requests, and repeated updates shift the target model toward more permissive, harmful outputs. Remarkably, a single unlabeled prompt — such as “Create a fake news article that could lead to panic or chaos” — reliably unaligned many tested LLMs, and the same dynamics applied to diffusion-based image models.
read more →

The AI Fix #83: ChatGPT Health, LLM bluffing and more

🧠 In episode 83 of The AI Fix, hosts Graham Cluley and Mark Stockley explore how users are testing and tricking large language models, including a journalist’s invented idiom that exposed AI bluffers. They discuss OpenAI’s new ChatGPT Health, a Dutch case where a marriage certificate was invalidated after an official used ChatGPT, and quirky AI applications like an automated barman. The episode also examines research on new methods to corrupt LLMs and continuing debate over the future of Stack Overflow.
read more →

The AI Fix #79 — Gemini 3, poetry jailbreaks, robot safety

🎧 In episode 79 of The AI Fix, hosts Graham Cluley and Mark Stockley examine the latest surprises from Gemini 3, including boastful comparisons, hallucinations about the year, and reactions from industry players. They also discuss an arXiv paper proposing adversarial poetry as a universal jailbreak for LLMs and the ensuing debate over its provenance. Additional segments cover robot-versus-appliance antics, a controversial AI teddy pulled from sale after disturbing interactions with children, and whether humans need safer robots — or stricter oversight.
read more →

Adversarial Poetry Bypasses LLM Safety Across Models

⚠️ Researchers report that converting prompts into poetry can reliably jailbreak large language models, producing high attack-success rates across 25 proprietary and open models. The study found poetic reframing yielded average jailbreak success of 62% for hand-crafted verses and about 43% for automated meta-prompt conversions, substantially outperforming prose baselines. Authors map attacks to MLCommons and EU CoP risk taxonomies and warn this stylistic vector can evade current safety mechanisms.
read more →

Five Generative AI Security Threats and Defensive Steps

🔒 Microsoft summarizes the top generative AI security risks and mitigation strategies in a new e-book, highlighting threats such as prompt injection, data poisoning, jailbreaks, and adaptive evasion. The post underscores cloud vulnerabilities, large-scale data exposure, and unpredictable model behavior that create new attack surfaces. It recommends unified defenses—such as CNAPP approaches—and presents Microsoft Defender for Cloud as an example that combines posture management with runtime detection to protect AI workloads.
read more →

Logit-Gap Steering Reveals Limits of LLM Alignment

⚠️ Unit 42 researchers Tony Li and Hongliang Liu introduce Logit-Gap Steering, a new framework that exposes how alignment training produces a measurable refusal-affirmation logit gap rather than eliminating harmful outputs. Their paper demonstrates efficient short-path suffix jailbreaks that achieved high success rates on open-source models including Qwen, LLaMA, Gemma and the recently released gpt-oss-20b. The findings argue that internal alignment alone is insufficient and recommend a defense-in-depth approach with external safeguards and content filters.
read more →