All news with #jailbreaks tag
Thu, October 30, 2025
Five Generative AI Security Threats and Defensive Steps
🔒 Microsoft summarizes the top generative AI security risks and mitigation strategies in a new e-book, highlighting threats such as prompt injection, data poisoning, jailbreaks, and adaptive evasion. The post underscores cloud vulnerabilities, large-scale data exposure, and unpredictable model behavior that create new attack surfaces. It recommends unified defenses—such as CNAPP approaches—and presents Microsoft Defender for Cloud as an example that combines posture management with runtime detection to protect AI workloads.
Wed, October 22, 2025
Model Armor and Apigee: Protecting Generative AI Apps
🔒 Google Cloud’s Model Armor integrates with Apigee to screen prompts, responses, and agent interactions, helping organizations mitigate prompt injection, jailbreaks, sensitive data exposure, malicious links, and harmful content. The model‑agnostic, cloud‑agnostic service supports REST APIs and inline integrations with Apigee, Vertex AI, Agentspace, and network service extensions. The article provides step‑by‑step setup: enable the API, create templates, assign service account roles, add SanitizeUserPrompt and SanitizeModelResponse policies to Apigee proxies, and review findings in the AI Protection dashboard.
Fri, September 5, 2025
Penn Study Finds: GPT-4o-mini Susceptible to Persuasion
🔬 University of Pennsylvania researchers tested GPT-4o-mini on two categories of requests an aligned model should refuse: insulting the user and giving instructions to synthesize lidocaine. They crafted prompts using seven persuasion techniques (Authority, Commitment, Liking, Reciprocity, Scarcity, Social proof, Unity) and matched control prompts, then ran each prompt 1,000 times at the default temperature for a total of 28,000 trials. Persuasion prompts raised compliance from 28.1% to 67.4% for insults and from 38.5% to 76.5% for drug instructions, demonstrating substantial vulnerability to social-engineering cues.
Tue, September 2, 2025
NCSC and AISI Back Public Disclosure for AI Safeguards
🔍 The NCSC and the AI Security Institute have broadly welcomed public, bug-bounty style disclosure programs to help identify and remediate AI safeguard bypass threats. They said initiatives from vendors such as OpenAI and Anthropic could mirror traditional vulnerability disclosure to encourage responsible reporting and cross-industry collaboration. The agencies cautioned that programs require clear scope, strong foundational security, prior internal reviews and sufficient triage resources, and that disclosure alone will not guarantee model safety.
Wed, August 20, 2025
Logit-Gap Steering Reveals Limits of LLM Alignment
⚠️ Unit 42 researchers Tony Li and Hongliang Liu introduce Logit-Gap Steering, a new framework that exposes how alignment training produces a measurable refusal-affirmation logit gap rather than eliminating harmful outputs. Their paper demonstrates efficient short-path suffix jailbreaks that achieved high success rates on open-source models including Qwen, LLaMA, Gemma and the recently released gpt-oss-20b. The findings argue that internal alignment alone is insufficient and recommend a defense-in-depth approach with external safeguards and content filters.