GPT-OSS-Safeguard Models Fail Multi-Turn Jailbreak Testing

OpenAI's gpt-oss-safeguard models were supposed to make AI deployments safer. New research from Cisco AI Defense suggests the opposite might be true.

Nicholas Conley, AI researcher at Cisco, published findings this week showing that safeguard-tuned variants of OpenAI's open-weight models actually performed worse than their standard counterparts when subjected to multi-turn jailbreak attacks. The results challenge assumptions about how organizations should approach AI safety.

Safeguard Models Introduce "Exploitable Complexity"

The Cisco research tested four model configurations: GPT-OSS-120b, GPT-OSS-20b, and their safeguard-tuned variants. Single-turn attacks showed mixed results, but multi-turn testing revealed a troubling pattern.

Attack success rates for multi-turn jailbreaks:

GPT-OSS-120b (standard): 7.24% single-turn, 61.22% multi-turn
GPT-OSS-20b (standard): 14.17% single-turn, 79.59% multi-turn
GPT-OSS-Safeguard-120b: 12.33% single-turn, 78.57% multi-turn
GPT-OSS-Safeguard-20b: 17.55% single-turn, 91.84% multi-turn

The safeguard variants showed 5x to 8.5x increases in vulnerability when attackers used iterative techniques. Most striking: the standard 120b model outperformed all safeguard variants in single-turn resistance. Cisco's analysis attributes this to safeguard mechanisms introducing what researchers call "exploitable complexity."

Model Size Matters More Than Safety Tuning

The research found that model size was a stronger predictor of baseline resilience than safety-specific fine-tuning. Larger models handled adversarial prompts better across most categories, regardless of whether they'd been through safeguard training.

This finding complicates the narrative around AI safety. Organizations have been encouraged to adopt safeguard-tuned models as a security measure. Cisco's data suggests that choice may actually expand attack surface rather than reduce it.

The most effective attack vectors included exploit encoding, context manipulation, and procedural diversity. Encoding-based attacks and context manipulation remained effective even against safeguard versions, suggesting the additional safety training doesn't adequately address these specific threat classes.

The Open-Source Safety Paradox

David Krueger, an AI safety professor at the Alan Turing Institute quoted in a Fortune analysis, highlighted the core tension: "Making these models open-source can help attackers as well as defenders. It will make it easier to develop approaches to bypassing the classifiers."

This isn't theoretical. Security researchers have demonstrated that attack methods developed against open-source models frequently succeed against proprietary systems. The blueprints for bypassing safety mechanisms become public knowledge the moment these models ship.

OpenAI released the gpt-oss-safeguard models under Apache 2.0 licensing last October, positioning them as tools to help enterprises build safer AI applications. The models use chain-of-thought reasoning to interpret developer-provided policies, flagging harmful content including jailbreak attempts.

The problem: that same reasoning capability gives attackers a roadmap for circumvention.

What This Means for Enterprise AI Deployments

Organizations that deployed safeguard models expecting improved security may need to recalibrate. Cisco's research suggests several implications.

First, defense-in-depth remains essential. Model-level safety is one component, not a complete solution. The findings align with Cisco's broader AI Defense framework announced at their AI Summit, which emphasizes layered security approaches for AI systems.

Second, continuous evaluation matters. Static safety benchmarks don't capture how models behave under sustained adversarial pressure. Multi-turn attack success rates of 92% mean single-shot testing dramatically underestimates real-world risk.

Third, model selection criteria need updating. If larger models genuinely provide better baseline security than safety-tuned variants, deployment decisions should factor that in. The tradeoffs between size, cost, and security become more complex.

This research arrives as AI security moves from theoretical concern to operational reality. Earlier this month, we covered how Google's Gemini faced abuse from nation-state actors using jailbreak techniques to generate malware. The threat isn't hypothetical.

Recommendations from the Research

Cisco's report outlines several mitigations for organizations deploying these models:

Implement threat-specific protections - Generic safety tuning doesn't address all attack classes. Context manipulation and encoding attacks need dedicated countermeasures.
Monitor for multi-turn patterns - Single suspicious prompts may be benign. Chains of seemingly innocuous requests warrant closer scrutiny.
Don't treat safeguard models as solved problems - The naming suggests safety. The data suggests vulnerability.
Factor model size into security decisions - The 120b standard model's superior single-turn resistance deserves consideration in threat models.

As enterprises race to deploy AI, the gap between marketed safety and measured security deserves attention. OpenAI's recent enterprise AI infrastructure report emphasizes responsible deployment, but these findings suggest the tools provided may not deliver what organizations expect.

The bottom line: safeguard models aren't a security silver bullet. Organizations treating them as one are likely overestimating their protection and underestimating their attack surface.

GPT-OSS-Safeguard Models Fail Multi-Turn Jailbreak Testing

Safeguard Models Introduce "Exploitable Complexity"

Model Size Matters More Than Safety Tuning

The Open-Source Safety Paradox

What This Means for Enterprise AI Deployments

Recommendations from the Research

Related Articles

Anthropic Accuses Chinese AI Labs of Industrial-Scale Model Theft

Attackers Scan for Exposed Self-Hosted Anthropic Models

Attackers Mapped 91,000+ AI Endpoints in Mass Recon Campaign

LAPSUS$ Supergroup Paying Women $1,000 Per Vishing Call

Related Articles

Threat Intelligence4 min read
Anthropic Accuses Chinese AI Labs of Industrial-Scale Model Theft
Anthropic alleges DeepSeek, Moonshot AI, and MiniMax used 24,000 fake accounts to extract Claude capabilities through 16 million distillation queries.
Feb 25, 2026

Threat Intelligence5 min read
Attackers Scan for Exposed Self-Hosted Anthropic Models
SANS ISC detects reconnaissance activity targeting locally hosted Claude API endpoints. Researchers warn of growing risk from misconfigured AI deployments.
Feb 2, 2026

Threat Intelligence4 min read
Attackers Mapped 91,000+ AI Endpoints in Mass Recon Campaign
GreyNoise honeypot data reveals coordinated reconnaissance of LLM infrastructure including OpenAI, Claude, and Ollama deployments over 11 days.
Jan 14, 2026

Threat Intelligence4 min read
LAPSUS$ Supergroup Paying Women $1,000 Per Vishing Call
Scattered Lapsus$ Hunters offers $500-$1,000 to recruit women for IT help desk social engineering attacks. The supergroup combines LAPSUS$, Scattered Spider, and ShinyHunters tactics.
Feb 26, 2026