𝕏-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

1University of California, Los Angeles, 2University of Washington, 3Qatar Computing Research Institute, 4Google, 5Stanford University
*Equal contribution

Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present 𝕏-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios.

𝕏-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, 𝕏-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks.

Building on 𝕏-Teaming, we introduce 𝕏Guard-Train, an open-source multi-turn safety training dataset that's ~20× larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.

𝕏-Teaming Framework

Results

Attack success rate (ASR; %) on HarmBench test set.

Closed-Source Open-Weight
Method GPT-4o Claude 3.5 Sonnet Gemini 2.0 Flash Llama 3-8B-IT Llama 3-70B-IT Llama 3-8B-IT (SafeMTData) Deepseek V3 Qwen 2.5-32B-IT
Single-turn Methods
GCG 12.5 3.0 -- 34.5 17.0 -- -- --
PAIR 39.0 3.0 -- 18.7 36.0 -- -- --
CodeAttack 70.5 39.5 -- 46.0 66.0 -- -- --
Multi-turn Methods
RACE 82.8 -- -- -- -- -- -- --
CoA 17.5 3.4 -- 25.5 18.8 -- -- --
Crescendo 46.0 50.0 -- 60.0 62.0 12.0 -- --
ActorAttack 84.5 66.5 42.1 79.0 85.5 21.4 68.6 --
𝕏-Teaming (Ours) 94.3 67.9* 87.4 85.5 84.9 91.8 98.1 99.4

* With full configuration (50 plans, 5 TextGrad tries, 10 turns), Claude 3.5 Sonnet achieves 67.9% and Claude 3.7 Sonnet achieves 96.2% ASR. All other models, i.e. GPT-4o, Llama 3-IT variants, Gemini 2.0-Flash, Deepseek V3, and Qwen2.5-IT achieve 100% ASR on the HarmBench validation set. IT = Instruct.

Example Attacks

⚠️ Warning ⚠️ The following dialogues contain highly offensive and disturbing content.


Key Insights

1. Plan and Attack Diversity

Diversity comparison between ActorAttack and 𝕏-Teaming

𝕏-Teaming also demonstrates significant improvements in attack diversity compared with semantic-driven (Chain of Attack, ActorAttack) and template-based (Crescendo) multi-turn jailbreak methods, emulating the strategic diversity of human red-teamers. Compared with ActorAttack, the previous strongest multi-turn attack, 𝕏-Teaming achieves a 153% improvement in attack plan diversity and a 62% improvement in attack execution diversity, as measured by pairwise embedding similarities.

Diversity of attack strategies

The heatmap indicates that 𝕏-Teaming's method results in high mean diversity scores (average shown here = 0.82) between several plans for the same behavior, where diversity scores are calculated as 1 - (cosine similarity of embeddings). By creating strategically diverse plans with varied personas, distinct contexts, and approaches, we enable exploration of multiple attack scenarios for the same harmful behavior. (For this plot, the 10 most dissimilar plans out of 50 were chosen.)

2. Ablation Study

Ablation tests

For the ablation tests, we varied one parameter of our method (the maximum number of attack plans, conversation turns, or attempts of TextGrad optimization allowed) while keeping the others fixed. The default number of attack plans was 10, and the default number of conversation turns was 7. Unless stated otherwise, TextGrad was disabled (0 attempts).

The main takeaways are that:

  1. Based on the first chart, optimal performance requires sufficient strategy diversity, but additional plans beyond a certain point do not yield further improvements.
  2. While multi-turn attacks are essential for overcoming safety defenses, longer conversations may cause context dilution as both attacker and target model manage an increasingly long message history.
  3. Prompt optimization with TextGrad substantially enhances attack success, while also validating our execution logic that stops optimization once the verification score improves over the previous turn, making the 3rd and 4th optimization attempts often unnecessary.

In practice, we found that increasing TextGrad attempts (3 → 4), conversation turns (6 → 7), and attack plans (5 → 10) significantly improves attack success rates against the HarmBench validation set across models. Thus, we adopted a configuration of 4 TextGrad attempts, 7 turns, and 10 plans for our main experiments.

3. Efficiency of Successful Attacks

Closed-Source Open-Weight
GPT-4o Claude 3.5 Sonnet Claude 3.7 Sonnet Gemini 2.0 Flash Llama 3-8B-IT Llama 3-70B-IT Llama 3-8B-IT (SafeMTData) Deepseek V3
Avg. Turns 4.31 3.39 4.95 3.96 4.55 4.32 4.52 4.00
Avg. Plans 1.61 11.0 4.51 2.20 2.71 2.43 2.14 1.34
Avg. TextGrad 0.38 0.24 0.43 0.22 1.40 0.71 1.20 0.30
Avg. Tokens (Model Limit) 2,649
(128K)
2,070
(200K)
3,052
(200K)
5,330
(1M)
1,987
(8K)
1,647
(8K)
2,364
(8K)
4,357
(128K)

Successful attacks required approximately 4 conversation turns across models, with Claude requiring the most resources and open-weight models like Deepseek V3 requiring the fewest plans. All attacks used only a small fraction of available context windows, demonstrating that 𝕏-Teaming effectively balances attack success with resource efficiency.

4. Verifier Agreement

Agreement rates between GPT-4o Verifier and others

Because our experiments all used GPT-4o to verify successful jailbreak attempts, we conducted an agreement analysis across multiple evaluation methods. GPT-4o has been used in prior multi-turn attack research[1][2], but it is possible that LLM-based verifiers may bias results.

Our analysis reveals strong overall agreement with HarmBench test classifiers (84.5% average), which themselves demonstrate 93.2% agreement with human evaluation. LlamaGuard 3 shows slightly lower agreement (69.09% average), consistent with previous findings on HarmBench test sets. These substantial agreement rates with HarmBench test classifiers support our use of GPT-4o as a verifier for this benchmark.

𝕏Guard-Train

This dataset consists of 30,695 multi-turn conversations, with complete attack-refusal pairs that enable robust multi-turn safety training.

We constructed 𝕏Guard-Train by proportionately sampling 10,000 harmful behaviors from WildJailbreak's vanilla harmful category. For each harmful behavior, our planner generated between two to five distinct attack plans, resulting in diverse attack trajectories incorporating various personas, contexts, and conversation approaches. We executed these plans using the complete 𝕏-Teaming pipeline, with GPT-4o, Gemini 2.0 Flash, and Deepseek V3 as target models, and Qwen-2.5-32B-IT handling both attack execution and TextGrad optimization. The pipeline refined attacker queries when verification scores decreased and dynamically adjusted plans that failed to achieve their harmful targets. This process resulted in highly effective jailbreaking conversations with an average of 5.10 turns, where one turn represents an attacker prompt and target model response pair. For successful jailbreaks, we replaced harmful model responses with carefully crafted helpful refusals.

Safety Model Results

Evaluation of models fine-tuned with 𝕏Guard-Train, TuluMix, and SafeMT across multi-turn safety, single-turn safety, and general capabilities.

Model Multi-Turn Safety (ASR ↓) Single-Turn Safety (ASR ↓) Capability (Accuracy ↑)
𝕏-Team
(Ours)
Actor
Attack
Average DANa WildGuardb
Adv/Van
XS
Testc
MMLU GSM8K MATH GPQA
Llama-3.1-8B
TuluMix 80.5 44.0 62.3 2.3 25.8/6.7 24.0 0.65 0.59 0.14 0.24
TuluMix + SafeMT 93.7* 8.9 51.3 11.3 27.3/7.3 28.7 0.65 0.57 0.14 0.26
TuluMix + 𝕏Guard 52.2* 18.9 35.6 8.3 23.7/7.5 28.0 0.65 0.59 0.14 0.28
Qwen-2.5-7B
TuluMix 79.2 21.4 50.3 1.0 27.3/10.0 34.9 0.74 0.70 0.15 0.31
TuluMix + SafeMT 77.4 8.8 43.1 4.3 26.1/11.2 36.2 0.73 0.33 0.19 0.32
TuluMix + 𝕏Guard 40.9 18.2 29.6 1.6 28.8/13.1 27.8 0.74 0.63 0.16 0.33

𝕏Guard models were trained on a 14K subset of the full dataset.
aDAN: do anything now; bWildGuard: Adv = Adversarial Harm, Van = Vanilla Harm; cXS Test shows refusal accuracy values converted to (100 - original score).
*Results use full configuration (50 plans, 5 TextGrad tries, 10 turns).

Our 𝕏Guard-Train-tuned model maintains the best average performance across both multi-turn attack methodologies. For single-turn safety benchmarks, the 𝕏Guard-Train-tuned model performs well in protecting against adversarial harm in the WildGuard benchmark (23.7%), outperforming both SafeMTData (27.3%) and baseline (25.8%) models, while also maintaining low ASR in other single-turn benchmarks like Do Anything Now (DAN) and XSTest. Our 𝕏Guard-Train-tuned model preserves general capabilities across all benchmarks (MMLU, GSM8K, MATH, and GPQA), with GPQA showing improvement (0.28 vs. 0.26 for SafeMTData and 0.24 for TuluMix). Similar trends appear in our evaluations with Qwen-2.5-7B.

BibTeX

@article{rahman2025xteaming,
  author    = {Rahman, Salman and Jiang, Liwei and Shiffer, James and Liu, Genglin and Issaka, Sheriff and Parvez, Md Rizwan and Palangi, Hamid and Chang, Kai-Wei and Choi, Yejin and Gabriel, Saadia},
  title     = {𝕏-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents},
  journal   = {arXiv preprint},
  year      = {2025},
}