DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Authors: Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue (Algorithm), Weinan Dai, Tiantian Fan, et al. (Infrastructure), Jiaze Chen, et al. (Dataset) Affiliations: ByteDance Seed, Institute for AI Industry Research (AIR), Tsinghua University, The University of Hong Kong, SIA-Lab of Tsinghua AIR and ByteDance Seed arXiv: 2503.14476 Project Page: dapo-sia.github.io GitHub: BytedTsinghua-SIA/DAPO

1. Motivation (研究动机)

这篇论文研究的是 大规模 LLM 强化学习训练系统。作者指出，尽管 test-time scaling（如 OpenAI o1、DeepSeek R1）通过 RL 训练出长 Chain-of-Thought 推理能力，但实际的训练细节一直被隐藏：

作者在 Qwen2.5-32B 上用 naive GRPO 训练，AIME 2024 只达到 30 分，远低于 DeepSeek-R1-Zero 的 47 分；
深入分析后发现 naive GRPO 存在 entropy collapse（策略熵快速下降）、reward noise（截断样本的 reward 不准确）、和 training instability（训练不稳定）等问题；
社区也普遍反映难以复现 R1 的结果，说明关键训练细节可能被遗漏。

因此，这篇论文的目标是：公开一个完整的、可复现的、达到 industry-level 性能的大规模 LLM RL 训练系统，包括算法、代码和数据。

2. Idea (核心思想)

DAPO（Decoupled Clip and Dynamic sAmpling Policy Optimization）的核心思想是：在 GRPO 框架上引入四个关键技术改进，解决 long-CoT RL 训练中的 entropy collapse、zero-gradient、loss imbalance 和 reward noise 问题。

四个技术分别是：

Clip-Higher：解耦上下 clip 阈值，放宽上限让低概率 token 有更大提升空间，防止 entropy collapse；
Dynamic Sampling：过滤掉 accuracy=0 和 accuracy=1 的 prompt，保证每个 batch 都有有效梯度；
Token-Level Policy Gradient Loss：按 token 而非 sample 聚合 loss，避免长序列被低估；
Overlong Reward Shaping：对超长截断样本用软惩罚替代硬惩罚，减少 reward noise。

与 DeepSeek-R1-Zero 的根本区别是：DAPO 不依赖 KL 惩罚项（因为 long-CoT 场景下模型需要大幅偏离初始分布），而是通过上述四个技术共同维持训练稳定性。

3. Method (方法)

3.1 DAPO 目标函数

DAPO 的完整目标函数如下：

J_{DAPO} (θ) = E_{(q, a) \sim D, {o_{i}}_{i = 1}^{G} \sim π_{θ_{old}} (\cdot ∣ q)} \frac{1}{\sum _{i = 1}^{G} ∣ o _{i} ∣} i = 1 \sum G t = 1 \sum ∣ o_{i} ∣ min (r_{i, t} (θ) \hat{A}_{i, t}, clip (r_{i, t} (θ), 1 - ε_{low}, 1 + ε_{high}) \hat{A}_{i, t})

s.t. 0 < ∣ {o_{i} ∣ is_equivalent (a, o_{i})} ∣ < G

其中：

r_{i, t} (θ) = \frac{π _{θ} ( o _{i, t} ∣ q , o _{i, < t} )}{π _{θ_{old}} ( o _{i, t} ∣ q , o _{i, < t} )}, \hat{A}_{i, t} = \frac{R _{i} - mean ({ R _{i} } _{i = 1}^{G} )}{std ({ R _{i} } _{i = 1}^{G} )}

约束条件 $0 < ∣ \cdot ∣ < G$ 表示一个 prompt 的 $G$ 个回复中，必须既有正确又有错误的才参与训练（Dynamic Sampling 的核心约束）。

3.2 Clip-Higher

Figure 2 解读：左图是 AIME 准确率，右图是 actor model 的生成熵。没有 Clip-Higher 时（紫线），entropy 在约 500 步后快速降到接近 0，说明策略崩溃为几乎确定性输出；有 Clip-Higher 后（蓝线），entropy 保持在合理范围内持续上升，准确率也更高。

问题：标准 PPO/GRPO 使用对称 clip $[1 - ε, 1 + ε]$ 。当 $ε = 0.2$ 时，一个概率为 0.01 的 token 最多只能被提升到 $0.01 \times 1.2 = 0.012$ ，而概率为 0.9 的 token 可以被提升到 $0.9 \times 1.2 = 1.08$ （截断到 1）。上限 clip 严重限制了低概率 “exploration” token 的概率提升空间。

解决方案：解耦上下 clip 阈值： $ε_{low} = 0.2$ ， $ε_{high} = 0.28$ 。放宽上限给低概率 token 更多空间，同时保持下限不变以避免不确定 token 概率被压到 0。

3.3 Dynamic Sampling

Figure 3 解读：左图是被 up-clip 的 token 的平均概率（都低于 0.2，验证了 Clip-Higher 的直觉）；右图是 accuracy=1 的样本比例随训练增长到 60%+，意味着越来越多 prompt 的 G 个回复全部正确，advantage 为零，产生零梯度。

问题：GRPO 用组内 reward 归一化算 advantage。如果一个 prompt 的 $G$ 个回复全部正确（reward 相同），advantage 全为零，这个 prompt 不产生梯度。训练中这种 prompt 比例不断增长，有效梯度信号越来越稀疏。

解决方案：过采样 + 过滤。在采样时持续生成直到 batch 被 accuracy 既非 0 也非 1 的 prompt 填满。这保证了 batch 中每个 prompt 都有有效梯度。

3.4 Token-Level Policy Gradient Loss

Figure 4 解读：左图是生成熵，右图是平均回复长度。没有 token-level loss 时（紫线），entropy 不稳定，回复长度在 2000-4000 之间剧烈波动；有 token-level loss 后（蓝线），entropy 稳定上升，长度增长更健康。

问题：原始 GRPO 先按 token 对每个 sample 内部取平均，再按 sample 取平均。这意味着不管序列长短，每个 sample 权重相同。长序列中的 token 贡献被稀释，短序列中的 token 贡献被放大。

解决方案：改为按 token 总数做全局平均（分母从 $\sum G$ 变成 $\sum ∣ o_{i} ∣$ ）。这样长序列的 reasoning pattern 有更大影响力，同时长序列中的 gibberish/repetitive pattern 也会被更有效地惩罚。

3.5 Overlong Reward Shaping

Figure 5 解读：左图是 AIME 准确率，右图是 entropy。没有 overlong filtering 时（蓝线），准确率震荡且 entropy 爆炸到 4+；有了之后（紫线），训练稳定得多。

问题：RL 训练通常设置最大生成长度 $L_{max}$ ，超过的样本被截断。截断样本默认被判为错误并给 $- 1$ reward，但实际上其推理过程可能是正确的，只是太长。这引入了 reward noise。

解决方案：先做 Overlong Filtering（mask 掉截断样本的 loss），再加 Soft Overlong Punishment：

R_{length} (y) = ⎩ ⎨ ⎧ 0, \frac{( L _{max} - L _{cache} ) - ∣ y ∣}{L _{cache}}, - 1, ∣ y ∣ \leq L_{max} - L_{cache} L_{max} - L_{cache} < ∣ y ∣ \leq L_{max} ∣ y ∣ > L_{max}

其中 $L_{cache}$ 是软惩罚缓冲区长度。当回复接近最大长度时，施加渐进惩罚，避免突然切断。

3.6 Algorithm 1: DAPO 完整算法

Algorithm: DAPO
Input: initial policy π_θ, reward model R, task prompts D, hyperparams ε_low, ε_high
1: for step = 1, ..., M do
2:     Sample batch D_b from D
3:     Update old policy π_θ_old ← π_θ
4:     Sample G outputs {o_i}_{i=1}^G ~ π_θ_old(·|q) for each q ∈ D_b
5:     Compute rewards {r_i}_{i=1}^G for each o_i via R
6:     Filter out o_i and add remaining to dynamic sampling buffer  (Eq.11)
7:     if buffer size n_b < N:
8:         continue  # keep sampling until batch is full
9:     For each o_i in buffer, compute Â_{i,t} for each token (Eq.9)
10:    for iteration = 1, ..., μ do
11:        Update π_θ by maximizing DAPO objective (Eq.8)
Output: π_θ

3.7 Pseudocode（基于 verl 公开实现）

组件 A：Clip-Higher (decoupled clip ratios)

# Source: verl/trainer/ppo/core_algos.py
# Corresponds to: actor.clip_ratio_low, actor.clip_ratio_high in config
 
def compute_policy_loss_with_clip_higher(log_prob, old_log_prob, advantages,
                                          clip_ratio_low, clip_ratio_high):
    ratio = torch.exp(log_prob - old_log_prob)
    # Decoupled clipping: different lower and upper bounds
    clipped_ratio = torch.clamp(ratio, 1 - clip_ratio_low, 1 + clip_ratio_high)
    pg_loss1 = ratio * advantages
    pg_loss2 = clipped_ratio * advantages
    loss = -torch.min(pg_loss1, pg_loss2)
    return loss

组件 B：Dynamic Sampling (filter accuracy=0 and accuracy=1 groups)

# Source: verl/trainer/ppo/ray_trainer.py (training loop)
# The dynamic sampling buffer continues sampling until batch is full
 
def dynamic_sampling_step(policy, prompts, G, batch_size):
    buffer = []
    while len(buffer) < batch_size:
        batch = sample_batch(prompts)
        for q in batch:
            outputs = [policy.generate(q) for _ in range(G)]
            rewards = [reward_fn(q, o) for o in outputs]
            n_correct = sum(1 for r in rewards if r > 0)
            # Filter: keep only prompts with 0 < n_correct < G
            if 0 < n_correct < G:
                buffer.append((q, outputs, rewards))
    return buffer[:batch_size]

组件 C：Token-Level Policy Gradient Loss

# Source: verl/trainer/ppo/core_algos.py
# loss_agg_mode="token-mean" in config
 
def token_level_loss_aggregation(per_token_loss, response_mask):
    # Instead of: mean over tokens per sample, then mean over samples
    # Do: sum all token losses, divide by total token count
    total_loss = (per_token_loss * response_mask).sum()
    total_tokens = response_mask.sum()
    return total_loss / total_tokens

组件 D：Overlong Reward Shaping (Soft Overlong Punishment)

# Source: verl reward manager (reward.reward_manager.name=dapo)
 
def compute_overlong_penalty(response_length, max_resp_len, buffer_len, penalty_factor):
    cache_start = max_resp_len - buffer_len
    if response_length <= cache_start:
        return 0.0
    elif response_length <= max_resp_len:
        return penalty_factor * (cache_start - response_length) / buffer_len
    else:
        return -1.0 * penalty_factor

3.8 Code-to-paper mapping table

Paper Concept	Source File	Key Class/Function
DAPO 训练主入口	`verl/trainer/main_ppo.py`	Main entry point
GRPO advantage 计算	`verl/trainer/ppo/core_algos.py`	`compute_grpo_outcome_advantage`
Clip-Higher (decoupled clip)	`verl/trainer/ppo/core_algos.py`	`clip_ratio_low`, `clip_ratio_high`; `torch.clamp(ratio, 1-cliprange_low, 1+cliprange_high)`
Token-Level Loss	`verl/trainer/ppo/core_algos.py`	`agg_loss()`, `loss_agg_mode="token-mean"`
Dynamic Sampling (Filter Groups)	`verl/trainer/config/algorithm.py`	`FilterGroupsConfig(enable=True, metric="seq_reward")`
Overlong Reward Shaping	`verl/workers/reward_manager/dapo.py`	`DAPORewardManager.__call__()`, `overlong_buffer_cfg`
数学答案验证	`verl/utils/reward_score/math_dapo.py`	Rule-based reward
DAPO 训练脚本	`examples/gmpo_trainer/test_dapo_7b_math.sh`	完整超参配置
Ray 分布式训练	`verl/trainer/ppo/ray_trainer.py`	`RayPPOTrainer`
DAPO 论文/数据/模型	`github.com/BytedTsinghua-SIA/DAPO`	Paper, eval scripts, DAPO-Math-17K dataset

4. Experimental Setup (实验设置)

4.1 模型与数据

Base model: Qwen2.5-32B（预训练模型，非 instruct 版本）
训练数据: DAPO-Math-17K，从 web 和竞赛主页爬取并经 LLM 转换为整数答案格式的数学题
Reward: Rule-based， $R (\overset{y}{^}, y) = 1$ if correct, $- 1$ otherwise（无 reward model，直接判对错）

4.2 训练配置

Optimizer: AdamW, lr = $1 \times 1 0^{- 6}$ , 20 step linear warmup
Rollout: prompt batch size = 512, 每个 prompt 采样 16 个回复 ( $G = 16$ )
Mini-batch: 512, 即 16 次梯度更新 per rollout step
Max length: $L_{max} = 16384 + 4096 = 20480$ tokens（16K 期望 + 4K 软惩罚缓冲）
Clip-Higher: $ε_{low} = 0.2$ , $ε_{high} = 0.28$
评估: AIME 2024, temperature=1.0, top_p=0.7, 重复 32 次取 avg@32
框架: verl (HybridFlow)，基于 Ray 分布式

4.3 Baselines

DeepSeek-R1-Zero-Qwen-32B: 47 on AIME 2024（之前 SOTA）
Naive GRPO: 30 on AIME 2024（作者的 baseline）

5. Experimental Results (实验结果)

5.1 主结果：逐步添加技术的 ablation

Figure 1 解读：DAPO 在 Qwen2.5-32B 上的 AIME 2024 训练曲线。avg@32（蓝线实线）在约 6000 步达到 50，超过 DeepSeek-R1-Zero-Qwen-32B 的 47（虚线），且只用了约 50% 的训练步数。pass@32 和 cons@32 也在持续提升。

Model	AIME24 avg@32
DeepSeek-R1-Zero-Qwen-32B	47
Naive GRPO	30
+ Overlong Filtering	36
+ Clip-Higher	38
+ Soft Overlong Punishment	41
+ Token-level Loss	42
+ Dynamic Sampling (DAPO)	50

每个技术都有正向贡献，从 naive GRPO 的 30 逐步提升到 DAPO 的 50。

5.2 Training Dynamics 分析

Figure 7 解读：左图是平均回复长度，右图是 reward score。回复长度在 1000-4000 之间波动但整体上升，说明模型在学习更长的推理链。Reward 则相对稳定上升，没有明显的 reward hacking。

Figure 7c-d 解读：左图是生成熵，右图是平均生成概率。Entropy 保持在 0.3-0.7 的健康范围内缓慢上升；生成概率从 0.84 下降到 0.74，说明模型在探索更多样的 token 选择。

5.3 Dynamic Sampling 的效率

Figure 6 解读：有无 Dynamic Sampling 的对比。虽然 Dynamic Sampling 需要更多采样次数（因为要过滤掉 accuracy=0/1 的 prompt），但它到达相同性能所需的训练步数更少。总训练时间基本不受影响，因为采样时间在 RL 系统中通常不是瓶颈。

5.4 涌现行为

论文发现在训练过程中，模型逐渐出现了反思和回溯行为——这在训练初期几乎不存在。例如，模型会输出 “However, wait a moment, let’s rethink…” 这样的自我修正模式，说明 RL 训练不仅在强化已有的推理模式，还在涌现全新的推理能力。

5.5 Limitations

论文只在数学任务上验证，未展示代码、科学推理等其他领域的结果；
DAPO 的四个技术虽然各自有效，但它们之间的交互效应没有被系统分析；
训练成本仍然很高（512 prompt × 16 responses = 8192 次生成/step），对小团队不够友好。

总体来说，DAPO 的最大贡献是：把一个 industry-level 的 LLM RL 系统从算法到代码到数据全部开源，填补了 DeepSeek-R1 和 OpenAI o1 之间的可复现性空白。

Paper Notes

探索

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 DAPO 目标函数

3.2 Clip-Higher

3.3 Dynamic Sampling

3.4 Token-Level Policy Gradient Loss

3.5 Overlong Reward Shaping

3.6 Algorithm 1: DAPO 完整算法

3.7 Pseudocode（基于 verl 公开实现）

组件 A：Clip-Higher (decoupled clip ratios)

组件 B：Dynamic Sampling (filter accuracy=0 and accuracy=1 groups)

组件 C：Token-Level Policy Gradient Loss

组件 D：Overlong Reward Shaping (Soft Overlong Punishment)

3.8 Code-to-paper mapping table

4. Experimental Setup (实验设置)

4.1 模型与数据

4.2 训练配置

4.3 Baselines

5. Experimental Results (实验结果)

5.1 主结果：逐步添加技术的 ablation

5.2 Training Dynamics 分析

5.3 Dynamic Sampling 的效率

5.4 涌现行为

5.5 Limitations

目录

反向链接