V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

1. Motivation（研究动机）

现有方法的问题： 把 policy-gradient online RL 直接用于 diffusion / flow matching 模型时，精确 $l o g π_{θ} (o ∣ c)$ 通常不可得。主流绕法是把采样轨迹建成 MDP，逐步优化 reverse transition kernel；这虽然稳定，但有三个具体瓶颈：收敛慢、被绑定到一阶随机 SDE 采样、优化目标与 rollout transition kernel 强耦合，导致 MixGRPO / BranchGRPO 等方法不得不用更复杂的采样树、滑窗或 ODE-SDE 混合设计来补效率。
本文要解决的问题： 重新启用与 diffusion ELBO / pretraining loss 相连的 likelihood surrogate，把它放进 GRPO 里直接优化最终样本的“原子动作”概率，同时修复先前 DDPO / FPO 式 surrogate 在视觉生成上不稳定、欠收敛的问题。
为什么重要： 如果 ELBO surrogate 可稳定工作，就能把 pretraining objective、高阶 ODE rollout 与 online RL 对齐到同一框架里，避免 MDP 化带来的 per-step policy coupling；在多 reward 图像对齐任务上，这意味着更少 function evaluations、更少梯度步、更简单的实现，同时保持或超过强基线质量。

2. Idea（核心思想）

核心洞察： ELBO surrogate 本身不是失败根因，失败主要来自 surrogate 的相对量级方差过大：不同时间步、不同噪声样本造成的 loss scale 变化会淹没 group-relative reward signal。只要让同组样本在可比较的 timestep-noise basis 上估计 likelihood，并控制更新步幅，ELBO surrogate 可以比 MDP-based GRPO 更简单、更快。
关键创新： V-GRPO 用 negative weighted pretraining loss 近似 $lo g π_{θ} (o_{i} ∣ c)$ ，再用 GRPO 的 importance ratio 做 online update；同时引入三类 surrogate variance reduction：group-shared timestep-noise pairs、stratified timestep sampling、adaptive loss weighting，并按场景使用 importance ratio clipping / KL penalty / advantage soft-clipping 控制步幅。
与代表性方法的本质区别： 相比 MixGRPO / BranchGRPO 把 generation 建成 MDP 并优化 trajectory transitions，V-GRPO 把完整生成结果当作一个 atomic action，用 ELBO-style surrogate 直接近似 final-output likelihood；相比 DiffusionNFT 的正负策略对比，它不需要维护两套模型权重。

3. Method（方法）

总体框架

直觉段落。 传统 MDP 化的视觉 RL 把每个 denoising step 都当作 action，这让 policy gradient 可计算，但也把训练强行锁进一阶 stochastic sampler：rollout 生成时的每个 transition kernel 必须和优化时的 log-prob 对齐。V-GRPO 反过来利用 diffusion / flow matching 预训练本来就在估计“给定数据样本在某个 noise level 下的重构误差”这一事实：对于固定生成样本 $o_{i}$ ，其 weighted denoising loss 可看成 negative likelihood surrogate。问题在于 raw surrogate 的 scale 很不均匀，因此本文的核心不是发明复杂策略梯度，而是把这个 surrogate 的估计方差压到 group-relative advantage 能稳定发挥作用的范围。

Figure 1 解读： 图中统计同一类 per-sample loss term $ℓ_{i}^{θ} (t_{j}, ϵ_{j})$ 在不同 timestep 上的均值与方差。它显示 raw ELBO surrogate 的量级随 timestep 明显变化；若每个 output 独立抽 timestep / noise，同组样本之间的 surrogate 差异会混入大量采样噪声，导致 advantage 排序不再可靠。

Figure 2 解读： 该图把 surrogate magnitude 与 gradient norm 关联起来，说明 surrogate 方差会进一步转化为梯度方差。应用三种 variance-reduction 技术后，组内 CV 从 0.170 降到 0.038，overall surrogate CV 从 0.230 降到 0.128，gradient norm 对 surrogate magnitude 的二次拟合 $R^{2}$ 从 0.406 降到 0.328。

ELBO-based likelihood surrogate

预训练 denoising / flow matching 模型通常最小化 weighted regression：

L_{w} (θ) = E_{t, x, ϵ} [w_{t} NN_{t}^{θ} (z_{t}) - r_{t} (x, ϵ)_{2}^{2}] .

V-GRPO 将最终输出 $o_{i}$ 和 prompt $c$ 条件化后，把 negative loss 替代 log likelihood：

lo g π^{θ} (o_{i} ∣ c) \leftarrow - L_{w} (θ ∣ o_{i}, c) .

对应 importance ratio 为：

ρ_{i}^{θ} = exp (- L_{w} (θ ∣ o_{i}, c) + L_{w} (θ_{old} ∣ o_{i}, c)) .

实际用 $N_{MC}$ 个 timestep-noise pair 估计：

\hat{L}_{w} (θ ∣ o_{i}, c) = \frac{1}{N _{MC}} j = 1 \sum N_{MC} ℓ_{i}^{θ} (t_{j}, ϵ_{j}),

ℓ_{i}^{θ} (t_{j}, ϵ_{j}) = w_{t_{j}} NN_{t_{j}}^{θ} (z_{t_{j}}, c) - r_{t_{j}} (o_{i}, ϵ_{j})_{2}^{2} .

三个 surrogate variance reduction 组件

Group-shared timestep-noise pairs： 对同一个 prompt $c$ 下的 $G$ 个输出，共享同一组 ${(t_{j}, ϵ_{j})}_{j = 1}^{N_{MC}}$ ，避免同组样本因随机抽样 basis 不同而不可比较。
Stratified timestep sampling： 把 schedule 分成 $N_{MC}$ 个等长区间，每个区间抽一个 timestep，保证每个样本覆盖完整噪声时间轴。
Adaptive loss weighting： 先把 model output 重参数化成 $x$ -prediction，如 rectified flow 中 $x^{θ} (z_{t}) = z_{t} - t NN_{t}^{θ} (z_{t})$ ，再用 self-normalized loss：

L^{adaptive} (θ ∣ o_{i}, c) = E_{t, ϵ} [\frac{x ^{θ} ( z _{t} ) - o _{i} _{2}^{2}}{sg ( \frac{1}{d} ∥ x ^{θ} ( z _{t} ) - o _{i} ∥ _{1} )}] .

三类 gradient-step regulation

Importance ratio clipping： 默认最常用，按 PPO/GRPO 形式限制 $ρ_{i}^{θ}$ ，在 FLUX.1-dev 主实验中足以防止 loss spikes。
KL penalty： 不额外维护 reference model，而对 behavior policy $θ_{old}$ 罚偏离，只需存旧 surrogate：

D_{i}^{simple} (π^{θ} ∥ π^{θ_{old}}) = E_{t, ϵ} [x^{θ} (z_{t}) - x^{θ_{old}} (z_{t})_{2}^{2}] .

Advantage soft-clipping： 对 fully on-policy 或采样步数很少的设置，用

A^{soft} = η \cdot tanh (\frac{A}{η})

平滑限制极端 advantage；在 SD 3.5 M Stage-1 与 FLUX 16-step 设置更有用。

论文算法伪代码（非源码实现）

代码搜索说明：论文脚注写明 https://github.com/tang-bd/v-grpo，但当前公开访问为 404；GitHub 用户 tang-bd 的公开 repo 列表中未出现 v-grpo；WebSearch / GitHub code search 也未找到可核对官方实现。因此下面是按 Algorithm 1 和公式整理的论文级 PyTorch-style 伪代码，不是源代码复刻。

import torch
import torch.nn.functional as F
 
 
def group_relative_advantages(rewards: torch.Tensor, eta: float | None = None):
    # rewards: [batch, group, num_rewards] or [batch, group]
    if rewards.dim() == 3:
        rewards = rewards.mean(dim=-1)
    mean = rewards.mean(dim=1, keepdim=True)
    std = rewards.std(dim=1, keepdim=True).clamp_min(1e-6)
    adv = (rewards - mean) / std
    if eta is not None:
        adv = eta * torch.tanh(adv / eta)
    return adv

import torch
 
 
def sample_shared_stratified_pairs(batch_size, group_size, n_mc, latent_shape, device):
    # one shared basis per prompt group, then expanded to all outputs of that prompt
    bins = torch.linspace(0, 1, n_mc + 1, device=device)
    u = torch.rand(batch_size, n_mc, device=device)
    t = bins[:-1] + u * (bins[1:] - bins[:-1])
    eps = torch.randn(batch_size, n_mc, *latent_shape, device=device)
    t = t[:, None].expand(batch_size, group_size, n_mc)
    eps = eps[:, None].expand(batch_size, group_size, n_mc, *latent_shape)
    return t, eps

import torch
 
 
def adaptive_elbo_surrogate(model, sample_o, prompt_emb, t, eps, schedule):
    # sample_o: generated clean image/latent o_i; t, eps shared within prompt group
    z_t = schedule.noise(sample_o, eps, t)
    pred = model(z_t, t, prompt_emb)
    x_pred = schedule.to_x_prediction(z_t, pred, t)
    l2 = (x_pred - sample_o).pow(2).flatten(2).sum(dim=-1)
    l1 = (x_pred - sample_o).abs().flatten(2).mean(dim=-1).detach().clamp_min(1e-8)
    return (l2 / l1).mean(dim=-1)  # average over N_MC

import torch
 
 
def v_grpo_update(model, old_model, batch, rewards, optimizer, beta, eps_clip, eta, n_mc):
    prompts, old_outputs = batch["prompts"], batch["outputs"]
    t, noise = sample_shared_stratified_pairs(
        batch_size=old_outputs.size(0),
        group_size=old_outputs.size(1),
        n_mc=n_mc,
        latent_shape=old_outputs.shape[2:],
        device=old_outputs.device,
    )
    with torch.no_grad():
        old_loss = adaptive_elbo_surrogate(old_model, old_outputs, prompts, t, noise, batch["schedule"])
    new_loss = adaptive_elbo_surrogate(model, old_outputs, prompts, t, noise, batch["schedule"])
    ratio = torch.exp(-new_loss + old_loss)
    adv = group_relative_advantages(rewards, eta=eta)
    clipped = torch.clamp(ratio, 1 - eps_clip, 1 + eps_clip) * adv
    objective = torch.minimum(ratio * adv, clipped)
    kl = (batch["schedule"].to_x_prediction_current(model, old_outputs, prompts, t, noise)
          - batch["schedule"].to_x_prediction_current(old_model, old_outputs, prompts, t, noise)).pow(2).mean()
    loss = -(objective.mean() - beta * kl)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.detach()

Code reference: no public code found (2026-04-28) — paper footnote URL https://github.com/tang-bd/v-grpo returned 404; pseudocode and mapping are paper-derived, not implementation-anchored.

Paper Concept	Source File	Key Class/Function
ELBO surrogate replacing $lo g π^{θ} (o_{i} ∣ c)$	论文 Section 4.1 / Eq. before Eq. (loss_term)	`adaptive_elbo_surrogate`（上方论文级伪代码）
Group-shared stratified timestep-noise pairs	Algorithm 1 + Section 4.3	`sample_shared_stratified_pairs`（上方论文级伪代码）
Advantage soft-clipping	Algorithm 1 + Section 4.4	`group_relative_advantages(..., eta)`（上方论文级伪代码）
Importance ratio + GRPO clipped objective + KL penalty	Algorithm 1 + Eq. for $ρ_{i}^{θ}$ / $D_{i}^{s im pl e}$	`v_grpo_update`（上方论文级伪代码）

关键图表解读

Figure 3 解读： FLUX.1-dev 主实验定性图展示 V-GRPO 在 alignment、coherence、style 上优于 baseline；第二个例子突出其未使用 OCR reward / task-specific dataset 也能改善 text rendering。

Figure 4 解读： FLUX 消融显示 naive ELBO-GRPO 明显不稳定；去掉 group-shared pair 或 stratified sampling 都会破坏收敛，去掉 adaptive weighting 则略降性能，说明稳定性主要来自 surrogate basis 的可比较性。

Figure 5 解读： 对 FLUX 来说，单独 KL penalty 无法抑制 loss spikes；importance ratio clipping 更适合作为常规稳定器。

Figure 6 解读： 在 SD 3.5 M fully on-policy Stage-1 中，soft-clipping advantage 可以稳定训练；这是因为只有 1 个 gradient step 时 $θ = θ_{o l d}$ ，importance ratio clipping 不再起作用。

Figure 7 解读： 采样步数降到 16 时，advantage soft-clipping 对 FLUX 训练也有稳定效果，说明它适合“每次 rollout 信息更稀疏 / 更新更激进”的场景。

Figure 8 解读： 在 SD 3.5 M Stage-2 的粗粒度 GenEval reward 任务中，importance ratio clipping + 2 gradient steps 优于 advantage soft-clipping，说明不同 reward granularity 需要不同 update regulator。

Figure 9 解读： $N_{MC} = 2$ 不足以收敛， $N_{MC} = 4$ 已较好，增到 8 只有边际收益；这与 MDP-based 方法中“optimized timesteps”饱和现象一致。

Figure 10 解读： SD 3.5 M 对 ELBO surrogate 更鲁棒；去掉单个组件影响不如 FLUX 剧烈，但三项技术整体仍有收益。

Figure 11 解读： adaptive loss weighting 中使用 $x$ -prediction 最优； $ϵ$ -prediction 会训练 collapse， $v$ -prediction 稳定但收敛略慢。

Figure 12 解读： 补充 FLUX 定性图表明 V-GRPO 不只提升局部美学，也能改善复杂世界知识和 prompt coherence。

Figure 13 解读： SD 3.5 M 多阶段训练的定性图展示 V-GRPO 在 alignment、coherence、style 上达到与 DiffusionNFT 接近或更好的结果，同时训练步数更少。

4. Experimental Setup（实验设置）

Base models： FLUX.1-dev（guidance-distilled，通常不显式使用 CFG）与 Stable Diffusion 3.5 Medium（训练与评估均关闭 CFG；作者观察 online RL 可起到 guidance distillation 作用）。
Datasets / prompts： FLUX.1-dev 实验使用 HPDv2 prompts；SD 3.5 M 按 DiffusionNFT，GenEval / OCR 用 Flow-GRPO prompt sets，其余训练用 Pick-a-Pic，评估用 DrawBench。论文未在正文列出每个 prompt set 的完整样本数；只明确训练设置与评估协议跟随对应 baseline。
Rewards / metrics： 规则类 GenEval（组合图文对齐）、OCR（文字渲染）；模型类 HPSv2.1、PickScore、CLIPScore、ImageReward、UnifiedReward、Aesthetics，评估 image quality、image-text alignment 与 human preference。
Training config： FLUX 主实验训练 300 iterations，每轮 4 gradient steps，global batch size 8/step，group size 12，AdamW lr=1e-5、weight decay 1e-4， $N_{MC} = 4$ ，720×720 rollout，16 或 25 sampling steps，评估 1024×1024 / 50 steps，并用 MixGRPO hybrid sampling（ $p_{mi x} = 0.8$ ）。SD 3.5 M 使用 LoRA r=32, alpha=64，AdamW lr=3e-4、weight decay 1e-4，global batch size 48，group size 24，五阶段总 580 gradient steps，512×512 / 40 steps；BF16 rollout，master weights、surrogate 与 backward 用 FP32。

5. Experimental Results（实验结果）

Teaser speed / average reward（Table 1）

Method	Steps	$NF E_{θ_{o l d}}$	$NF E_{θ}$	Reward
FLUX.1-dev	—	—	—	1.25
+ BranchGRPO	300	13.68	13.68	1.40
+ MixGRPO	300	25	4	1.41
+ V-GRPO	150	16 + 4	4	1.42
+ V-GRPO	300	16 + 4	4	1.45
SD 3.5 M (w/o CFG)	—	—	—	0.95
+ DiffusionNFT	1.7K	40 + 40	40	1.71
+ V-GRPO	580	40 + 6.9	6.9	1.71

FLUX.1-dev multi-reward（Table 2）

Method	$NF E_{o l d}$	$NF E_{n e w}$	HPS-v2.1	PickScore	ImageReward	UnifiedReward
FLUX.1-dev	—	—	0.313	0.227	1.088	3.370
DanceGRPO	25	4	0.334	0.225	1.335	3.374
DanceGRPO*	25	4	0.333	0.229	1.235	3.325
DanceGRPO	25	14	0.356	0.233	1.436	3.397
BranchGRPO-WidPru	13.68	8.625	0.364	0.231	1.609	3.383
BranchGRPO-DepPru	13.68	8.625	0.369	0.235	1.625	3.404
BranchGRPO-Mix	13.68	4.25	0.363	0.230	1.598	3.384
BranchGRPO	13.68	13.68	0.363	0.229	1.603	3.386
MixGRPO-Flash*	8	4	0.357	0.232	1.624	3.402
MixGRPO-Flash	16	4	0.358	0.236	1.528	3.407
MixGRPO	25	4	0.367	0.237	1.629	3.418
V-GRPO	16 + 4	4	0.372	0.241	1.731	3.437
V-GRPO	25 + 4	4	0.372	0.241	1.749	3.436

SD 3.5 M multi-stage multi-reward（Table 3）

Method	Steps	$NF E_{o l d}$	$NF E_{n e w}$	GenEval	OCR	PickScore	CLIPScore	HPSv2.1	Aesthetics	ImgRwd	UniRwd
SD XL†	—	—	—	0.55	0.14	0.2242	0.287	0.280	5.60	0.76	2.93
SD 3.5 L†	—	—	—	0.71	0.68	0.2291	0.289	0.288	5.50	0.96	3.25
SD 3.5 M (w/o CFG)	—	—	—	0.24	0.12	0.2051	0.237	0.204	5.13	-0.58	2.02
+ CFG	—	—	—	0.63	0.59	0.2234	0.285	0.279	5.36	0.85	3.03
+ FlowGRPO	>5K	40	40	0.95	0.66	0.2251	0.293	0.274	5.32	1.06	3.18
+ FlowGRPO	2K	40	40	0.66	0.92	0.2241	0.290	0.280	5.32	0.95	3.15
+ FlowGRPO	4K	40	40	0.54	0.68	0.2350	0.280	0.316	5.90	1.29	3.37
+ DiffusionNFT	1.7K	40 + 40	40	0.94	0.91	0.2380	0.293	0.331	6.01	1.49	3.49
+ V-GRPO	580	40 + 6.9	6.9	0.91	0.91	0.2350	0.298	0.341	6.02	1.52	3.43

KL 保留旧能力（Table 4）

Method	GenEval	OCR
Importance ratio clipping $ϵ = 3 \times 1 0^{- 2}$	0.87	0.93
KL penalty $β = 0.3$	0.91	0.91

训练效率（Appendix Table 5）

Stage	DiffusionNFT Steps	DiffusionNFT NFE	V-GRPO Steps	V-GRPO NFE
1 Human Preferences	800	120	150	48
2 GenEval	300	120	200	60
3 Human Preferences	200	120	150	48
4 GenEval	200	120	50	60
5 OCR	100	120	30	60
Total	1700	120	580	53.8

SD 3.5 M single-reward（Appendix Table 6）

Setting / Method	Steps	GenEval	OCR	PickScore	CLIPScore	HPSv2.1	Aesthetic	ImgRwd	UniRwd
SD 3.5 M (w/o CFG)	—	0.24	0.12	0.2051	0.237	0.204	5.13	-0.58	2.02
+ CFG	—	0.63	0.59	0.2234	0.285	0.279	5.36	0.85	3.03
GenEval / FlowGRPO	4K	0.97	0.30	0.2178	0.277	0.248	5.15	0.74	2.87
GenEval / DiffusionNFT	1K	0.98	0.36	0.2192	0.271	0.251	5.33	0.68	2.91
GenEval / V-GRPO	600	0.97	0.34	0.2150	0.280	0.225	5.20	0.35	2.80
OCR / FlowGRPO	1K	0.66	0.96	0.2194	0.280	0.257	5.18	0.31	2.86
OCR / DiffusionNFT	150	0.54	0.97	0.2163	0.281	0.246	5.19	0.37	2.81
OCR / V-GRPO	25	0.47	0.98	0.2170	0.277	0.243	5.21	0.28	2.83
PickScore / FlowGRPO	4K	0.54	0.60	0.2362	0.257	0.295	6.42	1.17	3.17
PickScore / DiffusionNFT	2K	0.53	0.64	0.2403	0.270	0.315	6.17	1.29	3.40
PickScore / V-GRPO	300	0.66	0.62	0.2403	0.267	0.308	6.42	1.30	3.26

关键发现与局限

主要结论： V-GRPO 在 FLUX 主实验四项 reward 指标全领先或并列最优；在 SD 3.5 M 上以 580 steps 匹配 DiffusionNFT 1.7K steps 的综合 reward，并以更低 $NFE$ 达到 3× 梯度步加速。
消融结论： surrogate variance reduction 是 FLUX 成败关键； $N_{MC}$ 存在饱和，4 是主要实验折中；KL penalty 适合保留旧能力，importance ratio clipping 适合抑制 spikes，advantage soft-clipping 适合 fully on-policy 或低采样步数场景。
作者明示边界： 结论指向“robustness and scalability”仍需进一步研究；官方源码当前未公开可访问，复现需要等待或自行实现 Algorithm 1。

Paper Notes

探索

V-GRPO - Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think