TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation

Authors: TeleBoost Team Affiliations: Institute of Artificial Intelligence, China Telecom (TeleAI) arXiv: 2602.07595 Project Page: tele-ai.github.io/TeleBoost GitHub: Tele-AI/TeleBoost

1. Motivation (研究动机)

TeleBoost 不是一篇单点算法论文，而是一篇关于 video generation post-training 的系统性技术报告。作者的出发点是：预训练视频生成模型距离真实部署还有一大段距离。即使 backbone 已经很强，模型在真实场景里仍然经常出现：

对 prompt phrasing 敏感；
长时间生成时容易不稳定；
局部伪影明显（如手部、文字、快速运动区域）；
指令跟随与 controllable editing 能力不足。

论文进一步指出，视频后训练比语言和图像后训练更难，主要因为：

高 rollout 成本：一段视频生成往往需要数十到数百步 diffusion sampling；
时序误差会累积：局部错误会扩散成长期退化；
监督天然模糊：文本到视频是一对多映射，alignment reward 往往弱判别；
评估器脆弱：VLM / CLIP / 手工时序指标都容易因 fps、resolution、sampling strategy 等因素而失真。

因此 TeleBoost 的核心问题不是“如何设计一个更强 reward”，而是：如何把 SFT、RL、DPO、reward system、infra 组织成一条稳定、可诊断、可扩展的视频后训练流水线？

2. Idea (核心思想)

TeleBoost 把视频后训练拆成三个阶段：

Stage I: SFT —— 先把策略塑造成稳定、可控的 reference policy；
Stage II: GRPO / RL with automatic feedback —— 利用自动奖励提升对齐、时序一致性和感知质量；
Stage III: DPO / preference alignment —— 用人类偏好做最后一跳的整体感知优化。

它的关键创新并不在于提出某一个新的单独算法，而在于把若干已有/扩展组件放在一个统一框架中，形成一个“先稳定、再细化、再做人偏好修正”的 staged optimization pipeline。

TeleBoost 的思路和很多零散后训练工作不同：它把视频生成的失败模式看成系统工程问题，而不是把所有目标直接扔给同一个 RL loss。

3. Method (方法)

3.1 Overall staged pipeline

Figure 1 解读： Figure 1 是 TeleBoost 的总览图。它从一个 pretrained video generator 出发，先做 Stage I SFT，把 controllability、instruction following、结构稳定性和 physics priors 先注入策略；然后进入 Stage II RL，通过自动反馈继续优化 perceptual fidelity、temporal coherence 和 alignment；最后用 Stage III human preference alignment 捕捉那些难以显式写成 reward 的整体偏好。图中还特别强调 evaluation 和 diagnostics 是跨阶段组件，这说明 TeleBoost 的重点是“系统性流程管理”，而不是单一阶段的最优。

3.2 Stage I: Supervised Fine-Tuning (SFT)

Figure 2 解读： Figure 2 展示了 SFT 阶段的作用：从 pretrained backbone 出发，通过统一的 supervised objective 同时引入 instruction / control supervision、spatial-structure–aware constraints 和 physics-aware motion supervision。作者强调所有结构先验都以较保守的方式注入 decoder level，目的是先建立一个稳定、可控、结构一致的 reference policy，而不是在 SFT 阶段就激进追求指标极限。

论文把 SFT 设计成整个系统的“policy shaping”阶段，而不是普通 warmup。作者的逻辑是：

如果 reference policy 本身不稳定，那么后续 RL 和 DPO 的 reward 只会更 noisy；
因此 SFT 的主要目标不是最大化视觉质量，而是定义模型允许做什么、不允许做什么。

Stage I 里又细分为三类 supervision：

Instruction and Control-Oriented SFT：增强 prompt-following、camera behavior、compositional constraints；
Spatial-Structure–Aware SFT：提升大视角变换下的 3D / spatial stability；
Physics-Aware SFT：通过辅助 motion branch 与 optical flow supervision，改善液体、变形材料等动态现象。

这意味着 TeleBoost 的 SFT 已经不是简单 imitation，而是在主动为 Stage II 打地基。

3.3 Stage II: Reinforcement Learning via GRPO

Figure 3 解读： Figure 3 展示了 TeleBoost 第二阶段的 GRPO optimization stack。底层是标准 GRPO：对同一 prompt 采样一组视频，利用 group-relative normalization 得到 advantage；在其上叠加三个 refinement module：ViPO 负责把标量 reward 转成时空 advantage map，解决“where to learn”；BPGO 负责根据 prior 可信度去调节 noisy reward 的权重，解决“what to trust”；Self-Paced GRPO 负责按模型能力调整奖励课程，解决“when to learn”。图中还显示了 Joint Reward 层负责协调多目标冲突。

(1) GRPO backbone

TeleBoost 采用 critic-free 的 GRPO 作为 Stage II 基础。给定同一 prompt 的一组采样视频 ${v_{1}, \dots, v_{G}}$ ，每个视频得到标量分数 $r_{i}$ ，然后做组内标准化：

A_{i} = \frac{r _{i} - mean ({ r _{1} , \dots , r _{G} })}{std ({ r _{1} , \dots , r _{G} }) + ϵ} .

这一步的作用是：

不依赖单独训练 critic；
把绝对打分问题改写成相对排序问题；
对视频这种 reward 本身易漂移的任务更稳。

(2) ViPO

ViPO 的目标是解决 scalar reward 无法回答“错误发生在何处”的问题。它引入一个 Perceptual Structuring Module，用冻结的 DINOv2 / VideoMAE 等 backbone 抽取时空特征，再构造局部 advantage map：

L_{ViPO} = E t, h, w \sum M_{i}^{(t, h, w)} \cdot A_{i} \cdot lo g π_{θ} (v_{i}) .

这样，policy gradient 不再均匀作用于整段视频，而会重点更新 failure region。

(3) BPGO

BPGO 关注的是奖励本身的可信度问题。作者认为视频任务存在很多 many-to-many ambiguity，因此 reward model 对某些样本高置信、对另一些样本低置信。BPGO 通过 prior-guided trust allocation 来调节这种不确定性：

RAS（Reliability-Adaptive Scaling）：根据当前组 reward 分布与 prior 的偏差，决定该组优化权重；
CRT（Contrastive Reward Transformation）：在组内把高置信正样本拉开，把模糊样本压缩，增强差异性。

(4) Self-Paced GRPO

Self-Paced GRPO 处理 reward saturation：随着生成器能力变强，静态 reward 逐渐失去判别力。作者因此引入 competence-aware curriculum，把不同奖励维度按训练阶段动态调度，让模型先学稳定性和结构，再学精细 semantic alignment。

3.4 Stage III: DPO

Figure 4 解读： Figure 4 讨论的是 DPO 的工程实现问题。左边是标准 DPO：chosen 和 rejected 两条分支共享参数图，导致中间梯度生命周期叠加，显存峰值高；右边是 gradient-decoupled DPO：把两条分支拆开反传，使得每一条分支的中间梯度能更快释放，从而显著降低峰值显存。这张图说明 TeleBoost 在 DPO 阶段不只关注算法目标，也关心大规模训练时的系统可行性。

对于 preference optimization，TeleBoost 沿用 diffusion 版 DPO：

L_{DPO} = - E [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})] .

同时，论文在 Stage III 的重点还包括 preference data construction：

policy-on-policy hard negatives
synthetic temporal negatives（倒放、打乱、freeze）
holistic human ranking

这说明 TeleBoost 并不把 DPO 看成“直接套公式”，而是强调偏好数据如何构造才有足够信息量。

3.5 Pseudocode（基于论文报告结构整理）

TeleBoost 官方仓库当前主要是 project page，训练代码尚未公开。以下伪代码依据论文中的框架描述整理，而不是源码复现。

组件 A：Stage-I policy shaping

# Algorithm: Stage-I SFT policy shaping
# Input: pretrained video generator G
# Output: stable and controllable reference policy pi_ref
 
def stage1_sft(generator, instruction_data, structure_data, physics_data):
    for batch in mixed_loader(instruction_data, structure_data, physics_data):
        loss_inst = instruction_control_loss(generator, batch)
        loss_struct = structure_aware_loss(generator, batch)
        loss_phys = physics_aware_motion_loss(generator, batch)
        loss = loss_inst + loss_struct + loss_phys
        update(generator, loss)
    return generator

组件 B：Stage-II GRPO backbone

# Algorithm: Group Relative Policy Optimization
# Input: prompt c, group size G
# Output: group-relative advantages
 
def compute_grpo_advantage(rewards):
    mu = mean(rewards)
    sigma = std(rewards) + 1e-6
    return [(r - mu) / sigma for r in rewards]

组件 C：BPGO reliability-aware optimization

# Algorithm: BPGO with RAS and CRT
# Input: group rewards, prior rewards
# Output: reliability-aware transformed rewards
 
def bpgo_transform(group_rewards, prior_rewards):
    reliability = estimate_group_reliability(group_rewards, prior_rewards)   # RAS
    transformed = contrastive_reward_transform(group_rewards, prior_rewards) # CRT
    return reliability * transformed

组件 D：Self-Paced GRPO

# Algorithm: Self-paced reward curriculum
# Input: current policy competence statistics
# Output: active reward schedule
 
def self_paced_schedule(model_state):
    if model_state.is_unstable():
        return ['structure', 'stability']
    elif model_state.is_stable_but_not_aligned():
        return ['motion', 'temporal_coherence']
    else:
        return ['semantic_alignment', 'aesthetics']

3.6 Code-to-paper mapping table

Paper Concept	Source File	Key Class/Function
Stage I SFT framework	未公开	代码未发布
Stage II GRPO stack	未公开	代码未发布
ViPO / BPGO / Self-Paced GRPO	未公开	代码未发布
Gradient-decoupled DPO	未公开	代码未发布
项目页与可视化资源	`README.md`, `index.html`, `main.js`	GitHub Pages assets

4. Experimental Setup (实验设置)

4.1 Overall evaluation philosophy

TeleBoost 的 evaluation 不是单一 benchmark，而是按模块拆分验证：

SFT slice：验证 geometry / physics-aware initialization 是否更稳；
Stage II slice：验证 BPGO、ViPO、Self-Paced GRPO 是否真的带来更稳定的 reward optimization；
Global human eval：直接比较整套后训练系统与强基线 Wan2.2-14B I2V。

4.2 Important concrete settings

BPGO evaluation：
- T2V：Wan2.1-1.3B
- I2V：Wan2.2-14B
- group size $G = 8$
- reward model：VideoCLIP-XL
- 训练 200 steps
- 数据：iStock dataset 10K samples
- 测试：1000 samples
ViPO evaluation：
- 集群：32 × NVIDIA H100 GPUs
- T2V：Wan2.1 / DanceGRPO / ViPO 对比
Self-Paced GRPO evaluation：
- Wan2.1-T2V-1.3B：训练 220 steps
- Wan2.1-T2V-14B：训练 100 steps
- group size：1.3B 用 8，14B 用 32

4.3 Metrics

论文中混合使用：

GSB human evaluation
Optical flow metrics（EPE, >1px, >3px, F1-all）
VideoClipXL
VideoAlign-TA / VideoAlign-overall
Qwen3-VL-Embedding
VBench 多维指标
LAION aesthetic score

这也再次说明 TeleBoost 是系统报告：它并不试图用一个单一指标覆盖全问题，而是针对不同 slice 用最相关的评价方式。

5. Experimental Results (实验结果)

5.1 Human evaluation against Wan2.2-14B I2V

Figure 5 解读： Figure 5 可视化了 GSB（Good–Same–Bad）人评结果，对比对象是 Wan2.2-14B I2V。可以看到在 motion quality、text alignment 和 overall 三项上，“Good”占比显著高于“Bad”，说明 TeleBoost 的改进不是某个细分 metric 上的小幅提升，而是能被人类整体感知到的质量改善，尤其体现在长程运动一致性和 prompt 对齐上。

精确人评结果如下（WinRate / Preference / Margin）：

Aspect	WinRate	Preference	Margin
Visual quality	58.71	52.17	4.35
Motion quality	70.72	62.47	24.90
Text alignment	77.39	62.15	24.13
Preservation	63.28	54.15	8.15
Overall	71.18	66.38	32.71

其中最显著的优势体现在：

Motion quality
Text alignment
Overall

这与 TeleBoost 对 Stage II / III 的设计目标高度一致。

5.2 Physics-aware SFT

在 fluid effects 测试上：

Test Set	EPE	>1px	>3px	F1-all
Real-world fluid effects	0.538	14.7%	0.0%	4.680
Simulation fluid effects	1.541	21.8%	10.0%	10.040

这说明 Stage I 的 physics-aware SFT 在真实和模拟流体场景里都能给出更稳定的 motion prior，减少后续 RL 阶段里 reward 的噪声来源。

5.3 BPGO

BPGO 是 TeleBoost 里最容易量化的 Stage-II 模块，其主表结果为：

Task	Method	VideoClipXL	VideoAlign-TA	VideoAlign-overall	Qwen3-VL-Embedding
T2V	Wan2.1	2.6563	1.0638	0.0939	0.6741
T2V	GRPO	2.6714	0.8984	-0.5411	0.6722
T2V	BPGO	2.6788	1.1193	-0.0478	0.6754
I2V	Wan2.2	2.6726	1.0633	-0.7623	0.6885
I2V	GRPO	2.0713	0.2307	-1.8932	0.4513
I2V	BPGO	2.6855	1.0589	-1.0491	0.6890

几个关键信号：

T2V 上，BPGO 在 VideoAlign-TA 上比 GRPO 提升 24.6%；
I2V 上，GRPO 出现明显 collapse，而 BPGO 仍维持稳定；
人评（100 pairs）里，BPGO 在 overall / text alignment 上都明显优于 GRPO：
- Overall Quality：34 / 48 / 18
- Text Alignment：27 / 56 / 17

5.4 ViPO

ViPO 的代表性数值如下：

Method	VQ	MQ	Semantic	Quality	Total
Wan2.1	2.6219	0.5896	83.36	71.20	80.92
DanceGRPO	3.0935	0.8639	83.63	69.68	80.84
ViPO	3.5501	1.1515	83.98	72.59	81.70

ViPO 最显著的改进来自 motion quality (MQ)：

从 Wan2.1 的 0.5896
提升到 ViPO 的 1.1515

这说明把 scalar reward back-project 到时空区域，确实比全局一刀切的 credit assignment 更有效。

5.5 Self-Paced GRPO

Self-Paced GRPO 的整体结果：

Wan1.3B Total score：79.58 → 80.22
Wan14B Total score：81.46 → 82.09

在 ablation 中：

Method	VQ	MQ	TA	LAION	QS	SC	OA
Wan2.1-1.3B-T2V	3.448	0.2911	-1.914	5.224	83.15	65.31	79.58
Joint train	3.366	0.3011	-1.077	5.222	82.69	66.38	79.80
Self-paced GRPO	3.501	0.3090	-0.7114	5.252	83.53	66.94	80.22

它说明 Self-Paced GRPO 的优势来自 curriculum design 本身，而不是“换了一个更强的 reward model”。

5.6 Limitations / 总结

TeleBoost 的一个重要特点是：它不是一个可以单独拿出来部署的模型，而是一套后训练 recipe / blueprint。

从这个角度看，它的局限也很明确：

它更像系统报告，结果分散在不同模块与 slice 里，缺少统一单模型 benchmark；
官方仓库目前只有 project page 和 demo 资源，完整代码尚未公开；
Stage III DPO 有方法说明，但评测部分没有给出独立 DPO slice 的完整量化表。

但它的价值同样明显：

它把视频后训练从“零散 tricks”上升为阶段化、诊断驱动、基础设施感知的 pipeline 设计问题；
它系统回答了三个问题：
- where to learn（ViPO）
- what to trust（BPGO）
- when to learn（Self-Paced GRPO）

对做 video post-training 的人来说，TeleBoost 更像一份高质量 roadmap，而不是单点 benchmark paper。

Paper Notes

探索

TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation

TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation

1. Motivation (研究动机)

2. Idea (核心思想)

3. Method (方法)

3.1 Overall staged pipeline

3.2 Stage I: Supervised Fine-Tuning (SFT)

3.3 Stage II: Reinforcement Learning via GRPO

(1) GRPO backbone

(2) ViPO

(3) BPGO

(4) Self-Paced GRPO

3.4 Stage III: DPO

3.5 Pseudocode（基于论文报告结构整理）

组件 A：Stage-I policy shaping

组件 B：Stage-II GRPO backbone

组件 C：BPGO reliability-aware optimization

组件 D：Self-Paced GRPO

3.6 Code-to-paper mapping table

4. Experimental Setup (实验设置)

4.1 Overall evaluation philosophy

4.2 Important concrete settings

4.3 Metrics

5. Experimental Results (实验结果)

5.1 Human evaluation against Wan2.2-14B I2V

5.2 Physics-aware SFT

5.3 BPGO

5.4 ViPO

5.5 Self-Paced GRPO

5.6 Limitations / 总结

目录