VideoAlign: Improving Video Generation with Human Feedback

Authors: Jie Liu*, Gongye Liu*, Jiajun Liang†, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang✉, Wanli Ouyang Affiliations: MMLab CUHK, Tsinghua University, Kling Team (Kuaishou Technology), Shanghai Jiao Tong University, Shanghai AI Laboratory arXiv: 2501.13918 Project Page: gongyeliu.github.io/videoalign GitHub: KwaiVGI/VideoAlign Venue: NeurIPS 2025

1. Motivation（研究动机）

当前基于 Rectified Flow 的视频生成模型（如 Kling、Gen3 等）已能生成逼真的视频，但仍存在两类核心问题：

运动不流畅（Unsmooth Motion）：生成视频存在抖动、不自然的动态变化等问题。
文本-视频不对齐（Text Misalignment）：生成内容与文本 prompt 之间存在语义偏差，尤其是复杂提示词场景。

将 RLHF 引入视频生成面临两大瓶颈：

Reward Model 不可靠：现有视频偏好数据集（如 VideoScore 等）基于早期低分辨率、短时长的 T2V 模型收集，reward model 在现代高质量 T2V 模型上泛化性差；VLM-based reward model 的设计空间（标注范式、token 位置、tie 处理）未被充分探索。
Alignment 算法不兼容：现代 T2V 模型采用 Rectified Flow 而非传统 diffusion，直接迁移 Diffusion-DPO 会引入 timestep-dependent KL 系数 $β_{t} = β (1 - t)^{2}$ ，导致模型在高噪声层级过度优化，产生 reward hacking。

本文系统性地解决上述两个问题：构建大规模偏好数据集 + 多维度 VideoReward 模型 + 三种 Flow-based Alignment 算法。

2. Idea（核心思想）

一句话总结：构建 182k 多维度视频偏好数据集训练 VLM-based VideoReward，并从统一 RL 视角推导出三种 Flow Matching 对齐算法（Flow-DPO / Flow-RWR / Flow-NRG），其中 Flow-DPO 使用常数 $β$ 替代 timestep-dependent $β_{t}$ 以避免 reward hacking。

核心发现：

Bradley-Terry with Ties (BTT) 优于普通 BT 和回归损失，因为 tie 标注包含有价值的偏好信息。
Token Positioning 策略：将 VQ/MQ token 放在视频后、prompt 前（context-agnostic），TA token 放在 prompt 后（context-aware），消除 context leakage。
常数 $β$ 优于 $β_{t} = β (1 - t)^{2}$ ：timestep-dependent KL 系数在 $t \to 1$ 时趋近于 0，允许模型在高噪声步骤自由偏离 reference policy，导致 reward hacking；固定 $β$ 则全局约束偏离程度。

3. Method（方法）

3.1 整体框架

Figure 1 解读：VideoAlign 框架全貌。(a) Human Preference Annotation：收集 16k 高质量 prompts，通过 12 个 T2V 模型生成 108k 视频，配对形成 182k triplets，每个 triplet 在 VQ（Visual Quality）、MQ（Motion Quality）、TA（Text Alignment）三个维度独立标注偏好（A wins / Tied / B wins）。(b) Reward Model Training：将视频帧序列 token 化后输入 VLM（Qwen2-VL-2B），在视频后插入 [VQ] [MQ] context-agnostic token，在 prompt 后插入 [TA] context-aware token，通过 Linear Projection 输出三维奖励分数，使用 BTT loss 训练。(c) Text-to-Video Alignment：三种对齐算法——Flow-DPO（训练时直接偏好优化）、Flow-RWR（训练时奖励加权回归）、Flow-NRG（推理时奖励梯度引导）。

3.2 偏好数据集构建

Prompt 收集：从互联网收集 prompts，按 8 大类（animal, architecture, food, people, plants, scenes, vehicles, objects）分类，经 GPT-4o 扩写后去重、过滤，最终得到 16,000 个高质量 prompts。

视频生成与标注：

模型类型	T2V 模型	视频数量	Triplets 数量	分辨率	时长
Pre-Sora 时代	Gen2, SVD, Pika 1.0, Vega, PixVerse v1, HiDream	6k each (HiDream 0.3k)	13k each (HiDream 0.3k)	576-768 px	3-5s
现代模型	Dreamina, Luma, Gen3, Kling 1.0, PixVerse v2, Kling 1.5	6k-16k	28k-68k	384-768 px	5-6s

总计：108k 视频，182k 标注 triplets，验证集保留 13,000 triplets。

标注方式：

每个 triplet 在 VQ / MQ / TA 三维度分别标注 pairwise preference（A wins / Tied / B wins）
同时为每个视频赋 1-5 Likert 分数（支持 pointwise vs pairwise 对比实验）

3.3 VideoReward 模型

骨干网络：Qwen2-VL-2B-Instruct

三个关键设计决策：

(1) Bradley-Terry with Ties (BTT) vs Score Regression

Figure 2 解读：BT reward model 和 Regression reward model 在不同训练数据比例下的验证精度对比。横轴为训练数据占比（log scale），纵轴为 accuracy (%)。BT 模型在所有数据规模下均优于 Regression 模型，尤其在数据量较大时差距更为明显（~83% vs ~75%），说明 pairwise comparison 天然更适合捕捉细微偏好差异。

Bradley-Terry (BT) loss：

L_{BT} = - E [lo g σ (r (x_{0}^{w}, y) - r (x_{0}^{l}, y))]

Score Regression loss：

L_{re g} = E [∥ r (x_{0}, y) - z ∥^{2}]

结论：BT 模型始终优于 Regression 模型，因为 pairwise annotation 能捕捉到 pointwise 分数无法区分的微妙偏好差异。

(2) Bradley-Terry with Ties (BTT)

Figure 3 解读：BT 和 BTT 模型在验证集上的 score difference $Δ r = r (x_{0}^{A}, y) - r (x_{0}^{B}, y)$ 分布箱线图。左侧 BT 模型中，Tied 对的 $Δ r$ 分布散乱，与 win/lose 对高度重叠，说明 BT 模型无法有效区分 tie 情况。右侧 BTT 模型中，Tied 对的 $Δ r$ 紧密聚集在 0 附近，win/lose 对则保持较大间隔，说明 BTT 学到了更清晰的决策边界。

BTT 定义三类偏好分布（ $θ > 1$ 控制 tie 倾向，实验中设 $θ = 5.0$ ）：

P_{θ} (c ∣ y, x_{0}^{A}, x_{0}^{B}) = ⎩ ⎨ ⎧ \frac{( θ ^{2} - 1 ) e x p ( r _{A} ) e x p ( r _{B} )}{( e x p ( r _{A} ) + θ e x p ( r _{B} )) ( θ e x p ( r _{A} ) + e x p ( r _{B} ))} \frac{e x p ( r _{A} )}{e x p ( r _{B} ) + θ e x p ( r _{A} )} \frac{e x p ( r _{B} )}{θ e x p ( r _{A} ) + e x p ( r _{B} )} Tie A preferred B preferred

BTT loss：

L_{BTT} = - E_{(y, x_{0}^{A}, x_{0}^{B}) \sim D} c \in {≻, ≺, =} \sum 1 (c) lo g P_{θ} (c ∣ y, x_{0}^{A}, x_{0}^{B})

(3) Token Positioning 策略

解决 context leakage 问题：传统方法将所有维度的分数预测 head 放在最后一个 token 上，导致 VQ 分数受 prompt 影响。

本文的解决方案：

[VQ] 和 [MQ] token 插入在视频 token 之后、prompt 之前（context-agnostic，只关注视觉内容）
[TA] token 插入在完整 prompt 之后（context-aware，需同时关注视觉和文本）
三个 token 的 final-layer embedding 经共享 Linear Projection 映射为三维分数

输入序列: [Video Tokens] ... [VQ] [MQ] [Instructions] [Prompt] [TA]
                              ↓    ↓                           ↓
                         视觉质量  运动质量                  文本对齐

VideoReward 训练伪代码

# VideoReward 训练伪代码
def train_video_reward(model, dataset, theta=5.0):
    """
    model: Qwen2-VL-2B + Linear Projection
    dataset: 182k triplets with VQ/MQ/TA pairwise labels
    theta: BTT tie probability parameter
    """
    for (prompt, video_A, video_B, labels) in dataset:
        # 1. 视频预处理: 2fps 采样, 448x448 分辨率, 保持宽高比
        frames_A = sample_frames(video_A, fps=2, resolution=448)
        frames_B = sample_frames(video_B, fps=2, resolution=448)
 
        # 2. 构造输入序列 (Token Positioning)
        input_A = [frames_A, "[VQ]", "[MQ]", instructions, prompt, "[TA]"]
        input_B = [frames_B, "[VQ]", "[MQ]", instructions, prompt, "[TA]"]
 
        # 3. 前向传播获取三维 reward
        r_A = model(input_A)  # [r_vq_A, r_mq_A, r_ta_A]
        r_B = model(input_B)  # [r_vq_B, r_mq_B, r_ta_B]
 
        # 4. 对每个维度 d ∈ {VQ, MQ, TA} 计算 BTT loss
        loss = 0
        for d in [VQ, MQ, TA]:
            if labels[d] == "tie":
                p = (theta**2 - 1) * exp(r_A[d]) * exp(r_B[d]) / \
                    ((exp(r_A[d]) + theta*exp(r_B[d])) * (theta*exp(r_A[d]) + exp(r_B[d])))
            elif labels[d] == "A_wins":
                p = exp(r_A[d]) / (exp(r_B[d]) + theta * exp(r_A[d]))
            else:  # B_wins
                p = exp(r_B[d]) / (theta * exp(r_A[d]) + exp(r_B[d]))
            loss -= log(p)
 
        loss.backward()
        optimizer.step()

3.4 Flow-DPO

Rectified Flow 基础

Rectified Flow 定义 “noisy” data 为 $x_{t} = (1 - t) x_{0} + t x_{1}$ ，其中 $x_{0} \sim q (x_{0})$ 为真实数据， $x_{1} \sim p (x_{1})$ 为噪声。Flow Matching 目标为：

L (θ) = E_{t, x_{0} \sim q, x_{1} \sim p} [∥ v - v_{θ} (x_{t}, t) ∥^{2}]

其中目标速度场 $v = x_{1} - x_{0}$ 。

RLHF 统一目标

p_{θ} max E_{y \sim D_{c}, x_{0} \sim p_{θ} (x_{0} ∣ y)} [r (x_{0}, y)] - β D_{KL} [p_{θ} (x_{0} ∣ y) ∥ p_{ref} (x_{0} ∣ y)]

从 Diffusion-DPO 到 Flow-DPO 的推导

Diffusion-DPO 的目标（在噪声预测框架下）：

L_{DD} (θ) = - E [lo g σ (- \frac{β}{2} (∥ ϵ^{w} - ϵ_{θ} (x_{t}^{w}, t) ∥^{2} - ∥ ϵ^{w} - ϵ_{ref} (x_{t}^{w}, t) ∥^{2} - (∥ ϵ^{l} - ϵ_{θ} (x_{t}^{l}, t) ∥^{2} - ∥ ϵ^{l} - ϵ_{ref} (x_{t}^{l}, t) ∥^{2})))]

关键引理（Lemma B.1）：在 Rectified Flow 中，噪声预测误差和速度场预测误差的关系为：

∥ ϵ^{*} - ϵ_{pred} (x_{t}^{*}, t) ∥^{2} = (1 - t)^{2} ∥ v^{*} - v_{pred} (x_{t}^{*}, t) ∥^{2}

将此代入 Diffusion-DPO 目标，得到 Flow-DPO loss：

L_{FD} (θ) = - E [lo g σ (- \frac{β _{t}}{2} (∥ v^{w} - v_{θ} (x_{t}^{w}, t) ∥^{2} - ∥ v^{w} - v_{ref} (x_{t}^{w}, t) ∥^{2} - (∥ v^{l} - v_{θ} (x_{t}^{l}, t) ∥^{2} - ∥ v^{l} - v_{ref} (x_{t}^{l}, t) ∥^{2}))]]

其中 $β_{t} = β (1 - t)^{2}$ 。

核心发现：常数 $β$ 替代 $β_{t}$

问题分析： $β_{t} = β (1 - t)^{2}$ 在 $t \to 1$ （高噪声）时趋近于 0，导致 KL 约束几乎消失，模型在高噪声步骤自由偏离 reference policy，造成 reward hacking。

解决方案：受 DDPM 中丢弃时间权重可改善样本质量的启发，直接使用常数 $β$ 替代 $β_{t}$ ：

L_{Flow-DPO} (θ) = - E [lo g σ (- \frac{β}{2} (Δ_{w} - Δ_{l}))]

其中 $Δ_{w} = ∥ v^{w} - v_{θ} (x_{t}^{w}, t) ∥^{2} - ∥ v^{w} - v_{ref} (x_{t}^{w}, t) ∥^{2}$ ， $Δ_{l}$ 类似。

Flow-DPO 伪代码（论文 Appendix C）

def flow_dpo_loss(model, ref_model, x_w, x_l, c, beta):
    """
    model: 当前 Flow 模型（可训练）
    ref_model: 冻结的 reference Flow 模型
    x_w: 偏好视频 latent (preferred)
    x_l: 非偏好视频 latent (non-preferred)
    c: 文本条件 (prompt)
    beta: 常数 KL 正则化系数
    """
    # 1. 随机采样 timestep 和噪声
    timestep = torch.rand(len(x_w))
    noise = torch.randn_like(x_w)
 
    # 2. 构造 noisy latent: x_t = (1 - t) * x_0 + t * noise
    noisy_x_w = (1 - timestep) * x_w + timestep * noise
    noisy_x_l = (1 - timestep) * x_l + timestep * noise
 
    # 3. 预测速度场
    velocity_w_pred = model(noisy_x_w, c, timestep)
    velocity_l_pred = model(noisy_x_l, c, timestep)
    velocity_ref_w_pred = ref_model(noisy_x_w, c, timestep)
    velocity_ref_l_pred = ref_model(noisy_x_l, c, timestep)
 
    # 4. 计算目标速度场: v = noise - x_0
    velocity_w = noise - x_w
    velocity_l = noise - x_l
 
    # 5. 计算各项误差
    model_w_err = (velocity_w_pred - velocity_w).norm().pow(2)
    model_l_err = (velocity_l_pred - velocity_l).norm().pow(2)
    ref_w_err = (velocity_ref_w_pred - velocity_w).norm().pow(2)
    ref_l_err = (velocity_ref_l_pred - velocity_l).norm().pow(2)
 
    # 6. 计算 DPO loss (常数 beta，非 beta_t)
    w_diff = model_w_err - ref_w_err
    l_diff = model_l_err - ref_l_err
    inside_term = -0.5 * beta * (w_diff - l_diff)
    loss = -1 * log(sigmoid(inside_term))
 
    return loss

3.5 Flow-RWR

从 RLHF 统一目标（Eq. 3）出发，其最优闭式解为：

p_{θ} (x_{0} ∣ y) = \frac{1}{Z ( y )} p_{ref} (x_{0} ∣ y) exp (\frac{1}{β} r (x_{0}, y))

对应的 RWR loss（用于 Rectified Flow 模型）：

L_{RWR} (θ) = E [exp (r (x_{0}, y)) ∥ v - v_{θ} (x_{t}, t, y) ∥^{2}]

与 Flow-DPO 类似，省略 $(1 - t)^{2}$ 因子以获得更好的性能。本质上是用 reward 对 Flow Matching loss 进行加权。

3.6 Flow-NRG（Noisy Reward Guidance）

推理时对齐方法，无需训练。从闭式解出发：

p_{θ} (x_{0} ∣ y) \propto p_{ref} (x_{0} ∣ y) [exp (r (x_{0}, y))]^{w}

对于 Rectified Flow，通过偏移速度场实现引导：

\tilde{v}_{t} (x_{t} ∣ y) = v_{t} (x_{t} ∣ y) - w \frac{t}{1 - t} \nabla r (x_{t}, y)

Noisy Latent Reward Model：在像素空间计算 $\nabla r$ 需要通过完整 VAE decoder 反向传播，代价极高。因此训练一个轻量级的 time-dependent reward model $r_{ϕ} (\cdot, t)$ 直接在 latent space 上工作：

对每个偏好对 $(x^{w}, x^{l})$ 施加相同噪声
使用 BTT loss 在 noisy latent 上训练
复用预训练 VDM 的前几层作为 backbone

Flow-NRG 伪代码（论文 Appendix C）

def reward_guidance(model, reward_model, prompt_embeds, latents,
                    timesteps, reward_weight, rg_scale, cfg_scale):
    """
    model: 预训练 Flow 模型
    reward_model: Noisy latent reward model r_phi(x_t, prompt, t)
    reward_weight: 多维奖励的加权系数 (e.g., [0.1, 0.1, 0.8] for VQ:MQ:TA)
    rg_scale: reward guidance 强度 w
    cfg_scale: classifier-free guidance 强度
    """
    dts = timesteps[:-1] - timesteps[1:]  # 步长
    for i, t in enumerate(timesteps):
        # 1. 速度场预测 (含 CFG)
        v_pred = model(latents, prompt_embeds, t)
        if cfg_scale != 1.0:
            v_pred_uncond = model(latents, None, t)
            v_pred = v_pred_uncond + cfg_scale * (v_pred - v_pred_uncond)
 
        # 2. 计算 reward gradient
        latents = latents.detach().requires_grad_(True)
        reward = reward_model(latents, prompt_embeds, t)
        reward = (reward * reward_weight).sum()  # 加权求和
        reward_guidance = torch.autograd.grad(reward, latents)
 
        # 3. 引导速度场 (t != 1 时)
        if t != 1:
            v_pred = v_pred - rg_scale * t / (1 - t) * reward_guidance
 
        # 4. Euler 积分更新
        latents = latents - dts[i] * v_pred
 
    return latents

3.7 三种方法对比

特性	Flow-DPO	Flow-RWR	Flow-NRG
阶段	训练时	训练时	推理时
需要 reference model	是	否	否
需要 reward score	否（隐式）	是（显式加权）	是（gradient）
偏好数据	pairwise pairs	reward-labeled data	无需偏好数据
多维度可控	需重新训练	需重新训练	推理时自由调节权重
效果	最优	中等	灵活

4. Experimental Setup（实验设置）

4.1 Reward Model 训练

骨干：Qwen2-VL-2B-Instruct
损失函数：BTT loss（ $θ = 5.0$ ）
视频采样：2 fps，固定帧间隔，约 $448 \times 448$ 分辨率（保持宽高比）
训练数据：182k triplets（16k prompts × 12 T2V models）
验证集：13k triplets

4.2 Reward Model 评估

VideoGen-RewardBench（本文提出）：26.5k 标注 video pairs，来自现代 T2V 模型，分 VQ/MQ/TA/Overall 四类
GenAI-Bench：Pre-Sora 时代短视频（2s），用于跨代泛化评估
指标：ties-included 和 ties-excluded pairwise accuracy

4.3 Video Alignment 训练

基座模型 $p_{ref}$ ：内部基于 Transformer 的 Rectified Flow T2V 模型
微调方式：LoRA（following SD3）
Reward 来源：VideoReward 作为 ground-truth reward 重新标注训练集
SFT baseline：仅使用 “chosen” 数据训练

4.4 Video Alignment 评估

自动指标：
- Win rate（VideoReward 分数 vs pretrained model）
- VBench score（Quality / Semantic 等子项）
人工评估：2 名标注员 + 1 名仲裁员
Prompt 来源：VBench、VideoGen-Eval、TA-Hard（本文构建的高难度 TA 测试集）

5. Results（实验结果）

5.1 VideoReward 性能

Method	GenAI-Bench (w/ Ties)	GenAI-Bench (w/o Ties)	VideoGen-RewardBench Overall (w/ Ties)	Overall (w/o Ties)	VQ (w/o Ties)	MQ (w/o Ties)	TA (w/o Ties)
Random	33.67	49.84	41.86	50.30	49.86	49.64	50.40
VideoScore	49.03	71.69	41.80	50.22	51.09	50.34	50.34
LiFT	37.06	58.39	39.08	57.26	54.91	55.43	55.43
VisionReward	51.56	72.41	56.77	67.59	59.03	60.98	61.15
Ours (VideoReward)	49.41	72.89	61.26	73.59	75.66	74.70	72.20

关键发现：

VideoReward 在 VideoGen-RewardBench 上全面领先，Overall accuracy (w/o Ties) 达 73.59%，比 VisionReward 高 6 个百分点
在 GenAI-Bench（Pre-Sora 时代模型）上也保持可比性能（72.89% w/o Ties），说明跨代泛化能力强
VQ 维度领先幅度最大（75.66% vs 59.03%），说明现有 reward model 难以评估现代 T2V 模型的视觉质量

5.2 Video Alignment 性能

多维度对齐（VQ:MQ:TA = 1:1:1）

Method	VBench Total	VBench Quality	VBench Semantic	VQ	MQ	TA	VideoGen-Eval VQ	MQ	TA	TA-Hard VQ	MQ	TA
Pretrained	83.19	84.37	78.46	50.0	50.0	50.0	50.0	50.0	50.0	50.0	50.0	50.0
SFT	82.31	83.13	79.04	51.28	65.21	52.84	61.27	76.13	46.35	57.75	76.06	57.75
Flow-RWR	82.27	83.19	78.59	51.55	63.9	53.43	59.05	69.7	48.35	61.97	78.87	55.71
Flow-DPO ( $β_{t}$ )	80.90	81.52	78.42	87.78	82.36	51.02	88.44	91.23	28.14	84.29	83.10	38.03
Flow-DPO (常数 $β$ )	83.41	84.19	80.26	93.42	69.08	75.43	90.95	81.01	68.26	77.46	71.43	73.24

关键发现：

Flow-DPO（ $β_{t}$ ）出现严重 reward hacking：VQ/MQ win rate 极高（87.78%/82.36%），但 TA 大幅下降（51.02%），VBench Total 也降至 80.90
Flow-DPO（常数 $β$ ）全面均衡：VQ 93.42%、TA 75.43%，VBench Total 83.41（高于 pretrained 的 83.19）
Flow-DPO > SFT > Flow-RWR，SFT 虽不错但无法利用 rejected 数据的信息

单维度对齐（仅优化 TA）

Method	VBench Total	Quality	Semantic	TA	VideoGen-Eval TA	TA-Hard TA
Pretrained	83.19	84.37	78.46	50.00	50.00	50.00
SFT	82.71	83.48	79.62	52.88	53.81	64.79
Flow-RWR	82.40	83.36	78.58	59.66	49.50	66.20
Flow-DPO ( $β_{t}$ )	82.35	83.00	79.75	63.67	55.95	71.83
Flow-DPO (常数 $β$ )	83.38	84.28	79.80	69.09	65.49	84.51

常数 $β$ 的 Flow-DPO 在 TA-Hard 上达到 84.51% win rate，远超其他方法。

5.3 人工评估

Figure 5 解读：Flow-DPO aligned model vs pretrained model 在 VideoGen-Eval 上的人工评估结果。总体上 DPO Wins 占 44.0%，Pretrained Wins 仅 27.0%。按维度看：Visual Quality 上 DPO 以 31.8% vs 21.0% 领先（47.2% 为平局），Motion Quality 上 DPO 以 35.5% vs 32.2% 微幅领先，Text Alignment 上 DPO 以 29.5% vs 22.2% 显著领先（48.2% 为平局）。人工评估验证了 Flow-DPO 的全面提升。

5.4 Reward Guidance 结果

Figure 4 解读：Flow-DPO 对齐前后的视频生成对比。三组示例中，Flow-DPO 模型在文本语义对齐和视觉质量上均有显著提升：第一组中马匹运动更自然、色调更和谐；第二组中女性动作更流畅、光照更自然；第三组中 “mushrooms glow and flowers whisper” 的语义得到更好体现（蘑菇发光效果明显改善）。

Flow-NRG 的灵活性（Table 5, VideoGen-Eval, guidance strength $w = 100$ ）：

VQ:MQ:TA 权重	VQ win rate	MQ win rate	TA win rate
0:0:1	60.56	46.48	70.42
0.1:0.1:0.8	66.50	63.73	60.86
0.1:0.1:0.6	68.94	67.59	53.28
0.5:0.5:0	86.43	93.23	26.65

用户可通过调节权重在推理时实现自定义的多目标 trade-off。

5.5 $β$ 消融实验

Figure 6 解读：在不同 $β$ 值下，timestep-dependent $β_{t} = β (1 - t)^{2}$ （虚线）和常数 $β$ （实线）在 TA 维度的性能对比。横轴为 $lo g (β)$ （从 100 到 8000），纵轴为 TA accuracy。在所有三个评估集（TA-Hard、VideoGen-Eval、VBench）上，常数 $β$ 始终优于 $β_{t}$ 变体，且在较大 $β$ 值时差距更为明显。这验证了 timestep-dependent KL 约束导致高噪声步骤过度自由偏离 reference policy 的分析。

代码-论文对应关系 (Code-to-Paper Mapping)

论文内容	代码位置 (GitHub: KwaiVGI/VideoAlign)	说明
VideoReward 推理（Sec 3.2）	`inference.py`	单视频三维度评分
VideoReward 训练（Sec 3.2, BTT loss）	`train_reward.py`	基于 TRL + Qwen2-VL-2B
VideoGen-RewardBench 评估	`eval_videogen_rewardbench.py`	Table 2 结果复现
Token Positioning 模板（Sec 3.2）	Appendix K (论文)	完整 input template
Flow-DPO loss（Eq. 6）	Appendix C 伪代码 (论文 p.20)	PyTorch 实现已开源用于 T2I
Flow-NRG 引导（Eq. 11）	Appendix C 伪代码 (论文 p.20)	PyTorch 实现
DeepSpeed 配置	`ds_config/`	分布式训练支持
模型权重	HuggingFace: KwaiVGI/VideoReward	Qwen2-VL-2B based
Flow-DPO T2I 训练	仓库中提供（参考 README）	文生图版本实现
Flow-DPO T2V 训练	未开源	使用内部 T2V 模型，依赖 LoRA
Flow-RWR / Flow-NRG 完整训练	未开源	论文提供伪代码

注意：GitHub 仓库主要开源了 VideoReward 模型（训练 + 推理 + 评估）及 Flow-DPO 的 T2I 实现。T2V alignment 部分因依赖内部模型未完整开源，但论文 Appendix C 提供了 Flow-DPO 和 Flow-NRG 的完整 PyTorch 伪代码。

Paper Notes

探索

VideoAlign: Improving Video Generation with Human Feedback

VideoAlign: Improving Video Generation with Human Feedback

1. Motivation（研究动机）

2. Idea（核心思想）

3. Method（方法）

3.1 整体框架

3.2 偏好数据集构建

3.3 VideoReward 模型

(1) Bradley-Terry with Ties (BTT) vs Score Regression

(2) Bradley-Terry with Ties (BTT)

(3) Token Positioning 策略

VideoReward 训练伪代码

3.4 Flow-DPO

Rectified Flow 基础

RLHF 统一目标

从 Diffusion-DPO 到 Flow-DPO 的推导

核心发现：常数 $β$ 替代 $β_{t}$

Flow-DPO 伪代码（论文 Appendix C）

3.5 Flow-RWR

3.6 Flow-NRG（Noisy Reward Guidance）

Flow-NRG 伪代码（论文 Appendix C）

3.7 三种方法对比

4. Experimental Setup（实验设置）

4.1 Reward Model 训练

4.2 Reward Model 评估

4.3 Video Alignment 训练

4.4 Video Alignment 评估

5. Results（实验结果）

5.1 VideoReward 性能

5.2 Video Alignment 性能

多维度对齐（VQ:MQ:TA = 1:1:1）

单维度对齐（仅优化 TA）

5.3 人工评估

5.4 Reward Guidance 结果

5.5 $β$ 消融实验

代码-论文对应关系 (Code-to-Paper Mapping)

目录

Paper Notes

探索

VideoAlign: Improving Video Generation with Human Feedback

VideoAlign: Improving Video Generation with Human Feedback

1. Motivation（研究动机）

2. Idea（核心思想）

3. Method（方法）

3.1 整体框架

3.2 偏好数据集构建

3.3 VideoReward 模型

(1) Bradley-Terry with Ties (BTT) vs Score Regression

(2) Bradley-Terry with Ties (BTT)

(3) Token Positioning 策略

VideoReward 训练伪代码

3.4 Flow-DPO

Rectified Flow 基础

RLHF 统一目标

从 Diffusion-DPO 到 Flow-DPO 的推导

核心发现：常数 β 替代 βt​

Flow-DPO 伪代码（论文 Appendix C）

3.5 Flow-RWR

3.6 Flow-NRG（Noisy Reward Guidance）

Flow-NRG 伪代码（论文 Appendix C）

3.7 三种方法对比

4. Experimental Setup（实验设置）

4.1 Reward Model 训练

4.2 Reward Model 评估

4.3 Video Alignment 训练

4.4 Video Alignment 评估

5. Results（实验结果）

5.1 VideoReward 性能

5.2 Video Alignment 性能

多维度对齐（VQ:MQ:TA = 1:1:1）

单维度对齐（仅优化 TA）

5.3 人工评估

5.4 Reward Guidance 结果

5.5 β 消融实验

代码-论文对应关系 (Code-to-Paper Mapping)

目录

核心发现：常数 $β$ 替代 $β_{t}$

5.5 $β$ 消融实验