UNIVERSE: Adapting Vision-Language Models for Evaluating World Models

Authors: Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot Affiliations: University of Oxford, Microsoft Research Venue: NeurIPS LAW 2025 (Oral) Code: 未公开（截至 2026 年 3 月，未找到官方 GitHub 仓库）

Figure 1 解读: 左图和中图展示了 UNIVERSE（橙色柱状图）与 task-specific baselines 在 Action Recognition (AR) 和 Character Recognition (CR) 两个任务上的对比，跨越 binary、multiple-choice、open-ended 三种格式。UNIVERSE 在 AR 上超越所有 baselines，在 CR 上排名第三。右图显示 UNIVERSE 的训练样本量远小于 task-specific 模型（每个 epoch 仅需约 3,894 / 1,298 / 973 个 AR 样本和 5,192 / 324 个 CR 样本），体现了极高的数据效率。

论文信息:

标题: Adapting Vision-Language Models for Evaluating World Models
作者: Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot
机构: University of Oxford, Microsoft Research
会议: NeurIPS LAW 2025 (Oral)
链接: arXiv:2506.17967
代码: 未公开（截至 2026 年 3 月，未找到官方 GitHub 仓库）

1. Motivation (研究动机)

核心问题

World models（世界模型）是用于模拟环境动态的生成模型，被广泛应用于游戏引擎（如 Dreamer、DIAMOND）、自动驾驶（GAIA-1）和 embodied AI 领域。然而，评估这些模型生成的 rollouts（模拟轨迹）面临根本性挑战：

现有指标不足: FID、IS、SSIM 等分布级指标只关注低层视觉特征，无法捕捉 action alignment 和语义一致性
时序依赖缺失: FVD 等视频指标忽略 timestamp-level 的动作条件对齐
多模态指标盲区: Jayasumana et al. (2024) 等多模态指标不考虑动作条件
人工评估成本高: 人类评估虽是金标准，但昂贵且难以扩展
通用 VLM 表现差: 即使是 GPT-5 在 binary AR 任务上 6 个样本中答错 5 个，说明需要 task-specific adaptation

需要解决的挑战

如何构建一个 细粒度、时序感知 的自动化评估器？
如何在 数据和计算受限 的条件下高效适配 VLM？
如何用 单一统一模型 替代多个 task-specific checkpoints？

2. Idea (核心思想)

核心思想

提出 UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments)，将 world model rollout 评估形式化为视觉问答（VQA）任务，通过三大核心设计实现轻量级、高效的统一评估：

Partial Fine-tuning: 仅微调 projection head（ $θ_{P}$ ），只更新模型 0.07% 的参数（2.66M / ~3B）
Efficient Frame Sampling: 从 14 帧 rollout 中 uniform 采样 $k = 8$ 帧，保持时间覆盖
Mixed Supervision: 使用层次化的 task + format 混合训练，单一模型覆盖 2 个任务 × 3 种格式

评估协议设计

定义两个结构化识别任务：

Action Recognition (AR): 判断生成序列是否准确反映了 agent 的动作
Character Recognition (CR): 评估实体在时间维度上的身份和外观一致性

每个任务有三种问答格式：

Binary: “Is the character evading right? Answer yes or no”
Multiple-Choice: “What is the character doing? A) jumping B) evading…”
Open-Ended: “What is the character doing?“

3. Method (方法)

3.1 形式化定义

World model 定义为： $W : (o_{< t}, a_{< t}) \to o_{t}$ ，其中 $o_{t} \in O$ 是 timestep $t$ 的感知观测， $a_{< t}$ 是历史动作序列。

UNIVERSE 评估器形式化为函数：

E : (V, Q) \to \hat{A}

其中 $V = (o_{t_{1}}, \dots, o_{t_{k}}) \in O^{k}$ 是从 rollout 采样的帧序列， $Q \in L$ 是自然语言问题， $\hat{A} \in L$ 是预测答案。通过将 $\hat{A}$ 与参考答案 $A$ 比较来度量评估质量。

3.2 模型架构

基于 PaliGemma 2 3B 模型，由三个模块组成：

组件	架构	参数量
Vision Encoder ( $M_{V}$ )	SigLIP-So400m, 27 层 Transformer Encoder	400M
Projection Head ( $M_{P}$ )	Linear(1152 → 2304)	2.66M
Language Decoder ( $M_{L}$ )	Gemma 2 2B, 26 层 Transformer Decoder	~2B

输入格式:

图像分辨率: $224 \times 224$ pixels → 每帧 256 个 patches
输入序列: $S = {S_{I}, S_{T}^{PREF}, S_{T}^{SUFF}}$
- $S_{I}$ : 来自 $k$ 帧的 visual tokens
- $S_{T}^{PREF}$ : <BOS>, answer en, <QUESTION>, <SEP> (task cue + question)
- $S_{T}^{SUFF}$ : <ANSWER>, <EOS>, <PAD>,... (expected answer, 仅训练时)

3.3 训练目标

优化 answer suffix tokens 上的 causal language modeling loss:

L (S) = - t = 1 \sum T_{SUFF} lo g P (s_{t}^{SUFF} ∣ S_{< t^{'}})

其中 $s_{t}^{SUFF}$ 是 suffix 中第 $t$ 个 token， $t^{'} = T_{I} + T_{PREF} + t$ 是在展平序列中的位置。

3.4 适配策略探索

论文系统性地探索了三个维度的设计空间：

(a) Fine-tuning 配置

策略	更新参数	说明
Zero-shot	无	直接 prompting
Full FT ( $F_{all}$ )	$θ_{V} \cup θ_{P} \cup θ_{L}$	全参数微调
Two-component	三选二	如 $F_{V + P}$ , $F_{V + L}$ , $F_{P + L}$
Single-component	三选一	$F_{V}$ , $F_{P}$ , $F_{L}$
LoRA	Low-rank adapters	$W \leftarrow W + \frac{α}{r} AB$ , $α = 8$ , $r \in {8, 16, 32, 48, 64}$

关键发现: 仅微调 projection head ( $F_{P}$ ) 即可达到第二好的性能，仅次于 vision encoder tuning ( $F_{V}$ ) 但参数量从 ~11% 降至 0.07%。

(b) 帧采样策略

比较两种采样方法：

First-n: 选取前 $n$ 帧
Uniform-n: 从整个序列均匀采样 $n$ 帧

关键发现: Uniform-n 在所有格式上一致优于 First-n，尤其在低帧数时优势显著。2 帧时: Binary 84.42% → 90.47%, MC 65.53% → 83.93%, OE 65.38% → 82.68%。

(c) 数据混合优化

三阶段层次化 grid search:

固定 format 分布为均匀，调节 task ratio $α_{A R}$ vs $α_{CR}$
固定 $α_{A R} = 0.8$ ，调节 open-ended 比例 $β_{OE}$
固定 $β_{OE} = 0.8$ ，分配 binary 和 MC 比例

最优配置:

Task ratio: $α_{A R} = 0.8$ , $α_{CR} = 0.2$
Format ratio: $β_{binary} = 0.15$ , $β_{MC} = 0.05$ , $β_{OE} = 0.8$

3.5 UNIVERSE 最终配置

UNIVERSE Configuration:
├── Base Model: PaliGemma 2 3B
├── Adaptation: Partial fine-tuning (projection head only, 2.66M params, 0.07%)
├── Frame Sampling: Uniform-8 from 14-frame rollout
├── Training Mix:
│   ├── Task: α_AR = 0.8, α_CR = 0.2
│   └── Format: β_binary = 0.15, β_MC = 0.05, β_OE = 0.8
└── Evaluation: Exact Match (EM) + ROUGE-F1

3.6 评估指标

Exact Match (EM):

EM = 1 (\hat{A} = A)

ROUGE-F1:

P = \frac{∣ G \cap GT ∣}{∣ G ∣}, R = \frac{∣ G \cap GT ∣}{∣ GT ∣}

ROUGE = 2 \times \frac{P \times R}{P + R}

其中 Binary 使用 bigram，MC 和 OE 使用 trigram。

3.7 伪代码

数据构建流程:

def DatasetCreation(S, L=14):
    """S: gameplay sessions, L: clip length"""
    # Stage 1: Preprocessing
    S_valid = Preprocessing(S)  # filter corrupted sessions
    V = []
    for (v, c, m) in S_valid:
        V_s = Segment(v, c, m, L)  # segment into 14-frame clips
        V.extend(V_s)
 
    # Stage 2: Description Generation
    Z = []
    for (frames, controls, meta) in V:
        d = Describe(meta, controls)  # structured NL description
        Z.append((frames, d))
 
    # Stage 3: QA Pair Construction
    Y_AR, Y_CR = GetAnswerSpace(Z)  # action & character label sets
    D = []
    for (frames, d) in V:
        QA = GenerateQAPairs(d, Y_AR, Y_CR)  # 6 QA pairs per clip
        for (Q, A) in QA:
            D.append((frames, Q, A))
    return D  # 194,718 train + 48,678 val QA pairs
 
def GenerateQAPairs(d, Y_AR, Y_CR):
    QA = []
    for task in [AR, CR]:
        y = ExtractLabel(d, task)
        # Binary: positive + negative
        y_neg = SampleDistractor(Y_task \ {y})
        QA.append(FormatBinaryPrompt(task, y), "yes")
        QA.append(FormatBinaryPrompt(task, y_neg), "no")
        # Multiple-Choice
        O = FormatOptions(Y_task)
        QA.append(FormatMCPrompt(task, O), y)
        # Open-Ended
        QA.append(FormatOEPrompt(task), y)
    return QA

适配与评估流程:

class AdaptationDatasetBuilder:
    def build(self, alpha_task, beta_format, context_length, sampling_strategy):
        """
        alpha_task: {AR: 0.8, CR: 0.2}
        beta_format: {binary: 0.15, MC: 0.05, OE: 0.8}
        context_length: k=8 frames
        sampling_strategy: 'uniform'
        """
        samples = stratified_sample(self.data, alpha_task, beta_format)
        dataset = []
        for sample in samples:
            frames = sample_visual_context(sample, context_length, sampling_strategy)
            dataset.append((frames, sample.question, sample.answer))
        return dataset
 
class VLMAdapter:
    def adapt(self, base_vlm, adaptation_data, strategy='projection_only', num_steps=N):
        # Freeze all except projection head
        for name, param in base_vlm.named_parameters():
            param.requires_grad = (name in projection_head_params)
 
        for step in range(num_steps):
            batch = sample_batch(adaptation_data)
            loss = compute_loss(base_vlm, batch)  # causal LM loss on suffix
            loss.backward()
            update_model(base_vlm)
 
class Universe:
    def evaluate_rollout(self, rollout, target, complexity):
        question = generate_question(target, complexity)  # AR/CR × binary/MC/OE
        # Majority vote over 5 greedy decodings
        responses = [self.adapted_vlm(rollout, question) for _ in range(5)]
        return majority_vote(responses)

4. Experimental Setup (实验设置)

4.1 数据集

属性	值
数据来源	Bleeding Edge 游戏录制（Skygarden 环境）
训练集	32,453 clips → 194,718 QA pairs
验证集	8,113 clips → 48,678 QA pairs
Clip 长度	14 帧（60 FPS 视频）
标注方式	6 QA pairs / clip (2 tasks × 3 formats)

4.2 Baselines

Zero-shot VLMs (7 个):

VideoLLaMA3: 2B, 7B
PaliGemma 1 3B, PaliGemma 2 3B, PaliGemma 2 10B

Fine-tuned PaliGemma 2 变体 (8 个策略):

Single-component: $F_{V}$ , $F_{P}$ , $F_{L}$
Two-component: $F_{V + P}$ , $F_{V + L}$ , $F_{P + L}$
Full: $F_{all}$
LoRA: $r = 8$ , $α = 8$

4.3 训练超参数

超参数	值
GPU	8× NVIDIA A100 (40GB)
Batch size	1/device × 4 gradient accumulation = 32 effective
Optimizer	AdamW ( $β_{1} = 0.9$ , $β_{2} = 0.999$ )
Learning rate	$5 \times 1 0^{- 5}$ , cosine annealing
LR warmup	10% of training steps
Weight decay	$1 \times 1 0^{- 6}$
Gradient clipping	Global norm, threshold 1.0
Epochs	1-10
Precision	bfloat16
输入分辨率	$224 \times 224$
总计算量	5,154 GPU-days (14.12 GPU-years)

4.4 人类评估设置

属性	值
评估环境	8 个（1 个 in-domain + 7 个 unseen）
World models	大模型 1.6B (300×180) + 小模型 140M (128×128)
每个 setting	30 rollouts × 6 QA pairs
总 rollouts / QA instances	240 / 1,440
标注者	2 主标注 + 1 仲裁
评分标准	Correct (1.0), Partially Correct (0.5), Incorrect (0.0), Unclear
一致性度量	Cohen’s $κ$

5. Experimental Results (实验结果)

5.1 Zero-shot 性能

Figure 16 解读: PaliGemma 三个变体在 zero-shot 设置下的 ROUGE-F1 表现。按 model、question type 和 frame 数量分析。整体性能有限（AR 最高约 30%，CR 更低），PaliGemma 2 10B 在绝对值上略优但与 2 3B 差距不大。增加帧数对 AR 有轻微提升但对 CR 收益递减，确认需要 task-specific adaptation。

关键数据:

VideoLLaMA3: AR ≤ 12.7%, CR ≤ 6.4%
PaliGemma 2 3B (best zero-shot): AR 29.7%, CR 17.2%
GPT-5: Binary AR 6 个样本中答错 5 个

5.2 Fine-tuned Baselines 对比

Figure 2 解读: 左图 (AR): UNIVERSE 在所有适配策略中表现最优。右图 (CR): UNIVERSE 排名第三，仅次于 vision encoder tuning ( $F_{V}$ ) 和 $F_{V + L}$ ，但后者需更新约 11% 参数且使用 5 倍以上的 task-specific CR 数据。关键发现是 LoRA 在此设置下表现极差，而仅微调 projection head 即可获得极佳性价比。

核心对比数据 (Table 5, 1 epoch, 8 frames):

策略	AR-Binary EM	AR-MC EM	AR-OE EM	CR-Binary EM	CR-MC EM	CR-OE EM
$F_{L}$	50.00	13.13	13.13	50.00	98.92	98.98
$F_{P}$ (UNIVERSE)	83.97	61.43	61.68	99.09	99.22	99.15
$F_{V}$	83.70	63.40	66.03	99.31	99.14	99.61
$F_{all}$	74.35	13.13	13.13	50.00	97.67	96.55
$F_{LoRA}$	44.66	0.02	0.00	48.76	0.00	0.00

5.3 Supervision & Temporal Context 分析

Figure 3 解读: AR 性能随训练 epochs 和输入帧数 jointly 提升。三种格式（Binary、MC、OE）均在两个维度上呈现单调增长趋势，最佳结果在 10 epochs + 8 frames 时取得。这表明 AR 需要充分的 temporal context 和训练。

Figure 5 解读: AR（上）和 CR（下）的 Exact Match 准确率随 epoch 变化。CR 收敛极快，在约 12.5% epoch（~4K 样本）即达到 >97% EM；而 AR 需要更多训练且仅在有 temporal input 时才有显著提升（1 帧时即使 10 epochs 改善也有限）。

关键数据:

CR: 0.125 epoch → 91.63% EM (Binary), 97.08% EM (MC), 97.28% EM (OE)
AR (10 epochs): Binary 85.11%, MC 62.88%, OE 66.82%

5.4 帧采样策略

Figure 4 解读: Uniform-n（橙色）在所有帧数和所有格式上均优于 First-n（蓝色）。在低帧数时差距最大（2 帧时约 6-18 个百分点），8 帧时优势缩小但仍然存在。这验证了均匀时间覆盖对 AR 的重要性。

关键数据 (2 frames):

Binary: First-n 84.42% → Uniform-n 90.47% (+6.05%)
MC: First-n 65.53% → Uniform-n 83.93% (+18.4%)
OE: First-n 65.38% → Uniform-n 82.68% (+17.3%)

5.5 数据混合优化

Figure 6 解读: 三阶段层次化 ablation。左图：增加 $α_{A R}$ 到 0.8 显著提升 AR（尤其 MC），CR 基本稳定。中图：固定 $α_{A R} = 0.8$ ，增加 $β_{OE}$ 到 0.8 提升 AR-OE 性能。右图：在 $β_{OE} = 0.8$ 下，binary 和 MC 分配比较 robust， $β_{binary} = 0.15, β_{MC} = 0.05$ 略优。

Figure 7 解读: Default-Mix（等比例）vs Optimized-Mix 的对比。优化后的数据混合在 AR 上获得显著提升（MC: 33.4% → 88.9%, OE: 33.5% → 89.6%），同时 CR 性能保持竞争力。这充分说明了数据混合优化的重要性。

Optimized vs Default Mix (4 epochs):

格式	AR Default	AR Optimized	CR Default	CR Optimized
Binary	99.1%	94.6%	99.1%	94.2%
MC	33.4%	88.9%	99.1%	97.1%
OE	33.5%	89.6%	99.3%	97.9%

5.6 人类评估

Figure 9 解读: UNIVERSE 在 8 个评估 settings 上的 graded accuracy。Settings 1-7 使用大模型（1.6B, 300×180），Setting 8 使用小模型（140M, 128×128）。高分辨率 rollouts 的评估效果更好。跨环境（settings 2-7 为 unseen environments）性能保持稳定。Cohen’s $κ$ 范围 0.59-0.91，表示 “substantial” 一致性。

关键数据:

In-domain (Setting 1): AR graded ~75.0%, CR ~82.0%, Overall ~73.0%
Unseen envs (Settings 2-7): AR 68-85%, CR 90-99%
Low-res Setting 8: 整体下降明显（AR 35.7%, CR 10.7-60.7%），因 128×128 上采样至 224×224 导致模糊
Cohen’s $κ$ : 0.59-0.91（substantial agreement）

5.7 计算成本分解

实验类别	GPU-days
Zero-shot evaluation	136
Baseline fine-tuning	864
Analysis experiments	2,554
Human evaluation	1,125
Development/failed runs	1,599
总计	5,154 (14.12 GPU-years)

代码-论文对应表

论文组件	实现模块	说明
Section 3 - Dataset	`AdaptationDatasetBuilder`	从 ground truth 构建 QA pairs，支持 `alpha_task`, `beta_format`, `context_length`, `sampling_strategy`
Section 3 - Training	`VLMAdapter`	对 base VLM 应用适配策略，通过 `strategy` 参数控制（projection_only for UNIVERSE）
Section 3 - Rollout Gen	`RolloutsGenerator`	自回归采样 world model rollouts，维护 `o_lt` 和 `a_lt` 列表
Section 6 - Evaluation	`Universe`	推理引擎，`evaluate_rollout` 方法接受 rollout + specification，通过 `generate_question` 构建 prompt
Algorithm 1	`DatasetCreation`	三阶段 pipeline: Preprocessing → Description → QA Construction
Eq. (1)	`compute_loss`	Causal LM loss on answer suffix tokens
Table 3	Training config	AdamW, lr= $5 \times 1 0^{- 5}$ , cosine, bfloat16, 8×A100
Figure 1	Evaluation comparison	`Universe.evaluate_rollout` on validation set

注: 截至 2026 年 3 月，该论文未公开官方代码仓库。以上模块名称来自论文 Appendix C.1 的实现描述。

Paper Notes

探索

UNIVERSE: Adapting Vision-Language Models for Evaluating World Models

UNIVERSE: Adapting Vision-Language Models for Evaluating World Models

1. Motivation (研究动机)

核心问题

需要解决的挑战

2. Idea (核心思想)

核心思想

评估协议设计

3. Method (方法)

3.1 形式化定义

3.2 模型架构

3.3 训练目标

3.4 适配策略探索

(a) Fine-tuning 配置

(b) 帧采样策略

(c) 数据混合优化

3.5 UNIVERSE 最终配置

3.6 评估指标

3.7 伪代码

4. Experimental Setup (实验设置)

4.1 数据集

4.2 Baselines

4.3 训练超参数

4.4 人类评估设置

5. Experimental Results (实验结果)

5.1 Zero-shot 性能

5.2 Fine-tuned Baselines 对比

5.3 Supervision & Temporal Context 分析

5.4 帧采样策略

5.5 数据混合优化

5.6 人类评估

5.7 计算成本分解

代码-论文对应表

目录