RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Paper: arXiv:2605.10899v1

1. Motivation (研究动机)

现有 Deep Research Agent 的核心瓶颈不是“不会搜索”,而是很难用 RL 训练开放式长报告任务:最终答案没有唯一 ground truth,轨迹包含多轮规划、检索、证据判断和写作,只有终局分数时很难知道错误来自 Plan、Research、Review 还是 Answer。

本文要解决的问题是:如何在 beyond verifiable rewards 的长程研究任务中,把 rubric 从“最终打分表”升级为贯穿 policy execution、judge feedback、agent memory 的共享接口。也就是让 agent 先用 rubric 组织思考,再让 judge 按 stage rubric 给密集 credit,最后把被 judge 过的经验沉淀成可检索的 reflection memory。

这个问题值得研究,因为一旦能训练这类无标准答案的长程 agent,RL 就不再局限于 math/code/exact answer 这类可验证任务,而能覆盖研究报告、科学综述、数据分析、tutoring、复杂工具链等质量多维、评价语义化的任务。

2. Idea (核心思想)

核心洞察:Rubric 不应只作为输出后的 evaluator,而应成为 agent、judge、memory 三者共享的中间语言。RubricEM 先把长轨迹显式拆成 Plan → Research → Review → Answer 四个 stage,让 agent 自生成 task-specific rubric;再用 stagewise rubric judge 把 long-horizon credit assignment 从单一终局 reward 改成 stage-level semantic reward;最后用 shared-backbone reflection meta-policy 把 judged trajectory 转成 reusable guidance。

关键创新可以概括为 “structure → assign → evolve”:先用 rubric-guided scaffold 暴露任务结构,再用 Stage-Structured GRPO 给结构化 credit,最后用 reflection meta-policy 和 rubric bank 让经验既更新参数,也更新文本记忆。

与 DR Tulu 这类端到端 deep research RL 相比,RubricEM 不是只延长训练或增加最终 judge reward,而是显式建模中间阶段:DR Tulu 主要依赖终局或较粗粒度反馈;RubricEM 让 Plan/Research/Review/Answer 分别被 judge、归因和复用,因此更适合开放式长报告的语义 credit assignment。

3. Method (方法)

3.1 Overall framework

Figure 1 解读:RubricEM 的主线是左到右三段。左侧用 Qwen3-8B task policy 生成多条 stage-structured trajectory,每条 trajectory 都按 Plan/Research/Review/Answer 展开,并由 self-generated rubrics 指导后续决策;中间 LLM Judge 对每个 stage 单独评分,Stage-Structured GRPO 用 per-stage scores 计算 stagewise advantages;右侧 shared-backbone Reflection Meta-Policy 从 judged trajectory 中采样 reflection candidates,经 judge 选择后写入 Rubric Bank,后续通过 cross-episode transfer 或 within-episode refinement 反哺新问题。

整体训练对象是一类 tool-augmented deep research agent。给定 query ,agent 与工具环境 交互并产生轨迹

其中 可以是文本段或结构化 tool call, 是工具返回。policy 以 autoregressive 方式采样

RubricEM 在这个基础上增加三层结构:rubric-guided scaffold、SS-GRPO、reflection meta-policy training。

3.2 Rubric-guided structured reasoning scaffold

Figure 3 解读:scaffold 把 agent 输出组织成四个 XML-like stage:Plan 中先做 deep analysis、生成 rubrics 和 research plan;Research 中循环 call_tool → tool_output → state_evaluation,并允许根据证据修订 plan;Review 中把证据映射回 rubric,形成 writing plan;Answer 中生成最终长答案。关键点是 rubric 在 Plan 阶段产生,但贯穿 Research、Review 和 Answer,而不是只在结尾评分。

论文用一个 stage-information theorem 解释为什么显式 stage 有价值。令 为 trajectory 中的随机决策点, 为压缩状态表示, 为当前 stage label, 为在 history 采取 action 后继续 rollout 的期望下游价值。定义:

如果同一个压缩状态 在不同 stage 下的最优动作集合不相交,则

直觉是:长程研究中,同样的局部上下文可能要求完全不同的动作;Plan 阶段应先拆解问题,Research 阶段应检索证据,Review 阶段应查缺补漏,Answer 阶段应整合写作。若没有 stage label,policy 必须从局部 token 猜当前模式,容易在长轨迹中混淆;显式 stage 则把“当前决策模式”作为条件变量暴露出来。

3.3 Stage-Structured GRPO (SS-GRPO)

RubricEM 的第二步是把 scaffold 变成 reward/advantage 的归因单位。给定同一 query 的 条 rollout ,每条轨迹分成 个 semantic stages。令 为第 条 rollout 在 stage 的 token block, 为 LLM judge 在该 stage rubric 下的得分。

SS-GRPO 不把终局分数广播到所有 token,而是定义一个 causal stage-dependence matrix ,其中 for ,并计算 stage return:

每个 stage 保留自己的局部分数,同时接收它对 downstream stage 的贡献。然后在同一 stage 内跨 rollout group 做 normalization:

所有属于同一 stage block 的 token 共享 ,目标函数为:

其中

论文还给出实验中使用的

3.4 Evolving-rubric judge and reflection meta-policy

Figure 4 解读:上半部分展示两个耦合的 judge-agent loop。task-policy loop 中,LLM judge 对同一 query 的多条 stage-structured rollout 做对比,生成或更新每个 stage 的 discriminative rubric buffer,再对每条 rollout 的每个 stage 打分,形成 SS-GRPO reward。reflection loop 中,shared backbone 对一个 query-trajectory pair 生成多个 reflection candidates,judge 根据 task rollout 的 rubric scores 和 justifications 打分,最高分 reflection 写入 Rubric Bank,同时 reflection tokens 也用 GRPO 更新 shared policy。下半部分展示异步实现:reflection generation、judging、update 使用上一轮轨迹延迟执行,避免阻塞当前 task rollout。

Judge 本身不更新参数,而是维护 Plan/Research/Review/Answer 四个 stage 的 rubric buffer:它从 on-policy rollouts 中发现新的 failure modes,保留能区分好坏轨迹的 rubric,删除不再有判别力的 rubric。这样中间 reward 不是固定模板,而是随 policy distribution 演化。

Reflection Meta-Policy 与 task policy 共享 backbone。task rollout 被 judge 后,系统采样一个 query—trajectory pair,固定原始轨迹作为 context,只对 reflection tokens 反向传播。privileged judge 从两个角度评分:一是 reflection 是否有助于同一 query 的 within-episode refinement;二是是否可迁移到相似 query 的 cross-episode transfer。被接受的 reflection 进入 agent rubric bank,作为自然语言 memory。

Figure 2 解读:该例子用 “sleep patterns 如何影响 aging cognitive decline” 展示单个 RL step 内的完整闭环。右上是 Plan stage 产生的 deep analysis、rubrics 和 research plan;左侧 Research stage 通过搜索工具检索证据并做 state evaluation;左下 Review stage 检查证据是否满足 causal framing 和 search precision 等 rubric;右中 Answer stage 生成最终回答;底部 Stagewise Rubrics 显示 judge 可以按 Plan、Research、Answer 等 stage 给出正负判断。右下 reflection 把本次轨迹的经验提炼成 retrospective rubric 和 takeaway,用于后续 memory。

3.5 Windowed curriculum and asynchronous memory update

为了实现 cross-episode 与 within-episode 两种 memory 使用方式,RubricEM 使用 windowed curriculum。实验中 ,每个 step window 前 3 步采样 fresh batches 并做 cross-episode retrieval;后 3 步按同样顺序 replay ,使用第一遍生成的 reflection 做 within-episode exact retrieval。3-step gap 保证 reflection generation、judge scoring、bank insertion 和至少一次 deferred training step 已完成。

Step in windowPhaseDataRetrieval mode
1New freshCross-episode similar
2New freshCross-episode similar
3New freshCross-episode similar
4Repeat replayWithin-episode exact
5Repeat replayWithin-episode exact
6Repeat replayWithin-episode exact

3.6 Algorithm-level pseudocode

代码搜索未找到开源实现;以下伪代码根据 arXiv source 中 Algorithm 1、method sections 和 appendix training pipeline 重构,用于理解论文流程,不是 released code。

3.6.1 Rubric-guided rollout scaffold

def rubric_guided_rollout(policy, query, tools, retrieved_experience=None):
    history = [query]
    if retrieved_experience is not None:
        history.append(format_rubric_bank_items(retrieved_experience))
 
    plan = policy.generate(
        history,
        required_tags=["<structured_plan>", "<analysis>", "<rubrics>", "<research_plan>"]
    )
    history.append(plan)
 
    evidence = []
    while not has_sufficient_evidence(history) and len(evidence) < MAX_TOOL_CALLS:
        tool_call = policy.generate(history, required_tags=["<call_tool>"])
        tool_output = tools.run(tool_call)
        state_eval = policy.generate(
            history + [tool_output],
            required_tags=["<state_evaluation>"]
        )
        history.extend([tool_call, tool_output, state_eval])
        evidence.append(tool_output)
 
    review = policy.generate(
        history,
        required_tags=["<review>", "<rubrics_review>", "<writing_plan>"]
    )
    answer = policy.generate(history + [review], required_tags=["<answer>"])
    return parse_stage_blocks(history + [review, answer])

3.6.2 Stage-Structured GRPO update

import torch
import torch.nn.functional as F
 
 
def compute_stage_advantages(stage_scores, lambda_matrix, eps=1e-6):
    # stage_scores: [num_rollouts, num_stages], values in [0, 1]
    # lambda_matrix[k, j] = downstream credit from stage j to stage k
    returns = stage_scores @ lambda_matrix.T
    mean = returns.mean(dim=0, keepdim=True)
    std = returns.std(dim=0, keepdim=True)
    return (returns - mean) / (std + eps)
 
 
def ss_grpo_loss(logp_new, logp_old, ref_logp, stage_ids, advantages, clip_eta=0.2, beta=0.001):
    # logp_*: [num_rollouts, seq_len]
    ratio = torch.exp(logp_new - logp_old)
    token_adv = torch.gather(advantages, dim=1, index=stage_ids)
    unclipped = ratio * token_adv
    clipped = torch.clamp(ratio, 1 - clip_eta, 1 + clip_eta) * token_adv
    policy_loss = -torch.minimum(unclipped, clipped).mean()
    kl = (torch.exp(logp_new) * (logp_new - ref_logp)).mean()
    return policy_loss + beta * kl

3.6.3 Evolving judge and rubric buffers

def update_stagewise_rubric_buffers(judge, query, rollouts, old_buffers, caps):
    proposed = judge.generate_discriminative_rubrics(
        query=query,
        rollouts=rollouts,
        active_buffers=old_buffers,
        stages=["Plan", "Research", "Review", "Answer"],
    )
    merged = {stage: old_buffers[stage] + proposed[stage] for stage in old_buffers}
 
    stage_scores = {}
    for stage, rubrics in merged.items():
        stage_scores[stage] = judge.score_stage(
            query=query,
            rollouts=rollouts,
            stage=stage,
            rubrics=rubrics,
        )
        merged[stage] = prune_by_discrimination(
            rubrics=merged[stage],
            scores=stage_scores[stage],
            max_items=caps[stage],
        )
    return merged, stage_scores

3.6.4 Reflection meta-policy and rubric bank

def prepare_reflection_batch(policy, judge, query, rollouts, stage_scores, rubric_buffers, bank, m=8):
    trajectory = sample_uniform(rollouts)
    candidates = policy.generate_n(
        prompt=build_reflection_prompt(query, trajectory),
        n=m,
        temperature=0.7,
    )
    rewards = []
    for cand in candidates:
        score = judge.score_reflection(
            query=query,
            trajectory=trajectory,
            reflection=cand,
            stage_scores=stage_scores,
            rubric_buffers=rubric_buffers,
            criteria=["within_episode_refinement", "cross_episode_transfer"],
        )
        rewards.append(score)
 
    best = candidates[int(torch.tensor(rewards).argmax())]
    if is_valid_reflection(best):
        bank.upsert(query_hash=sha256(query), query=query, reflection=best)
    return build_grpo_samples(candidates, rewards)

3.6.5 Full training loop

def train_rubricem(policy, ref_policy, data, tools, judge, lambda_matrix, total_steps):
    bank = RubricBank(embedding_model="Qwen3-Embedding-0.6B", top_k=2)
    active_buffers = init_stage_buffers(stages=["Plan", "Research", "Review", "Answer"])
    deferred_reflection_batch = None
 
    for step in range(1, total_steps + 1):
        queries, retrieval_mode = windowed_batch(data, step, window_k=3)
        retrieved = {q: bank.retrieve(q, mode=retrieval_mode) for q in queries}
 
        if deferred_reflection_batch is not None:
            policy.update_on_reflection_tokens(deferred_reflection_batch, objective="GRPO")
 
        all_task_samples = []
        async_reflection_jobs = []
        for q in queries:
            rollouts = [rubric_guided_rollout(policy, q, tools, retrieved[q]) for _ in range(8)]
            active_buffers[q], stage_scores = update_stagewise_rubric_buffers(
                judge, q, rollouts, active_buffers.get(q, init_stage_buffers()), caps=[3, 2, 2, 3]
            )
            advantages = compute_stage_advantages(stage_scores, lambda_matrix)
            all_task_samples.extend(pack_stage_token_samples(q, rollouts, advantages))
            async_reflection_jobs.append(
                launch_async(prepare_reflection_batch, policy, judge, q, rollouts, stage_scores, active_buffers[q], bank)
            )
 
        policy.update_on_task_tokens(all_task_samples, ref_policy=ref_policy, objective="SS-GRPO")
        deferred_reflection_batch = collect_completed(async_reflection_jobs)
        if step % 10 == 0:
            bank.save_atomic(step)

3.7 Code search and paper-to-source mapping

代码搜索未找到开源实现。已检查 arXiv HTML/source 中的 project/GitHub 链接、Web 搜索 RubricEM GitHub / RubricEM-8B GitHub / Stage-Structured GRPO GitHub,以及 GitHub repository search;仅发现无关的同名自动化测试仓库,未发现作者官方 repo。因此本笔记不设置 github / github_ref,也不存在 released code 与论文公式的实现差异可比对。

Code reference: 代码搜索未找到开源实现(2026-05-16);下表映射到 arXiv source sections,而非 released code。

Paper ConceptSource FileKey Class/Function
Overall framework / trajectory notationsections/method.tex; sections/Intro.tex未公开;见伪代码 train_rubricem
Rubric-guided scaffoldsections/method_1_scaffolding.tex; sections/appen_scaffold.tex未公开;见 rubric_guided_rollout
SS-GRPO objectivesections/method_2_ssgrpo.tex; sections/appen_theory.tex未公开;见 compute_stage_advantages, ss_grpo_loss
Evolving-rubric judgesections/method_2_ssgrpo.tex; sections/appen_evolving_judge.tex未公开;见 update_stagewise_rubric_buffers
Reflection meta-policy / rubric banksections/method_3_evolving.tex; sections/appen_async_detail.tex未公开;见 prepare_reflection_batch, RubricBank
Full proceduresections/appen_algo.tex未公开;Algorithm 1

论文公式与 released code 实现差异:无 released code,无法检查实现是否偏离论文公式;训练配置数字均来自 arXiv source 的 appendix training details,而非通用 README/default config。

4. Experimental Setup (实验设置)

Datasets and benchmarks

长形式 benchmark:

BenchmarkScaleEvaluation
HealthBench1000 medical questions subsampleGPT-4 LLM-as-judge,按 per-question rubrics 评估 accuracy、completeness、context awareness、communication,报告 rubric satisfaction rate
ResearchQA756 scientific research questionsGPT-4 LLM-as-judge,expert-authored rubric items,5-point coverage scale,报告 normalized coverage
DeepResearchBench (DRB)100 complex research questionsRACE scoring,Gemini judge 同时评估 report content quality 与 citation accuracy
ResearchRubrics101 open-ended deep research promptsLLM-as-judge,expert prompt-specific rubrics,覆盖 factual grounding、reasoning、synthesis、relevance、clarity、citation use,报告 binary score

短形式 out-of-domain benchmark:SimpleQA、2WikiMultihopQA(1000 multi-hop questions)、WebWalker(680 web navigation questions)、DeepSearchQA(900 search-intensive questions)。短形式任务在同一 agent pipeline 与 search tools 下 zero-shot 评估,RL 阶段没有使用 short-form data。

训练数据:SFT queries 来自 DR Tulu 使用的约 13K open-ended research questions,由 Gemini-3.1-Pro teacher 生成 stage-structured trajectories;RL 使用同一 DR Tulu RL dataset rl-research/dr-tulu-rl-data,约 4.9K diverse deep research queries。训练 query 来源包括 SearchArena 的 realistic search conversations 和 OpenScholar 的 research-oriented questions。

Baselines

对比对象覆盖三类:

  • Closed Deep Research:Claude-Sonnet Search、Perplexity-Sonar (High)、Perplexity Deep Research、Gemini Deep Research、Gemini 3.1 Pro + Search、GPT-5 + Search、OpenAI Deep Research。
  • Fixed Pipeline Deep Research:WebThinker QwQ-32B、WebThinker-32B-DPO、Ai2 ScholarQA — Claude Sonnet。
  • Open Deep Research Models:Search-R1-7B、WebExplorer-8B、Tongyi DeepResearch-30B-A3B、DR Tulu-8B (SFT)、DR Tulu-8B (RL, 1900 steps)。

Training config

SFT:

  • Base model / framework: Qwen3-8B, LLaMA-Factory, DeepSpeed ZeRO-3。
  • Optimization: 5 epochs, LR , cosine scheduler with 10% warmup, BF16, weight decay 0.0。
  • Batch / length: batch size per device 1, gradient accumulation 16, effective batch size 128 = 8 GPUs 16 accum., max sequence length 16,384 tokens。
  • Hardware / masking: 8 NVIDIA H100 80GB;对 <tool_output> 做 span masking,不在搜索结果 token 上给梯度。

RL:

  • Algorithm / rollout: GRPO / SS-GRPO, 8 rollouts per prompt, 32 unique prompts per step, effective batch size 256。
  • Optimization: LR , constant scheduler, KL coefficient , KL estimator KL3, PPO-style clipping, temperature 1.0。
  • Length / tools: max response length 18,432 tokens, max prompt length 8,192 tokens, max total pack length 26,624 tokens, max tool calls per trajectory 10。
  • System components: ZeRO-3 with CPU offloading, gradient checkpointing enabled;judge model 和 rubric generation model 都是 Gemini Flash;rubric buffer cap per stage = 3, 2, 2, 3;rubric bank 使用 Qwen3-Embedding-0.6B, retrieval top-, reflection trajectories sampled = 1, windowed curriculum , bank 每 10 steps 保存。

Ablation runs 使用同一 600-step budget 和 2 nodes,从同一 RubricEM-SFT checkpoint 开始;最终 RubricEM run 使用 4 nodes,训练 1400 RL steps。奖励设计只使用 rubric-based judge signals,不加入 format reward、citation reward 或 tool-use heuristics。

5. Experimental Results (实验结果)

Main long-form performance

RubricEM-8B (RL, 1400 steps) 在开源/非专有模型中平均分最高,且 8B 模型接近 proprietary deep research systems。主结果如下:

ModelHealthBenchResearchQADRBResearchRubricsAverage
Gemini 3.1 Pro + Search47.574.544.449.153.9
GPT-5 + Search59.578.250.760.562.2
OpenAI Deep Research53.879.246.959.759.9
WebThinker-32B-DPO39.474.240.641.949.0
Tongyi DeepResearch-30B-A3B46.266.740.649.550.8
DR Tulu-8B (SFT)38.168.539.038.446.0
DR Tulu-8B (RL, 1900 steps)50.274.343.446.453.6
Qwen3-8B + Our Search24.558.428.224.533.9
RubricEM-8B (SFT)39.071.843.042.849.2
RubricEM-8B (RL, 1400 steps)49.374.547.850.355.5

关键读数:RubricEM 从 SFT 到 RL 的 average 从 49.2 提升到 55.5(+6.3);相对 DR Tulu-8B (RL, 1900 steps) 的 average 53.6,RubricEM 用 1400 steps 达到 55.5。它在 ResearchQA、DRB、ResearchRubrics 三项上是非专有模型最佳,但 HealthBench 上 DR Tulu-8B RL 的 50.2 略高于 RubricEM 的 49.3。

Ablation: RL training recipes

Figure 5 解读:600-step budget 下,Baseline RL、SS-GRPO、Meta-Policy、RubricEM 都从同一个 SFT checkpoint 出发。RubricEM 在 average delta over SFT、DRB、HealthBench、ResearchQA 上整体曲线最高,说明 SS-GRPO 的 stage-level credit 和 Reflection Meta-Policy 的 experience reuse 不是替代关系,而是互补;只加 SS-GRPO 或只加 Meta-Policy 均能改善,但 full recipe 最稳定。

Ablation: scaffold and experience reuse

Figure 6 解读:Panel (a) 显示 structured SFT 相比 unstructured SFT 在 DRB/HealthBench/ResearchQA 上分别为 43.0 vs 39.8、39.0 vs 34.2、71.8 vs 70.2,说明 scaffold 直接提高蒸馏质量。Panel (b) 显示 structured RL 的 gain 曲线高于 unstructured RL,说明 scaffold 也让后续 RL 更有效。Panel (c) 控制 Gemini-3.1-Pro 和同一 search backend,只改变 prompt,Our Scaffold 在 DRB 上达到 43.5,高于 ReAct 的 39.9。Panel (d) 显示 RubricEM 在 cross-episode pass 1 与 within-episode pass 2 中分别 +0.4、+0.7,而 Baseline RL 在相同 retrieval 设置下为 -0.5、-0.6,说明 memory/reuse 不是简单增加上下文,而依赖训练出的 reflection meta-policy。

Short-form transfer

ModelSimpleQA2WikiWebWalkerDSQAAvg.
DR Tulu-8B (SFT)75.566.531.95.344.8
DR Tulu-8B (RL, 1900 steps)80.168.039.18.349.0
Qwen3-8B + Our Search84.061.542.615.250.8
RubricEM-8B (SFT)92.177.564.737.067.8
RubricEM-8B (RL, 1400 steps)92.378.870.053.073.5

短形式结果说明:虽然 RL 只用 long-form queries,RubricEM-8B RL 在 SimpleQA、2Wiki、WebWalker、DSQA 全部超过 DR Tulu baseline,平均 73.5;最明显的是 DSQA,从 SFT 的 37.0 提升到 RL 的 53.0。

Limitations

作者明确指出三类限制:第一,long-horizon agentic RL 依赖 search tool calls 和外部 LLM judging,训练比标准 offline RL/SFT 更容易受 API delay、network instability、rollout-judge latency 影响;主大规模 RL run 中训练服务器多次 shutdown/restart,虽然从 checkpoint 恢复,但可能给 asynchronous reflection branch、rubric bank 和 judge-feedback pipeline 引入超过设计中 one-step lag 的 staleness。第二,Gemini Flash 是 cost-effective judge;更强或专门化 judge 可能提升 stage-level credit 与 reflection reward,但会增加成本和延迟。第三,rubric-guided meta-policy 继承 judge/rubric 的偏差风险:不良 criteria 可能强化浅层偏好、biased standards 或 overconfident synthesis,memory 还可能跨任务传播这些错误。

Overall conclusion:RubricEM 证明了在没有 verifiable rewards 的开放式 deep research 任务中,显式 rubric interface 可以同时用于结构化 policy、细粒度 credit assignment 和 experience reuse;最终 8B model 在四个 long-form benchmark 上平均 55.5,超过可比开源模型,并通过 ablation 支持 structure、assign、evolve 三个组件的互补性。