今日精读 1 篇论文
1. Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
- 评分: 7.5/10 | arXiv: 2602.17616
- 摘要: 本文提出 Variance Controlled Policy Optimization (VCPO),通过有效样本量 (ESS) 引导的学习率缩放和闭式最小方差离策略基线两个互补机制,显式控制异步 RL 中策略梯度的方差,在保持异步训练吞吐量优势的同时匹配同步训练性能,将长上下文多轮训练时间缩短 2.5 倍。
- 评分详情: 新颖性 6 | 相关性 8 | 技术深度 7 | 实践影响 8
- 评价: VCPO addresses a practical and underexplored problem—variance amplification in asynchronous RL for LLMs—with a principled ESS-based diagnosis and lightweight fix that achieves 2.5x training speedup. Directly relevant to RLHF/alignment and tool-use training, with strong empirical validation across diverse benchmarks.
- 精读笔记 | PDF