中科院揭秘:多步工具RL为何崩溃?监督信号给出解法
Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It 作者:Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao 核心发表机构:Chinese Academy of Sciences、University of Chinese Academy of Scien…
2026/7/1 1:55:05