Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective
•
58
None defined yet.
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens