Papers
arxiv:2602.10693

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Published on Feb 11
· Submitted by
floyed shen
on Feb 23
#1 Paper of the day

Abstract

VESPO addresses training instability in LLM reinforcement learning by using variational formulation with variance reduction to correct policy divergence without length normalization.

AI-generated summary

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO

Community

Paper author Paper submitter

Training stability under off-policy conditions is a critical bottleneck for scaling RL-based LLM training — policy staleness from mini-batch splitting, asynchronous pipelines, and training-inference mismatches all cause importance weights to explode. Existing fixes like token-level clipping or length normalization are either lossy approximations or introduce bias.

We propose VESPO, which takes a fundamentally different approach: instead of designing heuristic weight transformations, we formulate variance reduction as a variational optimization problem over proposal distributions, yielding a closed-form reshaping kernel that operates directly on sequence-level importance weights — no length normalization, no token-level decomposition. VESPO maintains stable training under staleness ratios up to 64× and fully asynchronous execution, with consistent gains across both dense and MoE architectures on mathematical reasoning benchmarks. Code: https://github.com/FloyedShen/VESPO

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.10693 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.10693 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.10693 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.