May 27, 2026

Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

Soft-RLVR improves language model instruction-following by decomposing prompts into atomic checklists, scoring responses with an LLM verifier to provide denser partial-credit rewards that outperform holistic verification in controlled evaluations, though self-verification variants require explicit stabilization to prevent reward inflation.

Read the Paper

Authors

Saurabh Dash, Pierre Clavier, John Dang, Matthias Galle, Marzieh Fadaee, Ahmet Üstün, Beyza Ermis

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.

Related works

Research

Reverse Engineering Human Preferences with Reinforcement Learning

Read

Research

CIRCLE: A Framework for Evaluating AI from a Real-World Lens

Read

Research

Near-Optimal Distributionally Robust Reinforcement Learning with General Norms

Read