“"Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns." — Ding et al., arXiv:2510.22543”
You know that feeling when a student gets the right answer but their work is clearly wrong — they just guessed? RLVR training for LLMs has this problem at scale: the training signal only checks if the final answer is correct, so answer-guessing and logic shortcuts earn the same positive reward as fully reasoned solutions. The authors find this affects 20–40% of correct training rollouts across Qwen2.5-Math-7B-Base, Llama3.3-70B-Instruct, and Qwen3-1.7B. The ratio doesn't shrink with more training — it plateaus at ~30%, meaning your model keeps internalizing unreliable reasoning patterns even as its benchmark scores improve.
Think of it like a teacher who first lets students use cheat sheets during an early learning phase, then gradually removes them as students get stronger. FAPO adds two components on top of standard RLVR: a generative process reward model (GenRM) that reads each reasoning step and flags where the model guessed or made a logical jump, and a reward penalty term that only activates once fully-correct rollouts outnumber failed rollouts (the α/β ratio crosses 1). The penalty uses a distance-sensitive formula — the further a wrong step is from where the model claims correctness, the larger the deduction. During warm-up, flawed positives still get rewarded because they help the model learn fast. Once the model gets reliable, the penalty kicks in and steers optimization toward showing its work. The λ=1 penalty weight is derived from a majority-guided rule, not tuned manually.
If you train or fine-tune reasoning LLMs using RLVR (GRPO, PPO, or derivatives) and care about process reliability beyond benchmark accuracy, this paper is directly relevant. You need access to the verl training framework and the compute budget for 7B–32B models to apply the full method. Not usable yet for training reproduction: the recipe/fapo code directory in volcengine/verl returns HTTP 404 as of May 27, 2026.
Worth reading if you work on RLVR post-training — the empirical finding that flawed positives persist at ~30% across model types and training runs is concrete and documented across multiple model families. The method is elegant: no new hyperparameters, less than 20% compute overhead. However, the training code is not yet publicly accessible (recipe/fapo returns HTTP 404), validation covers only math tasks, and the model has 7 HuggingFace downloads as of May 2026 — far too early for any production evaluation.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.