R&D advanced 3 min read May 27, 2026
Public Preview Sign in free for the full digest →

Teaching AI to show its work, not just get lucky

“20–40% of correct outputs in RLVR training are lucky guesses — and without intervention, that rate never drops below 30%, no matter how long you train.”

Teaching AI to show its work, not just get lucky
1 Views
0 Likes
0 Bookmarks
Source · fapo-rl.github.io

“"Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns." — Ding et al., arXiv:2510.22543”

You know that feeling when a student gets the right answer but their work is clearly wrong — they just guessed? RLVR training for LLMs has this problem at scale: the training signal only checks if the final answer is correct, so answer-guessing and logic shortcuts earn the same positive reward as fully reasoned solutions. The authors find this affects 20–40% of correct training rollouts across Qwen2.5-Math-7B-Base, Llama3.3-70B-Instruct, and Qwen3-1.7B. The ratio doesn't shrink with more training — it plateaus at ~30%, meaning your model keeps internalizing unreliable reasoning patterns even as its benchmark scores improve.

rlvrreinforcement-learningllm-trainingprocess-reward-modelreasoningiclr-2026open-source

Think of it like a teacher who first lets students use cheat sheets during an early learning phase, then gradually removes them as students get stronger. FAPO adds two components on top of standard RLVR: a generative process reward model (GenRM) that reads each reasoning step and flags where the model guessed or made a logical jump, and a reward penalty term that only activates once fully-correct rollouts outnumber failed rollouts (the α/β ratio crosses 1). The penalty uses a distance-sensitive formula — the further a wrong step is from where the model claims correctness, the larger the deduction. During warm-up, flawed positives still get rewarded because they help the model learn fast. Once the model gets reliable, the penalty kicks in and steers optimization toward showing its work. The λ=1 penalty weight is derived from a majority-guided rule, not tuned manually.

01
Two-phase reward schedule — during warm-up, flawed positives still earn rewards to support early learning momentum, then the penalty activates automatically once reliable outputs outnumber failures, with no manual epoch or threshold to set
02
Distance-sensitive step-level penalty — the GenRM deducts proportionally more when the wrong step is far from where the model claims correctness (R_Process ∈ [-1, 0]), giving precise error localization rather than a binary pass/fail signal
03
Parameter-free penalty term — R_Δ uses λ=1 set by a majority-guided rule (α/β > 1), so you add zero new hyperparameters on top of your existing RLVR setup
04
4B GenRM that beats 32B teacher — FAPO-GenRM-4B achieves F1 89.4 on FlawedPositiveBench vs. 87.8 for Qwen3-32B, delivering better process reward modeling at one-eighth the parameter count
05
Less than 20% training overhead — the GenRM runs asynchronously from policy rollouts in the verl framework, confirmed below 20% wall-clock overhead in the paper
06
AIME25 flawed-positive rate cut by 9.2 points — from 10.9% down to 1.7% on AIME25 at the 32B model scale, with the largest gains on pure math tasks
Who it’s for

If you train or fine-tune reasoning LLMs using RLVR (GRPO, PPO, or derivatives) and care about process reliability beyond benchmark accuracy, this paper is directly relevant. You need access to the verl training framework and the compute budget for 7B–32B models to apply the full method. Not usable yet for training reproduction: the recipe/fapo code directory in volcengine/verl returns HTTP 404 as of May 27, 2026.

Worth exploring

Worth reading if you work on RLVR post-training — the empirical finding that flawed positives persist at ~30% across model types and training runs is concrete and documented across multiple model families. The method is elegant: no new hyperparameters, less than 20% compute overhead. However, the training code is not yet publicly accessible (recipe/fapo returns HTTP 404), validation covers only math tasks, and the model has 7 HuggingFace downloads as of May 2026 — far too early for any production evaluation.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →