Teaching AI to show its work, not just get lucky

What problem does it solve

“"Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns." — Ding et al., arXiv:2510.22543”

You know that feeling when a student gets the right answer but their work is clearly wrong — they just guessed? RLVR training for LLMs has this problem at scale: the training signal only checks if the final answer is correct, so answer-guessing and logic shortcuts earn the same positive reward as fully reasoned solutions. The authors find this affects 20–40% of correct training rollouts across Qwen2.5-Math-7B-Base, Llama3.3-70B-Instruct, and Qwen3-1.7B. The ratio doesn't shrink with more training — it plateaus at ~30%, meaning your model keeps internalizing unreliable reasoning patterns even as its benchmark scores improve.

rlvrreinforcement-learningllm-trainingprocess-reward-modelreasoningiclr-2026open-source

How it works

Think of it like a teacher who first lets students use cheat sheets during an early learning phase, then gradually removes them as students get stronger. FAPO adds two components on top of standard RLVR: a generative process reward model (GenRM) that reads each reasoning step and flags where the model guessed or made a logical jump, and a reward penalty term that only activates once fully-correct rollouts outnumber failed rollouts (the α/β ratio crosses 1). The penalty uses a distance-sensitive formula — the further a wrong step is from where the model claims correctness, the larger the deduction. During warm-up, flawed positives still get rewarded because they help the model learn fast. Once the model gets reliable, the penalty kicks in and steers optimization toward showing its work. The λ=1 penalty weight is derived from a majority-guided rule, not tuned manually.

Key takeaways

✦

01

Two-phase reward schedule — during warm-up, flawed positives still earn rewards to support early learning momentum, then the penalty activates automatically once reliable outputs outnumber failures, with no manual epoch or threshold to set

⟁

02

Distance-sensitive step-level penalty — the GenRM deducts proportionally more when the wrong step is far from where the model claims correctness (R_Process ∈ [-1, 0]), giving precise error localization rather than a binary pass/fail signal

⊕

03

Parameter-free penalty term — R_Δ uses λ=1 set by a majority-guided rule (α/β > 1), so you add zero new hyperparameters on top of your existing RLVR setup

◈

04

4B GenRM that beats 32B teacher — FAPO-GenRM-4B achieves F1 89.4 on FlawedPositiveBench vs. 87.8 for Qwen3-32B, delivering better process reward modeling at one-eighth the parameter count

∞

05

Less than 20% training overhead — the GenRM runs asynchronously from policy rollouts in the verl framework, confirmed below 20% wall-clock overhead in the paper

◎

06

AIME25 flawed-positive rate cut by 9.2 points — from 10.9% down to 1.7% on AIME25 at the 32B model scale, with the largest gains on pure math tasks

Should you care?

Who it’s for

If you train or fine-tune reasoning LLMs using RLVR (GRPO, PPO, or derivatives) and care about process reliability beyond benchmark accuracy, this paper is directly relevant. You need access to the verl training framework and the compute budget for 7B–32B models to apply the full method. Not usable yet for training reproduction: the recipe/fapo code directory in volcengine/verl returns HTTP 404 as of May 27, 2026.

Worth exploring

Worth reading if you work on RLVR post-training — the empirical finding that flawed positives persist at ~30% across model types and training runs is concrete and documented across multiple model families. The method is elegant: no new hyperparameters, less than 20% compute overhead. However, the training code is not yet publicly accessible (recipe/fapo returns HTTP 404), validation covers only math tasks, and the model has 7 HuggingFace downloads as of May 2026 — far too early for any production evaluation.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Teaching AI to show its work, not just get lucky

Underrated tools. Unfiltered takes.