GitHub Repos advanced 3 min read May 2, 2026
Public Preview Sign in free for the full digest →

SIM2REASON: Scaling LLM Reasoning with Simulator-Generated QA Data

“SFT on 200K GPT-4 examples hurt IPhO by -3.9%. RL on 6,400 simulated problems improved it by +5.4%. The data that costs nothing beats the data that costs everything.”

SIM2REASON: Scaling LLM Reasoning with Simulator-Generated QA Data
1 Views
0 Likes
0 Bookmarks
Source · github.com

“"SFT yields only modest in-distribution gains on our synthetic evaluation and substantially degrades out-of-distribution performance (e.g., -3.9% on IPhO Mechanics). We hypothesize that this is driven by a large KL shift from the base Instruct model, which can induce catastrophi...”

You know that frustration when your LLM training data runs dry for a specific science domain? Physics is exactly that case: less than 1% of the 800K QA pairs in DeepSeek-R1 training involve STEM topics (per paper Section 1), meaning internet-scale scraping simply does not cover physics at the depth needed for RLVR training. The obvious fix — generate synthetic labels with GPT-4 or collect 17,000 human-curated physics problems — turns out to make things worse: per paper Table 3, SFT on 200K GPT-4/o3 demonstrations degraded IPhO performance by 3.9% for the 32B model. Sim2Reason bypasses both paths by using physics simulators where ground truth is free, exact, and deterministic — the reward signal does not require a teacher model or a human grader.

llmphysicsreinforcement-learningresearchmujocoopen-sourcepython

A YAML-based language describes physics scenes (a mass on a pulley, two charged plates, an orbiting body) and compiles them to MuJoCo for simulation. The simulator records exact numerical answers — velocities, forces, tensions — which become the ground truth for three question types: forward ('what is the velocity at t=3s?'), reverse ('what mass produces this velocity?'), and symbolic ('express velocity as a function of t'). About 15% of generated questions pass a quality filter that removes shortcut-solvable problems: if you can get the right answer by ignoring any entity in the scene, the question is discarded. The remaining ~6,400 QA pairs train the LLM via RLVR: the model earns reward when its answer falls within 5% of the simulator's recorded value, nothing otherwise. After 200 RL steps on this data, the model improves on real IPhO problems it has never seen — zero-shot transfer from simulation to real competition questions.

01
Zero-cost ground truth via MuJoCo simulation — the simulator records exact numerical values as reward signals; you never need human annotators, LLM judges, or answer keys from external sources
02
Shortcut-solution filter removes low-quality QA pairs — without this filter, IPhO gains drop from +7.5% to +1.46% for the 3B model (Table 4b); this single component is responsible for most of the performance gain
03
6,400 synthetic pairs outperform 17,000 real ones — per Table 6, RL on Sim2Reason data scores 13.15% on IPhO vs 9.98% for RL on DAPO-17K real problems; synthetic data efficiency beats real data volume
04
Do not use SFT here: Table 3 shows SFT on 200K demonstrations hurts IPhO by -3.9% for 32B, while RLVR gains +5.4%; the repo gives you the RL pipeline that actually works, not the SFT shortcut that backfires
05
YAML DSL allows new physics entity types without rewriting simulation code — adding spring-mass or orbital mechanics scenes requires only a new YAML entity definition
06
Cross-domain transfer included: 32B model improves +17.90% on JEEBench, +3.12% on OlympiadBench, and +4.4% on MATH 500 after physics-only RL training — the reasoning gains generalize beyond mechanics
Who it’s for

If you work on LLM post-training — specifically RLVR pipelines — and you need non-math domains where synthetic data can replace scarce human-curated QA, this paper directly addresses your data bottleneck. Robotics and physics simulation engineers who want to connect MuJoCo-based workflows to language model training will find the DSL-to-MuJoCo pipeline reusable for their domain. This is not useful if you need a deployed model: no pre-trained weights are downloadable from the repo, no license is declared, and training requires a minimum of 8 H100/A100-class GPUs.

Worth exploring

Worth reading immediately if you work on RLVR data pipelines — the finding that shortcut filtering matters more than data volume is directly applicable to any synthetic QA generation project, not just physics. Not ready for production use: no license declared, no downloadable pretrained weights confirmed from the repo, training requires 8+ high-end GPUs, and the simulator DSL covers classical mechanics only. Read the paper; treat the code as a research artifact requiring hardware access and license verification before commercial use.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →