SIM2REASON: Scaling LLM Reasoning with Simulator-Generated QA Data

What problem does it solve

“"SFT yields only modest in-distribution gains on our synthetic evaluation and substantially degrades out-of-distribution performance (e.g., -3.9% on IPhO Mechanics). We hypothesize that this is driven by a large KL shift from the base Instruct model, which can induce catastrophi...”

You know that frustration when your LLM training data runs dry for a specific science domain? Physics is exactly that case: less than 1% of the 800K QA pairs in DeepSeek-R1 training involve STEM topics (per paper Section 1), meaning internet-scale scraping simply does not cover physics at the depth needed for RLVR training. The obvious fix — generate synthetic labels with GPT-4 or collect 17,000 human-curated physics problems — turns out to make things worse: per paper Table 3, SFT on 200K GPT-4/o3 demonstrations degraded IPhO performance by 3.9% for the 32B model. Sim2Reason bypasses both paths by using physics simulators where ground truth is free, exact, and deterministic — the reward signal does not require a teacher model or a human grader.

llmphysicsreinforcement-learningresearchmujocoopen-sourcepython

How it works

A YAML-based language describes physics scenes (a mass on a pulley, two charged plates, an orbiting body) and compiles them to MuJoCo for simulation. The simulator records exact numerical answers — velocities, forces, tensions — which become the ground truth for three question types: forward ('what is the velocity at t=3s?'), reverse ('what mass produces this velocity?'), and symbolic ('express velocity as a function of t'). About 15% of generated questions pass a quality filter that removes shortcut-solvable problems: if you can get the right answer by ignoring any entity in the scene, the question is discarded. The remaining ~6,400 QA pairs train the LLM via RLVR: the model earns reward when its answer falls within 5% of the simulator's recorded value, nothing otherwise. After 200 RL steps on this data, the model improves on real IPhO problems it has never seen — zero-shot transfer from simulation to real competition questions.

Key takeaways

✦

01

Zero-cost ground truth via MuJoCo simulation — the simulator records exact numerical values as reward signals; you never need human annotators, LLM judges, or answer keys from external sources

⟁

02

Shortcut-solution filter removes low-quality QA pairs — without this filter, IPhO gains drop from +7.5% to +1.46% for the 3B model (Table 4b); this single component is responsible for most of the performance gain

⊕

03

6,400 synthetic pairs outperform 17,000 real ones — per Table 6, RL on Sim2Reason data scores 13.15% on IPhO vs 9.98% for RL on DAPO-17K real problems; synthetic data efficiency beats real data volume

◈

04

Do not use SFT here: Table 3 shows SFT on 200K demonstrations hurts IPhO by -3.9% for 32B, while RLVR gains +5.4%; the repo gives you the RL pipeline that actually works, not the SFT shortcut that backfires

∞

05

YAML DSL allows new physics entity types without rewriting simulation code — adding spring-mass or orbital mechanics scenes requires only a new YAML entity definition

◎

06

Cross-domain transfer included: 32B model improves +17.90% on JEEBench, +3.12% on OlympiadBench, and +4.4% on MATH 500 after physics-only RL training — the reasoning gains generalize beyond mechanics

Should you care?

Who it’s for

If you work on LLM post-training — specifically RLVR pipelines — and you need non-math domains where synthetic data can replace scarce human-curated QA, this paper directly addresses your data bottleneck. Robotics and physics simulation engineers who want to connect MuJoCo-based workflows to language model training will find the DSL-to-MuJoCo pipeline reusable for their domain. This is not useful if you need a deployed model: no pre-trained weights are downloadable from the repo, no license is declared, and training requires a minimum of 8 H100/A100-class GPUs.

Worth exploring

Worth reading immediately if you work on RLVR data pipelines — the finding that shortcut filtering matters more than data volume is directly applicable to any synthetic QA generation project, not just physics. Not ready for production use: no license declared, no downloadable pretrained weights confirmed from the repo, training requires 8+ high-end GPUs, and the simulator DSL covers classical mechanics only. Read the paper; treat the code as a research artifact requiring hardware access and license verification before commercial use.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

SIM2REASON: Scaling LLM Reasoning with Simulator-Generated QA Data

Underrated tools. Unfiltered takes.