“"SFT yields only modest in-distribution gains on our synthetic evaluation and substantially degrades out-of-distribution performance (e.g., -3.9% on IPhO Mechanics). We hypothesize that this is driven by a large KL shift from the base Instruct model, which can induce catastrophi...”
You know that frustration when your LLM training data runs dry for a specific science domain? Physics is exactly that case: less than 1% of the 800K QA pairs in DeepSeek-R1 training involve STEM topics (per paper Section 1), meaning internet-scale scraping simply does not cover physics at the depth needed for RLVR training. The obvious fix — generate synthetic labels with GPT-4 or collect 17,000 human-curated physics problems — turns out to make things worse: per paper Table 3, SFT on 200K GPT-4/o3 demonstrations degraded IPhO performance by 3.9% for the 32B model. Sim2Reason bypasses both paths by using physics simulators where ground truth is free, exact, and deterministic — the reward signal does not require a teacher model or a human grader.
A YAML-based language describes physics scenes (a mass on a pulley, two charged plates, an orbiting body) and compiles them to MuJoCo for simulation. The simulator records exact numerical answers — velocities, forces, tensions — which become the ground truth for three question types: forward ('what is the velocity at t=3s?'), reverse ('what mass produces this velocity?'), and symbolic ('express velocity as a function of t'). About 15% of generated questions pass a quality filter that removes shortcut-solvable problems: if you can get the right answer by ignoring any entity in the scene, the question is discarded. The remaining ~6,400 QA pairs train the LLM via RLVR: the model earns reward when its answer falls within 5% of the simulator's recorded value, nothing otherwise. After 200 RL steps on this data, the model improves on real IPhO problems it has never seen — zero-shot transfer from simulation to real competition questions.
If you work on LLM post-training — specifically RLVR pipelines — and you need non-math domains where synthetic data can replace scarce human-curated QA, this paper directly addresses your data bottleneck. Robotics and physics simulation engineers who want to connect MuJoCo-based workflows to language model training will find the DSL-to-MuJoCo pipeline reusable for their domain. This is not useful if you need a deployed model: no pre-trained weights are downloadable from the repo, no license is declared, and training requires a minimum of 8 H100/A100-class GPUs.
Worth reading immediately if you work on RLVR data pipelines — the finding that shortcut filtering matters more than data volume is directly applicable to any synthetic QA generation project, not just physics. Not ready for production use: no license declared, no downloadable pretrained weights confirmed from the repo, training requires 8+ high-end GPUs, and the simulator DSL covers classical mechanics only. Read the paper; treat the code as a research artifact requiring hardware access and license verification before commercial use.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.