Robot VLA beats GPT-5 on 13 embodied-reasoning benchmarks

What problem does it solve

“[object Object]”

You know that feeling when every robotics AI either locks you into a proprietary vendor, demands expensive hardware, or falls apart the moment you ask it to do something slightly different from its training set? Frontier VLAs like Physical Intelligence's π₀.₅ are closed-weight and closed-data — you cannot adapt them without a commercial agreement, and the inference loop can run too slow for reactive closed-loop control. Open-weight alternatives like OpenVLA score 0.36 versus 0.51 on third-party benchmarks and offer no bimanual manipulation support. If you need a robot that works out-of-the-box on accessible hardware with a credible path to open fine-tuning, there was no production-grade open option.

roboticsvision-language-actionvlaopen-sourceembodied-airobot-manipulationresearch-paper

How it works

MolmoAct2 starts with a vision-language model backbone — Molmo2-ER — trained on 3.3 million examples of spatial reasoning tasks like identifying where objects are relative to each other and what hand position would grab a specific item. Think of this as the robot's eyes and brain wired together. That brain connects to a separate action expert that converts spatial understanding into precise motor commands; the connection happens at every transformer layer — not just the final output — so the motor control reads the full spatial context at every step, not a summarized version. During a task, the robot captures a camera frame, the VLM reasons about what to do, and the action expert generates a batch of 10–30 moves as a continuous trajectory via flow-matching, which the robot then executes. An optional variant called MolmoAct2-Think adds a 10×10 depth grid per frame and only recomputes cells where the scene actually changed, cutting latency proportional to how static the scene is.

Key takeaways

✦

01

Molmo2-ER backbone — the spatial reasoning VLM behind the policy scores 63.8% across 13 embodied-reasoning benchmarks, beating GPT-5 at 57.9% and Gemini Robotics ER-1.5 at 61.3%; higher spatial IQ in the backbone directly raises action acc...

⟁

02

Per-layer KV-cache conditioning — each action expert transformer layer reads key-value projections from the corresponding VLM layer instead of only the final output; ablation confirms 95.9% on LIBERO versus 94.0% for the hidden-state appro...

⊕

03

Three open datasets totaling ~920 hours — MolmoAct2-BimanualYAM (720h, 34.5K demos, 28 tasks), SO-100/101 (~184h, 38K episodes from 1,222 community LeRobot datasets), and DROID (74.6K episodes, 17.7M frames); you can fine-tune without coll...

◈

04

OpenFAST action tokenizer — converts continuous robot actions across 5 embodiments into a shared 2,048-token discrete vocabulary trained on 1 million trajectories; you skip writing a custom tokenization pipeline and pre-train across embodi...

∞

05

MolmoAct2-Think adaptive depth — computes a 10×10 depth grid per frame and regenerates only changed cells via cosine-similarity threshold of 0.996; adds +2.2% on LIBERO-Long at 95.4% while reducing latency proportional to scene-change rate

◎

06

Out-of-the-box multi-task performance — on SO-100/101 (56.7%), bimanual YAM under $6K hardware (50.1%), and Franka DROID (87.1%) you get those success rates without task-specific training code

✺

07

Apache 2.0 license — commercial use, fine-tuning, and redistribution are all permitted; Physical Intelligence's π₀.₅ and π₀.₇ carry no comparable open license

Should you care?

Who it’s for

If you do robotics research or robotics product development with SO-100, bimanual YAM, or Franka DROID hardware, MolmoAct2 gives you a capable inference baseline without building a training pipeline. If you study embodied AI and need open datasets, the 720-hour BimanualYAM dataset is the largest open bimanual dataset available. Not useful yet if you need to reproduce or modify the training process — the training code is not released as of May 2026.

Worth exploring

The weights and datasets are worth exploring now if you have SO-100, bimanual YAM, or Franka DROID hardware — real-world success rates of 87.1% on Franka tasks and 50.1% on bimanual manipulation are high enough to benchmark against your current baseline. Hold off on production commitments until training code ships, since you cannot reproduce paper results or fine-tune on new embodiments without it.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Robot VLA beats GPT-5 on 13 embodied-reasoning benchmarks

Underrated tools. Unfiltered takes.