R&D advanced 3 min read May 11, 2026
Public Preview Sign in free for the full digest →

Robot VLA beats GPT-5 on 13 embodied-reasoning benchmarks

“Ai2 released a robot model that beats GPT-5 on spatial reasoning for free under Apache 2.0 — while Physical Intelligence raises at an $11B valuation with closed weights.”

Robot VLA beats GPT-5 on 13 embodied-reasoning benchmarks
Source · huggingface.co

“[object Object]”

You know that feeling when every robotics AI either locks you into a proprietary vendor, demands expensive hardware, or falls apart the moment you ask it to do something slightly different from its training set? Frontier VLAs like Physical Intelligence's π₀.₅ are closed-weight and closed-data — you cannot adapt them without a commercial agreement, and the inference loop can run too slow for reactive closed-loop control. Open-weight alternatives like OpenVLA score 0.36 versus 0.51 on third-party benchmarks and offer no bimanual manipulation support. If you need a robot that works out-of-the-box on accessible hardware with a credible path to open fine-tuning, there was no production-grade open option.

roboticsvision-language-actionvlaopen-sourceembodied-airobot-manipulationresearch-paper

MolmoAct2 starts with a vision-language model backbone — Molmo2-ER — trained on 3.3 million examples of spatial reasoning tasks like identifying where objects are relative to each other and what hand position would grab a specific item. Think of this as the robot's eyes and brain wired together. That brain connects to a separate action expert that converts spatial understanding into precise motor commands; the connection happens at every transformer layer — not just the final output — so the motor control reads the full spatial context at every step, not a summarized version. During a task, the robot captures a camera frame, the VLM reasons about what to do, and the action expert generates a batch of 10–30 moves as a continuous trajectory via flow-matching, which the robot then executes. An optional variant called MolmoAct2-Think adds a 10×10 depth grid per frame and only recomputes cells where the scene actually changed, cutting latency proportional to how static the scene is.

01
Molmo2-ER backbone — the spatial reasoning VLM behind the policy scores 63.8% across 13 embodied-reasoning benchmarks, beating GPT-5 at 57.9% and Gemini Robotics ER-1.5 at 61.3%; higher spatial IQ in the backbone directly raises action acc...
02
Per-layer KV-cache conditioning — each action expert transformer layer reads key-value projections from the corresponding VLM layer instead of only the final output; ablation confirms 95.9% on LIBERO versus 94.0% for the hidden-state appro...
03
Three open datasets totaling ~920 hours — MolmoAct2-BimanualYAM (720h, 34.5K demos, 28 tasks), SO-100/101 (~184h, 38K episodes from 1,222 community LeRobot datasets), and DROID (74.6K episodes, 17.7M frames); you can fine-tune without coll...
04
OpenFAST action tokenizer — converts continuous robot actions across 5 embodiments into a shared 2,048-token discrete vocabulary trained on 1 million trajectories; you skip writing a custom tokenization pipeline and pre-train across embodi...
05
MolmoAct2-Think adaptive depth — computes a 10×10 depth grid per frame and regenerates only changed cells via cosine-similarity threshold of 0.996; adds +2.2% on LIBERO-Long at 95.4% while reducing latency proportional to scene-change rate
06
Out-of-the-box multi-task performance — on SO-100/101 (56.7%), bimanual YAM under $6K hardware (50.1%), and Franka DROID (87.1%) you get those success rates without task-specific training code
07
Apache 2.0 license — commercial use, fine-tuning, and redistribution are all permitted; Physical Intelligence's π₀.₅ and π₀.₇ carry no comparable open license
Who it’s for

If you do robotics research or robotics product development with SO-100, bimanual YAM, or Franka DROID hardware, MolmoAct2 gives you a capable inference baseline without building a training pipeline. If you study embodied AI and need open datasets, the 720-hour BimanualYAM dataset is the largest open bimanual dataset available. Not useful yet if you need to reproduce or modify the training process — the training code is not released as of May 2026.

Worth exploring

The weights and datasets are worth exploring now if you have SO-100, bimanual YAM, or Franka DROID hardware — real-world success rates of 87.1% on Franka tasks and 50.1% on bimanual manipulation are high enough to benchmark against your current baseline. Hold off on production commitments until training code ships, since you cannot reproduce paper results or fine-tune on new embodiments without it.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →