“"What looks elegant in a paper often becomes brittle in production, and the gap between academic robotics benchmarks and production annotation data shows up as a fundamentally different distribution of failures." — Scale Labs Physical AI team (labs.scale.com/blog/path-to-large-s...”
You know that feeling when you have 1,000 hours of robot video arriving daily and need accurate captions for every subgoal before you can train a manipulation model? Manual annotation does not scale, and automated VLM captioning gives you wildly inconsistent results depending on how you format the input. You try raw video and get 28.7% accuracy. You try prompting tricks and temporal markers and they make things worse on smaller models. The question — which input representation actually works on production robot data, and why does adding more information sometimes hurt — had no published answer until now.
The study tests ~30 configurations against a fixed benchmark of 2,300 robot manipulation subgoal segments, changing one variable at a time: input format, sampling strategy, model class, or prompting technique. An LLM judge scores each output as acceptable (all rubric fields correct) or not acceptable. The core finding: when you arrange three keyframes — past, present, future — as a horizontal collage, the model reads temporal order from left-to-right spatial layout without needing explicit timestamp tokens. Adding raw video, temporal markers, or intermediate object detection steps each introduce noise that degrades accuracy. Upgrading from a Flash-class to a Pro-class model adds +2.0 percentage points on top of the best representation strategy.
If you build VLM-based annotation pipelines for robot manipulation data — or you are deciding between input representation strategies for any video understanding task — this study hands you a ranked ablation table to start from instead of running your own experiments. It is also directly useful if you manage a Physical AI data budget and need to estimate headroom before hitting a structural quality wall. Not useful if you need deployable code: no implementation is released, and all results are on Gemini models with an internal private dataset.
Yes, if you work on Physical AI data pipelines: the collage-over-video finding is immediately applicable to any VLM-based annotation workflow and costs nothing to test. The failure taxonomy (37/37/26 split) gives you a structured rubric for measuring caption quality. The main caveat: all results are single-run without confidence intervals on a private Gemini-only benchmark, so treat findings under 2 percentage points as directional rather than settled.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.